Working in the life sciences industry, I often deal with users who have requests that might be considered strange in other fields. For example, my organization has users asking for systems with 2 terabytes of RAM. We have other users asking for systems with 12 terabytes of RAM. To a normal system administrator who doesn’t run OLTP systems for a bank or brokerage, where you might find huge-memory database systems, this technical requirement seems silly. However, for gene sequence assembly and analysis, this much memory is really a requirement with longer sequences. Short read assemblers like Velvet can chew through this in as much time as it takes the system to allocate all that memory.
You don’t have to be on the forefront of bleeding-edge server technology to know that x86 systems with 12 terabytes of RAM simply don’t exist. With RAM density as it is right now, there’s simply no way to fit that many DIMMs on a board. However, some inventive software steps up to the plate.
We’ve been meeting with a company called ScaleMP. ScaleMP is, in strict terms, a virtualization software vendor. However, unlike companies like VMware, ScaleMP specializes in using virtual machine monitors to aggregate CPU and memory resources among a number of InfiniBand-connected hosts, presenting a logical system with all of the combined CPU and memory resources of the aggregated physical machines. Through their black magic technology, they apparently do this without substantial overhead to the host, and the resulting virtual machine performs on par with a parallelized MPI solution utilizing operating systems running atop bare metal on the physical nodes. The difference, of course, is that you have a very large coherent block of memory to work with. If you’re familiar with Isilon’s storage architecture, the pattern should look familiar.
There’s an article from HPCwire written by Shai Fultheim, ScaleMP’s CEO, that sums up this approach a lot better than I could hope to. But the ten-cent version is that you can use this to virtualize compute, memory and I/O resources to present a very large system for single tasks that require tons of memory and extremely fast parallelism, or you can use it to aggregate an entire cluster into a single virtualized node that would completely eliminate the need for traditional cluster management tools.
I was thinking about this a little while ago when I posted VMotion/Live Migration is not an HA feature.
Maybe we’ll see cache-coherent shared-memory virtual infrastructures running over InfiniBand, removing the network overhead that was pointed to as a problem by Rational Survivability.
It started out as a sidenote, but it really got me thinking about the big picture. Why isn’t this a direction we’re seeing existing virtualization vendors moving in, vendors who currently embrace the partitioning approach? Storage vendors have historically worked from the idea that true virtualization involves both aggregation and partitioning. It’s not enough to simply present disks to multiple hosts. You aggregate them into storage pools, and then you carve up LUNs and present them to your storage network. Why aren’t we trying to make compute cycles commoditized for generalized workloads, instead of just specific programs written for message-passing interfaces?
Vendors have heavy investments in distributed infrastructures, using tools like VMotion and DRS to balance resource utilization and maximize consolidation ratios. But is this really the optimal approach to this problem? What if you didn’t need to dynamically balance workloads because the hypervisor’s SMP scheduler would do it automatically on an enormous aggregated system? For day-to-day operations (as opposed to offsite migrations, where VMotion can still be rather useful), what if you were able to move virtual machines across an InfiniBand fabric as a simple in-memory copy, rather than sending the entire contents of a virtual machine’s memory over the network? What if all of your virtual page sharing was completely coherent across your virtualized compute grid, and you really could have one single OS instance in memory running your entire infrastructure?
Certainly there’s a lot of complications and a lot of engineering in this approach. First, of course, is resiliency and failure isolation: how do you make sure that a single server failure doesn’t bring down every OS instance on the grid, which happen to be running tasks on that system’s CPUs? (There’s checkpointing approaches for existing large-scale SMP systems, which could probably be applied to the vSMP approach as well; however, this is pretty academic discussion, and I’m not going to pretend to know how viable it is.) With resiliency in mind, what’s the best way of allocating and distributing resources so that a minimal amount of recovery has to occur in the event of a failure? It’s not useful to recover in this way if it takes longer than a regular clean boot.
This kind of engineering will take a very long time, but I think it’s inevitable. Virtualization vendors have gotten the host resource partitioning part down to the point where I don’t know if anything new can even happen in that space, but there’s a lot more exciting things that can happen once the aggregation piece is layered underneath the hypervisor as we know it today.
0 Responses
Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.