Skip to content

Categories:

VMotion/Live Migration is not an HA feature

I’m a couple of weeks behind the ball here, but I was a bit inspired by this (somewhat controversial) post over at Standalone Sysadmin:

I’m sorry. I know you probably paid a lot for that license, but if your infrastructure is relying on a machine’s ability to transition between VM hosts without rebooting as the crux of your high availability plan, you might want to reconsider.

Yesterday, Rational Survivability (a great all-over-the-place IT blog) had a post titled The Emotion of VMotion. It didn’t occur to me before reading this that my own previous search for a hypervisor that would do live migration was working directly against my own beliefs that uptime should only matter for services. Essentially, the infrastructure should be designed so that a single server down doesn’t contribute to the loss of availability.

That being said, live migration is a neat idea, and eventually it’s going to get to the point that it’s nearly instantaneous. When that happens, failovers will be next to invisible. Maybe we’ll have to reevaluate our approach in that case.

Until then, I read posts from people trying to rely on it to keep their infrastructures up and I worry that their approach is flawed.

Please, build your services for reliability, not just the underlying systems.

Now, I need to preface this by saying that I’m not missing the point of Matt’s post. There’s a lot of administrators out there who do treat live migration as a panacea for whatever ails your reliability problems. Anyone who has attempted to design real high-availability infrastructures is very aware that application-level clustering is more robust and typically more reliable than OS-level clustering, which is more robust than hypervisor-level clustering. But these features don’t compete with each other. They each function as a different piece of the datacenter puzzle. And as Matt implies, the cost savings aren’t right for everyone — but they are right for some people.

Absolutely, without a doubt, clustered services are a wonderful, great idea — that’s why people have been using them for decades, and continue to use them. And even though VMotion makes it very easy to add some server-level resiliency to any host or service, the application-level clusters are becoming much easier to configure and maintain at the same time, thanks to great configuration management tools like Puppet, Chef, and Cfengine.

But the big picture is an entire ecosystem around which VMotion thrives. The big cost driver for virtualization in large datacenter environments is consolidation, and being able to run multiple workloads on the same piece of physical hardware is only the first step. Consolidation ratios are improved substantially when you can transparently load-balance workloads in terms of network traffic, compute power and disk I/O — you don’t have to worry about a single bottleneck breaking your carefully-designed system. In addition to the raw server consolidation gains, you substantially save on engineering power, as there’s a lot less manual labor required to design a viable virtualized infrastructure, and a lot less things go wrong if you get it wrong. And if you require compute capacity on demand — say that the majority of your processing occurs during normal business hours and your servers stay mostly idle afterwards — a solution like DRS can actually completely power down your unused VMware hosts until your compute capacity is needed again.

Sure, this isn’t appropriate for everyone. In a pie-in-the-sky IT infrastructure, grid services would provide uniform access to compute capacity and storage on demand using commodity hardware, like Google or Facebook or other players who rely heavily on things like Hadoop or MapReduce in order to scale their operations. But for most real businesses, which have a real investment in commercial off-the-shelf software like databases, ERP systems, CRM and other necessities, we need hypervisors to abstract away the problem and do the work that the COTS vendors won’t, even if the result isn’t as elegant as it should be. And I’m sure that as the hypervisor marketplace matures and consolidates, VMware, Citrix, Microsoft, Red Hat and other vendors will begin to do things with their platforms that we haven’t even thought of yet. Maybe we’ll see cache-coherent shared-memory virtual infrastructures running over InfiniBand, removing the network overhead that was pointed to as a problem by Rational Survivability. The possibilities are endless.

It seems like in this instance, Matt is railing more against the idea of boot-from-SAN than he is about VMotion himself, as boot-from-SAN is another way of solving the same problem — it adds resiliency against hardware failure, but not a ton else. In various ways, he’s right: if you ignore maintenance of your systems documentation and proper server rebuild procedures in favor of a magical black box, your environment will become an unmaintainable mess as a result. It’s the same argument that Luke Kanies has been making about using Puppet or other configuration management systems versus golden master images. In this respect, I think Matt is right to want to know his systems well enough to rebuild them from scratch. It also makes upgrades and other migrations much simpler and smoother.

But every tool is just that: a tool. And they should be used as tools, and evaluated in terms of their effectiveness as a tool. You shouldn’t throw away a perfectly good tool because it doesn’t live up to the hype you were promised. You should use it if it delivers a real return on investment.

Posted in Sysadmin.

Tagged with , .


0 Responses

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.



Some HTML is OK

or, reply to this post via trackback.