Monday, March 10, 2008

Virtualization and consolidation: Less Availability and Less Predictable Performance?

Oh No - say it isn't so! Will virtualization and consolidation put more workloads on the same hardware? When that hardware fails, won't it take out more of your workloads? Remind me, how did consolidation help me? Oh, and with more workloads running on single machines, what happens if all my workloads are busy and growing busier all the time? Won't my performance and throughput go down?

Well, in a nutshell, yes. Yes, that is, if all you do is squeeze workloads onto a smaller number of machines with no thought to the aspects of performance, availability, and throughput. Even with sites such as Amazon's EC2 or AWS, there is currently no published availability plan as of this post on the O'Reilly radar. I think Bill Hammon's post here also gives a concise view of the problem statement. In Bill's post, he addresses two of the key issues - availability and performance that tend to accompany consolidation. I think there are several others as well, but these make a good start. The problem is that, as with any new technology, consolidation and virtualization introduce some new challenges. In this case, base provisioning, management of an image catalog, and simple management of virtual images make consolidation look easy. Basically, you transition your current tools to a virtualized environment, and presto! you are well on your way towards cost savings, ease of management, additional easy-to-access compute cycles for additional test and development activities. But, most people still manage each of those virtual machines as they would a physical machine, and we run more virtual machines on each physical machine - potentially say, 10 virtual machines per physical machine. That means we have to manage N + 10 * N machines, oh, and after consolidation, most people don't throw away the previous machines, the plan to re-use them as their workload grows. Add to that the fact that most workloads being consolidated today are those infrastructure or web services workloads where the availability story is based on running multiple servers of the same type and failover handled by something as simple as DNS or a first-to-respond policy. Consolidating all of your scale out capacity on a few machines means that your scale out policy is no longer going to work as you expect.

On the performance side of things, you have something of the opposite problem. Where previously you would have capacity for anywhere from twice to ten times (or more) the typical average workload, and sometimes as much as twice the capacity needed for peak performance, you now have the case where the aggregate of your ten virtual machines may run at 50% utilization on average, with peak capacity still twice your average. But now when two or three or four virtual machines are using something well in access of their average, the performance of the entire set of virtual machines may be impacted. Today, most workloads which are being consolidated don't find that limit to be too onerous, but as more critical workloads move into the consolidated environment, the risk of oversubscribed physical machines will grow.

So how do you combat these trends? Consolidation, despite these challenges, it still a huge potential money saver. One of the comments in the O'Reilly radar link above suggested one solution: pair your internal infrastructure with a hosting service like Amazon's, and provide availability that way. Clever, although your degraded mode of operation may not be as predictable in terms of responsiveness and performance, at least you have a degraded mode of operation. Another solution is to build an HA solution into your virtual machines and manage your HA yourself. Of course, if all of Amazon's EC2 is done (or any other provider) that doesn't help you much.

A better solution might be to look at the problem in terms of Cloud Computing (yeah, you saw this coming, I bet! :-). Within a cloud, you should conceptually be able to say "instantiate my virtual machine, and I'd like some of these properties." "These properties" might include some level of HA, some level of performance/throughput guarantee (an SLA), some level of backup, maybe even some concept of energy management/efficiency. So, if you have your own in-house cloud and a catalog of virtual machine images that you typically deploy, you would be able to specify your desired level of HA - how highly available and at what cost; your desired performance - what cap on machine resources you want, if any, or some level of guarantee of end-to-end throughput or latency; whether you could contribute your data center's "green" goals by using energy efficient hardware resources, etc.

And, part of the point of a good "Cloud" implementation would be that it would handle these details for you. The best practices of HA would be patterns that could be deployed based in part on a high level specification such as "4 nines" or "5 nines" of availability, or performance could be managed by a workload manager actively migrating tasks within your pool of resources as needed. And no, that isn't technology that is 5 years out... Cloud Computing is a potentially disruptive technology in development right now...



Post a Comment

Links to this post:

Create a Link

<< Home