|
Feb 22
2012
|
Why overallocation makes cloud computing services impossible to comparePosted by: Eric Novikoff Tagged in: Cloud Usage
|
Various recent user surveys and performance measurements done on cloud computing systems show great variability over time as well as between services, which affects both the perceived reliability of the system and the effective price paid for cloud computing. The root cause of this performance variation is overallocation of resources. This blog entry will explore what overallocation is and how to minimize its effects.
First of all, let's remember that cloud computing's biggest pluses - savings and scalability - come from the fact that customers are accessing shared resources. Because resources are shared, any bottlenecks in the system will affect all users' performance when the system is stressed. Studies done at the University of Adelaide showed that Amazon's performance can vary by up to 10x over the course of the day, possibly due to demands on the EBS system, or simply overtaxing the gigabit ethernet connections on the individual servers.
Let's look at how this affects pricing. For example, let's assume that you are paying $0.10/hr for a cloud instance. If you log in (linux) you can use the 'top' command to see the I/O latency, shown as iowait. Perhaps it's 10%. this would mean that your instance is spending 10% of its time waiting for the network, and effectively you are paying $0.11/hr for one promised unit of the instance's performance. However at 3pm, when all the schoolkids come home and get on multiplayer games, the iowait now jumps to 90% (yes, our customers have measured this.) Now, you are paying $1.00/hr for the instance's promised performance. Perhaps you don't need that performance, but if you do, then you have to compensate by buying 9 more instances, and you will indeed be paying $1.00/hr.
The example above covered unintentional (or at least unplanned) network overallocation but the other three resources that cloud vendors sell - storage, CPU, and RAM memory - can also be overallocated. Like network bandwidth, storage can be overallocated simply by placing too many demands on the centralized storage system or local disk (depending on the cloud architecture - see my blog on cloud architecture.) The network and storage overallocations are hard for the vendor to address rapidly because they would require hardware changes. However, depending on the hypervisor the vendor chooses, both CPU and RAM can be intentionally overallocated for a variety of purposes, usually associated with reducing costs and/or reducing pricing.
Here's how this works. If it supports overallocation, the hypervisor allows setting two allocation parameters, a "limit" or maximum size of the resource, and a "reservation" or minimum size. When a new instance is allocated, it is given the reservation amount, and then as it needs more, its usage can grow up to the limit. This allows many more instances to be allocated - actually over-allocated - on a server, than would actually fit based on the promised instance sizes (the limits). Until the instance actually needs more resources, it operates the same way as if it were not overallocated. But when it asks for more resources, the hypervisor has two choices: deny them (which can cause the software to crash or operate very slowly) or move the instance to another server that has the resources available. Moving the instance (or "motioning" it as some vendors like to say) causes congestion on the network and uses up CPU resources, resulting in slower performance for instances on the respective machines. And since both deciding that the move is necessary (generally monitoring a prolonged resource shortage) and doing the moving the takes time, there is a delay between when the resources are needed and when they are available, which may affect some applications, especially those that experience peaky loads.
Despite its shortcomings, overallocation is common practice among cloud providers, and is likely responsible for some of the extremely low prices that some providers offer. Whether that overallocation is intentional or the result of overtaxed resource bottlenecks like networking or SAN controllers, performance can vary widely between cloud providers based on hardware design and overallocation policies, yet these policies affect performance in complex ways and are rarely if ever documented by the provider.
How do you avoid suffering the penalties of low performance and excessive cost per unit of usable resources that overallocation can cause? If you choose a cloud provider based on price, you will likely be suffering from overallocation, which requires that you set up an auto-scaling capability in your software as well as the cloud management system, so that additional instances can be allocated automatically as the load exceeds the available resources, in order to keep performance constant. While this may address performance issues, it will not solve the problem if cost per usable unit of compute. It can also lead to some very complex architectures, which I have seen deployed in Amazon to get around its widely varying performance. In the end you may not see any cost savings from such workarounds since they inevitably have a cost of implementation and operation.
The alternative is to choose a provider that does not intentionally overallocate resources, and addresses performance bottlenecks aggressively.






