It's been a few days now since Amazon's Virginia datacenter failed and took down hundreds of internet companies, including serveral high-visibility startup companies and platform-as-a-service providers. There has been a lot of talk about how this failure, which "was never supposed to happen" is a blow to cloud computing. However, I think it's more of a blow to the freewheeling marketing practices for cloud computing, and to those consumers of cloud who thought that it eliminated their responsibility to worry about infrastructure reliability. These words are a bit harsh, I know, but they reflect the reality that many of Amazon's customers have been remiss in managing their IT responsibilities. The silver lining to this event is that I think it will bring about changes in how cloud is deployed and more importantly used, that will continue to drive its successful adoption.
To review what we know about the event, Amazon's storage subsystem encountered a failure (which I surmise was due to a code update) that caused it to replicate customer's data, causing storage slowdowns, until the storage system was full which then caused complete failure. Somehow this failure crossed multiple "availability zones" which were not supposed to all fail at the same time. With no storage accessible to them, applications failed. Even as late as today, some companies were saying that they were still missing data.
How could this happen? Well from an infrastructure point of view, the problem is clearly due to monoculture. Monoculture is a term used in agriculture to describe planting your entire farm with the same crop. If an insect that likes that crop attacks your farm, you lose your whole crop. Similarly, Amazon, in order to get good economy of scale and minimize manual labor, has filled their datacenter with the same hardware, hooked up the same way, loaded with the same software, replicated over and over. Introduce a software bug into a release of their infrastructure management systems, and the automated distribution of the release will spread to the entire datacenter - or wider, depending on maintenance policies.
However, this does not explain why so many of Amazon's customers were taken by surprise by the failure. From their expressions of surprise, it's pretty clear that many assumed that Amazon's assurances that its infrastructure was reliable meant that they wouldn't have any problems. However, from the first day that Amazon offered their service, they were quite clear that while their service was likely to remain up almost all the time, individual servers or even availability zones were not guaranteed to be that reliable, and that Amazon expected its customers to engineer around this limitation. Amazon's weak uptime guarantee of three and a half nines backed by a small percentage discount that users could apply for at the end of the year communicates even more clearly that they had no plans in place to keep individual servers running at high availability, nor did they plan to shoulder heavy financial responsibility for long outages. Even the introduction of automated failover features did little to improve this, since their extremely long provisioning times and questionable availability of resources at peak times meant that customers could experience noticeable failures.
Of course, nobody expected multiple availability zones to fail at once, so even if their applications had failover to another zone in the datacenter, they expected to enjoy high uptime. But they should have expected a full-datacenter failure: there are many modes of systems failure that take out an entire state-of-the-art datacenter, including accidental destruction of fiber lines to the datacenter, systemic power failures such as the one a few years ago in the generators at 365 Main in San Francisco that took out a raft of popular startups, or a more recent Amazon full-datacenter failure due to a car crashing into a power line. I think the message here is that modern data centers are reliable, but not infinitely reliable. And similarly, while Amazon's services simplify the access to virtual infrastructure, that ease of access masks the fact that it is built out of physical infrastructure that has definite vulnerabilities which have not been automated away as completely as the hassle of setting up your own physical server.
Puzzlingly, it seems that this simple understanding did not reach the management teams at the companies that were suffering the most from last week's outage. This could be because the marketing of cloud seems to follow an unwritten rule that the drawbacks and vulnerabilities will not be discussed by vendors, who prefer to emphasize the low prices of cloud computing - which paradoxically are enabled by ignoring vulnerabilities that experienced IT shops would not let pass. When pressed, vendors often say that they won't release proprietary information for security purposes or to protect their intellectual property. However, experienced IT folk will know that variying degrees of these limitations are a basic feature of any infrastructure deployment, and must be engineered around at the deployment and application software layers. So the inescapable conclusion is that the management teams at many of Amazon's customers either chose to ignore the problem or felt that the costs of mitigating it were not worth the potentially small improvement in reliability. In other words, there was a gap between what Amazon could reasonably provide in the way of reliability, and what these companies needed, that was not filled by them, either intentionally or unintentionally. In my next blog, I will explore how this has come to be, and how this problem will be solved by a combination of changes in both cloud customers and cloud vendors.