|
Mar 07
2012
|
The Straight Dope About Cloud Downtime and the Myth of PerfectionPosted by: Eric Novikoff Tagged in: Commentary
|
In the last 10 days, ENKI experienced two downtimes that seriously impacted some of our customers. What was common to the two failures is that they were due to unavoidable single points of failure in ENKI's infrastructure, points of failure that we had no control over. ENKI, like all top-tier cloud providers (and colo/datacenter providers) has taken painstaking care to remove all possible single points of failure in its infrastructure, from the power coming into the datacenter to the servers that run our customer's applications. In fact, ENKI has redundant power, cooling, security, storage, networking equipment, bandwidth links and service providers, servers, and interconnect. At first blush, it looks like nothing can fail without a backup system taking over.
However, there are still single points of failure lurking in the modern datacenter. And there is precious little that service providers can do about it. Customers CAN do something about it, and I'll get to that later. The source of those single points of failure is software. Not the customer's software (though that is the most common cause of downtime) but rather software that is used to provide the cloud services. And when it fails, it is the service provider's worst nightmare: watching the customers at whose side you've labored to help them grow their businesses, suffering terribly while there is little you can do besides working with your vendors to figure out a way to work around the defective software.
Eight days ago, one of our Juniper routers failed due to a known bug which was incorrectly categorized by Juniper as "noncritical" but caused a router to go down, and at the same time not informing the paired redundant router that a failure had occurred. So no failover occurred and as a result one of ENKI's datacenters was cut off from the internet. Lest you think this is a rare problem, just Google "Amazon router bug" and you'll find a few similar occurrences that have taken down Amazon's services. The search works with other major providers as well. They all rely on software that is inside hardware provided by major networking vendors; and that software is not perfect - as no software is. Because of the nature of the failure we expereinced, the problem looked like a bandwidth provider issue, since our "failed" router was still active and passing some traffic. It took a long time to diagnose and repair. It was awful.
Today, ENKI experienced another problem in which our VMWare hypervisor and manager (VSphere) crashed in one of our clusters, rendering some customers' machines unusable. VMWare confirmed that this is an as-yet unrepaired bug. Even worse, to reset VCenter, VMWare recommended every server in the cluster be restarted so that all the v-switches would be reinitialized, which caused each customer in the cluster to experience a short downtime. Our VCenter management nodes are redundant, the databases under them redundant, and of course our servers are redundant. But if the software fails to recognize its own failures as an event that it has to respond to, the failure will go unaddressed. This kind of failure is similar to the storage infrastructure failure that Amazon experienced a year ago, in which data was lost, or the Rackspace failure of two years ago. VMWare is the premiere virtualization system on the planet and well ahead of its competition - but not perfect. Today's downtime for the worst affected customers continued even after VMWare was fixed, since the crash stimulated a bug in our Oracle SAN causing it to make the LUNs open at the time of the crash unusable. Oracle had no idea how to repair the problem, and we finally failed over to the backup SAN - against their recommendation - to fix the problem. Once again, it was awful.
So despite our best efforts, we - ENKI and other cloud or colocation service providers - cannot in good honesty guarantee 100% uptime, or even anticipate all failures because of the software bugs lurking in our management systems and datacenter hardware. We could warrantee against them, but that wouldn't stop it from happening, and even worse it would give our customers a false sense of security! In general, the industry's record is pretty good, and ENKI's record better than average. But despite our best efforts, we are not immune to failures because at the end of the day, every provider relies on software that represents a single point of failure. This will probably be the case for a long time, especially due to the rapid pace of innovation in cloud, networking, and storage software, which introduces bugs.
What can cloud customers do if they need more reliability than service providers can offer? Many customers react to downtimes like this by considering building their own infrastructure. However, it will rely on the same technologies that the service providers use, and will be subject to the same failures - not to mention that the customer has to incur tremendous costs to go it alone and often can't afford the high-availability storage and networking that service providers use. And another strategy - moving to another services provider - seems like it will fix any recurring problems, but only until the next service provider encounters a bug.
There are two solutions customers can implement. The first is to make their software as restartable as possible. Suppose one of your servers goes down, taking the application with it. Then, the cloud management system restarts it (assuming you don't have ephemeral instances like those in Amazon.) Will your application come up? If not, you'll have a downtime much longer than the time your server was down. It pays to test the restartability of your servers before they fail. The downtimes experienced by many popular websites after a major outage have generally been 2-10x the length of the outage due to restartability issues.
The next and even more effective solution is diversifying critical applications across clusters. Every cloud provider or colo provider provisions their systems in "clusters", sometimes called "availability zones" or other terms. These clusters generally have separate storage and networking (though in the case of Amazon, what we saw last year was that storage was shared across the entire datacenter, crossing availability zones.) By placing active/active or active/standby application deployments in different clusters, you can get the redundancy that allows you to keep running past any single point of failure, since it's highly unlikely that even the same piece of software will fail at the same time in an unrelated infrastructure cluster. WIth clusters hosted in the same location, the second copy of your app will easily be able to keep up with the primary one. Over a longer distance, you will need to make provisions for handling delays in concurrency. The unfortunate side effect of this approach is cost, but still less than it would take to build two physical infrastructure locations from scratch. It also may require adapting or changing your software to allow for active/active or active/passive site pairing, and even potentially adding some global load balancing to use both sites at once. Whether you're hosted at ENKI, Amazon, or in your own infrstructure, this is the only way you can get past the reliability barrier of 3.5-4 nines that today's infrastructure tops out at.
If you're contemplating deploying a high availability application, please contact us to talk about how we can assist you with it.






