|
May 24
2011
|
|
Today I received an email from the Cloud Connect conference soliciting sponsorship. It began with the eye-opening comment, "Clouds are designed to fail - they're made of transient, unreliable components." It then goes on to say that at the conference, an analyst will lead a discussion about architecting applications to work around expected failure.
But I'd like to go back and examine the opening comment. Are clouds really designed to fail? If you are a cloud aficionado as I am, you have been keeping tabs on Amazon's EC2 since it was created, and you'd know that the statement is correct about EC2. Amazon approached the cloud from a very academic perspective, in which - much like a RAID array - the components are assumed to be failure prone and the only thing that can be guaranteed is the service as a whole (even across multiple datacenters) and that it is the users' responsibility to architect their application deployment around this principle. So it was no surprise to me when a datacenter-wide failure in EC2 brought down many of its customers.
However, Amazon isn't the only cloud out there, and it isn't even defining what cloud is or supposed to be anymore, even if it has a dominant market share. EC2's low reliability, low performance, and fend-for-yourself management aren't suitable to many (and I would argue most) cloud customers, and other vendors - including ENKI - have stepped in to offer alternatives. In fact, ENKI's cloud offerings have always been designed to recover from failure automatically because we know our customers are often not interested or capable of figuring out how to work around built-in failure modes. Just looking out at the available technologies for building public or private clouds, CA's Applogic and VMWare have offered automatic failover for a long time, and newer offerings from smaller vendors are starting to offer it as well.
Now, I'll be the first to say that all systems have failure modes and there is no cloud or hosting solution that can offer 100% uptime - despite optimistic advertisements to the contrary. Disks fail, servers fail, SANs fail, networks fail. It's inevitable. However, customers of clouds which have self-healing infrastructure have a choice: accept the vendor's guaranteed per-server/service uptime level as their base infrastructure reliability, or architect a more highly redundant deployment that can build on that base level. For many, the 99.975% to 99.99% uptime that we guarantee at the operating system level is more than adequate for their business, especially considering that the bulk of their downtime is usually due to software issues. On the other hand, customers of clouds guarantees only at the aggregate service level and not at the operating system/VM level do not have a choice: they must factor unbounded downtime into their systems architecture planning. And that requires skilled, experienced IT staff and developers, as well as increased complexity and cost for redundant cloud instances
All clouds will fail, but the ones designed to stay up will offer a very different customer experience from the ones that are designed to fail.









