Contact Us | Request Support | Monitoring Portal | Customer Portal | *

1-650-964-9100

  • Home
  • What is Cloud Computing?
  • Services
    • PrimaCloud Enterprise Cloud Computing
      • Features & Benefits
      • Component Services
      • Virtual Private Data Centers
      • Performance
      • Reliability
      • Security
    • PrimaSys Managed Private Cloud Deployments
      • Choosing Private Cloud
      • Implementation
      • PrimaSys Case Studies
    • PrimaCare Operations-as-a-Service
      • OaaS Detailed Description
      • OaaS Plan Comparison
      • Professional Services
      • Highly Available Cloud Cpanel
    • PrimaView Enterprise Grade Remote Monitoring
      • PrimaView Features
      • PrimaView NimSoft Professional Services
    • Frequently Asked Questions
  • Who You Are
    • Growing Enterprise
    • Start-Up Company or Entrepreneur
    • Colocation or Cloud Computing Customer
    • Shared Hosting or Virtual Private Server User
    • Hosting or Managed Service Provider
    • IT Operations Manager
  • Why Choose ENKI
    • Comparing Cloud Options
    • Case Studies
      • Media Rights Management Company
      • Web Design and Hosting Company
      • Political Web Services Company
      • Media File Sharing Start-Up
      • Financial Services Company
      • Online Gaming Company
      • Internet Advertising Company
      • Hedge Fund
    • Key Benefits
    • Videos & Downloads
    • Buying from ENKI
    • Promotions
    • Testimonials
  • About ENKI
    • The Enki Way
    • Management
    • Partners
    • News
    • Investor Relations
    • Legal
    • Service Level Metrics
  • Enki Blog
Enki Blog

Managed Cloud Blog

  • Home
  • Feed
Tags >> Commentary
Mar 07
2012

The Straight Dope About Cloud Downtime and the Myth of Perfection

Posted by: Eric Novikoff

Tagged in: Commentary

In the last 10 days, ENKI experienced two downtimes that seriously impacted some of our customers.  What was common to the two failures is that they were due to unavoidable single points of failure in ENKI's infrastructure, points of failure that we had no control over.  ENKI, like all top-tier cloud providers (and colo/datacenter providers) has taken painstaking care to remove all possible single points of failure in its infrastructure, from the power coming into the datacenter to the servers that run our customer's applications.   In fact, ENKI has redundant power, cooling, security, storage, networking equipment, bandwidth links and service providers, servers, and interconnect.  At first blush, it looks like nothing can fail without a backup system taking over.

However, there are still single points of failure lurking in the modern datacenter.  And there is precious little that service providers can do about it.   Customers CAN do something about it, and I'll get to that later.   The source of those single points of failure is software.  Not the customer's software (though that is the most common cause of downtime) but rather software that is used to provide the cloud services.  And when it fails, it is the service provider's worst nightmare: watching the customers at whose side you've labored to help them grow their businesses, suffering terribly while there is little you can do besides working with your vendors to figure out a way to work around the defective software.

Eight days ago, one of our Juniper routers failed due to a known bug which was incorrectly categorized by Juniper as "noncritical" but caused a router to go down, and at the same time not informing the paired redundant router that a failure had occurred.  So no failover occurred and as a result one of ENKI's datacenters was cut off from the internet.   Lest you think this is a rare problem, just Google "Amazon router bug" and you'll find a few similar occurrences that have taken down Amazon's services.   The search works with other major providers as well.   They all rely on software that is inside hardware provided by major networking vendors; and that software is not perfect - as no software is.  Because of the nature of the failure we expereinced, the problem looked like a bandwidth provider issue, since our "failed" router was still active and passing some traffic.  It took a long time to diagnose and repair.  It was awful.

Today, ENKI experienced another problem in which our VMWare hypervisor and manager (VSphere) crashed in one of our clusters, rendering some customers' machines unusable.   VMWare confirmed that this is an as-yet unrepaired bug.  Even worse, to reset VCenter, VMWare recommended every server in the cluster be restarted so that all the v-switches would be reinitialized, which caused each customer in the cluster to experience a short downtime.   Our VCenter management nodes are redundant, the databases under them redundant, and of course our servers are redundant.  But if the software fails to recognize its own failures as an event that it has to respond to, the failure will go unaddressed.    This kind of failure is similar to the storage infrastructure failure that Amazon experienced a year ago, in which data was lost, or the Rackspace failure of two years ago.  VMWare is the premiere virtualization system on the planet and well ahead of its competition - but not perfect.   Today's downtime for the worst affected customers continued even after VMWare was fixed, since the crash stimulated a bug in our Oracle SAN causing it to make the LUNs open at the time of the crash unusable.   Oracle had no idea how to repair the problem, and we finally failed over to the backup SAN - against their recommendation - to fix the problem.  Once again, it was awful.  

So despite our best efforts, we - ENKI and other cloud or colocation service providers - cannot in good honesty guarantee 100% uptime, or even anticipate all failures because of the software bugs lurking in our management systems and datacenter hardware.  We could warrantee against them, but that wouldn't stop it from happening, and even worse it would give our customers a false sense of security!  In general, the industry's record is pretty good, and ENKI's record better than average.  But despite our best efforts, we are not immune to failures because at the end of the day, every provider relies on software that represents a single point of failure.  This will probably be the case for a long time, especially due to the rapid pace of innovation in cloud, networking, and storage software, which introduces bugs.

What can cloud customers do if they need more reliability than service providers can offer?   Many customers react to downtimes like this by considering building their own infrastructure.  However, it will rely on the same technologies that the service providers use, and will be subject to the same failures - not to mention that the customer has to incur tremendous costs to go it alone and often can't afford the high-availability storage and networking that service providers use.   And another strategy - moving to another services provider - seems like it will fix any recurring problems, but only until the next service provider encounters a bug.   

There are two solutions customers can implement.  The first is to make their software as restartable as possible.  Suppose one of your servers goes down, taking the application with it.  Then, the cloud management system restarts it (assuming you don't have ephemeral instances like those in Amazon.)  Will your application come up?   If not, you'll have a downtime much longer than the time your server was down.  It pays to test the restartability of your servers before they fail.  The downtimes experienced by many popular websites after a major outage have generally been 2-10x the length of the outage due to restartability issues.

The next and even more effective solution is diversifying critical applications across clusters.    Every cloud provider or colo provider provisions their systems in "clusters", sometimes called "availability zones" or other terms. These clusters generally have separate storage and networking (though in the case of Amazon, what we saw last year was that storage was shared across the entire datacenter, crossing availability zones.)   By placing active/active or active/standby application deployments in different clusters, you can get the redundancy that allows you to keep running past any single point of failure, since it's highly unlikely that even the same piece of software will fail at the same time in an unrelated infrastructure cluster.   WIth clusters hosted in the same location, the second copy of your app will easily be able to keep up with the primary one.   Over a longer distance, you will need to make provisions for handling delays in concurrency.  The unfortunate side effect of this approach is cost, but still less than it would take to build two physical infrastructure locations from scratch.   It also may require adapting or changing your software to allow for active/active or active/passive site pairing, and even potentially adding some global load balancing to use both sites at once.   Whether you're hosted at ENKI, Amazon, or in your own infrstructure, this is the only way you can get past the reliability barrier of 3.5-4 nines that today's infrastructure tops out at.

If you're contemplating deploying a high availability application, please contact us to talk about how we can assist you with it.

Comment (0)
Sep 01
2011

Report From VMWorld: is the cloud industry getting ahead of itself?

Posted by: Eric Novikoff

Tagged in: Commentary

This week's VMWorld conference was a bit of a surprise to me.   Held in Las Vegas, the expo was considerably smaller than previous years in San Francisco perhaps due to lower marketing budgets (or higher costs of attendance) since vendor's booths were smaller and less ambitious than in previous years.    But what was more interesting was the limited commercial focus of the event: provisioning.  This year's exhibitor focus was on getting VMs into the cloud with a wealth of provisioning systems including VMWare's new version of VCloud Director and an updated VSphere suite, but also many third-party provisioning tools.  And with all that provisioning, lots of new storage and networking capability will be needed, so there are plenty of hardware vendors selling servers, storage, and network gear.  Storage, in particular, is taking a spotlight as public and private cloud providers are discovering that existing storage systems are not up to the task of serving up demands of a virtualized infrastructure loaded with a wide variety of applications.   And to a lesser degree there was a focus on managing increasing quantities of virtual machines and storage as both enterprises and cloud providers are seeing that "virtual sprawl" can turn into "cloud sprawl".

However what was more interesting was what was *missing*: innovation about what to do with all those VMs once they were deployed.   The problem of provisioning is essentially solved; lots of software exists to allow users to create new VMs on demand (even if it's still basically Beta software!).   Lots of hardware exists to facilitate that.   There are still horrific problems with scalability that only the largest cloud providers have by and large solved, but it is now only a matter of time until they are solved.   There's lots of innovation in storage and networking coming to market to solve them.  However, making those VMs useful is still an open field.

This issue is the one that I believe will define the next year of  progress in Cloud Computing.   The fundamental need of cloud users is to run applications with acceptable performance and uptime, and very low management effort.   This is where the next wave of innovation will be focused.  These products will improve productivity for all concerned.  From the user's point of view, this will look like an evolution of VM provisioning into platform-as-a-service, with much greater options available for deploying applications rather than just empty or "golden" VMs.   For the cloud provider - internal or public - it will look like tools that make it easier to get customers what they want quickly, and keep them running without downtime.   In particular, the current PaaS offerings - highly integrated but very inflexible and suffering vendor lock-in - will be replaced with a more flexible set of tools that provision customers' VMs and multi-VM applications based on templates managed by vendors, but customized by the end-customer.   In addition, as familiarity with application deployment and management in the cloud builds within the industry, cloud frameworks and management tools will offer standard options for application-dependent auto-scaling, disaster recovery, version updating, and failure response.
While this will be a many-year journey, I think the challenges that will be faced by cloud users and providers alike as enterprises start to move more mission-critical applications into the cloud will drive significant innovation and move the level of cloud services ever closer to true Virtual IT.   On the other hand, reports from the field are that larger enterprises are still struggling with virtualization and not moving to the cloud as fast as the analysts are reporting - so they too are looking for more useful, integrated cloud services... in other words, Virtual IT.  Since this is ENKI's vision, you can count on us being there with best-in-breed tools, a continued emphasis on a rich relationship with our customers, and some surprises that we're working on to help our customers make the most of their virtual infrastructure.

Comment (0)
Aug 24
2011

Is Cloud Hype Beneficial?

Posted by: Eric Novikoff

Tagged in: Commentary

In his recent blog post, "Don't dismiss cloud computing hype; creative fog is what makes cloud work", Kevin Fogarty of ITWorld asserts that cloud hype is actually beneficial since it stimulates innovation and adoption by driving providers to stretch capabilities and customers to try new things (a short paraphrase.)   While this would be completely true in an academic environment in which some new theory or knowledge domain was being debated and developed, there are negative consequences in the real world for both providers and customers of cloud.

Being on the frontlines of delivering cloud as a managed cloud service provider, I feel compelled to take the customers' side in this.  Sure, we as a provider have benefited from the hype, because it has brought us customers who thought we were the El Dorado of IT services, a land in which your every IT wish would come true (or at least the wishes made possible by all the hype.)  However he makes the implicit point in his article that the hype is good because potential cloud customers are too well-considered and conservative to fall for it, so it really benefits the cloud ecosystem by fostering innovation.   My experience is that such customers are few and far between - even IT management in larger enterprises often is squeezed and troubled enough by their current situation that they will grasp at solutions that might seem like hype to those more well-considered, and cloud buyers at smaller enterprises and especially entrepreneurs often think the hype is all completely real.  This puts us - as an infrastructure and platform cloud provider - in the position of having to compete with imaginary products like cloud services that are up 100% of the time.  Sure, it drives our innovation to achieve that service level, but what do we say to someone who thinks it's currently possible for any random cloud deployment?   In our case, we choose to educate as part of our sales process, even if the education process disappoints the customer or we find that they are best served by going elsewhere.

So it makes me wonder what happens to cloud customers who believed the hype and chose a provider that didn't bother to tell them that it was hype.  The results can't be pretty, as we saw a few months ago when Amazon had a data center failure that only surprised those who believed the hype, rather than reading Amazon's actual service level agreement.  So there lies the danger of the hype: it serves both the clients and the providers of cloud by building a shared delusion that allows them to achieve their business aims - until that delusion is pierced by reality.
This is very much the same situation as the '90s in which large enterprise suites were sold (and still are) with the hype that they will actually make the client company successful, instead of the all-too-common result of entrapping it in an endless miasma of deployment headaches.  There was substance behind enterprise suites, and they didn't fade, but they did disappoint large numbers of buyers who are now being experimented on by SaaS companies instead, though this time around the challenges are different, resulting more from integration issues than configuration issues.

As the holder of my customer's IT responsibilities, we take them very seriously and are looking - much like the much-maligned IT departments Kevin compares cloud to - to reduce their risks and eliminate the effects of the ecosystem of experimentation that he is praising.   Analysts are always talking about the watershed that will get large enterprises and those that think like them to move important parts of their IT to the cloud: perhaps it is realizing that cloud vendors aren't going to experiment on them with hyped features that will help them cross the chasm.

Comment (0)
May 24
2011

Are clouds designed to fail?

Posted by: Eric Novikoff

Tagged in: Commentary

Today I received an email from the Cloud Connect conference soliciting sponsorship.  It began with the eye-opening comment, "Clouds are designed to fail - they're made of transient, unreliable components."  It then goes on to say that at the conference, an analyst will lead a discussion about architecting applications to work around expected failure.

But I'd like to go back and examine the opening comment.  Are clouds really designed to fail?   If you are a cloud aficionado as I am, you have been keeping tabs on Amazon's EC2 since it was created, and you'd know that the statement is correct about EC2.   Amazon approached the cloud from a very academic perspective, in which - much like a RAID array - the components are assumed to be failure prone and the only thing that can be guaranteed is the service as a whole (even across multiple datacenters) and that it is the users' responsibility to architect their application deployment around this principle.   So it was no surprise to me when a datacenter-wide failure in EC2 brought down many of its customers.  

However, Amazon isn't the only cloud out there, and it isn't even defining what cloud is or supposed to be anymore, even if it has a dominant market share.  EC2's low reliability, low performance, and fend-for-yourself management aren't suitable to many (and I would argue most) cloud customers, and other vendors - including ENKI - have stepped in to offer alternatives.  In fact, ENKI's cloud offerings have always been designed to recover from failure automatically because we know our customers are often not interested or capable of figuring out how to work around built-in failure modes.   Just looking out at the available technologies for building public or private clouds, CA's Applogic and VMWare have offered automatic failover for a long time, and newer offerings from smaller vendors are starting to offer it as well.

Now, I'll be the first to say that all systems have failure modes and there is no cloud or hosting solution that can offer 100% uptime - despite optimistic advertisements to the contrary.  Disks fail, servers fail, SANs fail, networks fail.  It's inevitable.  However, customers of clouds which have self-healing infrastructure have a choice: accept the vendor's guaranteed per-server/service uptime level as their base infrastructure reliability, or architect a more highly redundant deployment that can build on that base level.   For many, the 99.975% to 99.99% uptime that we guarantee at the operating system level is more than adequate for their business, especially considering that the bulk of their downtime is usually due to software issues.  On the other hand, customers of clouds guarantees only at the aggregate service level and not at the operating system/VM level do not have a choice: they must factor unbounded downtime into their systems architecture planning.  And that requires skilled, experienced IT staff and developers, as well as increased complexity and cost for redundant cloud instances

All clouds will fail, but the ones designed to stay up will offer a very different customer experience from the ones that are designed to fail.

Comment (0)
Apr 28
2011

The Cloud Ecosystem's Conspiracy of Silence

Posted by: Eric Novikoff

Tagged in: Commentary

After last week's meltdown at Amazon, a lot of people (including me) are talking about what needs to change in cloud computing to provide users with a greater degree of confidence in cloud and the vendors that provide it.   So far, I have focused on the customers of cloud as having a great deal of influence over the levels of service they experience, since ultimately they have the power whether it is in how they use the service or which provider they choose.  The more informed they are, the better the cloud ecosystem will perform.

However, there is a major problem with this approach (nothing's simple, right?)  Customers can relatively easily inform themselves (or hire employees, consultants, or professional services) to help them make the best use of cloud services.  However, what they cannot easily do is find out all the limitations, gotchas, constraints, tradeoffs, and misrepresentations that cloud vendors suffer from, much like the colo and hosting vendors have for over a decade.   For example, Amazon is essentially a "black box" so even a very IT-literate customer can't effectively engineer around their limitations.   And Amazon has a very supportable position in not releasing all its secrets to the world for fear of losing its competitive edge.  This obfuscation in the name of protection of intellectual property and market position goes well beyond Amazon, however.  Based on my experiences, every cloud vendor and managed services provider also is part of this conspiracy of silence.  And amazingly, what I hear from investors and analysts about problems with various cloud technologies is not published on the internet or spoken of in conferences.  There are many reasons for this, not the least of which are gag clauses that equipment and software vendors write into their contracts.

For example, here at ENKI we use a number of branded products and technologies to provide our services.    The sales/service/evaluation contracts we signed with these vendors specifically prevent us from sharing lists of bugs, performance analyses, or other damaging information about their products with the public.   Again, it's quite reasonable that vendors we do business with be protected from incorrect or malicious information or misinformation about their products.  However, if their products have serious flaws - something that is going to be universally true about anything that is new or changing rapidly - then our customers have a right to know the risks they are taking with using those products in our services.  But we have no way to let them know except to call them individually.   Our only choice is to stop or avoid using the vendor's product, which is something we have done a few times in our history.  We can't even announce why we are discontinuing the use of the product, according to most of our contracts.

Every day, I see ENKI's competitors touting this or that product or technology, many of which we have already discarded as fatally flawed from the perspective of reliability, security, or usability, yet we can not have an open discussion about them on a public forum (though we're happy to do so individually!)   And I think about the hapless customers signing up to use their services, only to face business-critical limitations later.   Just today I got three marketing emails in my inbox from competitors using software systems to provide cloud services that suffer from horrific bugs.

While all this secrecy is understandable, and for the moment legally correct and enforceable, it is a disservice to the cloud-using and cloud-selling community.   You can't really choose a technology based on only its positive features!  Let's face it: every piece of software and hardware has flaws, but the ones that persist still offer enough value to keep people using them.  This isn't just true of cloud, but many IT products, in particular large, expensive software systems (which I'll also have to let remain unnamed) suffer from long-running outstanding bugs and terrible service.   I only see two ways out of this dilemma: either something blows up as it did last week, or software/cloud/hardware vendors permit and encourage a more open dialog as many Web2.0 companies have bravely begun doing. 

What can you, the cloud/IT/software-buying public do about it?  Not a lot, but you can start by letting go of the expectation of perfection, which drives vendors to try to hide bugs and problems.   An easy way to do this is to look at your vendor from a relationship point of view: when the inevitable problems crop up, are they willing and capable of responding?    This is no more - and no less - than you'd expect from your own IT department, right?

All this secrecy puts us at ENKI in an uncomfortable bind, since our corporate values are based on openness and transparency as a means of honoring our customers.   So if you have questions about a cloud technology please contact me.  I'll be happy to share what I know.  But please, don't expect me to sign my name to it!

Comment (0)
Share to Facebook Share to Twitter Stumble It Share to Reddit Share to Delicious Share to Google Buzz 
Social Widgets Ultimate Edition - Copyright © 2010 by Turnkeye.com

Free Cloud Buyer's Guide

Our informative guide is full of best practices to help you choose the right Cloud vendor for your business and to make your cloud application deployment successful.

Download Now

Latest Blog Entries

  • Going beyond compliance: achieving true security in the Cloud
  • The Straight Dope About Cloud Downtime and the Myth of Perfection
  • The two basic types of cloud architecture
  • Why overallocation makes cloud computing services impossible to compare
  • Does Cloud Computing Drive Vendor Lock-in?
  • Is Amazon "all that?"
  • Report From VMWorld: is the cloud industry getting ahead of itself?
  • Is Cloud Hype Beneficial?
Business Strategy Case Studies Cloud 101 Cloud Industry Cloud Usage Commentary ENKI Information Events First Person Infrastructure News Philosophy Pricing Techniques Technology

Blog Archive

  • March 2012(2)
  • February 2012(2)
  • January 2012(1)
  • September 2011(2)
  • August 2011(2)
  • May 2011(3)
  • April 2011(4)
  • March 2011(1)
  • February 2011(2)
  • January 2011(5)
  • October 2010(1)
  • September 2010(5)
  • August 2010(2)
  • June 2010(1)
  • May 2010(1)
  • April 2010(1)
  • March 2010(1)
  • February 2010(1)
  • January 2010(1)
  • October 2009(2)
  • September 2009(7)
  • August 2009(3)
  • July 2009(3)
  • June 2009(6)
  • May 2009(2)
  • April 2009(4)
  • March 2009(2)
  • February 2009(1)
  • January 2009(1)
  • November 2008(1)
  • October 2008(2)
  • August 2008(4)
  • July 2008(2)
  • June 2008(1)
  • May 2008(1)
  • April 2008(1)
  • February 2008(3)
  • January 2008(3)
  • December 2007(2)
  • November 2007(1)
  • September 2007(1)
  • August 2007(3)
  • June 2007(1)
  • May 2007(1)
  • March 2007(1)
  • February 2007(4)
  • January 2007(3)
OVERVIEW
  • About PrimaCloud
  • About PrimaCare
  • Key Benefits
  • Comparing Cloud Options
HELP CENTER
  • Frequently Asked Questions
  • Contact Us For Support
  • Terms and Conditions
SELF SERVICE PORTALS
  • PrimaCloud
  • Monitoring
  • Customer Portal
  • Discount Domains & Certificates
Follow @enkicloud
LOGO_CoFounderWebsite
Copyright © 2011 ENKI LLC