• Contact Us
  • Support Portal
  • Payment Portal
  • *

1-650-964-9100

  • Home
  • What is Cloud Computing?
  • Services
    • PrimaCloud Enterprise Cloud Computing
      • Features & Benefits
      • Concierge Onboarding
      • Component Services
      • Virtual Private Data Centers
      • Performance
      • Reliability
      • Security
    • PrimaSys Managed Private Cloud Deployments
      • Choosing Private Cloud
      • Implementation
      • PrimaSys Case Studies
    • PrimaCare Operations Services
      • Operations Services Detailed Description
      • PrimaCare Plan Comparison
      • Professional Services
    • PrimaView Enterprise Grade Remote Monitoring
      • PrimaView Features
    • Frequently Asked Questions
  • Who You Are
    • Growing Enterprise
    • Start-Up Company or Entrepreneur
    • Colocation or Cloud Computing Customer
    • Shared Hosting or Virtual Private Server User
    • Hosting or Managed Service Provider
    • IT Operations Manager
  • Why Choose ENKI
    • Comparing Cloud Options
    • Case Studies
      • Media Rights Management Company
      • Web Design and Hosting Company
      • Political Web Services Company
      • Media File Sharing Start-Up
      • Financial Services Company
      • Online Gaming Company
      • Internet Advertising Company
      • Hedge Fund
    • Key Benefits
    • Videos & Downloads
    • Buying from ENKI
    • Promotions
    • Testimonials
  • About ENKI
    • The Enki Way
    • Management
    • Partners
    • News
    • Investor Relations
    • Legal & Contact
    • Service Level Metrics
  • Enki Blog
Enki Blog

Managed Cloud Blog

Subscribe to blog Subscribe via RSS

  • Home
    Home This is where you can find all the blog posts throughout the site.
  • Categories
    Categories Displays a list of categories from this blog.
  • Tags
    Tags Displays a list of tags that has been used in the blog.
  • Bloggers
    Bloggers Search for your favorite blogger from this site.
  • Team Blogs
    Team Blogs Find your favorite team blogs here.
  • Login
Recent blog posts
Apr 26
2013

Comparing Amazon AWS Pricing to ENKI: A Real-World Case Study Showing 33% Savings

Posted by Eric Novikoff on Friday, 26 April 2013 in Blog

We recently completed a cost comparison between ENKI and Amazon AWS, starting with the monthly bill of one of our mid-sized customers, a social media/SaaS application company.  Their application consumes 51 virtual machine instances, 78 cores, and 236 GB of RAM.   In order to get an apples-to-apples comparison, we looked at the resource settings on each instance and found the AWS instance type that meets or exceeds both the RAM and CPU in order to determine the equivalent pricing.  We did this comparison for all the client's instances for both reserved and on-demand instances.  

ENKI's pricing ended up being 33% lower than Amazon AWS for this client, while delivering better performance and uptime due to our enterprise architecture.   If the client had chosen reserved instances, ENKI would have been 43% lower. The case study is available as a PDF for download.

Tagged in: Case Studies Pricing
0 Comments Continue reading
Hits: 1796
0
Apr 4
2013

Crash or Aha! - two different philosophies on how to rightsize your cloud deployment.

Posted by Eric Novikoff on Thursday, 04 April 2013 in Blog

Adjusting the size of your deployment to match your business needs is the most obvious tool you have to control costs.  Allocate too large a server or too many servers, and you’re going to waste money.  Allocate too little, and your clients will think there is something wrong with your service and look elsewhere because of slow performance or crashing.  It may be obvious that you want to avoid crashes which lead to downtime, but many of our clients have and continue to economize on resources such that their customer base is continuously alienated.  It’s convenient to blame the cloud provider for inadequate performance, but since the Cloud gives you control over how much resources you allocate, in the end the decisions that lead to poor performance or uptime are in your hands.  It inevitably will cost more to lose clients than to pay a bit more for resources.

crash and overcompensationBut how much do you need?

Unfortunately most new cloud customers have no idea what the appropriate resource allocation is for their application out of the gate because they haven’t had the chance to measure real-world usage. This presents a choice at initial deployment which we like to call “Aha vs.  Crash” (please see our white paper on controlling cloud costs).  You can choose to learn as much about your application as possible (“Aha!”), or you can choose to minimize resources for short-term savings, which will inevitably result in downtime (“Crash!”) – if only to resize and restart your instance.

We recommend the “Aha” approach of oversizing your cloud deployment initially to avoid the crashing, and then measuring it with an appropriate monitoring tool under real or simulated loads to get that all-important ratio of resources to demand at your chosen level of performance.  Because cloud resources can be adjusted down as well as up, you aren’t locked into overpaying for long periods of time, but only until you have your data.  After that, you can monitor usage adjust resources based on measured loadand decide how you will scale up with demand, making adjustments as needed over time.  I call this "adaptive allocation."  By planning ahead, you can schedule downtime with your users for adjusting resource levels, making the “Aha” approach even more appealing.

Or, you can install some auto-scaling functions to adjust your resource levels based on measured loads.  However, there are plenty of gotchas with autoscaling as well, which can result in either ongoing crashes or expensive overallocation of resources.  I'll cover these in another blog article.

Tagged in: Cloud 101 Techniques
0 Comments Continue reading
Hits: 47232
0
Mar 21
2013

The economics and future of PaaS

Posted by Eric Novikoff on Thursday, 21 March 2013 in Blog

I was having a discussion the other night with a friend potential future client who is growing a startup with a bright future.   He's currently hosted on Heroku and wondering how he's going to increase his uptime and get better control over his software deployment as he grows.   The emails and chats went back and forth for a while and I realized we were writing a blog about PaaS in the process.  I wanted to share it with you...

He asked me why I thought Heroku's reliability was so dependent on Amazon's reliability, because his goal is to surpass 4-nines of uptime and he's already seen that the Heroku platform suffers from Amazon's far lower reliability (see my blog about Amazon reliability).   I'd never really thought about it, just assuming that Heroku didn't know how to do it.   But I think the real reason is market forces that demotivate such a solution.  I have seen how we often get customers who want 5-nines of uptime but when they see that it will cost them 2-4x in cloud resources compared to the base reliability of our offering, they suddenly drop the requirement.  Or, maybe they go to another provider that offers a "100% uptime" guarantee but actually delivers 3+ nines

So, there's an upper bound to cost that many cloud buyers have in their minds, which has been successfully set by Amazon.  I am I of course speaking of cloud buyers who Heroku appeals to, which are lean startups or enterprise departments.  These customers don't want to build the human infrastructure to provide their own PaaS out of cloud infrastructure and open-source software - which is essentially what Heroku has done but tied together with a very nice user interface.  I compare Heroku PaaS pricing to Amazon because every larger Heroku user complains about the pricing in comparison to Amazon: they're always aware of the cost they're paying for that convenience.   Because of this upper bound, Heroku cannot reasonably sell a high reliability infrastructure on top of Amazon, something that is inherently possible to a large (but not complete) degree, though again few users have done so as evidenced by the moans and wails that issue with every major Amazon failure.

Another limiting factor for Heroku (or other PaaS providers) is that PaaS, while convenient especially if you don't have your own IT staff, provides only a limited subset of what an in-house IT group can do with respect to incident response, systems design, accommodating application architecture, etc.   At some point in an application's maturity, it becomes almost imperative that people are involved with maintaining it, especially tuning the deployment to match the requirements of the application.  This limits commercial PaaS to clients that are early on the maturity curve, or steadfastly determined not to hire IT staffing.  And with the advent of third-party PaaS tools like Standing Cloud or CliQr, you can make your own PaaS out of anyone's cloud (though once again, there's a fee for using the tool as a service.)   These new tools are adding quite a few management features, but ultimately they don't eliminate the need for trained system administrators on complex deployments.   I've seen our customers rely on similar features in CPanel or other management tools and back themselves into painful corners where their app couldn't be restarted without rebuilding the server.

As a result of these forces, plus infrastructure cloud providers slowly adding PaaS features, I see a limited future for add-on PaaS services like Heroku.  

But for now, PaaS, like Heroku and others, are a great way to launch an app into the cloud for the first time and run it mostly worry-free until it actually gets a lot of traction.  At that point, you'll need to decide how you want to involve the human element in managing the deployment - either because PaaS costs are higher than a small dedicated IT team, or because you need the flexibility of an IT team in adapting your deployment to your application.   ENKI was created to offer an alternative to building that IT Team, while still paying on a pay-as-you-go basis much like Heroku.

0 Comments Continue reading
Hits: 15807
0
Feb 18
2013

Cloud 2.0 - the DevOps Revolution

Posted by Eric Novikoff on Monday, 18 February 2013 in Blog

I was recently at a content marketing seminar where I ran into Dave Nielsen, the co-founder of Cloud Camp.   We got to talking as we are always wont to do and he told me about his svDevOps meetup and the devops camps that he's been helping to organize.   As usual we realized we've been thinking the same way: the next phase of cloud computing is bringing the ease of use and cost savings of infrastructure as a service to IT as a service, and DevOps (integration of development and operations) is the next frontier.   

Our founder and CEO, Dave, wrote a paper a few years ago, "Why Cloud Computing Will Never Be Free"  in which he asserted that the bulk of the costs of cloud computing are in the IT administration (not the computing itself) and that services will be the wave of the future that will make cloud infrastructure usable.  Here we are 3 years later and cloud providers are still focusing on dollars per instance-hour (or Gigabyte hour as the case may be) but not on the true cost of operations.   This is what the true Cloud 2.0 revolution will transcend.  And DevOps, both as a field of knowledge and a green field for automation, will bring that transcendance.   This is why we've been honing ENKI's services to truly deliver a "concierge cloud computing" experience to our clients.   I believe this is the only approach that will deliver cloud deep into the enterprise - or companies that think like enterprises - rather than simply conquering the perimeter.

 

Tagged in: Commentary
0 Comments Continue reading
Hits: 3238
0
Feb 18
2013

SSD Storage And The Cloud: Are Reliability AND Speed Both Possible?

Posted by Eric Novikoff on Monday, 18 February 2013 in Blog

Storage has always been a major challenge for ENKI in building a high-performance cloud: how do we achieve reliability AND speed?  

For worry-free reliability, we have consistently chosen not to place storage on the compute instances, unlike Amazon or Rackspace's sliced-dedicated-server clouds.  If the storage is on the instance, any speed gains are offset by the possibility of data loss if the instance fails.   On the other hand, concentrating the storage demands of many instances onto a common storage infrastructure gets you persistent instances with full failover restartability, but it requires the centralized storage to be very high speed and connected to the instances over fast networking.  In other words, expensive.  To solve these problems, we've so far chosen Infiniband networking coupled with SANs that accelerate access to storage using SSD caching for commonly-used data, offering a large fraction of full-SSD storage wthout the price.

But now, Amazon and other cloud providers are upping the available storage speed in their clouds by placing SSD storage into their servers.  Do these increased speeds - necessary for today's cloud database and transactional loads - warrant building in even further incentives to decentralizing storage?   I don't think so.  

First of all, today's applications - in order to reach the highest levels of transaction processing speed - are highly parallelized, meaning they run on multiple servers that have to exchange data in real-time.   So "landlocking" their data on the SSD inside a server actually serves to slow down the application, unless the storage can be fully fragmented (often called "sharded.")   In fact, with many applications the necessity for high speed synchronization between servers becomes so extreme that the networking speeds have to approach the memory access speeds to allow applications to scale linearly.   Very few cloud providers are putting that kind of networking in place because it's expensive.  

Second, placing SSD on the server doesn't solve the failover problem.  In fact it makes it worse in practice because even more of the client's access-speed critical data will be placed on the server.   The cloud provider who places SSD on the server is essentially dangling an irresistable treat in front of their customers, tempting them to leap off a dangerous cliff of unreliability.

There have been a few proposed distributed storage architectures which maintain access-critical data in local SSD cache on the server, but over time, these solutions have all become unidirectional storage products, used for media servers and such, because they didn't synchronize data between the separate local caches.   There haven't yet been products offered that offer distributed cache coherency for storage, especially because the customers with a full Infiniband or similar network structure tying their clouds together just haven't existed.

I think the solution to offering cloud with true enterprise-grade performance still remains with centralized storage: making SSD available as a shared resources, accessed over dedicated, fast networking.    This is the approach that we are going to be offering for our new Santa Clara datacenter cloud cluster in Q2 '13.    It also offers the additional benefit that the SAN can dynamically move the data to the most appropriate storage type (disks or SSD) depending on load, which reduces the overall cost of SSD storage for the cloud customer.

Tagged in: Technology
0 Comments Continue reading
Hits: 19505
0
Feb 14
2013

Want to change clouds? Introducing Concierge Onboarding Services

Posted by Eric Novikoff on Thursday, 14 February 2013 in Blog

We recently met with one of our biggest boosters, a Fortune 500 CEO who has always been enthusiastic about ENKI's services.   He pointed out that many of his friends and former clients are unhappy with their choice of cloud providers, but are simply too busy, overworked, overwhelmed, and frazzled to even consider moving to another provider.   He recommended that we emphasize our capability to make migration to ENKI's cloud services, or onboarding new clients, effortless.

To that end, we decided to call out what we can do under the name, Concierge Onboarding.  Concierge, because like a real concierge, we take care of all the details of the onboarding process for you, including migration.  To address the objections that he brought up (and many of our prospects bring up) that keep people from changing cloud providers, we made the onboarding process guaranteed - both in time and money.    

Concierge Onboarding will cost you a fixed amount, agreed in advance.  

Concerge Onboarding will take a known amount of time which we will estimate in advance.  Of course, things can happen or you can add more to our plate, and it might take longer.  But we won't charge you anything - not for cloud computing resources or labor - if we can't make our initial estimate (unless of course you ask us to do more.)

Check our our cheeky page about it and let us know what you think!

0 Comments Continue reading
Hits: 11404
0
Feb 3
2013

Should you fire your cloud vendor?

Posted by Eric Novikoff on Sunday, 03 February 2013 in Blog

I just read, "Firing A Cloud Vendor" by Chris Nerney in ChannelproNetwork, with great interest.  This topic matters to me because most of our clients come from other cloud vendors, or at least have had bad experiences with cloud computing at some point, and of course I'd like ENKI to learn from these experiences. Chris' assertion is that the primary reason to fire your cloud vendor is if they breach SLA, while it's the client's responsibility to make sure the SLA is comprehensive and business-specific.   While I agree with both of these points, the reasons to fire a cloud vendor are usually present when you first choose one.  There are two important observations that our experience has show us make a critical difference in what cloud vendor you choose, as well as whether firing the vendor will actually solve any problems you're experiencing.   These two observations boil down to "know theyself" and "know thy application."

The first observation is that the cloud client must know whether their business and team require an intimate or hands-off relationship with the cloud vendor.  Making the wrong choice at the beginning will inevitably cause you to be unhappy with the vendor because your expectations won't be met.   If they are highly hands-off with only a web provisioning portal for cloud management and support staff that don't hold context with you or understand your application, you should adjust your expectations accordingly.  In particular, you cannot expect a hands-off cloud vendor to be responsible for software crashes or misconfigurations on your servers, so you have to be able to administer them yourself. In the drive for the lowest per-hour compute pricing, there are many hands-off vendors out there, and if you have the internal IT skills to completely manage your own servers, they are a viable option.   In this case, the SLA discussion is quite simple: did they meet their uptime promises, or any other promises relating to performance?  Data loss, because of the self-management requirement, is always going to be at least in part your responsibility and not a sole reason for firing the vendor.  The exception to this is, of course, if you discover that a hands-off vendor was the wrong choice for you.  In that case, you should be careful that your contract allows you an early exit from your committments.  Many low-cost cloud vendors have no required committment, but this simply mirrors the fact that they have no committment to you either!

On the other hand, if the cloud vendor offers full management (what we call "operations services") then their performance to SLA requires - as Chris points out - a detailed SLA that you negotiate with them to describe how their services react to exceptional conditions or to work orders that are necessary for your service to perform reliably.  IT is very complex, and problems will inevitably occur that will violate your SLA requirements from time to time, but as Chris' article points out, if this happens regularly, something is wrong with the vendor or the relationship. If the vendor responds quickly to SLA violations and provides an effective plan to correct them, then you are getting better service than most enterprises get from a hand-picked internal IT team (as I've seen in my time at companies large and small!)  However, it's incumbent upon you to develop an intimate relationship with their team: let them know in advance of any requirements you have like large spikes in load, meet regularly with their team, and if possible share documents (a "runbook") that define how your infrastructure should be managed.  With such documents, the SLA becomes a requirement on the vendor for how well the runbook is managed and followed, which is much easier than enumerating every problem and response in the contract. 

Whether you choose a hands-off or hands-on cloud vendor my experience is that the best predictor of satisfaction with a cloud vendor - or any vendor actually - is the intimacy of the relationship that you have with them.  So even if you want a hands-off relationship, it may make sense to choose a vendor with a documented customer support process, 24x7 support, technical account managers, dedicated sales staff, and if your business warrants it, access to their executives. 

The second observation is that if you don't understand your application well, you will have problems with your cloud vendor.   The dividing line between your responsibility to create a cloud-friendly, reliable application and the cloud vendor's responsibility to create a reliable platform to host it is never clear.   If the application crashes repeatedly, performs poorly, or suffers "inexplicable" errors, the cause could be the cloud vendor's infrastructure, but it is more likely within your application or the configuration of your servers.   We've found that about 80% of our clients' downtime is due to application or configuration problems.  There are well-known "problem" applications including WordPress (which crashes regularly if used as a web application rather than just a blogging or content management platform), or MySQL (which loses data if configured incorrectly.)  If you have one of these apps, your cloud vendor cannot be held responsible for your downtime BUT a good hands-on vendor will sit down with you and do a failure analysis and make recommendations that you can use to improve your system's reliability.   These recommendations may include changing technology or testing and coding practices if you write the software yourself, so starting the relationship with the cloud vendor early in your product cycle can avoid many problems.

If you understand your needs and the limitations of your application, and choose a cloud vendor accordingly, you should be able to avoid having to "fire" them in many of the cases that we see bringing customers to us, or even a few of the cases where we have had to part ways with clients.

0 Comments Continue reading
Hits: 1489
0
Feb 1
2013

How to never lose a byte of data in the Cloud!

Posted by Eric Novikoff on Friday, 01 February 2013 in Blog

This week I was working with our Director of Sales, Ryan, on an "engajer" about ENKI.  It's a nifty viewer-directed video presentation technology from one of our customers (and we're now their client too - sounds like a commercial I saw somewhere!)  We were thinking over the things our clients have told us they like about ENKI to share in the video and realized that ENKI has never lost a byte of our clients' data in its 7 years of operation due to equipment or software failures.   It stunned us, considering the constant drumbeat of data loss stories from clients of cloud providers.  And it sort of scared us to think about shouting it out, since it seems so implausible and unbelievable.

We looked over our support records to find any hint of data loss tickets, but the only ones we found were due to the usual human errors: clients deleting data that wasn't backed up, ENKI's operations services not setting up client backups because clients never replied to requests for a backup plan when we initiated services, or software failures in client application stacks (MySQL figures prominently here) which went undetected until the retention window closed.   I thought I'd summarize some of our learnings around data loss to celebrate this milestone.

The primary reason we haven't lost any data is that we only provide centralized storage.   With centralized storage, ENKI controls the degree of redundancy applied to customer data.   Other cloud providers that store client data on the server (including each of the top three providers) require that the user take steps to back up data in order to protect against hardware failure by explicitly copying data off the server onto another form of storage.  By separating storage from the server, the server can fail without damaging the data.  Enterprises have known this for a long time, which is why they use SAN storage.  A wonderful consequence of this is that the client's virtual machines can be automatically restarted on another server, providing instance persistence (the virtual machine is never lost and its state is always preserved) even if it is shut down for a while by the user.  There are quite a few cloud providers who say that local storage is better, but really all they can claim is that it's faster than most centralized storage.  We chose instead to have the fastest possible centralized storage available, and connect it to our cloud servers with the fastest networking we could find - Infiniband or multiple 10Gb Ethernet links.

The secondary reason we haven't lost any data is that our centralized storage is "enterprise-grade" meaning it stores the data to a high standard of reliability - at least twice (usually three times) with automatic recovery of data that resides in damaged copies.  Our first-generation cloud management system, AppLogic, used on-server storage but mirrored it to other servers, so that effectively the data was fully separated from any failed server.  Our current SAN uses multiple disk arrays served by multiple controllers, each of which can fail without causing data loss.   Because of the centralized nature of the SAN, we can fill it with expensive "writezillas", which are an SSD cache that stores data to be written even in the event of a power failure, allowing the SAN - and hence our cloud - to come up on its feet with no data loss even in the unlikely event that the multiple power systems in our facilities all fail at once.

On the flip side, when our clients have lost data due to human error, the causes and corresponding fixes are relatively easy to implement:

1) Lack of backup.  For our self-managed clients, this is a constant danger.  No cloud provider will ensure that your software and data get backed up in the way you need unless you either do it yourself or engage them to do it for you with a clear set of requirements.   ENKI's operations services plans include backup, but if you don't tell us what you need, you can't be sure your data is backed up.

2) Lack of regular recovery testing.  The best backup plan is worthless unless you test it.   Many clients find this is a prohibitive amount of work and as a result have backup processes that actually don't protect them.  We find that for our operations services clients, we often have to beg them to take the time to work with us to validate a data restore.

3) Improperly matched backup requirements and plans.   If you need absolute confidence that your backup data is available, you probably want to store it in more than one place and retain it indefinitely.  ENKI offers multiple locations of backup storage, and picky clients often back up to a third party as well.  If your requirements are more modest, then simply backing up to a separate file system and storing it for a short while may suffice.  Since better backup costs more, you should be sure what all your users/stakeholders require and then set up (or outsource) the backup to meet those requirements.   All too often we see clients who store backups for a few months agonizing over trying to recover data from a few years ago because someone in their organization wasn't asked how long they needed data retained.

4) Thinking the cloud provider has "it all taken care of."   We've had a few unmanaged clients run into this one, but I read about it regularly in analyses of client data loss at Amazon and other providers.   Cloud clients often forget that Infrastructure-as-a-Service cloud is nothing more than virtualized hardware (though it varies greatly in performance and architecture between vendors, of course.)   As nice as it is having someone else provide virtual servers for you, if nobody has it in their responsibilty to back them up, they don't get backed up!

0 Comments Continue reading
Hits: 7105
0
Jan 27
2013

Cloud Vendor Comparison - Are We Ready For It?

Posted by Eric Novikoff on Sunday, 27 January 2013 in Blog

I just got an email from the LinkedIn Cloud Ventures discussion in which Neil McEvoy proposed a cloud vendor competitive matrix and asked what was important.   I'm still wondering if it's possible to generate such a matrix because of the lack of standards and excess of "cloud washing" in the industry.

One of the biggest requests that we get from our prospects is some sort of assurance that they will be getting the performance they desire with both compute and storage.  Even though performance is how we distinguish ourselves, most of the prospects don't even have the vocabulary to describe how the industry allocates performance to them, so a conversation is difficult.   Some of the variables include:

Compute:

  • RAM oversubscription
  • CPU oversubscription (or in our case, undersubscription because we offer free bursting)
  • Virtual core performance
  • Underlying hardware architecture (speed and vendor of CPU cores)
  • Ratio of unmeasured resources to charged resources (for example if resources are charged by RAM, how much CPU do you get per GB - which is highly variable even within a single vendor's offering)

Storage:

  • Available IOPS per connection
  • IOPS guarantees
  • Connection speed to storage
  • Storage performance oversubscription
  • Storage interconnect oversubscription (many still use 1GbE for example)
  • Raw storage speed (7200k, 15k, SSD etc.)
  • Availability of cache and how it's allocated to clients

Many of these factors are unfortunately held as proprietary by vendors which means that the matrix will never accurately predict performance without common language.  It also renders cost-saving gimmicks like cloud brokerages completely moot since the actual value (performance per dollar per hour) of each vendor varies so greatly that chasing the lowest price doesn't guarantee any savings.

0 Comments Continue reading
Hits: 1159
0
Jan 17
2013

Agile IT - Just what agile software development was dreaming about

Posted by Eric Novikoff on Thursday, 17 January 2013 in Blog

One of the challenges we have always had at ENKI in providing high-uptime managed services is to drive our clients towards software design and deployment processes that ensured production application uptime, which we cannot provide singlehandedly as a managed IT services vendor.  Since our mission is to support growing or changing companies, we are often called upon to create IT processes that support rapid code development.   The challenge is that software development departments are rewarded for releasing features (in other words rapid change) while IT service is measured on uptime (in other words controlled, slow change to reduce errors.)  This inherent mismatch between us and our clients seems to set up a relationship of friction in which we constantly resist their efforts to move forward in attempting to provide high reliability service.  However, is this traditional division between IT organizations and development organizations a foregone conclusion?  Will it slow the development of cloud oriented application management?  Perhaps, but I think there's another way.   I'll call it Agile IT.

Traditionally, there are already accommodations that IT makes to support agile software development.    The short, high-frequency development sprints that agile software development methodologies like SCRUM each may require changes in server setup, network topologies, and maintenance processes in order to reach the natural conclusion of a successful deployment of each software chunk.   This has driven IT operations ("Ops") folks to insulate the production environments from the software development processes with multistage environments suited to each step in the development-deployment process, including dev, test, integration, staging, pre-production, and production environments.   In fully virtualized cloud platforms like ENKI's PrimaCloud, these environments can be easily set up, copied, or torn down to match the requirements of the develop/deploy methodology.   By controlling what goes into each environment through process, change management tools, configuration management tools, and business process automation tools, many companies have successfully developed DevOps methodologies that cope well with frequent releases without breaking.  An diagram of such a process flow and its stages looks like this:

dev -> integration/test -> staging/pre-prod -> production

However, what happens in today's startup/intrapreneurship environments where the deployment architecture is being developed at the same time as the software?  Even the staged single-threaded methodologies I spoke about above break down.   What we've seen is that our clients may rewrite their software in each sprint to require different network topologies, which results in chaos if the software ends up being deployed into any environment that does not comply - especially of course their production environment!

However, Agile IT can take advantage of the power and flexibility of the cloud to solve this problem with more environments and more machines - yes, at greater cost than static deployment models, but also at much greater speed and accuracy which are crucial to startups or intrapreneurs.  Here's what an Agile IT process would look like:

Agile IT Scoreboard
Sprint 1 Phase dev integration test stage release    
IT Status decom decom active active active    
Environment R1DEV R1INT R1TEST R1STAGE R1REL    
Sprint 2 Phase   dev integration test stage release  
IT Status   active active active setup definition  
Environment   R2DEV R2INT R2TEST R2STAGE R2REL  
Sprint 3 Phase     dev integration test stage release
IT Status     active active setup definition definition
Environment     R3DEV R3INT R3TEST R3STAGE R3REL

In the chart above, you can see how each environment - a virtual private datacenter or maybe just a server - goes through the stages of:

architecture definition and design (in concert with the software sprint) -> setup and deployment -> active use -> decommissioned/archived

This is a deployment approach that supports multiple deployment tracks occurring in parallel (even multiple development tracks if you have the staff!), yet allows each track to have a different deployment architecture.   Yes, there are going to be challenges even with all these environments, which include:

  • Managing multiple environments (This requires a VPDC-oriented management tool like Vcloud Director)
  • Transferring production from one environment to another (this is optional - you can take downtime and convert the production environment from the prior release architecture to a duplicate of the next stage architecture during the downtime, but this causes problems of its own.)
  • Lots of well-defined communication between the development team and the Ops team, including detailed documentation of the deployment requirements of each phase and of course the schedule (which is represented above as the traditional "waterfall" diagram)
  • Agreement across all teams to follow the process (this is of course true of all processes!)
  • Liberal deployment of teamwork tools that allow everyone to share status, documentation, alerts, etc.

I'm still refining and defining Agile IT based upon requests from our clients that we support their agile software development processes with high reliability.  I'd love your feedback.

0 Comments Continue reading
Hits: 1150
0
Jan 16
2013

Aloha and Mahalo AppLogic

Posted by Eric Novikoff on Wednesday, 16 January 2013 in Blog

On December 31st at 11:30pm, ENKI shut down our last AppLogic grid.   We have a lot of fond feelings for AppLogic since it was the technology upon which we based our groundbreaking virtual colocation/virtual IT/managed cloud services since soon after our founding in 2006, and in the last few years it achieved amazing uptime with clients experiencing 750+ days without any service interruptions due to unavailable virtual machines!   However, almost from the beginning when we selected it, it also showed its limitations, which came down to two primary ones: the lack of separation between virtual private datacenters (AppLogic "applications") meant that ENKI had to retain ultimate management authority over client services, preventing the clients from using the spectacular graphical VPDC design tool included with it; and because of its reliance on 1Gbit networking for its cross-system RAID file system and inter-VM communications, performance of the cluster went down as the number of servers or user load rose, creating cross-customer peformance impacts that are common with other cloud providers like AWS, but which ENKI resolved long ago to eliminate from our cloud services.  In fact it was these limitations which caused us to develop our high performance PrimaCloud service.

Over the last few years, we watched as CA continued to develop AppLogic for their usual target customers - enterprises - while ignoring the needs of cloud managed service providers such as ENKI.  When their support level to us dropped to the point where we could not maintain our own customer SLAs, we knew we had to move our AppLogic customers to PrimaCloud.   But this was easier said than done.   Simply providing customers a new VM and asking (self managed) customers to move themselves seemed harsh and risked damaging our relationship with them, as well as risking their business if they didn't meet the deadline.  For our managed customers, we knew we had to do it ourselves.   

So for the last 6 months, we've been moving clients to PrimaCloud from AppLogic.   Fortunately despite some large obstacles, we managed to complete it with some sleepless nights over the holidays.   We have a couple interesting stories to tell about heroics our team had to make to get the moves to happen for a couple customers.  I'll cover those in the next blog!

Tagged in: ENKI Information
0 Comments Continue reading
Hits: 1179
0
Dec 28
2012

What is "Elastic Private Cloud"?

Posted by Eric Novikoff on Friday, 28 December 2012 in Blog

I've been reading a lot of articles lately about Elastic Private Clouds.  But does such a thing even exist? What we've seen is that those companies that legitimately need private cloud are trying to comply with legal or business rules that require fully PHYSICALLY and ELECTRONICALLY separated infrastructure.  This is best (and most compliance-orientedly) expressed as putting the hardware for the cloud into a separate cage or even building with a unique and known set of people who can touch it.   For example, the highest levels of PCI compliance have this requirement written into them.   Physical separation is the simplest and most usable definition of "private Cloud," despite the fact that many MSPs love to give it a different meaning.  

If the hardware used to deliver a private cloud truly is physically separate from other parties' hardware, then you can never spin up new resources as as my friend Sean Tario at Open Spectrum likes to say "as easy as buying a book online" because it cannot be done by moving an already running server logically from one domain to another as the public cloud does.  Instead, it must be moved *physically*  (in other words, installed, or turned on if already installed.)  Installation of course can't happen in real-time, especially if you include the purchasing process for the hardware.

Because of this "elastic private cloud" is oxymoron.  Sure, an MSP can achieve some level of logical separation using VLANs, switches, virtualization, etc. but that separation will not meet the line-in-the-sand criterion of physical separation, so it is not private, and therefore not elastic.   At best, it's SEMI-private - which may be more than enough privacy to give the business what it needs without eliminating the benefits of elasticity.  Or, if the MSP can wheel in standby hardware from a stock it has, it is SEMI-elastic!  

Sean makes the point that lots of companies go to public cloud providers and spin up resources without management permission... because their staff recognize that the rules against using public cloud are not based on business need.  However, if that need is compliance, those companies have to choose between elastic or private.

Sean says "WTF" to private elastic cloud - he sure has that right!

Tagged in: Cloud Industry
0 Comments Continue reading
Hits: 2526
0
Oct 25
2012

Another hit to Amazon reliability

Posted by Eric Novikoff on Thursday, 25 October 2012 in Blog

Our recent article, The Crumbling Myth of Amazon Reliability, generated a lot of interest and comments.  However, the table presented in that article went out of date two days ago when Amazon experienced another serious problem with their EBS (shared storage) infrastructure in their East Coast datacenter.   Did this constitute "Amazon Downtime"?  Well, I think so because so many high-profile customers (including Redditt, numerous mobile apps, and even the Oxygen file sharing service I use to work remotely) were affected.   For more discussion on this topic, please see the blog on service health dashboards and what constitutes a failure, that I wrote a few weeks ago.

This week's failure caused their 4 1/2 year uptime average to drop from 99.71% to 99.66%.

TechRepublic (see link below) also put in a telling comment about Amazon's service guarantee: it's relative.  Your refunds are proportional to how much worse your uptime is than their overall average.  In other words, they only guarantee that their service is as good as it actually is:

"Finally, all Amazon customers should pay attention to see if they are eligible to receive service credits. The current outage alone may result on a dip in uptime numbers that lead to this eligibility, so customers should make sure they put in their claims in the next 30 days."

Major Amazon Outages 2008-2012          
(Does not cover individual instance failures or systems failures affecting small numbers of customers, which are not publicly documented.)    
                               
Date   Announced
Length(hrs)
  Affected
Services
  Location   Cause   Documentation        
23-Oct-12   24   Instances
EBS
  US-East   EBS bugs/failure   http://www.techrepublic.com/blog/datacenter/the-amazon-web-service-outage-step-by-step/5846
29-Jun-12   7   Instances
EBS
  US-East   Power failure,
Software bugs
  http://gigaom.com/cloud/some-of-amazon-web-services-are-down-again/
13-Jun-12   8   Instances
EBS
  US-East   Power failure   http://gigaom.com/cloud/did-amazons-web-services-go-down
8-Aug-11   1   Instances unreachable    US-East   Router failure,
Software bug
  http://www.crn.com/news/cloud/231500023/amazon-offers-explanations-apologies-for-dual-cloud-outages.htm
7-Aug-11   48   Instances (5h)
EBS (48h)
  Ireland   Power failure,
EBS bugs/failure
  http://www.datacenterknowledge.com/archives/2011/08/07/lightning-in-dublin-knocks-amazon-microsoft-data-centers-offline/
21-Apr-11   37   Instances (1h)
EBS (37h)
  US-East
+ other
  Network failure,
EBS bugs/failure,
  http://money.cnn.com/2011/04/21/technology/amazon_server_outage/index.htm
                Router
Maintenance
  http://arstechnica.com/business/2011/04/amazons-lengthy-cloud-outage-shows-the-danger-of-complexity/
                    http://www.syracuse.com/news/index.ssf/2011/04/amazon_failure_takes_down_site_1.html
21-Jul-08   8   S3   US-East   Network failure   http://gigaom.com/collaboration/s3-outage-aftermath/
15-Feb-08   5   S3   US-East   User overload,
Software bugs
  http://gigaom.com/2008/02/15/amazon-s3-service-goes-down/
                               
TOTAL DOWNTIME: 138                          
HRS IN PERIOD   40880                          
UPTIME   99.66%                          

Tagged in: Cloud Industry
0 Comments Continue reading
Hits: 3126
0
Oct 5
2012

DR, BC, and HA - what's the difference?

Posted by Eric Novikoff on Friday, 05 October 2012 in Blog

Many of our clients are concerned about keeping their applications running 100% of the time.  Since there are costs and capability tradeoffs for the various techniques to improve uptime, I wanted to offer a quick survey of the options.

HA, or High Availability, describes techniques used to keep an application running in the event of hardware or software failures. ENKI offers HA on its hardware (servers, storage and networking), making most hardware failures invisible or minimally disruptive to our clients (presuming the software in question can be restarted automatically).  However, hardware HA does not take into account 1) Software failures that can bring a server and application down; 2) System administration errors that can bring a server down; or 3) Network failures (including those at internet providers between ENKI and our clients and not associated with either party) that can interrupt connectivity to the hardware used to host the application.  To solve these problems, additional steps are required.

Basic High Availability
To add HA that reduces the impact of software, hardware, and management failures, you must deploy two completely separate instances of the application (ideally written separately from scratch like the software in the Space Shuttle computers so that the same bug cannot appear in both installations.)  Then, the software has to be enhanced to communicate between the two separately running applications so that both are "up to date" on what they are doing and on any stored data, including synchronizing any databases.   The costs involved are those required to enhance the software and double the infrastructure necessary to run it. As a rule of thumb, the hosting costs alone will approximately double since there will be two of everything.
Business Continuance
To eliminate the impact of network interruptions, the two copies of running software for an HA implementation must be in two different places (datacenters.)  This is often called BC or business continuance.   This introduces the additional complexity that keeping the two running instances of software synchronized requires taking into account that there is an internet delay that can become significant between the two locations.   The software must typically be enhanced to use distributed database technologies that lock a transaction in the database until the remote location indicates that it has received the update, so that both locations are working with the exact same data and state in case a failure interrupts one location.   There are additional costs associated with deploying, supporting, testing, and potentially licensing more capable database technology, as well as deploying Global Load Balancing, which consists of commercial software running in both locations that directs customer requests away from the failed software deployment to the one which is still running. A typical rule of thumb is that recurring costs for BC typically are 2.5-3 times that of hosting a single application.
Disaster Recovery
Finally, we must discuss DR, or Disaster Recovery.   This is a set of techniques used to save the stored data, state, and code of a running application so that if the application should fail in a way where it cannot be easily recovered or restarted, there is a second, current copy available in another location which can be brought on-line in a defined amount of time.   Typically, DR recovery times are measured in hours, since the backup copy of the application must be deployed and brought into a running state.  However, DR is the easiest to implement compared to BC or full HA, because the application need not be changed at all to accommodate two concurrently running copies.  Costs are also the lowest, since until a failure occurs, the DR costs are similar to an off-site backup.  However it does not guarantee uninterrupted service.
Choosing Uptime Requirements That Are Right For Your Business
For each of HA, DR, and BC the costs are related to the recovery time objective or RTO - in other words, the amount of time that the application is down when a failure occurs that you are willing to tolerate.   Because of the architectural changes to both the software and the deployment (the configuration of the hosting) for any of these techniques, and because the changes depend on the RTO, it is not possible to determine the exact costs until these choices are made.

0 Comments Continue reading
Hits: 1271
0
Sep 28
2012

Service Health Dashboards - What Do They Mean And How Can You Use Them?

Posted by Eric Novikoff on Friday, 28 September 2012 in Blog

With the growing focus on the reliability of cloud services, many cloud providers are now offering "service health dashboards" that report the current and historical status and uptime of their services.  However, these dashboards are at best misleading about reliability, and at worst they can create an atmosphere of distrust between a cloud provider and its clients. The reason for this isn't necessarily nefarious - it has to do with the fact that it is nearly impossible to report on each client's experience of a complex cloud service.

Let's start with a typical SaaS health dashboard from a major SaaS ERP provider:CloudHealthDashboard

When you look at this dashboard, the first conclusion you might come to is that YOUR system - whatever serves you when you log in - is being measured here.   But a quick check of the number of "Application Requests" will tell you that's not so since it's so large.  Instead, it's the aggregate or average system uptime that's being reported.   But what does that mean and iis that meaningful to you?  This SaaS provider, like most, uses many servers to serve their clients. Some servers serve any client, and some serve a specific set of clients.   If one of the specific ones dies, or something in the provider's network goes wrong, it will affect these numbers. When you see a "99.97", most likely a customer-facing server has failed and been down for a while, affecting all the customers who use it.  Now, this page has no information on what the uptime numbers mean, so we have to assume that they refer to the percentage of Application Requests that were successfully served.  However, given the failure mechanism, those requests most likely came from a small set of customers using the failed server.   The number of failed requests in this instance is (1-.9997) * 219968978, or 65990.  That's a lot of problems!   If the failed equipment hosted 100 customers with 10 users each, it implies that on the average each user tried 65 times to get the system to do something and it was unable to.  When you use your accounting, payroll, support, or inventory system, how many times do you ask it to do something in a day?   My guess that that the affected people probably got little or no work done that day. The 99.97 looks like just a "little" problem but for the affected people, it could have been a very bad day.

Let's take a look at another health dashboard, this time from Amazon Web Services:

AmazonHealth

Amazon is having a good day in California, Virginia, and Oregon.  However, if there is a problem, they simply say that the service is affected, partially down, or completely down with some details on what the failure is.   On one of their worst days during a storage failure in 2011, the details said that some users were affected. Later on we found out that most users of Amazon's services hosted in Virginia were offline, and some had actually lost data.  However, the health dashboard never reflected this, because of the sheer number of user account and the need to summarize the data.  

As an example of this, one of my vendors - a software company - offers a hosted CRM service from their own datacenter and has been running a proof of concept in Amazon for a client who wanted to be hosted there.  Yesterday (Sep 26) they lost all the data for their POC after running it for a 8 months due to a failure in the EC2 storage service (EBS).   While Amazon claims that their expected failure rate is once every 200-1000 years, it happened to our vendor.  But what does the claim mean? Is it one failure within all their locations and services every 200 years, or one failure per customer every 200 years? In either case, the health history dashboard for that date did not show the failure:AmazonHistory

Why is this?  I don't think it is intentional deception on Amazon's part, but rather that the granularity of their monitoring and reporting simply can't catch individual failures in servers, storage systems, or local networking equipment. It's very difficult to report failures accurately in such large infrastructure systems.

However, the consequences of this inability could be dire: most cloud providers offer some kind of guarantee based on their SLA. If the provider's monitoring can't catch it, you are on your own in proving that you are due compensation.  And of course, if you base your decision on whether to choose a cloud provider only on their historic dashboard data, you are not going to get an accurate idea of the worst case or even typical issues you might experience.  Instead, it may make more sense to talk to the provider about how your virtual infrastructure or services are monitored and reported on, though many providers don't offer that capability.  And it may make sense to clearly understand their compensation policies and what proof has to be given to engage them, since the health dashboard will not be proof enough.

Tagged in: Cloud Industry
0 Comments Continue reading
Hits: 1406
0
Jul 17
2012

The Crumbling Myth of Amazon Reliability

Posted by Eric Novikoff on Tuesday, 17 July 2012 in Blog

The two recent Amazon outages in June have once again brought the topic of cloud reliability to the attention of cloud users and the media, but on top of numerous Amazon failures in 2011, pundits are starting to ask if the actual architecture of Amazon's cloud is the source of its recurring problems. Up until recently, the average cloud user was firmly convinced that Amazon could do no wrong. This has been a challenge for us at ENKI, since we end up competing against a myth instead of a real service with real problems. So I thought it was time to sum up the problems at Amazon over the last few years and see what they mean.

If you've been following my blog, you'll know that I don't believe in the myth of a failure-free cloud service, and that ultimately going beyond cloud providers' uptime guarantees requires that the users themselves take responsibility for uptime by desiging applicaitons and deployments that take advantage of geographic diversity. And this has been borne out by Netflix, which has experienced far fewer downtimes than its host, Amazon, has had. However, the bulk of cloud users still deploy on non-redundant, single-location servers, which is what they compare the reliability of.  So let's look at that reliability from Amazon...

Amazon guarantees 99.95% uptime.   If you get less than that, you can apply for a minimum 10% refund per month.  However, they can not possibly achieve their own guarantee:   99.95% translates to about 4.38 hours per year of downtime.   If you just look at their mammoth failure (documented below) in April 2011, which was officially declared repaired after 37 hours (though some customers experienced significanltly longer or shorter downtimes), that would require them to have had no failures for another nine years!   As you can see from the list below, that hasn't been the case.   In fact, adding up just the published failure times in Amazon's East Coast datacenter for 2011 and 2012 yields an uptime of only 99.7% - which has been clearly visible to millions of users of systems such as 4Square and other sites (and mobile apps) using Amazon.

This is far, far lower than what is possible with well-designed cloud services or a well-designed redundant colocated server farm.   And there is a disturbing trend to the root causes of these failures: over and over, power failures have taken Amazon datacenters down, and recurring software failures have kept them down.   Is the AWS architecture too complicated to be reliable?

Major Amazon Outages 2008-2011          
(Does not cover individual instance failures or systems failures affecting small numbers of customers, which are not publicly documented.)    
                               
Date   Announced
Length(hrs)
  Affected
Services
  Location   Cause   Documentation        
29-Jun-12   7   Instances
EBS
  US-East   Power failure,
Software bugs
  http://gigaom.com/cloud/some-of-amazon-web-services-are-down-again/
13-Jun-12   8   Instances
EBS
  US-East   Power failure   http://gigaom.com/cloud/did-amazons-web-services-go-down
8-Aug-11   1   Instances unreachable    US-East   Router failure,
Software bug
  http://www.crn.com/news/cloud/231500023/amazon-offers-explanations-apologies-for-dual-cloud-outages.htm
7-Aug-11   48   Instances (5h)
EBS (48h)
  Ireland   Power failure,
EBS bugs/failure
  http://www.datacenterknowledge.com/archives/2011/08/07/lightning-in-dublin-knocks-amazon-microsoft-data-centers-offline/
21-Apr-11   37   Instances (1h)
EBS (37h)
  US-East
+ other
  Network failure,
EBS bugs/failure,
  http://money.cnn.com/2011/04/21/technology/amazon_server_outage/index.htm
                Router
Maintenance
  http://arstechnica.com/business/2011/04/amazons-lengthy-cloud-outage-shows-the-danger-of-complexity/
                    http://www.syracuse.com/news/index.ssf/2011/04/amazon_failure_takes_down_site_1.html
21-Jul-08   8   S3   US-East   Network failure   http://gigaom.com/collaboration/s3-outage-aftermath/
15-Feb-08   5   S3   US-East   User overload,
Software bugs
  http://gigaom.com/2008/02/15/amazon-s3-service-goes-down/
                               
TOTAL DOWNTIME: 114                          
HRS IN PERIOD   39420                          
UPTIME   99.71%                          

Tagged in: Commentary
0 Comments Continue reading
Hits: 1373
0
Jun 22
2012

Termites In Your Cloud? (How Paging can cost you big time)

Posted by Eric Novikoff on Friday, 22 June 2012 in Blog

I wanted to bring your attention to an issue that could be causing you big trouble in your cloud deployment, but not be visible to you, much like termites do when they eat away at the structure of your house.   It is paging, sometimes also page faulting or swapping.

What is paging?  It is a pathological state your cloud server can get into when the software running on it tries to use more memory (RAM) than is available.  Modern operating systems (like Windows or Linux) offer Virtual Memory Management which  supplement your server's memory by storing those data which don't fit in memory onto your disk storage (called "swap space.")  This is an essential capability your cloud server should be set up with (check with your cloud provider) because it keeps it from crashing when it runs out of memory, and some swap is required by most operating systems.

However, if your server tries to actually use disk as RAM, it runs into the problem that disk is up to one million times slower than memory.  The result is like trying to win the 100-yard dash while being forced to crawl the last 10 feet.

Paging can raise the effective cost of your cloud deployment dramatically.   You are still paying the same rate per hour for your server as if it was not paging, but it's getting a lot less done.   So, for example if the paginging slows your server down by 10 times, you are now paying ten times the amount per useful unit of work that you are asking your server to do!   It also causes shared resources on the physical server that your cloud server resides on to get clogged with I/O operations, slowing the entire cloud down - which is why cloud providers don't like you to page, or may not even let you allocate swap space on your disk storage at all. 

Paging is much like termites eating away at your cloud server and the cloud itself; and like termites the damage is not easily visible because you can't easily see the consequences.  It's hard to know how fast your server would be if it wasn't paging!

The insidious thing about paging is that you control whether it happens by how much memory you allocate to your cloud server.  And because you pay for that memory, you have a visible incentive to keep it low, while paging is more invisible and will only cost you big money if you guess wrong.

The only effective way to know if you're swapping is to use an monitoring tool that keeps historic data on how often your server uses its swap space on disk, such as the Zyrion system we offer with PrimaView.  However, if the paginging is constant and extreme, running the "top" (load average) command on Linux or using the Windows performance monitor can show you how much "I/O Wait" (the amount of time your server spends waiting for I/O) or swap activity your server is experiencing.

Correcting swap may cost you as little as 10-20% of your spend on your server, resulting in a doubling or more of performance.   Even more importantly, correcting it gives you the headroom to absorb spikes in load which would drive your server into a near standstill due to paging.

Fortunately ENKI's PrimaCloud customers are able to resize their running servers on-the-fly in most cases, without any downtime.   Other cloud technologies usually require some downtime to restart the server.   In a future article, "Crash vs. Aha" I'm going to detail some ways to save money while you figure out exactly how much RAM you need over time, even as your server(s) are experiencing increasing usage.

Just like with termites, being prepared can save you big.

 

Tagged in: Cloud Usage
0 Comments Continue reading
Hits: 2604
0
Jun 13
2012

Use the cloud effectively: put the "Ops" back in DevOps

Posted by Eric Novikoff on Wednesday, 13 June 2012 in Blog

Over the last few months, I've been in a number of initial sales meeting with prospects and have talked to them about why it is so important that they have a good operations capability to design, manage, optimize, and secure their cloud deployments, and why outsourcing it is a superior alternative.   This capability will not only save businesses tangible costs on staffing and efficient cloud deployment of their applications, but also on intangibles like uptime, perceived performance of their applications, and even the distraction of building a highly capable ops team when they don't have that knowledge in-house.

On a recent visit, Sean Tario, the CEO of Open Spectrum, Inc, an ENKI partner, came along and became very excited about the message and my back-of-the-envelope chickenscratch, since OSI like ENKI is dedicated to serving the startup/web2.0/entrepreneur community that can benefit from this knowledge.   I'd planned to write my thoughts up as a blog article, but Sean and his team have done it better than I could.    You can find their thoughts on why having good operations capability is important in their blog "A Cloud Divided".

Tagged in: Business Strategy
0 Comments Continue reading
Hits: 903
0
May 29
2012

Addressing Cloud Inefficiency

Posted by Eric Novikoff on Tuesday, 29 May 2012 in Blog

I recently read an article posted in the Cloud Hosting and Service Providers Forum on Linked-In bringing up the topic of inefficiency in using cloud resources.   Somehow, I think we all expected that for a pay-per-use resource, that users would use it efficiently.  However, this is not the case.

Inefficiency in using the cloud is the result of not managing your resources effectively. in the enterprise, there are two reasons for this, both structural: one is inside the enterprise, other outside. The one inside is that the people paying for cloud are not the ones deploying it, except in rare circumstances. This is an age-old problem of budgeting and cost control in large organizations that predates cloud and is a result of how enterprises manage their budgets. The structural problem outside the enterprise is in the industry: cloud is sold as being "nearly free" or "great savings." While this misapprehension has been catered to by many vendors, the fact is that replacing expensive, reliable, high performance infrastructure with cloud that performs as well is still... expensive! And replacing it with cut-rate cloud that doesn't do the job is still more expensive due to per-unit inefficiencies and fragmentation of unused CPU and memory inside fixed-size instances. There may be compelling cost savings, but the numbers still add up when you start to use a lot of cloud services.  So this misapprehension leads to waste because it reinforces the idea that the resources are free.

As prices of cloud start to rise with better comparative service metrics coming into use, there will be an incentive to "turn the lights out."  In the meantime, there are easy things you can do to save on cloud through improved efficiency: 

- Implement budget controls at the departmental level so that funds are used wisely
- Periodic usage reviews to catch unused server/storage sprawl 
- Project lifecycle management ensures that completed projects are not burning resources 
- Avoid choosing cloud providers based on price alone: the hourly rate numbers don't reflect actual costs, especially if you have to buy a lot more instances/resources to compensate for poor performance, which actually can increase costs, or if instance sizes are fixed causing you to buy more than you need.
- Write or buy more efficient software: poorly written software can vary your costs by 10x. A corollary to this is to examine your software configuration for inefficiencies (Poor SQL queries, inadequate memory/cache buffers, etc.) 
- Apply systems engineering reviews to your deployment architecture: a poorly architected deployment can waste up to 4x of your compute.  For example, bad decisions made on auto-scaling or parallelism can lead to extra instance deployments.

There's no one perfect solution, especially in an organization that makes resources available for "free" internally, but these can help, though they may require a culture change.

Tagged in: Cloud Usage
0 Comments Continue reading
Hits: 2988
0
Mar 17
2012

Going beyond compliance: achieving true security in the Cloud

Posted by Eric Novikoff on Saturday, 17 March 2012 in Blog

One of the largest barriers to cloud deployments is the actual or perceived lack of security in the shared infrustructure used to provide cloud services.   Companies seeking to create applications that meet PCI, HIPAA, or FIPS standards are struggling with the twin challenges of actually meeting the requirements as well as finding auditors that are well enough versed in cloud technology to assess whether those requirements are met by a proposed cloud deployment.   On the other side of the fence, vendors are lining up to provide checklist-decorated services of supposedly regulation-compliant infrastructure that dangles the prospect of guaranteed compliance simply by choosing their hosting/cloud solution.   It's a mess! 

At ENKI, what we've learned is that there are two basic considerations in meeting requirements for compliant cloud security: actually providing infrastructure that meets auditor's requirements AND providing our customers with security solutions that are simple, effective, and flexible enough so that they can tell their clients or end users that they are not just meeting the letter of the law but actually sure that the personally identifiable data is really safe.

Our most sophisticated security-conscious client, a HIPAA-compliant medical information processing service company, is headed by a CEO who explained to me that his major challenge is convincing his customers (medical clinics) that they can trust him, for which HIPAA and SSAE compliance simply wasn't good enough.   He knew that he needed to secure his software by design, and then make sure that all the client data was completely inaccessible both at rest and in motion.   To accomplish the latter, they have chosen High Cloud Security's data security product, which is going to be rolled out in their processing environments at ENKI.

I recently met with their executive team (they're in Mountain View, like us) and was impressed with the product.  The architecture behind their system is to provide a storage republishing engine that handles all encryption, keeping unencrypted data completely out of the cloud storage infrastructure, while republishing a secured block or file based storage to the client VMs.   This not only secures the customer's data, but also the VM itself and its swap space - everywhere that an image of the secured data might reside, even temporarily, including backups.   In addition, they offer a key separate key management server with role-based access that can be locally or remotely hosted, allowing keys to be managed without the necessity of logging in and providing them to the application before it can run.  Key management can be entrusted to the cloud service provider, a third part, or even handled by the clients themselves.  While there is no 100% secure way of storing a key (who has the key to where the key is stored?) this solution allows you to choose who you'll trust with your keys, without having to manage them manually.   It completely eliminates carrying around printed keys, USB key storage sticks, or other ad-hoc solutions.   The one remaining challenge - securing the link from the VM to another to storage - is addressed by having an in-VM version of their storage publisher software.   With this method, any operations that the cloud service provider applies to the protected data do not expose any priveleged information - even moving the VM from one datacenter to another.

Because of ENKI's "everyting is virtual" approach to infrastructure, High Cloud Security's services are easy to deploy, and no restriction is placed on the flexibility of their key management: we can run it, you can run it, or you can have a third party run it. You can also use the storage republisher VM entirely for your private application, kept within a VLAN, so that no element of your cloud infrastructure shares unprotected data with another customer, which meets even the most stringent storage privacy requirements.

Please contact us to talk over your compliance requirements and how our virtual private cloud / virtual colocation architecture combined with High Cloud Security can make meeting your compliance requirments a snap.

Tagged in: Techniques
0 Comments Continue reading
Hits: 1363
0
  • Page :
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • Next

Free Cloud Buyer's Guide

Our informative guide is full of best practices to help you choose the right Cloud vendor for your business and to make your cloud application deployment successful.

Download Now
Techniques Commentary News Business Strategy Cloud Usage Technology Thank You First Person Congratulations Cloud 101 Events Cloud Industry ENKI Information Philosophy Case Studies Pricing Infrastructure

Blog Archive

2013
April (2)
March (1)
February (5)
January (3)
2012
December (1)
October (2)
September (1)
July (1)
June (2)
May (1)
March (2)
February (2)
January (1)
2011
September (2)
August (2)
May (3)
April (4)
March (1)
February (2)
January (5)
2010
October (1)
September (5)
August (2)
June (1)
May (1)
April (1)
March (1)
February (1)
January (1)
2009
October (2)
September (7)
August (3)
July (3)
June (6)
May (2)
April (4)
March (2)
February (1)
January (1)
2008
November (1)
October (2)
August (4)
July (2)
June (1)
May (1)
April (1)
February (3)
January (3)
2007
December (2)
November (1)
September (1)
August (3)
June (1)
May (1)
March (1)
February (3)
January (3)
OVERVIEW
  • About PrimaCloud
  • About PrimaCare
  • Key Benefits
  • Comparing Cloud Options
HELP CENTER
  • Frequently Asked Questions
  • Contact Us For Support
  • Terms and Conditions

SELF SERVICE PORTALS

  • Support Portal
  • Payment Portal
  • Discount Domains & Certificates
Follow @enkicloud
LOGO_CoFounderWebsite
Copyright © 2011 ENKI LLC