For those of you who don’t already know, this Monday, the Internet stood still for a few hours. It started as a small issue which impacted a few instances of Amazon’s North Virginia Elastic Cloud Compute, and quickly steamrolled out of control, exploding into a full-blown outage. Services such as Reddit, Foursqare, Minecraft, Heroku, Github, Imgur, Pocket, Hipchat, and Coursera were all affected by the outage, which lasted for several hours. While someone by the handle of Anonymous Own3r claimed responsibility for the outage, they weren’t actually the one behind it. Simply read the tweet where they claim responsibility and you’ll probably agree; rather than some master hacker, they seem to be a very troubled, very lonely fourteen year old boy with a very tenuous understanding of how networking actually functions.
No, this wasn’t an attack or a hacking attempt. It was a small issue with a critical application in Amazon’s cloud infrastructure. Therein lays the problem. Keep in mind, folks; this sort of thing has happened before, with far more serious consequences.
We spend a lot of time talking up cloud computing, and speaking to the benefits of multi-tenancy and infrastructure-as-a-service. It’s easy to forget, in amongst all the advantages, that there are some very real, very serious drawbacks, especially if you’re putting all your eggs in one basket. The AWS outage underscored one of the most significant of these: when you sign on with an IAAS vendor, you’re putting the availability of your content, the stability of your website, directly in their hands. If something goes wrong with their service, as happened with AWS on Monday, everything goes straight south – and there’s nothing you can really do except wait out the storm.
“This is a clear indication,” writes one user on TechCrunch, “that we should not be relying on ONE cloud service provider to host multiple websites. We need to get our heads out of this marketing-based “cloud” dream, and realize that the best solution is to either host our own network, or spread services between multiple data centers.”
Failover is the word of the day here, folks. To be honest, it’s faintly baffling how many of these large, well-established websites were apparently relying solely on Amazon’s platform – how many of them experienced crippling downtime and service interruptions from Amazon’s outage. Why did none of them appear to have a readily available backup plan? It’s not like it’d be terribly difficult to pull off. Platforms such as Rightscale are just the ticket for the balancing of multiple clouds.
The weaknesses of multi-tenancy and the importance of an established failover plan are only two of the takeaways from this whole snafu. There’s a third, final takeaway, though your mileage may vary: this whole incident actually makes for a very compelling argument in favor of implementing your own cloud solution, if you’ve the resources. See, the biggest problem with this outage was that everyone impacted by it couldn’t really do much except damage control, as they waited for Amazon to sort things out.
In the event that you’re using your own cloud platform, instead of someone else’s, when an outage happens, it’s all on you – your services are only out for as long as it takes for your IT staff to fix things. To be fair, there are a number of other advantages which make IAAS an excellent choice, but at the end of the day, it’s all about the resources available to you, and how you choose to use them.
At the very least, it’s important to have a backup plan.