If you Google “cloud outage,” you’ll receive a long list of news reports concerning Amazon, Rackspace, Microsoft, Google, and others. It’s a common occurrence for cloud service providers to experience downtime. Given this reality, what do you need to know about cloud outages?
- Everybody has outages. There is nothing inherent in cloud implementation architectures that makes a cloud 100 percent reliable. The fact is, running a hugely scalable, automated, efficient service is a lot of work. There are lots of moving pieces and sometimes things go wrong. This time it’s a hardware problem; that time, a software upgrade; still other times, a weather event. That’s why every major cloud service provider in the business has suffered an outage at times. And enterprises are not immune either, so remember that when you are building a private or hybrid cloud. Finally, keep in mind that public providers have big, natural incentives in place to keep their clouds working reliably.
- They are survivable. While everyone has outages, those outages don’t have to take down your application. It all comes down to application architecture. If you expect and anticipate that cloud failures will occur, you will architect your applications to use redundancy and data replication so that they can continue to operate in the face of outages. The folks at Netflix have made this into a fine art, using a stateless architecture, multiple availability zones and geographical points of presence, and a robust database replication architecture to help the company’s streaming service survive multiple Amazon outages. If you’re simply “lifting and shifting” old-fashioned enterprise apps from the last decade into a public cloud provider, you should expect that they will suffer the same issues as when your own data center went down, previously. Categorize your applications according to their requirements and create an appropriate strategy.
- This too, shall pass. Remember back in the 1990’s when the Internet was new? Reliability was horrible. If a backbone service provider had a problem with a key fiber link or a router got a bad upgrade and decided to blackhole all the traffic in the world, the Internet ground to a halt. In the 2000’s, we also had malicious attacks from worms like Code Red, Nimda, and SQL Slammer. We learned how to deal with all those issues and now we’re surprised (and even a bit angry) when the Internet doesn’t “just work.” We’ll get there with cloud computing, too, but we have to take our lumps first.
- Identify your detection mechanism. How will you know if your cloud based application is having problems? Hopefully, it isn’t when your customers start calling your support line. System monitoring and end-user experience management software are helpful here. But be careful how you deploy these packages; you don’t want them dependent on the same clouds they are monitoring and suffering an outage right as you need them the most.
- Ensure you have a reliable communication channel. Once you have detected that there is a problem, how are you going to communicate with your provider? You’ll need to get timely status updates and determine what actions they are taking to resolve the problem. Some providers will have live phone support; others will have a specific website or page with posted updates; still others might just have a dedicated Twitter feed. Some may use all these channels. Whatever the mechanism is, make sure you understand it and you’re comfortable with the service being offered.
- Management tools can help. Ideally, you’ve designed your app with a redundant, cloud native architecture, similar to Netflix, that can power through an outage without a hiccup. But most apps won’t be ready for that. An alternative is to treat a cloud outage similar to a traditional enterprise disaster recovery exercise. The goal here is to get a copy of the application up and running in an alternative location (another availability zone, geo location, cloud provider, or private cloud) quickly and restore service to users as soon as possible. You’ll need to think about database replication so that key data is already in place when the outage occurs. You can use a cloud management platform with application blueprints to quickly spool up an identical copy of the application in the new location. You’ll have the most flexibility if your cloud management platform can deal with multiple public service providers and multiple private cloud technologies, so you’re not limited to just hopping into the next availability zone in the same provider that is having the outage.
Cloud outages are a fact of life. But they don’t have to be a showstopper for your applications or your users. With some basic planning and preparation, you can weather the storm.
These postings are my own and do not necessarily represent BMC's position, strategies, or opinion.