How to avoid getting clobbered when your cloud host goes down
"No problem," I thought. "Everybody has glitches once in a while." So I decided I'd work on a different piece of content, and pulled up another browser window for the project management system we use to get the URL. The servers, I was told, were "receiving some TLC."
OK, what about that mailing list task I was going to take care of? Nope, that was down too.
As you probably know by now, all of these problems were due to a failure in one of Amazon Web Services' S3 storage data centers. According to the BBC, the outage even affected sites as large as Netflix, Spotify, and AirBnB.
Now, you may think I'm writing this to gloat -- after all, here at Mirantis we obviously talk a lot about OpenStack, and one of the things we often hear is "Oh, private cloud is too unreliable" -- but I'm not.
The thing is, public cloud isn't any more or less reliable than private cloud; it's just that you're not the one responsible for keeping it up and running.
And therein lies the problem.
If AWS S3 goes down, there is precisely zero you can do about it. Oh, it's not that there's nothing you can do to keep your application up; that's a different matter, which we'll get to in a moment. But there's nothing that you can do to get S3 (or EC2, Google Compute Engine, or whatever public cloud service we're talking about) back up and running. Chances are you won't even know there's an issue until it starts to affect you -- and your customers.
A while back my colleague Amar Kapadia compared the costs of a DIY private cloud with a vendor distribution and with managed cloud service. In that calculation, he included the cost of downtime as part of the cost of DIY and vendor distribution-based private clouds. But really, as yesterday proved, no cloud -- even one operated by the largest public cloud in the world -- is beyond downtime. It's all in what you do about it.
So what can you do about it?
Have you heard the expression, "The best defense is a good offense"? Well, it’s true for cloud operations too. In an ideal situation, you will know exactly what's going on in your cloud at all times, and take action to solve problems BEFORE they happen. You'd want to know that the error rate for your storage is trending upwards before the data center fails, so you can troubleshoot and solve the problem. You'd want to know that a server is running slow so you can find out why and potentially replace it before it dies on you, possibly taking critical workloads with it.
And while we're at it, a true cloud application should be able to weather the storm of a dying hypervisor or even a storage failure; they are designed to be fault-tolerant. Pure play open cloud is about building your cloud and applications so that they're not even vulnerable to the failure of a data center.
But what does that mean?
What is Pure Play Open Cloud?You'll be hearing a lot more about Pure Play Open Cloud in the coming months, but for the purposes of our discussion, it means the following:
Cloud-based infrastructure that's agnostic to the hardware and underlying data center (so it can run anywhere), based on open source software such as OpenStack, Kubernetes, Ceph, networking software such as OpenContrail (so that there's no vendor lock-in, and you can move it between a hosted environment and your own) and managed as infrastructure-as-code, using CI/CD pipelines, and so on, to enable reliability and scale.
Well, that's a mouthful! What does it mean in practice?
It means that the ideal situation is one in which you:
- Are not dependent on a single vendor or cloud
- Can react quickly to technical problems
- Have visibility into the underlying cloud
- Have support (and help) fixing issues before they become problems
Not being dependant on a single vendor or cloudPart of the impetus behind the development of OpenStack was the realization that while Amazon Web Services enabled a whole new way of working, it had one major flaw: complete dependance on AWS.
The problems here were both technological and financial. AWS makes a point of trying to bring prices down overall, but the bigger you grow, incremental cost increases are going to happen; there's just no way around that. And once you've decided that you need to do something else, if your entire infrastructure is built around AWS products and APIs, you're stuck.
A better situation would be to build your infrastructure and application in such a way that it's agnostic to the hardware and underlying infrastructure. If your application doesn't care if it's running on AWS or OpenStack, then you can create an OpenStack infrastructure that serves as the base for your application, and use external resources such as AWS or GCE for emergency scaling -- or damage control in case of emergency.
Reacting quickly to technical problemsIn an ideal world, nobody would have been affected by the outage in AWS S3's us-east-1 region, because their applications would have been architected with a presence in multiple regions. That's what regions are for. Rarely, however, does this happen.
Build your applications so that they have -- or at the very least, CAN have -- a presence in multiple locations. Ideally, they're spread out by default, so if there's a problem in one "place", the application keeps running. This redundancy can get expensive, though, so the next best thing would be to have it detect a problem and switch over to a fail-safe or alternate region in case of emergency. At the bare minimum, you should be able to manually change over to a different option once a problem has been detected.
Preferably, this would happen before the situation becomes critical.
Having visibility into the underlying cloudHaving visibility into the underlying cloud is one area where private or managed cloud definitely has the advantage over public cloud. After all, one of the basic tenets of cloud is that you don't necessarily care about the specific hardware running your application, which is fine -- unless you're responsible for keeping it running.
In that case, using tools such as StackLight (for OpenStack) or Prometheus (for Kubernetes) can give you insight into what's going on under the covers. You can see whether a problem is brewing, and if it is, you can troubleshoot to determine whether the problem is the cloud itself, or the applications running on it.
Once you determine that you do have a problem with your cloud (as opposed to the applications running on it), you can take action immediately.
Support (and help) fixing issues before they become problemsPreventing and fixing problems is, for many people, where the rubber hits the road. With a serious shortage of cloud experts, many companies are nervous about trusting their cloud to their own internal people.
It doesn't have to be that way.
While it would seem like the least expensive way of getting into cloud is the "do it yourself" approach -- after all, the software's free, right? -- long term, that's not necessarily true.
The traditional answer is to use a vendor distribution and purchase support, and that's definitely a viable option.
A second option that's becoming more common is the notion of "managed cloud." In this situation, your cloud may or may not be on your premises, but the important part is that it's overseen by experts who know the signs to look for and are able to make sure that your cloud maintains a certain SLA -- without taking away your control.
For example, Mirantis Managed OpenStack is a service that monitors your cloud 24/7 and can literally fix problems before they happen. It involves remote monitoring, a CI/CD infrastructure, KPI reporting, and even operational support, if necessary. But Mirantis Managed OpenStack is designed on the notion of Build-Operate-Transfer; everything is built on open standards, so you're not locked in; when you're ready, you can take over and transition to a lower level of support -- or even take over entirely, if you want.
What matters is that you have help that keeps you running without keeping you trapped.