Disaster recovery concepts: fault tolerance, high availability, backups, and more

Ben Dorman and Nick Chase - May 23, 2022
image

In today’s world, where it is so easy to become the target (or unintended victim) of acriminal or state-sponsored attack, having a disaster recovery plan is absolutely crucial. This blog explains the basic concepts of disaster recovery that you should understand before creating a disaster recovery plan, with great examples from NASA, AWS, chemical manufacturing, and more.

This content is derived from a recent webinar,Now’s not the time to ignore Disaster Recovery. Learn what you need to know, which you can watch on demand.

Featured presenters: Ben Dorman, Global Cloud Architect, and Nick Chase, Director of Technical Marketing 

What constitutes a disaster for disaster recovery?

Ben Dorman:We’re going to start talking about what is a disaster, or what could possibly go wrong? Disaster scenarios, some of them are very clear cut. If your data center is physically destroyed by a bomb landing on it or a fire in the area, then it’s pretty clear that you have a disaster on your hands, and you need to do something about it. But, more generally, a disaster could be a business continuity break, something that prevents you from doing business for an extended period of time. This could be caused by a network failure, hardware failure, or software failure. The other thing that’s not quite so obvious is you may havevarious kinds of data loss. You may lose a database and find that it’s unrecoverable for whatever reason. But, if you were to lose some large fraction of your customer data but not all of it, that could also constitute a disaster.

Finally, you’ll see a lot of press about ransomware and other cyber attacks where you aren’t able to function because someone has ransomed your data somehow by penetration and making it inaccessible to you. Of course, there are other potential cyber attacks, too. What we’re going to do is we’re going to go through how you determine whether you have to deal with a disaster, what disaster recovery is and is not.

In particular, we need to understand that there are some politics involved in this, and also because at some point people have to make a decision as to whether to put this into effect. Finally, there are some differences if you are a cloud-native platform over a traditional platform, and we need to discuss those, as well.

What is business continuity?

Nick Chase:I think we should start at the beginning and talk about what is business continuity. I mean, ultimately, disaster recovery is part of business continuity itself.

In this webinar, we’re going to mostly talk about the technical aspects of disaster recovery, but a disaster could be something as simple as a key person leaving the company, maybe the one person who really understands all your processes goes and works for your competitor. That could be a disaster. You may want to have a plan for that.

The overall business continuity plan should be looking at every single thing that could go wrong. Again, we are in this case going to focus on disaster recovery for technical issues, but when you’re thinking about that, remember that you need to think about everything that could go wrong.

Ben Dorman:If I could just add to that, Nick, there is an aspect which is part technical and part people-related, which is, in which order do you do things, and how you do this on an enterprise basis, not just on recovering one or more specific applications.

Nick Chase:That’s right. Let’s cover the one elephant in the room: backups. I always hear, “Well, I don’t need to worry about disaster recovery because I have backups.” How does that relate, Ben?

Ben Dorman:Well, backups are an element of disaster recovery but they’re not the whole story, of course. Clearly, if you have an environment loss, you need to be able to recover your environment, and you will need a backup to do that. But just being able to have that does not make for disaster recovery. The other thing to be defining disaster recovery, it’s what you have to do if all of these kinds of countermeasures actually fail. We’re going to go through a few definitions, but you’ve heard about high availability and fault tolerance and backups. These are all valid and important strategies, but disaster recovery is what happens when none of that has managed to save you from some break in your business.

Nick Chase:Exactly, when all else fails.

So let’s go ahead and start with some definitions. Let’s talk about this. Let’s talk about what is disaster recovery itself? As we said, what is it besides what you do when everything else fails?

What is fault tolerance?

Ben Dorman:Okay, let’s talk quickly about these concepts. I’m going to start with fault tolerance. A fault tolerant system is one which has a secondary system take over from a primary when that primary fails, and there is no noticeable gap in service. For example, heart beating systems which, within a matter of milliseconds, realize or detect that the primary system is not going, and they take over. Fault tolerance usually includes the idea that the status, state of the current process is maintained as the switch goes from one system from the primary to the secondary, and the users don’t notice, so that’s fault tolerance. 

What is high availability (HA)?

High availability is slightly different. It’s where you have that secondary system ready to kick into action within a short space of time, but the actual process data may be lost, and there may be a noticeable but short gap in whatever is in service. An example of that would be, we used to have systems that had clustering, like you may remember the Veritas clustering system, I believe they’re still around. But, I used to be involved in systemware. If the primary went down, the disk mounted on a secondary side and came back up within a minute or two. So that was a highly available system. With second servers on both sides, you didn’t really see too much interruption in service unless you had been doing something when it went down. It was not fault tolerant; it was highly available. 

What is a backup?

A backup is a system which, obviously, allows you to recover a faulty database or, in these days of infrastructure-as-code, recover all the applications exactly as they were. That usually takes a space of time and has some interruption. This, obviously, is absolutely essential for recovering a system which has failed, that you have an alternative.

Nick Chase:All right. Those are kind of the basics. Let’s get into some of the more specific things.

What is a failover site?

Ben Dorman:First of all, the most common and nearly universal way of dealing with disaster recovery is having a failover site. It’s not the only strategy. If you remember watching NASA launch space shuttles, they had systems where you have multiple computers that all make the same calculation. If one of them is discrepant from the others, it gets ignored. That’s kind of like a voting system, but that’s incredibly expensive. If you have several billion dollars and several human lives at stake, you might want to do it that way. Most companies are not in that situation, so they have a failover site which the infrastructure switches to after some decisions and some infrastructure is modified in order to make the business come back up again as soon as possible. 

What is active-active?

An active-active system is one where your so-called secondary is active all the time. Although it sounds like that’s not really disaster recovery because it’s active all the time, in fact, there are some trade offs because in practice, the secondary is not as expensive in general, or as capable as the primary, and is not the primary source of the data. You need to recover from it after that switch, so it’s not completely seamless when you have an active-active system in general, unless you have complete parity, in which case you are running into some additional expense. 

What is active-passive?

An active-passive system is when the secondary is actually offline and can be brought up quite quickly, sometimes in a matter of seconds. Active-passive systems are also used when the process really requires that there is only one consumer for the data, and so you can’t have an active-active system.

Cold and warm standbys refer to the situation where the secondary is either switched off and reboots, or, in the case of warm, it’s there ready to go as soon as its traffic is directed to it. 

What is a recovery time objective (RTO)?

Recovery time objectives and recovery point objectives. Recovery time, by definition, is how long it takes to get recovered after it’s declared to be a disaster. The idea here is that when you set up your disaster recovery system, you state that as a criterion, that we design our disaster recovery system to be able to recover in three or four hours, or whatever the answer is. We heard recently of an organization that has a chemical manufacturing plant where that recovery time objective is three seconds, so it just depends ‒ 

Nick Chase:That’s crazy.

Ben Dorman:It is, isn’t it, but the situation was that the company in question was running some chemical manufacturing. If something really nasty happened, those chemicals could burn through stuff, and so you really needed to get back up really quickly.

Nick Chase:That’s when you are spending the money to have all the standbys you need.

Ben Dorman:Yeah, we point out this because this can be no joke, that the system really needs to be as close to fault tolerant as possible. In the case of a real disaster where your usual fault tolerance mechanisms have failed, you still have, because of the nature of the problem, some onus to recover very quickly.

What is a recovery point objective (RPO)?

Moving on to recovery point objective. Every organization runs on data, and in order to make sure we don’t lose data, you use some form of replication. There are basically two kinds of replication: synchronous and asynchronous. In synchronous replication, the secondary is constantly being updated. That is, as part of any transaction, the secondary is updated. In fact, the primary doesn’t continue until the secondary data has been saved. Of course, there are some trade offs there because that can produce latency, as we’ll see in the next discussion. But, nevertheless, it can be so critical to ensure that you have no loss of data, that you have synchronous replication, and of course that is the most expensive by networking, by hardware. 

And then there is asynchronous replication, where that feed between the primary and secondary goes periodically, and the period of that depends on, again, what the recovery point objective is, that is how much data you can afford to lose. If you ask any business, they’ll say, “Oh no, we can’t afford to lose any data at all,” but in fact, the best rejoinder there is, “Okay, so how much are you prepared to pay for having a system that can’t lose data?” Generally speaking, there is a happy medium between what you’re prepared to pay and what you can accept.

Nick Chase:Just to be clear, if your recovery point objective is you’re willing to lose, say, six hours worth of data, then you need to have backups at least every six hours.

Ben Dorman:Exactly.

What is shared-nothing infrastructure?

Shared-nothing infrastructure is a concept that’s actually quite obvious. Clearly, if your primary data center is destroyed, if your secondary shares anything at all with it, it is also destroyed. So shared-nothing means that the failover site has no dependency on anything that the primary site does. This goes anywhere from the situation where you have the failover site in the same data center but on a different rack, but that’s not good enough if they share a power source or a network, or both. It’s not good enough sometimes if the failover site is on the same power grid and wide area network. Generally speaking, when you say shared-nothing, you’re talking about data centers that are physically separated by some number of miles, or kilometers. 

But, there is also a constraint on that, of course, because there is network latency. For all the other things that we just talked about in order to have data replication or seamless networking between the two data centers, shared-nothing still requires you to have them not separated by too much. You can imagine two districts in the same city which have different power grids, different network providers, and so on and so forth, in order to make this failover site robust against the primary failure.

Nick Chase:You need to be careful with that because I don’t know if you’ll remember, but there was anAmazon Web Services outagea while back a couple months ago where so many things went down. Even the people who took the care to put their failover sites on other regions, figuring that that would be safe, apparently it turned out that there was one component over at Amazon Web Services that was shared between these things, it was DNS or something like that. It took everything down for some time until they could fix it, so you have to be careful.

Ben Dorman:If you read Amazon’s documentation even not terribly carefully, you see that they say, “We are responsible for the physical infrastructure. We have our sentries in front of our data centers. We synch up the power and so on, and you are responsible for outages caused by your software,” and so on and so forth. But, in practice, if Amazon does have such a failure, it’s still on you.

Nick Chase:Yes, that’s right.

Ben Dorman:It’s still your business that’s going to suffer.

Thanks for reading! You canwatch the full webinar replay here.


Ben Dorman and Nick Chase

Ben Dorman is a Global Cloud Architect at Mirantis, and Nick Chase is Director of Technical Marketing at Mirantis.