Now's not the time to ignore Disaster Recovery — Your questions answered!

Ben Dorman and Nick Chase - May 26, 2022
image

More and more organizations are beginning to build their disaster recovery plans. In this blog, experts Ben Dorman and Nick Chase provide answers to your pressing questions about disaster recovery and how to build a disaster recovery plan.

The content of this blog is based on our recent webinar, Now's not the time to ignore Disaster Recovery. We didn't have time to address all the audience questions during the broadcast, so we've posted all the answers here.

You can watch the full webinar replay here.

When does the recovery time start running?

Ben Dorman: That would start running when you have declared that you have a disaster because, clearly, before that time, you don't even know that you're going to put the plan in effect.

Nick Chase: Yeah, and that's important to understand. It's like if the power goes out at 10:00 and you don't declare a disaster until 10:45, that's when it starts. It doesn't start at 10:00.

Ben Dorman: When you create the plan, you state the recovery time objective (RTO). That's the key thing to realize. A recovery time objective and a recovery point objective (RPO), those are criteria. They're like the SLE (single loss expectancy) for the plan. Clearly, as soon as you put the plan into effect, the clock starts ticking.

If you have a 6 hour RPO, doesn't the backup have to be created and get sent offsite in under 6 hours so it can be protected? Probably means the interval is less than 6 hours.

Ben Dorman: Most of this is going to happen electronically nowadays, so if you're not dealing with physical media, there is probably no particular benefit in reducing the frequency. Depending on how much data you have, you could start one backup; as soon as that backup is finished, start the next one and get it ready for the next instance. Without nitpicking the details, your recovery is going to depend on how fast it takes to actually back up the data, to some extent. There is going to be a minimum time to do that. If you've got petabytes of data, it's going to take a while if you have to back it up in full.

Generally speaking, you're going to deal with some incremental system that depends on data rates and other things like that. There are a lot of technical issues here. Anticipating another question, in order to understand how far apart your data centers can be, you need to know something about the network bandwidth and the type of hardware you have connecting, and who is running that. These are decisions that have to be made in consultation with the network provider and the hardware engineers.

Nick Chase: Even if you have a large system and you use incremental backups, you are also going to want to make sure that you periodically take full backups, because now we're in a situation where we have ransomware that isn't necessarily noticed right away. We have these advanced, persistent threats, and you may wind up in a situation where your files are corrupted, and then they get backed up, and then your backup is also corrupted. You need to be able to go back to a previous state at some point in order to eliminate that, and you need to take that into consideration when you're making all of these plans, because that's going to be affected by your recovery point objective, as well.

Can you recommend us any paper which analyzes the optimum distance between the main datacenter and the disaster recovery one?

Ben Dorman: Just like any other piece of analysis, you need to understand the factors which affect your latency, if latency is the main issue, which it really is. We used to talk about messaging systems that were fault-tolerant, how far apart could the failover site be in order to make sure that you don't lose any messages through latency. It depends on networks, it depends on hardware ‒

Nick Chase: Yeah, and some database clustering systems require extremely low latency or they won't work at all.

Ben Dorman: Right, and the other thing is that some systems are not CPU-bound but they're I/O-bound, and so it depends very much on the physical hardware of the storage medium. This is less of an issue now, with solid-state storage and spinning disks. But, nevertheless, these are the factors you have to consider.

Nick Chase: Yes, and like you said, basically every situation is different.

Note: Below are some papers you may want to review, though they are 20 years old, the technical feasibility of the recommendations are controversial, and risks attributable to climate change have increased.

I develop a client's quantified DB for their energy infrastructure and their detailed plan for sustainable energy - how valuable would this blueprint data be for a client worried about DRP?

Nick Chase: Basically any time you have a situation where you have a database or an infrastructure, anything that is necessary for the business, anything that if it goes bad, the business will stop, you need that detailed plan for disaster recovery. You can't just say, "Well, we're just going to bring everything back up." You have to say exactly what you're going to bring up, the order in which you're going to bring it back up, how you're going to bring it back up, and so on. I hope that answers your question. Ben, did you have anything to add to that?

Ben Dorman: The basic idea is go back to what the function of the plan is. I need to have a group of trained people execute the task of recovering, failing over the data and the application in order to get it up and running. Anything that you have that will assist in doing that is useful. What is most likely, however, is that that team is going to want to know a set of bullet points or flowcharts that are implicit in that kind of documentation that will allow them to act, because they're not going to have a lot of time to think.

Nick Chase: Exactly. Let's talk about time to think for just a minute, because somebody just asked the question, "Should a Disaster Recovery Plan for an organization consider requirements set out by regulations and cyber insurance policies?" The issue of regulations are part of this because in Europe, if ATMs go down, banks have something like 48 hours to get those ATMs back up. The regulators don't care what's going on around them, that is very specifically defined by those regulations, so you need to take those regulations into consideration when you're building that plan.

Should a Disaster Recovery Plan for an organization consider requirements set out by regulations and cyber insurance policies?

Ben Dorman: The answer is "yes". Regulatory agencies to which you are subject will tell you if there are specific timelines or capabilities you are going to have to meet. For example, if you are a German bank, you may need to have your ATMs up within 48 hours of an outage in order to keep your license.

Would information from this webinar be a good guidance if a company is planning to do a DR Tabletop Exercise?

Nick Chase: Yes, all of this is good guidance for a DR tabletop exercise. The best guidance that I can say for a disaster recovery tabletop exercise is make sure you do one.

Ben Dorman: Right. The first time I ever encountered this issue was about 17 or 18 years ago, where we were told to provision hardware for a disaster recovery environment, and so we told them which service to buy, or whatever. Then we realized that the organization didn't have a clue about all the other things that we had mentioned, which is which order you exercise these things in, and who actually gets to make the decisions. There was no plan. Subsequently, as a result of considering that, the company did actually put together a plan and they also exercised it, so awareness is really important on this at the highest levels of the company, that they need to produce a plan.

Just anticipating another question, aside from we can chat about what is a non-functional requirement, but the key thing is that when you do an application architecture, if you're not looking, in this particular case, at the enterprise as a whole, as well as what you need to do for a particular application, you're not going to succeed for the simple reason that you have to have all these other considerations in mind that the whole enterprise is going down; therefore, you have to figure out how to bring the whole enterprise up, as well as figuring out how to bring up a specific application.

Is Disaster Recovery typically addressed as part of NFRs (non-functional requirements) or separately?

Ben Dorman: Disaster Recovery defines the qualities you need to ensure are in place after an incident. These can include both functional requirements (what a system does) and non-functional requirements (how a system is).

Another perspective is that of The Open Group, which encourages organizations to think in terms of Sevice Qualities as an alternative to NFRs. In this view, disaster recovery is one of the qualities of the Enterprise Architecture to be built into the platform when it is established, and specific applications are built to use an appropriate level of the disaster recovery "architecture" depending on its criticality and urgency. This perhaps is a more comprehensive approach that applies to the development within an enterprise.

What tool do you recommend for Business Impact Analysis to start the process?

Ben Dorman: There are a number of good tools that you can use for a Business Impact Analysis (BIA), depending on the size and complexity of your organization. Ready.gov gives you a good basic set of tools and worksheets or if you want to go the full monty, so to speak, you should consider purchasing the ISO 22317 specification.

You mentioned having a well-written disaster plan with device priority is critical to maintain business continuity. Could you propose a template that we could use instead of starting from scratch as there are hundreds of templates available online?

Ben Dorman: See answer question above (Business Impact Analysis). Also, we stress the usability of the DR documentation whose format therefore will depend somewhat on the internal workings of the organization.

Additionally there are IT organizations with significant experience in this area who can be hired to create a plan for an organization. For example, this organization has well-development materials available for download.

Thanks for reading! You can watch a full replay of the webinar here.


Ben Dorman and Nick Chase

Ben Dorman is a Global Cloud Architect at Mirantis. Nick Chase is Director of Technical Marketing at Mirantis.