Disaster Recovery Plan & Strategies
Today’s enterprises depend heavily on technology to store, process, and transmit critical data and information in order to deliver their products and services to their customers. Any disruptions to the technology infrastructure or the flow of data present a risk to normal business operations. Such disruptions can occur due to…
By Raaj Prasad
02/14/2024

Share to

Today’s enterprises depend heavily on technology to store, process, and transmit critical data and information in order to deliver their products and services to their customers. Any disruptions to the technology infrastructure or the flow of data present a risk to normal business operations. Such disruptions can occur due to natural disasters, power outages, or cyber attacks

To minimize the impact of these incidents and keep the business running relatively smoothly, it’s critical to have a disaster recovery plan (DRP) in place.

A disaster recovery plan (DRP) is a documented and systematic approach to responding to and recovering from unexpected events that could affect critical systems and data. Its primary purpose is to minimize the downtime and disruptions caused by disasters and restore normal operations as quickly and efficiently as possible. This plan is focused on the IT infrastructure and includes details on data backup and recovery, communication protocols, and disaster response teams.

On the other hand, a BCP is a more comprehensive plan that covers all aspects of your business operations in the event of a disruption. This plan includes strategies for maintaining customer service, managing supply chains, and ensuring the safety of your employees. A BCP is designed to keep your business running smoothly even during a crisis, minimizing the impact on your customers, employees, and stakeholders.

In this post, we first note the differences between the two but focus more on the DRP.

Differences between DRP and BCP

Though sometimes used interchangeably and often in the same breath, DRP and BCP are two distinct strategies with unique objectives.

Firstly, a DRP is a set of procedures and protocols that an organization would follow to recover its IT infrastructure and data after a disaster or a disruptive event. The primary objective of a DRP is to minimize data loss and downtime, restore critical systems, and resume operations as quickly as possible.

On the other hand, a BCP is a broader strategy that focuses on the continuity of critical business operations and services during and after a disruptive event. The primary objective of a BCP is to ensure the organization’s survival and resilience by minimizing the impact of the disruption.

While DRP and BCP have different objectives, they are complementary strategies that work together to ensure business continuity. In fact, a DRP is an essential component of a BCP, and it is often incorporated into the overall BCP framework.

Steps to create a DRP

Creating a DRP involves several critical steps that organizations must follow to ensure a comprehensive and effective response plan. Below are some of the key steps:

  1. The first step is to identify the critical systems and data that are vital to the organization’s operations. This includes hardware, software, applications, data centers, and other infrastructure.
  2. Determine recovery time objectives (RTOs) and recovery point objectives (RPOs). RTOs and RPOs are two critical metrics that define the recovery objectives for the organization. RTOs define the acceptable downtime for each system or application, while RPOs define the acceptable data loss for each system or application.
  3. Develop procedures for data backup and restoration. Once the critical systems and data are identified, the next step is to create procedures for data backup and restoration. This includes defining backup schedules, storage locations, and restoration processes.
  4. Create a communication plan, which defines how the organization communicates with employees, stakeholders, customers, and vendors during and after a disaster. It should include contact information, escalation paths, and alternative communication channels. Keeping the affected customers informed is key to preserving customer confidence. If the main site is down, this may be accomplished through a lightweight DR communication site spun up with the appropriate DNS redirection to reach it.
  5. Finally, test and update the DRP regularly to ensure its effectiveness and relevance as conditions change and threats evolve. This includes conducting drills and simulations to identify gaps and improve the response plan.

A well-developed DRP can help organizations prepare for and respond effectively to disasters, minimizing the impact on business operations. By following the steps outlined above, organizations can create a comprehensive and effective DRP that ensures business continuity and resilience. Regular testing and updating of the plan are also essential to ensuring its effectiveness and relevance over time.

Disaster scenarios and Recovery Strategies

It is important to distinguish between the strategies that are applicable to different types of disaster scenarios.

Localized disruptions are those that affect only one or a few sites, but not all. Examples are disruptions due to weather, power outages, or network outages. In such scenarios, a system architected for high availability or a distributed system would also serve as a DR scheme. Typical architectures, in the increasing order of RTO/RPO measures and decreasing order of cost, are shown below

  1. Multi-site Active/Active, where each site is fully operational and the data is fully replicated or synchronized across all sites. In this case, the user traffic will just have to be rerouted to the operating sites, and the overall service will remain available. If the sites are provisioned with enough headroom, they will be able to accommodate the additional traffic or workloads and not impact the overall service availability metrics.
  2. Warm Standby is an active/passive scheme whereby each site is fully functional from a feature/functionality perspective, but the passive site handles only a small fraction of the traffic or workload. In this scenario, the resources at the passive site would have to be scaled up during the recovery phase to accommodate the additional workload, which may increase the RTO. However, the RPO would not be impacted as the data would have been replicated, just like in the Active/Active scheme
  3. Pilot Light is also an active/passive scheme, whereby the data will be replicated but the compute elements will not be running and no workloads will be processed while on standby. Additional resources would have to be provisioned and scaled. Additionally, the application process would also have to be started to provide the service. This will further increase the RTO but not have any impact on the RPO.
  4. Backup/Restore is yet another Active/Passive scheme, whereby the backups happen regularly but all resources are provisioned, scaled, application processes started, and the data restored to provide the service. This not only increases the RTO even further but would also have an impact on the RPO as the backups would be configured to occur at a certain frequency and some data would be lost.

Of the above schemes, Backup/Restore is a bit unique in that it also provides protection against human errors and actions, as there is no automatic propagation of data to other sites. However, the human actions in this case are not necessarily malicious in nature.

Also, for the same reason, the first three schemes would not offer protection against malicious actors.

Cyber attacks are another form of disruption and they may take on different forms

  1. Denial of service or distributed denial of service attacks, while causing the service to become sluggish or unresponsive, do not cause the loss of data. Once the source(s) of the attack are identified and neutralized, the service can return to normal without any additional recovery or restoration. Therefore, any of the above schemes may be used.
  2. Malware attacks are potentially the worst, as they corrupt, destroy, or render the data inaccessible. In such a scenario, any automatic data replication would cause the infection to propagate to all the other sites, thus making the low RPO HA schemes vulnerable and powerless as a DR mechanism. In such cases, only the backup/restore option becomes viable. However, additional steps are necessary to ensure the infection is not carried forward. The following precautions and steps are proposed: The backups are scanned for malware and stored in a different location from the production system, as most ransomware attacks include any connected storage.An alternative would be to use immutable WORM storage that guarantees that the backup cannot be altered or overwritten. Encryption at rest would also protect against the data being stolen or misused.The DR site must also be completely air-gapped from the affected production site. The infrastructure and the resources must be completely different, like the physical machines, virtual machines, clusters, etc. The applications at the DR site must also be started from known sterile and golden copies of executables, containers, deployment scripts, etc.The infected production site must be quarantined for investigation and potential isolation of the infection. In some cases, in-place recovery may be possible, but if it is a new ransomware, it may not be possible to decrypt and recover the data. In any case, nothing should be copied over until a thorough sanitation or a complete rebuild from scratch has been done.
  3. If the production site has to be rebuilt from scratch, the DR site must be capable of sustaining the service for the duration. Whether it should be scaled up to take the place of the primary site or function with reduced capability would be a business decision depending on the length of time to rebuild the primary site.

Conclusion

To recap, a DRP is a set of procedures and protocols that organizations must follow to recover their IT infrastructure and data after a disaster or a disruptive event. On the other hand, a BCP is a broader strategy that focuses on the continuity of critical business operations and services during and after a disruptive event. Different DR strategies are applicable to different disaster scenarios. By incorporating DRP and BCP into their overall risk management strategy, organizations can increase their chances of successfully recovering from disruptions and ensuring business continuity.

Additional Resources

Creating and implementing a DRP and a BCP can seem daunting, but fortunately, there are many resources available to help organizations navigate the process. Here are some links to some helpful resources.

  1. https://www.ready.gov/:AsitetofHomeland, which provides information and resources to help businesses create their DRP and BCP. It includes templates, checklists, and guidance on developing and implementing these plans
  2. https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-207.pdf: The seminal NIST publication on Zero Trust Architecture
  3. https://www.techtarget.com/:Asitet links to many resources, including those related to disaster recovery, storage, and security.
  4. https://www.sans.org/: Provides information security training and cyber security certifications.
  5. https://nomiso.io/: Provides comprehensive DR solution design and implementation services for on-premises, cloud, and hybrid environments.

By leveraging these resources, organizations can create, design, and implement effective DRP and BCP that help them recover quickly from disruptions and ensure business continuity.