An Introduction to IT Disaster Recovery Planning
Risks to critical business operations due to systems outages have been, and will always be, a concern for most organizations. As a result, IT disaster recovery planning is critical to help reduce the likelihood of a system disruption, or reduce downtime if (when) a disruption does occur. So, if you’re looking for an introduction to IT disaster recovery planning, you’re in the right place!
This perspective presents how IT disaster recovery planning fits into the overall organizational Business Continuity Program; discusses common goals in developing Business Continuity and Disaster Recovery plans; and explores unique activities that must be considered when developing an IT Disaster Recovery Plan.
THE RELATIONSHIP BETWEEN BUSINESS CONTINUITY AND IT DISASTER RECOVERY
Disaster Recovery Planning is one component of a comprehensive Business Continuity Program. Other areas include Crisis, or Incident, Management and Business Continuity Planning, which is focused on the recovery of essential business operations. When developed appropriately, Business Continuity Plans and Disaster Recovery Plans work together to achieve overall recoverability. During times of business interruption, dependent systems are needed in order to support the resumption of essential business functions, likely from a remote location. In the event of a system outage, departmental Business Continuity Plans document workarounds to continue operations until the system is restored, as well as procedures to reconstruct the recovered system data to its most current status.
Prior to beginning the process of developing Disaster Recovery Plans, it is imperative to establish a solid foundation for your program by first securing management buy-in/support. Obtaining such support enables you to allocate the necessary time needed to developing the IT Disaster Recovery Plans, while providing a level of assurance that organizational resources (people, tools, budget) will be made available to you to build the program. This effort should be done in tandem with the overall Business Continuity Program. Once you have obtained senior management approval, the next step in developing a comprehensive IT Disaster Recovery Plan is the identification of the requirements necessary to build a viable disaster recovery strategy and establish the proper scope of the project.
IT DISASTER RECOVERY PLANNING PROCESS
BEGINNING THE ITDR PLANNING PROCESS
Conducting a Business Impact Analysis (BIA) is necessary to begin all business continuity and disaster recovery development projects because it sets the foundation for your strategy identification and selection effort to enable plan development. Specific to ITDR, the BIA identifies and assesses impacts to the organization in the event that essential operations cannot be performed due to a systems outage. These impacts can be categorized by the following: financial, reputational, regulatory and legal, operational, and, most importantly, external customer impacts. During the BIA, the identification of risks is also performed, as well as the impact to the organization due to the risks. Using the results of the BIA, IT will be able to determine when applications need to be restored (Recovery Time Objectives, or RTOs), how they can be restored (strategy options), and if the dependent operating departments can continue operations while the system is unavailable (and for how long).
For more information on this topic, please read: Ultimate Guide to Business Impact Analysis
It’s important to note that some organizations opt to conduct a formal Application Impact Analysis (AIA) to specifically address the impacts to the organization in the event of a system interruption. The loses are categorized in the same manner as the BIA, but may be quantified by such areas as:
- Need to increase staff (or provide for overtime) when implementing manual workarounds
- Lost information and the cost of reconstructing data
- Delays in accepting and fulfilling orders
- Impacts to the customer, resulting in increased reputational damage and possible loss of business
- Regulatory impacts and possible litigation due to negligence in not protecting the company’s information
EXPERT TIP:When embarking upon a BIA, you will be meeting with a number of departments throughout the organization to identify their activities and applications (both internally and externally hosted). Linking departments and activities to their required applications is virtually impossible if done manually (e.g, using a spreadsheet). Through the use of an automated tool, you can link each application to the departments and activities that use them, and then determine the best course of action for recovery. In addition, you can also take an alternative view, by reviewing the organization’s critical activities and creating a profile of those applications needed to support the activities.
Key to the outcome of the BIA or AIA is the identification of Recovery Time Objectives (RTOs) and Recovery Point Objectives (RPOs) for each critical application. RTOs address when the application is needed before significant impacts are recognized, and RPOs identify acceptable data loss (e.g., the point where data can be reconstructed without permanent loss).
Establishing RTOs and RPOs must be consistent with expressed business requirements. Any potential gap between business expectations and IT recovery targets could lead to confusion and operational or supply chain disruptions. The following actions should be taken to ensure a cohesive approach to IT Disaster Recovery strategies:
- Review business RTO’s and IT RTO’s to ensure agreement. If there are questions regarding the information, return to the BIA or AIA documentation and validate accuracy of the information. For example, is the application absolutely required to support the product or service, or is it just a convenient tool?
- Present the application prioritization list to senior management before determining the strategies needed to meet the RTO/RPO. Ensure justification is provided for all systems.
- Once approved, strategies need to be determined for the recovery of the IT infrastructure (servers, network, etc.) as well as applications. Various strategies exist, depending upon the defined RTO and RPO.
- Data protection strategies address RPO. These include backup processes and intervals; redundancy; tape or disk; offsite storage; etc.
- Recovery of the data center’s infrastructure and applications is guided by RTO. Strategies include alternate recovery sites, such as internal sites (second data center; split server farm; etc.) and/or use of a third party providers (colocation; DRaaS; etc.).
- For applications hosted externally, establishing tight SLA’s with third-party hosted providers is necessary (as well as auditing their IT Disaster Recovery Plan). Understand that you will likely have different recovery strategies for each application prioritization classification.
Costs play a vital role in determining the strategy. What is ultimately determined to be feasible may not always be in-line to business expectations. In these cases, meet with the business owners to ensure understanding in moving forward.
EXPERT NOTE:The use of third-party hosted systems may transfer the responsibility of maintaining the system to an external provider, but it does not transfer the responsibility for managing the risk of losing the system. For example, Appendix J of the FFIEC standards require financial institutions to ensure resiliency of its external technology service providers (TSPs).
Once you have determined the strategy for the recovery of your organization’s IT infrastructure and applications, it is time to begin the development of the IT Disaster Recovery Plan. The structure of these plans should include:
- High-level recovery strategies for the recovery of IT systems;
- Detailed IT Incident Response Procedures, including Activation Criteria noting how and when the decision to activate the plan will be made;
- Defined roles and responsibilities of the IT Recovery Team;
- Processes to communicate with internal and external customers;
- Detailed application recovery procedures, as well as validation and data reconstruction steps to ensure functionality of the system prior to turning over to the business units;
- Procedures to return back to the primary production data center.
EXERCISING THE PLAN
Documenting the restoration of your IT infrastructure is never complete until you test the procedures developed, as exercising validates plan content and serves as a great training opportunity for those responsible for executing the plan in a recovery. Some organizations go through an initial tabletop exercise to discuss each recovery step documented and feel comfortable that initial recovery strategies will be met. However, the proof resides in an actual simulation, where the infrastructure and applications are recovered at the defined secondary site. In the exercise phase, a “crawl-walk-run” approach to testing is encouraged. Your first test should avoid impacting the production environment, and target application functionality. Subsequent test methods include recovery of interfacing systems, ultimately leading to recovery of the production environment at an alternate location.
The approach to testing is based upon your recovery strategies. For example, some financial institutions have a redundant secondary site for their critical applications, so testing involves failing over to that site for a period of days before actually failing back to the primary center. Many healthcare organizations I’ve worked with select applications for recovery testing during non-peak periods and ensure return to the primary site before the start of peak times.
EXPERT TIP:Include critical customers and end users in IT disaster recovery exercises. By including the end users in testing activities, they increase their awareness of the IT recovery process for their application, and it enables them to practice manual work-around procedures and discuss processes to reconstruct data, if required.
Whatever the approach you take on testing, always ensure that your test plan includes validation of steps to return to the primary production site. So many organizations fail to address this phase, and encounter problems that require remaining at the secondary site for a longer period of time. For example, modifications made to the system during testing may need to be backed out when returning to the primary site. One prevalent issue during testing is handling data replication during the exercise. If replication is suspended during an exercise, the company’s recovery capabilities are seriously threatened. In addition, you want to ensure that test data does not make its way into the production environment when returning home. For additional information on this topic, please read: Failing Back Home Can Trip You Up.
The most successful disaster recovery exercises are those that identify areas to fix or improve. Frankly, it’s likely that “perfect” tests are those that involved a significant amount of pre-test configuration that did not represent the actual production environment.
Business continuity standards and overall best practice call for the exercising of your plan every year. Some organizations, such as banking institutions, exercise their plans more frequently.
In addition to the exercises, it is also necessary to ensure that the established RTO’s and RPO’s are updated, with obsolete applications replaced by newer systems. This process should be done more frequently, on a quarterly basis.
Developing an effective IT Disaster Recovery Program requires input and contribution from many resources that reside outside of the IT department. Identify these components and engage in constructive dialog that builds confident among senior management that the investment is the right approach. Remember:
- The guiding principle in developing an effective IT Disaster Recovery Plan is to ensure compliance with overall company Business Continuity Programs, and to be inclusive of the operating departments that depend upon these systems to maintain productivity and profitability. One side – IT or Business Department – simply cannot approach Business Continuity alone.
- Conducting a cost analysis of impacts to the organization over time against the cost of strategy options helps senior management determine the acceptable approach in the IT disaster recovery program.
- Always ensure that the developed IT Disaster Recovery Plan is exercised frequently, and ensure that “Return Home” practices are addressed.
Business continuity and IT disaster recovery planning is all that we do. If you’re looking for help with building or improving your business continuity program, we can help! Please contact us today to get started. We look forward to hearing from you!
Get business continuity insights delivered to your inbox.