The IT DR program: a crucial, but not well understood, aspect of disaster recovery
- Details
- Published: Friday, 30 September 2016 09:09
While the hardware and software costs for disaster recovery are well understood many organizations do not fully realize that, in order to be assured of successfully executing the plan in the event of an outage or disaster, a comprehensive IT DR program must first be in place. An organization can have all the right IT DR hardware and software, but without a properly managed program, its efforts will fail.
Even the organizations that do have this understanding often underestimate the complexities involved in creating an IT DR program and the associated costs.
The DR program consists of the people, processes and tools necessary to implement the IT DR solution and manage its lifecycle. Because this implementation process requires considerable expertise and experience, organizations must carefully consider the costs of developing their in-house skill sets as well as those of purchasing, implementing, and maintaining their own hardware and software in house. They should then compare this expertise and the hardware and software costs to those they could access by going to a third-party managed recovery provider that specializes in providing IT disaster recovery services.
Only by understanding what goes into a full IT DR program and the complete total cost of ownership (TCO) of both an in-house versus a ‘selectively outsourced’ solution can organizations make the right choice.
The DR program consists of five processes: application mapping; developing disaster recovery procedures; test planning and execution; post-test analysis; and recovery lifecycle management. The discussion below will address what each step involves.
Application mapping to determine interdependencies
Organizations typically automate complex business processes using multiple interdependent applications and databases. The application mapping process connects each business process to all the application software and hardware required to deliver that process and assigns the process a desired recovery time/ recovery point objective (RTO/RPO). To ensure that business processes are recovered within the desired timeframe, all the supporting applications must also be recovered within this time period.
Organizations perform application mapping through collaboration between IT and the business units to understand the business processes, and the applications, databases, and hardware that support the processes. IT and each business unit must then determine the cost of downtime. This requires assigning a financial amount to each unit of time should an automated business process go down and determining how long the business can afford to be without this application.
Based on downtime costs, organizations can create recovery point objectives and recovery time objectives that specify how quickly the organization must recover each process should it go down.
An example tiering of applications with their concomitant RTOs might look like the following:
- Tier 1: 0 – 4 hours RTO
- Tier 2: 4 – 12 hours RTO
- Tier 3: 12 – 24 hours RTO
- Tier 4: 24+ hours
All of the applications in the same process need to be assigned the same RPO/RTO. If organizations don’t recover all of the interdependent applications and data necessary for a particular process at the same time, the entire process won’t recover properly. For example, a Tier 1 application may rely on a seemingly less critical Tier 4 database. If this database is recovered more slowly (or not at all), IT directors will find themselves explaining to the Board why their recovery effort failed.
Recovery procedure development
Recovery procedures are the steps the organization must take to recover the data center / centre and applications.
Developing recovery procedures involves writing a detailed plan or run-book that defines how to deal with the loss of various aspects of the network (databases, servers, bridges/ routers, and communications links). This script should include specifications for who will arrange for repairs or reconstruction, communication procedures for the initial respondents, and instructions on how the data recovery process should proceed.
The script should also outline priorities for recovery (e.g. what should be recovered first). Once the repairs and data recovery have taken place, the procedures should include a checklist that organizations can use to verify that everything is back to normal.
These procedures need to follow best practices and be kept up to date. Many organizations do not spend enough time developing complete recovery procedures to make sure their data center is fully recoverable.
For example, Sungard AS studies show that each application procedure takes approximately 19 hours to complete, and these procedures need to be continually updated based on production changes and testing results.
A DR specialist like Sungard AS can speed up and simplify the process of developing recovery procedures.
For example, based on hundreds of customer implementations, Sungard AS has developed a library containing thousands of best practices templates and modules. Instead of having to reinvent the wheel each time to develop test procedures, Sungard AS can take advantage of existing best practices-based procedures. Automated configuration tools then arrange these templates and modules into a customer-specific ‘procedure’.
Test planning and execution
Best practice is for organizations to test their recovery plan at least once or twice a year. By testing the entire recovery process, organizations can determine whether their recovery plan works, as well as uncover problems, mistakes or errors, and resolve any problems before they impact an actual recovery effort.
Such testing also educates staff in managing disaster recovery situations.
Many IT organizations, however, do not adequately test their DR plans. Executing recovery tests always involves removing production subject matter experts from their day jobs. This fact often leads companies to test subsets of their applications to minimize the disruption to the production environment. Less than full testing, however, is not sufficient to ensure successful recovery.
To develop an optimal plan, organizations should develop tests that consider everything that might go wrong, come as close as possible to simulating a real-life incident, and have independent reviews and observers.
Test plans should include:
- Test goals to drive the tests and keep the process on track;
- Execution scenarios that define the equipment, standard operating procedures, or conditions needed to conduct the test; test execution assumptions; and an event or incident scenario.;
- Instructions to participants;
- A communications directory with phone numbers, fax numbers, or email addresses of those whom the participants are likely to call;
- A list of participants — including a test design team, simulation team, evaluators, test participants;
- Test briefings;
- Test debriefings;
- Written evaluations and reports.
Typically, these tests require a sizable team for test planning, startup testing, ongoing testing, and setup and teardown of the environment.
Post-test analysis
After testing, it’s important to take the time to understand what happened during the test in order to optimize your recovery procedures. Organizations must examine detailed logs after each test to identify any errors in procedures, eliminate the errors, retest the changed procedures, and then incorporate the changed procedures into the recovery plan, revising all existing disaster recovery documents. Organizations are often surprised to find that tests reveal significant gaps in their recovery scripts.
Recovery lifecycle management
Changes are constant in a production environment — software changes, patches are added, capacity grows and so on. Organizations need to manage the lifecycle to ensure that the changes flow into the backup and disaster recovery environments. If changes are not synchronized with the recovery systems and plans, the restoration of systems and data can be significantly delayed or even fail. Many companies fail to manage and keep track of changes.
Conclusion
Most organizations understand the need for an IT DR plan. However, when planning for IT DR, many organizations frequently fail to develop a comprehensive DR program.
While they typically plan for the required hardware and backup software, many organizations don’t understand that a disaster recovery program, which includes the people and processes required to document the recovery procedures, test the procedures, execute the recovery at time of outage, and manage the ongoing lifecycle, is essential to ensuring successful IT DR. In effect, no ‘set it and forget it’ technology exists when it comes to DR. These organizations are not fully aware of all the expertise, tasks, and technology required to implement an effective plan, nor do they have a full grasp on all the costs involved.
The author
This article is based on a white paper produced by Sungard AS. The original can be read here: How to develop a comprehensive and cost effective IT DR program. For more on testing and exercising you can read this Sungard AS blog: How to Conduct a Disaster Recovery Test.