Disaster recovery testing: how to get it right

Published: Wednesday, 15 May 2019 07:52

With an article prepared for Business Continuity Awareness Week, Ryan Weeks, chief information security officer at Datto shares five tips that business managers and IT teams should follow to help ensure that disaster recovery testing efforts are effective.

Having a solid disaster recovery (DR) strategy in place is imperative – but if you don’t test it regularly, you still risk your business being hit hard if ransomware strikes or if there is a system outage. The purpose of IT disaster recovery testing is to pinpoint and fix any flaws in your DR plan well before you find yourself in a real disaster scenario.

To do this, you need to thoroughly scrutinize how well your plan performs, and allow enough time to resolve any issues before they impact the ability to restore operations in case of an emergency. Scheduled and frequent testing is the only way to be certain your organization can be back up and running quickly following an outage.

To help ensure your testing efforts are effective, follow these five key steps:

1. Make sure you choose technology that facilitates the all-important testing. Modern disaster recovery systems, for example, take frequent image-based backups and replicate server images to the cloud. When there is a primary server outage, operations can be restored directly from a backup instance of a virtual server. This so-called ‘instant recovery’ approach has fundamentally changed how DR testing is performed as it allows users to easily spin up virtual machines locally or in the cloud and test the ability to restore essential services such as email and database applications.

One word of caution: To avoid conducting ineffective tests, always refer to the vendor’s guidance first. Many disaster recovery vendors provide a pre-test checklist with specific tasks that must be performed prior to testing and skipping those can create tests that yield inaccurate results – invalidating the entire testing process.

2. Define the scope of testing. For example, should the test be conducted in a cloud-based environment that mirrors the production environment, or is the scope broader? Some tests might even go beyond IT – such as testing an emergency generator.

There is no single ‘right’ approach; every organization will have to determine its own specific needs based on how much disruption it can tolerate during testing, and the amount of time and resources it can dedicate. However, cutting corners or running incomplete tests is not advisable as potential issues may be missed that will impact restores later. While defining the test scope, it’s also important to remember that some of the more radical test methods carry a risk of data corruption or even data loss.

3. When it comes to the frequency of DR tests, again, there is no silver bullet. While it should be considered essential to perform a test every time there has been a significant change to the production environment, routine tests may take place quarterly or every six months depending on the available resources. Again, it’s also a matter of weighing up risks – some organizations might require more frequent testing.

4. Reporting and sharing the results of these tests demonstrates the value of the DR strategy to the management board and other stakeholders. This might be as part of a formal review meeting or a more informal email report, but as a minimum, it should include details of the test results and proof that any issues have been resolved, as well as confirming the ability to recover along with the on-going validity of the DR strategy. Live testing, on the other hand, is not recommended, as depending on the outcome, this can actually decrease confidence in the DR plan.

5. Something that is all too easy to neglect is the comprehensive documentation of network topology, DR plans, testing processes and test results. However, documenting everything is important, and there are many tools on the market to help with this, ranging from fairly basic to highly comprehensive. The information captured should go beyond IT components and also include contact lists for support teams, technology vendors, and any other pertinent information that might be needed following a disaster event.

Ransomware, user error, natural disasters: all of these are very real threats. Those businesses that can restore operations in the shortest timeframe will have a competitive edge. No plan and no system is ever failsafe, but by carefully performing regular DR tests, immediately dealing with any issues that are identified, and meticulously noting down all relevant information related to the DR plan, you should be in the best position possible to cope with all eventualities.

https://www.datto.com