Giving the recovery point objective some respect
- Published: Tuesday, 18 April 2017 08:07
Accurate recovery point objectives are essential for the recovery and restoration of systems in the time expected, but they are often neglected within business continuity and disaster recovery plans. Robert S. Emmel explains what RPOs are and how they can be calculated using various factors.
Activities associated with recovering from a disaster event are typically performed in a time-compressed and bounded manner. Even though the business may have analyzed, documented and (hopefully) tested the steps to be used in preparation for such a contingency, actual recovery and restoration times always seem to vary from what was originally planned. While some level of variance is to be expected, the level of variance can be due to any number of factors. One of those factors we see among our clients as contributing to longer than expected recovery times is not accurately quantifying and establishing appropriate recovery point objectives (RPOs).
But what is an RPO and why should a business be concerned about its accuracy? For the sake of this article let’s assume an RPO represents the aging of business information in the form of electronic data files stored (generally) on tape or disk: ‘backups’ (whether in a corporate data center / centre, third-party, or cloud environment) – via a centralized IT organization or through some local means (e.g., a local area network (LAN), etc.)
Business organizations ‘age’ their stored data when they determine how often a particular type of data (e.g., sales, manufacturing production, financial, accounting, etc.) should be copied and stored. For example, some data copies are performed every 24 hours while others must be performed in near real-time (i.e., a secondary and remote copy immediately following the primary production and local copy). The problem is that the longer the aging, the greater the amount of data that will be lost between the time of the technology infrastructure ‘down’ situation and the last backup. Stated another way, if the technology infrastructure were to be recovered (for whatever reason), the business would, depending upon the type and backup rate of data, be missing all production data input between the time of the down situation and the last time data was backed up. Obviously, there would be the potential for more data loss associated with a 24-hour backup than a near real-time backup.
Consequently, data aging is an important concept when the business determines how quickly backup data would be needed if there were some event requiring the recovery and rebuild of the business utilizing that aged data. The concern becomes paramount when the business endeavors to calculate the RPO and determines the relationship between the type of backup, infrastructure and timeframe in which the backup is required is not a one-to-one relationship. In fact, as the time required for recovery of backed up data shrinks (e.g., moves from, say, 24 hours to 15 minutes), the associated storage infrastructure type and cost becomes higher – sometimes exponentially more than that typically associated with a 24-hour backup requirement.
Therefore, to determine a logical approach to determining an appropriate RPO, there are several potential factors or considerations to help a business organization make a more informed RPO calculation. Some of these factors are presented further along in this article to help a business process or application owner determine the timeframes (and costs) the business should anticipate with their data backups given a storage type.
RPO and the BIA
When a business impact assessment / analysis (BIA) is performed, the focus is typically on generating a recovery time objective (RTO) that accurately portrays the timeframe in which a business process, group or application must be recovered following some disaster event. Just about all individuals within the business understand this process. But what about the RPO? It usually gets short shrift in any application or infrastructure recovery timing discussion, and our client project experience highlights this fact. Furthermore, the average business executive does not have a good understanding of why an RPO is important and how it is formulated.
Whereas the RTO is usually determined via some sort of formalized calculation (or sometimes through an IT or management edict), it also uses business logic and output from the BIA to arrive at what is, many times, a compromise in restoration timeframes. However, when attempting to quantify an RPO, most recovery planners, application owners and business process leaders use a very informal derivation process to facilitate an agreed upon value of time. In many instances an RPO is determined solely by the storage technology and approach used within the business. However, we believe development of an RPO should require the same level of calculation rigor as an RTO value.
But why is a more formally calculated RPO even important? While you may be able to generate any number of reasons for developing a more ‘formalized’ RPO timeframe, a prescribed calculation provides a clearer insight to the costs, timing, infrastructure and level of effort required to save and retrieve production data, particularly during a disaster situation. In addition, this approach highlights the fact that most organizations focus on an application when they should be focusing on the business process supported by the application.
RPO calculation factors
So, to help an organization better define their appropriate RPO(s), the following calculation factors are presented for consideration. All or some of these factors can be used in an RPO calculation. In addition, this list of factors should not be considered exhaustive. Rather, these are just some of the factors we currently use in our RPO model during client engagements. These factors are presented here to illustrate the thinking behind the establishment of a viable RPO suitable for the business process given the costs and recovery timeframes typically associated with tape, disk and/or cloud storage options.
Business process transaction level: Each business process or group has a different rate at which endemic business transactions are realized, processed and saved. This is important because typically, but not always, those business processes with a high transaction rate require a quicker recovery time than other ‘slower’ business processes. This could be due to the increased potential for transaction loss or corruption (e.g., the number of lost opportunities to act or react with complete data), faster access to previously processed and/or saved data to complete a transaction, the increased need to quickly save a specific transaction, or the requirement to quickly reconstruct a failed or lost transaction resulting from some disaster or loss-of-availability event. Therefore, the two components to be considered within this RPO factor center on the level of process transaction throughput (e.g., high, medium, low), and the timeframe in which the business process typically requires access to, and reuse of, saved data to complete a transaction (e.g., one minute, one hour, 24 hours, 3 days, etc.).
Type of data backup: From a storage technology perspective, there are any number of types of data backup approaches. These approaches include backing up to disk, tape, or the cloud (using tape or disk as the primary storage infrastructure) and come with differing levels of speed, timing, cost, and data retrievability. As another factor in determining an RPO, consideration must be given to data type, data backup approach, and the ease (and speed) of being able to retrieve backup data following a disaster or extended outage event.
Frequency of data backup: In our experience, the IT organization typically assumes the lead in determining when data backups should be [taken], which does not always provide the most optimal solution for a given business process. On the other hand, IT is continually considering the costs and technologies associated with [data backup] infrastructures in an effort to accelerate the backup process while keeping operational costs in check. To aid in establishing a workable RPO, business process owners must determine how often data backups must be taken (e.g., continual, every 15 minutes, daily, etc.) given the expense constraint associated with shorter times between backups.
Size of backup file: Another factor to be considered when identifying an RPO is the size of the backup file used to recover the application(s) needed by a business process. Due to advances in storage technologies, backup file size has become somewhat less of a concern for data stored on disk. But for data stored to tape, backup file size is still somewhat of a concern due to the lower speed of the actual physical backup and time associated with the data retrieval process. And, with the price of disk space continuing to drop and available space increasing seemingly exponentially, the cost demarcation between disk and tape backup capabilities is becoming blurred. In any event, when determining an RPO, size of a backup file may need to be considered.
Supporting infrastructure approach: How an application is supported from an infrastructure approach should also be a consideration in determining an appropriate RPO for a business process. As data centers and their included infrastructure evolve and accelerate system resiliency capabilities, it is important to determine whether the application being supported is part of an Active-Active, Active-Passive or Active-Traditional Recovery infrastructure approach and its impact upon business processes. While costs associated with an Active-Active infrastructure approach are significantly higher than those of an Active-Traditional Recovery approach, the potential for extended data loss is extremely low with the first approach and appreciably higher with the latter approach, both of which impact business process recovery timeframes.
Dependent applications, business processes and infrastructure using the backed-up data: A clear majority of our clients only have a very cursory or high-level view as to the dependencies between their business process and the supporting IT applications and infrastructure. This lack of application inter-dependence or ‘affinity’ knowledge typically leads to IT stumbling during an enterprise or business process test or recovery activity as they quickly discover the intended applications will not ‘run’ without other applications and/or infrastructure being ‘live’, and the business processes cannot meet their RTO. Furthermore, when other dependent (inbound/outbound) applications are identified, their RPOs are rarely in congruence. Therefore, before calculating an appropriate RPO that would apply to all applications using the applicable backed up data, consideration must be given to determining whether appropriate application, middleware, and infrastructure inter-dependencies have been documented.
Network bandwidth and intended data transfer speeds: Network bandwidth used to be a significant issue when contemplating recovery capabilities. Today, however, network bandwidth is less of an issue but should still be taken into consideration when determining RPO requirements. A determination must be made for every application and its RPO whether network capacity would be adequate for business process support during a recovery event.
Location of the data backup: Location of a data backup really only becomes a major concern when a tape-based backup strategy is used, and the time between consecutive backups typically approaches 24 hours or more. The concern centers not only on the amount of process data ‘lost’ between the time of the event or disaster and the last time process data was backed up, but the passage of time to retrieve, transport and restore data to the point of the last backup. The former timeframe is the typical definition of RPO, while the latter timeframe is a factor in a process or application RTO calculation. With a disk based storage solution, location of a data backup is typically not an issue. However, while location of the data backup will have a lower weighting in comparison to some of the other RPO calculation factors, it must still be taken into consideration when determining an appropriate RPO for a business process or application.
Ability to recreate lost or corrupted data: While it’s good business practice to backup production data, it may be even more important to determine the ability to recreate lost or corrupted data following an event or disaster. Therefore, an RPO calculation should consider the ease, degree and speed with which lost data could be recreated by the business process. Similarly, it should be ascertained whether the ability to recreate lost data can be accomplished with information from other applications and/or business processes that may not have been impacted in the event or disaster. Lastly, the ability to recreate lost data may be dependent upon the business priority of the data and length of time from the previous backup. In many cases, this RPO calculation factor will be one of the top two or three factors driving the final RPO figure. This is because as data criticality increases, the ability to quickly and accurately recreate data decreases, and the timeframe associated with re-creation significantly increases.
Impact to the business: Finally, calculating a viable RPO must at the very least include a determination of the impact to the business due to the amount of data potentially lost following an event or disaster. While this is a somewhat similar concept within the RTO calculation, the consideration here focuses on whether the impact of the loss of a given amount of business data is direct, measurable, or time-based (e.g., immediate, delayed, extended, etc.)
Calculating the RPO
Now that some of the factors impacting the derivation of an RPO have been identified, it must be determined how these factors are formulated to develop an appropriate RPO. First, it must be stated that there is typically more than one approach to applying or integrating these factors. The way the factors are applied is typically dependent upon the goals and objectives of the business, process or application in question. Second, the RPO calculation needs to be based upon, or compared to, some other figure to ensure the resulting RPO is in alignment with designated recovery requirements. Our experience has shown the RTO to be that figure.
As stated above, there is usually more than one approach to applying or integrating these factors to develop an appropriate RPO. For the RPO calculation model we use with most of our clients, a sliding scale of values (e.g., low to high, one to four, etc.) is assigned to each factor to help quantify how the business, process or application environment impacts a factor. For example, when considering the Business Process Transaction Level factor, the sliding scale might vary from “…not transaction-based and only requiring data within a minimum of 24 hours following an outage…” to “…highly transaction-based and requiring immediate access to data for financial ‘position’ purposes…” And this scale may change depending upon the industry of the client.
Following the establishment of appropriate value scales for each factor, the individual factors must then be integrated such that the outcome is an RPO value. In the spirit of providing useful information that can be immediately applied while not giving away any ‘secret sauce’ calculations, we typically weight the importance of each factor, combine the factors through the application of a proven formula, and compare the result with the corresponding RTO to determine whether the RTO or resulting RPO must be ‘moved’ to better accommodate business needs.
Using the information provided in this article
This article has endeavored to present a more formalized method for calculating RPOs which have typically been developed via a very informal derivation process to facilitate an agreed upon value of time for recovery of business process-specific data. The use of RPO calculation factors may help in organizations identifying and implementing more appropriate and viable RPOs that better encompass business process requirements and IT delivery capabilities during a disaster scenario.
So, is that newly calculated business process or application RPO to be taken as a final figure? Not in all cases. This calculation may be accepted as the de facto RPO, but in most situations, it establishes a very strong base from which informed decisions can be made regarding when a business or process needs access to specific data following some extended outage event.
Robert Emmel is a Senior Principal and Global Business Continuity Management (BCM) lead and Subject Matter Expert (SME) for Accenture’s BCM practice. He has more than 25 years of experience encompassing a broad range of business, risk management, and information technology service and operational capabilities. Bob has performed in excess of 120 recovery planning engagements, more than 30 data center assessment and enhancement projects, and five data center consolidation/migration efforts. In addition, he has held interim IT operational and CIO positions with clients and organizations spanning a wide array of industries and global locations. A former naval officer, Bob has an undergraduate degree in Naval Science and Analytical Management, a master’s degree in Business Administration, and a doctorate in Management with a focus on data center risk identification and management. He holds the COAP, CBCP, CBCV, AFBCI, and CHS-III certifications. Contact Robert at email@example.com