Business continuity and service level agreements: a perfect marriage?
- Published: Wednesday, 01 April 2015 08:49
By Andrew Hiles.
Service level agreements (SLAs) and business continuity go hand-in-hand: or they should do!
Whether SLAs are implemented in support of a balanced scorecard to align information and communications technology with business mission achievement, or as a stand-alone initiative, the strategic use of service level agreements can be a perfect solution to the justification of investment in resilience and business continuity: an approach I have been advocating for over ten years.
How does it work?
First, define the business mission.
Take, as an example, a multinational company – call it Klenehost - selling miniature packs of soap, shampoo, hair conditioner and shower gel to the hotel industry. These are packaged in different ways and customized for specific hotel chains.
Klenehost states: “Our mission is to be the number one vendor, world-wide, of in-room hygiene products to the hotel industry.”
Fine – but what does that mean? Number one in what way? The biggest (by turnover)? The most profitable? Having seven of the top ten hotel groups as customers? Having a dominant market share in each of the geographic regions in which Klenehost operates? Having the products most liked by hotel guests?
Following board level discussion and business analysis, critical success factors (CSFs) are developed to reflect the board’s definition of mission achievement. High-level key performance indicators (KPIs) are established: these are the numbers and ratios that reflect whether the CSFs have been met. Examples of KPIs could be: return on investment; net profit; turnover; customer satisfaction ratings from hotel guests; return per employee, market share by geographic region; key account penetration; customer churn rates; employee satisfaction.
Initiatives can then be undertaken to put the necessary products, infrastructure, tools, methods, research etc. in place so that the mission may be achieved. Capacity plans and HR policies can be put in place to support mission delivery. Service specifications can be developed to ensure that services meet customer and business needs.
However, the problem with KPIs is that they are usually lagging indicators: you may only know whether you have hit the numbers when it is too late to take action to correct under-performance. The KPIs therefore have to be broken down into lower level business performance requirements and technical performance measurements: enter service level agreements.
Technical measurement is important: but only to the technicians who can use it to adjust the service to ensure it does meet SLAs and hence support business achievement of KPIs and ultimately of CSFs and mission achievement. Thus the technical measurement is a leading indicator for ICT. SLAs for information & communications technology (ICT) used to be written in technical terms, typically reporting to end users in terms of the platforms from which services were provided: so they reported on items such as mainframe, server, WAN or LAN availability and response, typically with minimal business content. SLAs tended to reflect technical measurement over which the end user had no control and in which they had little interest. The similarity is with in-flight information provided to passengers. Because the information is available, the passenger is informed of the outside temperature. What use is this information to the passenger? What are they supposed to do about it?
Technical achievement needs to be put into a business context and the business or support unit needs to have ICT performance reported not on a technical platform level but in terms of overall service quality across all platforms that support the business activity. The CFO may use PCs, LAN, servers, printers, WAN and mainframe: but these are just tools. As far as the CFO is concerned, the deliverable is what matters, not the tool. Are invoices issued on time? Are credit control systems working effectively? Are debtors chased promptly? Is the payroll out on time? Is the call centre / center working at optimum effectiveness in handling the maximum number of calls, maximizing sales and minimizing customer churn?
The ICT technical performance measures need to be translated into business terms, since they then reflect whether or not ICT’s customers - the business or support units – are meeting their service levels and hence their KPIs. Timely production of business performance reports enables ICT’s customers to take any remedial action necessary to ensure each unit is on course to support overall mission achievement.
So far we have de-composed high-level metrics into technical performance measures to establish a direct chain of results running through technical performance; the business performance supported by it; to mission achievement. But we can do more. We can evaluate all of the ICT services, establishing how critical they are to business mission achievement and what the impact on the business would be of failure of each of these services.
This is best done with input from the business and support units. Ideally, a high-level business steering committee should be established, with representation from finance and marketing, as well as from key operational and support areas. We identify the criticality and the recovery time objective (RTO) for each service (that is, the maximum length of time the organization can afford to be without the service). We can also establish the recovery point objective (RPO) (that is, the point to which data must be recovered – e.g. start of day, end of day, or to a checkpoint). The results of this process will form the basis of the SLA requirements for availability and reliability (the number of incidents of outage) for each service
The results can be sorted into tiers. A financial institution might, perhaps, define tiers as follows:
- Tier one: continuous availability requirement - maximum four minutes
downtime a year.
- Tier two: high availability - maximum of one outage per year, maximum four hours outage per year.
- Tier three: recovery essential within 24 hours - maximum three outages per year.
- Tier four: recovery required within three days - maximum four outages per year.
- Tier five: delayed recovery – all other services.
There may be as many tiers as appropriate, with the requirements for each tier adapted to each particular organization’s needs.
The next step is to review the quality of infrastructure at each site. Clearly, the quality of infrastructure has to be capable of supporting the availability and reliability requirement for the tier of service that is to be delivered to that site. Critical component failure analysis can be undertaken to establish the theoretical availability of the equipment, operating systems, applications and network on which the service depends.
A simplified example could be a service with one access route, a firewall, a web server, an application server and a database server.
Including operating systems, middleware and application software there may be, say 15 components involved. In this case, each component is a potential single point of failure. If each component has a 99.98 percent availability, the theoretical availability of the overall service is calculated as:
The overall availability works out at about 99.7 percent: about 2.23 hours of unavailability. And we have not included the resilience of the facility or availability of people in this, so actual availability could be significantly lower. Clearly this could be unacceptable for a tier one continuous availability service as defined above.
To get continuous availability, the configuration might need duplication or triplication, with each configuration cross-linked by triangulated communications and geographically separated (e.g. one in New York, one in Dallas, one in Paris, France) so that the same physical disaster could not impact all configurations and any one could stand on its own. A cloud solution using spare capacity on existing equipment could be another, possibly cheaper, option but the same principle applies. The system could be affected not just by hardware or software failure, but also by overload. So capacity on demand and storage on demand could also be considered. Since zero downtime is the requirement, data has to be mirrored in real time: there is no time for traditional data recovery from off-site backup tapes. In this case, disaster recovery arrangements are not simply added on later: they are an integral part of the system design and build. This solution is expensive and similar resilience might be found cheaper in a public, hybrid, community or private cloud. However, the principles of replication, resilience, elastic capacity and geographic separation remain true.
Contracts and SLAs with external suppliers need to reflect the appropriate tier rating: even 99.5 percent availability is impossible if the maintenance contract only allows for four hours to get on site. The maintenance SLA needs a guaranteed four-hour fix time.
Once the requirement has been agreed by the business for each tier, it is simply a case of applying the rules. Clearly tier one services are going to require more funding than tier three or four: resilience costs money. However, the budget naturally follows the business decision.
The concept can be adapted to any organization.
By mapping the Service Tiers against the Uptime Institute’s Infrastructure Tiers you could provide high levels of confidence in achieving the service levels and meet demanding recovery time objectives (RTO).
The Uptime Institute tiers (I-IV) are progressive; each tier incorporates the requirements of all the lower tiers.
Tier I: Basic capacity
A Tier I data centre / center provides dedicated site infrastructure to support information technology beyond an office setting. Tier I infrastructure includes a dedicated space for IT systems; an uninterruptible power supply (UPS) to filter power spikes, sags, and momentary outages; dedicated cooling equipment that won’t get shut down at the end of normal office hours; and an engine generator to protect IT functions from extended power outages. This will provide availability of 99.677 percent with a RTO of 28.8 hours.
Tier II: Redundant capacity components
Tier II facilities include redundant critical power and cooling components to provide select maintenance opportunities and an increased margin of safety against IT process disruptions that would result from site infrastructure equipment failures. The redundant components include power and cooling equipment such as UPS modules, chillers or pumps, and engine generators. Tier II provides a 99.749 percent availability with a RTO of 22 hours.
Tier III: Concurrently maintainable
A Tier III data centre / center requires no shutdowns for equipment replacement and maintenance. A redundant delivery path for power and cooling is added to the redundant critical components of Tier II so that each and every component needed to support the IT processing environment can be shut down and maintained without impact on the IT operation. Tier III gives a 99.982 percent availability and RTO of 1.6 hours.
Tier IV: Fault tolerance
Tier IV site infrastructure builds on Tier III, adding the concept of fault tolerance to the site infrastructure topology. Fault tolerance means that when individual equipment failures or distribution path interruptions occur, the effects of the events are stopped short of the IT operations. Tier IV provides 99.995 percent availability and RTO of 0.04 hours.
An example from the construction industry
The construction industry is substantially less demanding than, for instance, banking. The example that follows reflects the development of SLAs for a construction company where a central ICT function served six hundred different sites owned by six operating companies that operated internationally.
Applications were allocated to application tiers in order to ensure that the infrastructure and support was provided to match the criticality of the application to the organization. Tier ratings were decided by business divisions who had to fund ICT accordingly. For new applications, the category influenced the design and resilience of equipment and infrastructure to be used.
Backup policy for each application tier was defined by the Corporate IT Backup Policy and was designed to facilitate recovery within the relevant recovery time objective.
Sites (i.e. the site where the user resides) were categorized in accordance with infrastructure resilience criteria. The higher the level of resilience in the site, the more reliable would be the service and the more suitable the site will be to run high tier applications.
Site infrastructure needs to be appropriate to provide the necessary availability, resilience and response required by the application tiers.
The requirements were identified for each application tier.
Problem management is also facilitated by this approach. The application tier was also considered in assessing the severity of an issue reported to the service desk. Any loss of service or significant degradation of response to a tier one service is likely to have a high priority, whereas a tier four service may never justify severity level one (i.e. the most urgent) response.
Similarly, Tier One services are likely to have a more demanding response service level than tier four services.
By following this approach, the result was:
- Alignment of ICT architecture and infrastructure with business requirements;
- Budget hassles avoided by adhering to the infrastructure categories and acceptance the business that it had to invest in upgrading infrastructure to match the application tier requirements;
- Automatic acceptance of disaster recovery requirements related to the different application tiers;
- A single, compact 25-page SLA covering all ICT activities for the whole organization.