Sandi Hamilton looks at the importance of considering the human factor when planning and managing high availability environments for critical applications such as SQL Server. She considers inherent organizational issues which result in continuity gaps and discusses the communication and documentation practices that can reduce failures.
Introduction
Technology is continuously changing, driven by user demand for improved performance. This is true for high availability (HA) resources that protect and secure organizational data assets and ensure availability for critical applications such as SQL Server. Despite significant innovations in hardware and software and moving resources to the cloud, which help to reduce the risk of loss of data and down applications, failures continue to happen. The often-over-looked causes of application downtime are related to the human factor.
The solution to protecting your SQL Server critical applications is integrating best practices, planning, testing, and automated technology that limits human error and provides predictable results. This article looks at inherent organizational issues and discusses the communication and documentation practices that can reduce failures. It also discusses the importance of regularly running a disaster recovery plan that supports your business continuity in both test and QA environments, whether your solution is on-premises or in the cloud and provides tips on how to make HA testing faster and easier.
Options for removing continuity gaps
There may be gaps in your operational processes where human activity could contribute to disrupting business-critical applications and your business continuity. Here are five key areas your organization should check:
- Roles and responsibilities
- Complexity of environments and testing upgrades
- Documentation and other communication
- Disaster recovery plan testing
- Nuances between on-premises and the cloud.
Let’s take a look at how these can contribute to continuity risk and what you can do to close the gaps.
Roles and responsibilities
It is important to understand your IT organization. How do the various components interrelate? Who manages them?
Every IT group has specific roles for members of its team. These may vary according to the size of your organization. For example, there may be specific job roles responsible for applications, operating systems, networking, the cloud environment, storage, and so forth. Often, however, responsibility for these critical roles is less clear. It is often shared among 2-3 people sitting in different organizational silos - all who need to work together as a team to achieve operational success. Check that everybody understands not only their own responsibilities but also those of the other members. To keep things running smoothly, they should keep the other groups informed about any changes being implemented. For example, when the infrastructure team is planning to reconfigure system components, they should keep their SQL Server counterparts informed as well.
The size of the organization and the environment (physical, virtual, cloud or hybrid) can also come into play. There may be separate people handling the roles or may be only one or two people in the organization that are responsible for all of these roles. Do you know your SQL Server administrator? It could be you.
Roles and responsibilities could be different in different environments. For example, if the environment is in the cloud, then you may have a separate cloud administrator that is responsible for configuring and managing systems, separate from the other administrators. Also, different administrators may have different permissions on the systems. An application administrator may not have permission to update or change many of the operating system-level files. In this case, the admin will need to coordinate the changes to be made for the application at the OS level.
Not only do you need to track system and environment changes, but also changes to the teams managing them. Too often, the stability of these components drastically decreases because there is no transition between teams. The best practice is to plan a transition period for changing the original team. This period should include training the new team members on the systems, the environment, and the history of the project.
Create a central document (sometimes called a runbook) where roles and responsibilities and all critical information needed to manage and maintain critical systems is documented. Share the runbook with that new team. This book is a living document that requires continual updating.
Another problem occurs when system administrator responsibilities are divided among people. Because database administrators may have little-to-no knowledge of the operating system and associated permissions, they will need to bring in the operating system administrators to help resolve problems on the systems. This can cause significant delays in getting problems debugged and corrected. The recommended solution is to ensure that they have immediate access to the OS administrators to get problems resolved quickly. With the admin role being split, it is very important to communicate plans to change or update your systems and environments to all necessary team members.
Complexity of environments and testing upgrades
Another factor to consider is the complexity of the environment. Environments running on Windows are typically less complex, whereas those running on Linux are often more complex. How old are the systems? What version of SQL Server is being used? Are there any legacy or custom applications involved? All of these questions need to be answered and well understood. Do you have any test systems along with your production systems?
It is critical to have a non-production environment to use for testing new versions, updates or configurations of applications, databases, and operating systems before deploying them to production. In the past with rolling updates, some companies could justify the cost of separate DEV and QA testing environments. But now, with the cloud and virtualization, separate environments are considerably easier and less costly to deploy and maintain. The best practice is to have at least one environment other than your production environment for testing. But if you have only a production environment, then consider upgrading your secondary node in your cluster first.
Documentation and other communication
For an IT organization to be effective, people must communicate effectively as well as document everything, including policies, procedures, and deployed environment configurations. Documentation – including your runbooks - should be continuously updated, stored and accessible from a centralized location (including your DR location), and in digital and physical format. Not having physical documentation can be a problem if your system is down and your recovery plan is only in digital format.
How well do IT groups communicate? There is a wide range amongst organizations. At the lower end of the spectrum where there are limited open lines of communication, it can be frustrating for getting things done in a timely manner, such as finding a storage admin to provide more storage, or a server admin to provision a server. At the other end of the spectrum, there are highly communicative groups, where the teams work together, planning and meeting on a regular basis, led by senior management so everyone knows where the priorities are.
Disaster recovery plan testing
Business continuity depends on always being prepared. The only real way to be this way is by proactively planning and continuously testing your systems. DR planning must be included in your runbook. The best practice that we have seen with our customers is scheduling tests for disaster recovery and business continuity before going into production. And then, during maintenance outages, when their systems are in production, they test by simulating and inducing failures.
Consider the following scenario. The customer is half-way into building redundancy with two systems, power, and everything in the same rack. The power goes out at the facility, there is no disaster recovery system to fail over to at a different location, no power, and no backup. This illustrates the need to include DR at the beginning of your projects and consider all of the environments that are involved.
Disaster recovery really focuses on the IT infrastructure, the systems that need to be brought back online. The business continuity plan should guide and inform the IT department which systems are business critical, with recovery time (RTO) or point (RPO) objectives, to be up and running within those specific metrics. Unfortunately, many companies believe testing is difficult to do because of the time, cost, and disruption to operations. As a minimum, you should perform DR testing annually.
Nuances between on-premises and the cloud
Your IT organization should also be aware of the nuances of on-prem versus cloud configuration. For example, what if Dev and Test systems are on prem and production is hybrid or in the cloud? Are these the same? If you're running production in the cloud, you want to have Test and Dev in the cloud. If you're running in a VMware environment, you want to have VMware. Of course, this isn’t always possible. In any case, your team should be aware about the cloud, especially as it relates to high availability.
In many ways, high availability is the same in the cloud as it is on premises. For instance, high availability solutions such as Always On Availability Groups and SQL Server failover cluster instances are still relevant. But there are typically some extra steps that need to be taken with the cloud, such as: configuring internal load balancers for client redirection in Azure, dealing with multi-subnet failover in AWS, or configuring SANless cluster solutions to overcome the lack of shared storage.
Configuration management and change tracking do not go away in the cloud. It actually becomes more important and complex because there are more components to keep track of in the cloud. Not only do you have your operating systems, applications, and databases. You also have all of the configuration management and change tracking of the cloud components to deal with. Changes to your environments, such as OS upgrades, database upgrades, application upgrades, cloud component upgrades should be understood, planned, and tracked. Make sure you understand the changes before they are implemented in your environment. Check that the upgrades are compatible and supported with other components. You should also develop procedures for upgrades and get them reviewed by the affected groups.
Also, make sure you test your planned upgrades in your test environments first, before rolling them into production. Often, customer environments become unstable due to lack of change tracking. For example, a system crashes due to a kernel upgrade that was installed. This kernel was not certified as supported with the other environment components. Don't let this situation happen to you. Make sure that you plan and track your system and environment changes.
Be aware of what happens when the cloud service provider wants to do planned maintenance. Cloud service providers will notify you if the infrastructure hosting your particular virtual machine is going to go under planned maintenance that has some amount of downtime associated with it. The notification will vary by cloud provider. Once you are notified, ensure that your team has a way to migrate the workload from the systems that are going to be affected.
Typically, in high availability environments, you're going to have systems that are replicated between availability zones, or even between regions, and you're going to want to move them, obviously off the impacted region or availability zone into one that's not being impacted.
The author
Sandi Hamilton, Director of Customer Support, SIOS Technology.
For more than 20 years, Sandi Hamilton has helped IT teams in hundreds of organizations implement and manage high availability environments to protect their critical applications, including SQL Server, SAP, HANA, and Oracle. Sandi holds a BSc in Electrical Engineering from University of Florida.