There are various options for SQL Server disaster recovery but each comes with its own tradeoffs and you need to understand the nature of these in order to select the approach that’s best for your organization. Dave Bermingham offers some help with this area…
Whether it’s on-premises, in the cloud, or a combination of both, something is going to happen to your production SQL Server system. You don’t know what (or when or why), but something is going to happen. While it’s useful to game out the things that are most likely to cause your database to go dark, it’s much more important to develop a plan to recover your SQL Server database quickly and efficiently — particularly if it goes offline for other than the main reasons you’ve imagined.
The good news is that there are multiple ways to recover your SQL Server system and get it back up and running. The bad news is there are tradeoffs – involving potential for data loss, time to recovery, operational complexity, and operational cost (to name only a few) – and you need to understand the nature of those tradeoffs in order to select the disaster recovery (DR) approach that’s best for your organization.
Evaluating your options
Four approaches to SQL Server disaster recovery stand out as readily accessible:
- Log shipping
- Failover clusters
- Availability Groups
- SANLess failover clusters
All four approaches require at least two sets of geographically separate servers configured to support your SQL Server needs (a third set in a third location would provide even greater security from a DR perspective). One of these sets will support your day-to-day SQL Server needs. It’s your primary instance of SQL Server. Another (and a third, if available), will stand in a geographically separate location. This is a secondary or backup instance that can be brought online to support your SQL Server production environment if the primary instance becomes unavailable.
Beyond that, each of these four approaches to DR provide different benefits. The question is, which one will best support your organization if you find yourself in a true disaster situation?
Log shipping
Log shipping is at once the least costly and most cumbersome approach of the four. You need to create offsite backup copies of your database on a regular basis and, between backups, you need to save remote copies of all your SQL Server transaction log files. These log files act as a kind of roadmap that SQL Server can use to roll forward a backup version of the database, which is effectively what you must do in a log shipping failover scenario.
From a DR perspective, log shipping is imperfect. If the last log file was saved 20 minutes before your server went offline, you may not be able to reconstruct those last 20 minutes of transactions — and that could represent a substantial data loss. Similarly, if your backup is relatively old, the process of rolling that database forward so that it resembles the state of the database before it became unavailable can take a long time. Log shipping lacks an automated failover management component, too, which means that in addition to having to rebuild your database on your backup servers you’re also going to have to reconfigure parts of your infrastructure manually to point users to the instance of SQL Server running on your backup infrastructure. The bottom line? Log shipping can save you money, but if your recovery time objectives (RTOs) are to have your SQL Server database up and running on backup servers in less than a few hours, log shipping may not provide the support you desire.
Failover clusters
Failover clusters have been around for quite a while in the world of Microsoft Windows Server. Traditionally, they consist of two or more servers (in separate locations) that can interact with shared storage (often configured as a storage area network [SAN]). If the system supporting the primary instance of SQL Server goes offline, the Windows Server Failover Clustering (WSFC) service (which is built into Windows Server), automatically fails over to the backup infrastructure and performs all the necessary administrative tasks to redirect users to the new instance of SQL Server running on the now-wide-awake backup server. All this can take just seconds and requires no administrator intervention, and because the individual nodes in the failover cluster all interacted with a shared SAN resource, the ‘new’ primary server can simply pick up where the original left off.
But there are tradeoffs here, too, when it comes to DR. One glaring one is the SAN. It is vulnerable as a single point of failure. If the primary instance of SQL Server goes offline because an event compromises the entire data center /centre, a SAN residing in the same data center will likely be compromised as well. Then it doesn’t matter how quickly the secondary servers come online. If the SAN is unreachable, there won’t be any database for the secondary servers to access.
One other tradeoff: because of the underlying technologies informing the cloud, it’s not possible to share a SAN in the cloud. As we’ll see, though, a SANLess failover cluster is an alternative to be considered.
Availability Groups
For some organizations, Availability Groups (AGs) can overcome the DR tradeoffs lurking in traditional failover clusters. AGs have been a part of SQL Server since 2012, and they take advantage of the WSFC services resident in Windows Server to automate failover from a primary SQL Server instance to a secondary instance when detecting a critical failure. AGs overcome the single-point-of-failure vulnerability by eliminating the SAN. Instead, AGs take advantage of local attached high-speed storage and provide services that replicate your SQL Server databases to all the cluster nodes in the AG. Note, too, that by eliminating the shared storage component of the failover cluster, AGs provide a path for configuring a highly responsive DR solution in the cloud.
But AGs can prove a costly DR solution for other reasons. Using AGs to support more than a few SQL Server databases requires you to use the very costly Enterprise edition of SQL Server – even if the Standard edition of SQL Server is all your application really requires. Another potentially important tradeoff is this: AGs replicate only the user-defined SQL databases. Other important databases, including the master database (MSDB) and those supporting agent jobs, logins, and passwords, are not replicated from the primary to the secondary storage sites. That may be fine if failover lasts only a few minutes, but if this is a true break-glass event and your primary infrastructure may be unavailable for some time, you run the risk of business interruption because you have an incomplete replica of your SQL Server instance.
SANLess failover clusters
A fourth approach to configuring SQL Server for DR is to use what is known as a SANLess failover cluster. Like an AG, a SANLess failover cluster can operate on-premises or in the cloud (or in a hybrid topology) and it consists of multiple geographically separate servers (or VMs), each configured with local high-speed storage where a replica of the SQL Server databases reside. Unlike an AG, however, the SANLess failover cluster replicates all your SQL Server databases, which eliminates the risk of losing your MSDB and other important databases.
You’ll need to license a third-party tool to create and maintain the SANLess failover cluster. But, the SANLess cluster management software works with any edition of Windows or SQL Server that Microsoft still supports, so you don’t need to spend extra money on the Enterprise edition of either product if your applications don’t otherwise require it. The SANLess cluster software works with WSFC services, so you can take advantage of the automation in Windows to failover from one cluster node to another quickly and cleanly. And, when the primary server and its storage system comes back online, the SANLess cluster management tools automatically reverse the mirror and replicate all the updated database transactions from the backup server to the primary server, making it easy to fail back over to the primary SQL Server environment (assuming you want to do so) whenever you are confident that the primary server environment is stable.
The author
Dave Bermingham is the Senior Technical Evangelist at SIOS Technology. He is recognized within the technology community as a high-availability expert and has been honored to be elected a Microsoft MVP for the past ten years: six years as a Cluster MVP and four years as a Cloud and Datacenter Management MVP. Dave holds numerous technical certifications and has more than thirty years of IT experience, including in finance, healthcare, and education.