Achieving high availability for SAP HANA

Published: Wednesday, 31 August 2022 08:48

Organizations across the globe rely on SAP ERP systems to maintain their essential applications. As deadlines draw nearer for moving to SAP’s HANA database, IT teams need to consider the potential complexity of implementing high availability and disaster recovery for these systems says Ian Allton.

SAP has been looking to transform its customers’ ERP systems to the new cloud-based, HANA environment for over a decade. Though customers have succeeded in getting the support deadline for SAP’s legacy on-premises software extended to 2027 - an impressive feat, considering that SAP set the first deadline for 2017 - most industry experts recognize that the writing is on the wall. In five years, the businesses still leaning on SAP ERP to run their most critical systems will need to transition to HANA or pay a premium for support.

Therefore, it is vital that those currently leaning on SAP ERP consider how they will achieve this same high availability with HANA - ideally before they’re faced with unplanned downtime. This planning must begin as soon as possible, as making the switch is not necessarily a simple project. Achieving the top-tier ‘four nines’ standard of high availability (99.99 percent uptime) under the rather memory-intensive HAHA environment comes with many challenges.

Fortunately, enterprises can overcome these with well-designed architecture, the right technical expertise, and careful planning.

How to avoid split-brain issues

When discussing high availability for HANA, we typically talk about HANA running on a two-node cluster. In a cloud, both nodes are typically configured in a ‘SANless’ cluster using SAP HANA System Replication. That is, each node has its own local storage. SAP replication software is used to synchronize storage among all cluster nodes, so if the primary node fails, the secondary can immediately step in and take over– accessing an identical copy of the primary node storage data.

This switch happens so quickly that the client shouldn’t notice, making it a proven setup for maintaining uptime. However, mirroring nodes like this has a unique challenge - namely, avoiding a ‘split-brain’.

Ordinarily, when a primary node crashes, the secondary will quickly assume ownership for overwriting data. This allows it to ensure that at any given time, there is only one copy of the data that is being changed. However, if the network connection between the nodes fails, both nodes can try to claim ownership. This can lead to a situation where two different nodes are both changing data without being aware of the other – putting the data at risk of corruption.

The question is, which node owns the data?

A common way to avoid this split-brain scenario is to utilize a ‘witness node’. In this configuration, a small third node is used to monitor the status of both cluster nodes and designate which is the ‘primary’ node, preventing any confusion and risk of corrupted data.  

Another, similar method is to use ‘file share’ witness. This method uses a file share as the third entity that designates the ‘primary node’.

Breadcrumbs

Another key to ensuring the maximum possible uptime comes from accepting that servers are mechanical and will eventually fail. A certain number of failures and crashes are inevitable. However, this does not mean that we should ignore any failures. Instead, we must ensure that we learn as much from them as possible.

For this reason, any discussion about setting up an SAP HANA environment should include a suitable logging mechanism - a breadcrumb trail that you can follow back to the point of failure. This can allow the support team to go in there and figure out what failed, when the failures started happening, and what might have triggered them.

Of course, many engineers recognize that this is as much a question of skills as architecture. Reading and understanding log messages is something of an art form. However, once you can piece together the origins of any crashes, you will be able to work to prevent them from repeating.

One of the best tools to achieve this is root cause analysis (RCA). This process tries to trace things back to the root cause of a crash, such as a change in a user’s script or an incorrectly set up hierarchy, and can assist significantly in identifying issues.

Embrace automation

Orchestrating a successful failover in an high availability cluster is intrinsically complex as it not only touches network, storage, compute, operating system, and applications/databases but also requires adherence to application-specific requirements for boot order, location of services, and other parameters. Many clustering software solutions, particularly those in Linux environments, require a high degree of manual scripting and tuning.

A key to maintaining high availability in any system is to use a clustering software that automates complex configurations and has application-specific intelligence.  It is also critically important that you use a clustering software that has passed stringent SAP testing and is fully SAP-certified for HANA high availability.

In doing so, you can ensure the reliability of the failover process and guarantee uptime for your HANA database.

The nature of server operations means that a significant portion of failures will happen when you - and the rest of the IT team - are asleep, on vacation, or on an airplane to deal with issues at another site! For this reason, you should also consider software that provides automatic failover as soon as you start thinking about making a move to the HANA environment.

Automated processes can run in the background, switching between nodes and generating logs as needed. A well-designed system should keep human intervention to a bare minimum and still run effectively.

Preparation is key

The truth of the matter is that achieving high availability is not easy. Server nodes inevitably fail and disasters happen.
Therefore, for any HANA high availability solution, you need to think about how you will deal with failures ahead of time. Too many IT recovery systems work in theory but are never tested in real life.
If businesses wish to maintain high availability in their SAP HANA system when the on-premises support deadline hits in 2027, they need to begin thinking about how and when to move to the new HANA environment as soon as possible.

The author

Ian Allton is Solutions Architect at SIOS Technology Corp. He has more than 20 years of experience in helping enterprise IT teams implement high availability and disaster protection strategies to protect their mission-critical Linux applications from downtime. Before joining SIOS, Ian worked for Hitachi, where he served as principal Technical Consultant in the America’s File and Content Global Services Practice and as Master Pre-Sales Technical Consultant.