A letter from Visa to the UK Treasury Select Committee, documenting details behind the recent outage which left millions of people unable to complete card transactions, reinforces a critical challenge that organizations face when exposed to a ‘partial failure’ of IT infrastructure. This is according to Peter Groucutt, managing director of Databarracks.
Visa revealed that a ‘rare defect’ to a switch caused a partial failure in its primary UK data centre. The issue delayed its secondary data centre from assuming responsibility for handling all of its card transactions taking place and was the root-cause behind millions of failed card transactions, over 10 hours on Friday 1st June 2018.
In the wake of the outage the Treasury Select Committee contacted the payments firm, seeking clarification over the cause of the outage and assurances to what action Visa is taking to prevent a repeat. Amongst the findings, a number of lessons can be learned says Groucutt:
“Businesses are often better prepared for a complete outage than ‘partial failures’. When a system fails completely the process to failover is more clearly defined. Partial failures however, make that change-over difficult. Once the problem has been identified, you have to make the decision to either fully switch to the secondary system or fix the problem on the primary. Defining the point at which to failover is specific to each organization and the issue you are dealing with.
“A switch issue, for instance, will require a different response to a natural disaster. An organization with good incident and crisis management processes will have these processes in place – decisions will already have been made and documented, so in the event of an incident, a business knows exactly what to do.
“In practice, a business might decide that it can’t tolerate an outage of longer than four hours. If it takes two hours to be fully operational at a second site, it then leaves you a window of just two hours to fix that issue before committing to fail-over.
“We would expect Visa to have a very mature incident management process in place and based on the reports, that was absolutely the case. Partial failures can be very difficult to plan for and mange, but the issue was identified, and response protocols initiated.
“The lessons Visa can take from the incident is that they weren’t prepared for this particular partial failure and should address this by building new processes to allow the backup switch to take over. We can all do the same.
“It is a good idea to include issues like this in your testing. It’s not just switches – we’ve seen exactly this issue for UPS systems and generators too. An organization will have a testing schedule for each of these technologies, so it’s important to include the impact of partial failures to these. A business should think about how quickly it can identify what the issue is and importantly, the actions which then need to be taken to either fix the problem and recover or alternatively, manually take it offline and failover to a secondary site.”