On 12th December 2014 NATS, the UK's leading provider of air traffic control services, experienced a failure in its Swanwick flight data system. The outage resulted in widespread flight delays and cancellations. A report has now been published which details the events behind the outage and subsequent business continuity response.
Written by an enquiry panel led by Sir Robert Walmsley the report finds that:
- Failure occurred on the 12th December because of a latent software fault that was present from the 1990s. The fault lay in the software’s performance of a check on the maximum permitted number of Controller and Supervisor roles.
- The system error was caused because of a number of new Controller roles that had been added to the system the day before.
- The standard practice in NATS is that engineering recovery is coordinated through a group of designated engineers, known as the Engineering Technical Incident Cell (ETIC) and drawn from those available in the Systems Control Centre adjacent to the Operations Room. While some recovery actions are automated, ETIC manually control all key recovery actions, e.g. the restoration of data, to ensure that decisions are made with due and careful deliberation; this is important, as the wrong decisions could have further downgraded performance.
- Identifying a software fault in such a large system (the total application exceeds 2 million lines of code), within only a few hours, is a surprising and impressive achievement. This was made possible because system logs contain details of the interactions at the workstations.
The detailed 93 page report is available here as a PDF and should be of interest to business continuity managers whatever their sector. It shows how legacy systems can have unexpected and unanticipated impacts as well as giving useful details about the business continuity plans and strategies that were in place at the time of the incident.
The report makes clear that although this was a high profile incident which caused difficulties for NATS' direct customers and the supply chain, it was undoubtedly a business continuity success. Without a strong recovery team response and the pre-planned procedures that were in place the incident and disruption would have been much worse.