Downtime has been a bugbear for organizations every since IT systems first started to be used, but according to a University of Chicago study, the causes of downtime are still often unknown. Doron Pinhas asks why this is still the case and looks at improvements that can be made.
Benjamin Franklin said: “Three may keep a secret, if two of them are dead.” Yet there are apparently some secrets time hasn't revealed – yet. One thing we haven't figured out is why service outages occur! A study by the University of Chicago, listing the usual reasons one would expect to see for outages concludes that the biggest reason for outages is – ‘unknown’.
That's especially telling, since IT activities, one would presume, are recorded in log files, and lend themselves to analysis. The study examined hundreds of service outages at companies that provide services via the cloud. And while the ‘usual suspects’ - bugs, upgrades, network or power issues, etc. - figured significantly, ‘unknown’ was still the single largest reason reported by IT administrators for outages. Meaning that even after a thorough, even forensic investigation, they couldn't figure out what caused the outage.
That's unacceptable, because outages can cost companies millions, and without knowledge of what caused an outage, there is no way to know how to prevent the next one.
As we all know, downtime is costly: but it's not just about the money; outages can mean a loss of reputation if the service is consumer facing (banking, finance, insurance, for example). Customers expect 24/7 availability in many cases, and when they don't get it, they tend to vent, often complaining on social media. For companies where the service outage is internal, the cost could be high in reduced motivation, greater frustration, and lower productivity from employees. And if the service offered is available to a restricted professional community – which is likely to be paying for access to it – it's a given that when news gets out about the outage, the service's competitors will be lining up to convince customers to go with their ‘more reliable’ offering.
So, there is a great deal at stake for companies in preventing outages – and one can be sure they are throwing all resources possible into making sure they do not happen. Yet ‘unknown’ remains the biggest reason for outages – meaning that with all of their efforts, companies are not getting to the bottom of the problem, nor are they coming up with ways to guarantee they do not happen again.
Why is it so difficult to track down the cause of outages? Because in a typical, complex, modern IT infrastructure, new features are introduced all the time, and must be configured correctly across all IT layers (compute, storage, networking, orchestration) in order to achieve resilience. A prodigious task to begin with, is further complicated by the almost daily stream of updates and changes made by in-house and third-party IT teams, some of which can negatively affect the system in unpredicted ways as well as lead to misconfigurations. Due to the online nature of most modern applications, it isn't possible to pause for thorough testing after each change – so it’s no wonder the risk remains an unknown quantity. As a result, maintaining the highest levels of IT resilience is becoming increasingly harder. In one sense, it's a disaster waiting to happen.
What can be done? It's obvious that human IT personnel can't solve this. There are just too many things to look for, too many potential points of failure. If an IT team were to try and figure out the reason for an outage on their own, it could potentially take months – and with the costs of outages so high and ‘three 9's’ uptime – or more - demanded today, that obviously won't fly.
Instead, IT teams need to implement the right processes, and introduce quality assurance automation by utilizing tools that can do the detective work for them. Fortunately, there are today excellent IT resilience assurance tools that can automatically parse through systems and determine where a point of failure has been introduced, and ensure three 9 service. Key capabilities to look for in these tools include support for a wide set of IT layers and technology stacks; the ability to connect to existing ITSM and CMDB tools, and to provide business awareness, and built-in libraries of industry best practices and knowledge.
When IT downtime strikes, there's a tendency to over-analyze the incident – with teams poring through log files, conducting a forensic investigation into what happened, and how to make sure it doesn't happen again. It's an understandable, perhaps even natural, response – but as we've seen, there's little to be gained from this kind of retrospection. Instead, the takeaway IT teams should take from an outage is that they need to adopt a proactive approach in order to prevent another outage – ensuring that all bases are covered, so that any potential outage is nipped in the bud before it even happens.
It's not just about adopting the appropriate tools; it's about adopting an appropriate attitude – moving from trying to figure out the unknown and groping about in the dark for answers, to adopting a system that can measure and predict the variables that cause outages. It's a journey from the unknown to the known – and with that knowledge, organizations can ensure that they don't become victims of outages.
Doron Pinhas is CTO of Continuity Software.