Go back to basics to avoid network outages
- Published: Friday, 03 April 2015 09:37
By Joel Dolisy.
Last year, some of the largest and most well-known brands across the globe, including Google, Facebook and Twitter, experienced interruptions to their services due to network outages. Whether these organizations experienced downtime due to internal network errors or full blown [Distributed] Denial of Service [D]DoS attacks, the costs to their reputations and, is some cases, their revenues, proved significant.
Which is the greater threat?
While media reports tend to hype-up the presence of hackers, the reality is that most outages are caused by an organization’s own network. A recent Gartner study projected that by 2015, 80 percent of outages impacting mission-critical services will be caused by people and process issues, and more than 50 percent of those outages will be caused by change/configuration/release integration and hand-off issues. In fact, both Xbox LIVE and Facebook recently suffered network outages from configuration errors during routine maintenance, and while the state of China blamed its outage on hackers, some independent watchers believe it was actually due to an internal configuration error in the firewall.
Indeed, one of the leading sources of network outage is human intervention through configuration errors injected during routine maintenance - in other words, good old human error.
That’s not to say that external threats shouldn’t be prepared for.
Stay safe, go back to basics
There are high-tech ways of mitigating risks and keeping networks up that cost a great deal of money, and there are also low-tech, low budget ways of mitigating network outages, even if they cannot be completely eliminated. The latter include:
1) Checks and balances. Common sense dictates that system changes should be reviewed by another pair of eyes, but not all organizations do this. The best practice of code reviews in software development has proven to increase code quality and significantly reduce the number of errors injected; operations teams should adopt the same practice.
2) Monitor, monitor, monitor. Ensure systems are monitored properly before any changes are made so that a good baseline is available, making errors more easily detectable. Alerts should be properly configured so that IT teams can respond quickly if the health, availability or performance of a system is impacted negatively following a change. The alerts should also be reviewed regularly to ensure they reflect the SLAs and other requirements dictated by business needs.
3) Have a back-up plan. Make sure a solid fall-back mechanism is in place so that the network can revert to the last state of configuration once a problem is detected.
4) Keep things simple. An error that is part of a series of changes affecting multiple parts of the IT infrastructure can make it difficult to isolate and remediate problems. Break down massive changes into smaller, more manageable chunks that can be reverted atomically.
5) Build in room for error. It’s surprising how often IT teams go full steam ahead in rolling out changes without thinking about how they will revert back to the previous state should errors occur. These teams should assume errors will happen, and create the action plan for addressing those errors once they do.
6) Communication, the old fashioned way. Any application or system owners impacted by changes should be notified of changes prior to their occurrence, including the scope of the change and timeframe. That will serve as precaution to the owners to be vigilant for abnormal application or system behaviour.
Beware of additional threats
DoS attacks can originate internally through a Trojan horse or virus impacting one or more internal systems. Externally, [D]DoS attacks originate from multiple systems on the Internet acting in an orchestrated manner to bring down publicly facing systems. Mitigating these threats will require more sophistication, but nevertheless, following tried-and-true best practices will still be key to protecting networks:
1) Strengthen your shield. The first level of defence is ensuring firewalls are configured properly and systems are patched with the latest security updates. Will this prevent a successful attack? No, but they are basic steps that many organizations ignore, leaving themselves vulnerable.
2) Keep vigilant. Appropriately monitor the firewalls and key systems in your network to detect abnormal events that usually accompany [D]DoS attacks, including high connection counts and high CPU and bandwidth utilisations. Different monitoring systems provide different ways to define what ‘normal’ means, ranging from setting up complete manual thresholds to learning from past data to identify normal range of operation. No matter the system, it is important for the IT team to understand the different thresholds being used by the systems and how those evolve over time. These systems should be capable of alerting IT staff of abnormal network behaviours and events.
3) Use appropriate technology. It can be difficult to figure out which data stream(s) to monitor in order to determine a baseline for normal behaviour. Leveraging deep packet inspection or flow based technology to monitor network behaviour provides a live picture of the network traffic on the network, minimising the window of time that it takes to detect abnormal behaviour.
4) Assign responsibility. Ownership empowers and confers accountability. It is extremely important to designate someone in the IT organization to be responsible for the security of the company. That individual should be involved in security assessments and analyses, and be consulted anytime there is suspicion of security related attacks. This person should also be responsible for staying abreast of the security threat landscape that can be impacting the business and effectively brief and educate the rest of the organization. This strategy does not alleviate the IT team at large of security responsibilities, but merely puts someone in charge of the effort.
Many companies tend to overlook the security risks associated with various attacks on their infrastructure and are therefore underprepared to face threats to their network, whether malicious or benign. While the basic steps described above aren’t a silver bullet, they can significantly help shore up networks against unplanned outages and ultimately help protect your network.
Joel Dolisy is SVP, CTO and CIO at SolarWinds.