Rice University lab develops a new circuit-switching technology to help maintain availability after a switch failure
- Published: Friday, 17 August 2018 08:42
ShareBackup, a strategy developed by researchers at Rice University lab, allows shared backup switches in data centers / centres to take on network traffic within a fraction of a second after a software or hardware switch failure.
Rice computer scientist Eugene Ng who led a team working on ShareBackup said that the idea would solve a common annoyance among data professionals, scientists and everyone who relies on a network to deliver results day in and day out.
"A data network consists of servers and network switches," said Ng, a professor of computer science and electrical and computer engineering. "Switches move data packets to where they need to go. But things fail, especially in large-scale data centers with thousands of pieces of hardware."
The usual response to a failed switch is to shunt the flow of data to another line. "Generally, the network has multiple paths for connecting servers so, just like if there's a closure on the highway, we'd drive around it. This is a conventional, natural approach that makes a lot of sense: You reroute around the failure to get where you need to go."
But sometimes that other road is congested and everything slows down. "Data centers aren't the Internet; they're not about people surfing websites," Ng said. "They're about supporting data-intensive applications like data mining or machine learning. And a lot of these applications have stringent performance deadlines, so blindly rerouting traffic could be the wrong thing to do in a data center."
Rather than the expensive option of installing redundant switches throughout a network, the ShareBackup strategy would put fast switches and software in strategic locations that could pick up the traffic from a failed switch in a microsecond. When that problem is resolved, the software makes the backup switch available to handle another failure.
The switch is fast enough - the failure-recovery time is 0.73 milliseconds, including latency from hardware and control systems - that most users would never know that part of the system had failed.
Ng says that ShareBackup could save data centers time and money not only by maintaining full bandwidth but by also helping to analyze problems, including misconfigurations that commonly lead to network failure.
Read ‘Masking failures from application performance in data center networks with shareable backup’ which provides more details.