1 November 2019 – A team, consisting of Computer Science PhD student Raj Joshi, Associate Professor Chan Mun Choon and Associate Professor Ben Leong, won the Best Paper Award at the 27th IEEE International Conference on Network Protocols (ICNP) 2019.
ICNP is the premier conference for network protocol research. The conference was held from 7 to 10 October in Chicago, USA. A total of 41 papers, out of 212 submitted papers, were accepted for the conference. The team won the award for their innovative work on reducing the impact of link failures in data centre networks. Raj, A/P Chan and A/P Leong co-authored the paper with National University of Defense Technology (NUDT) PhD student and NUS Computing visiting student Qu Ting, as well as NUDT Professors Guo Deke and Zhong Liu.
“Internet services such as web searches and e-commerce sites rely heavily on data centre computing,” said Raj, on behalf of the team. “When data centre networks fail to meet the response deadline, they can adversely impact the user experience and affect business profits.”
One of common type of network failure is the problem of link failure – when packets of data travelling across a network fail to reach their destination due to failure in the network link. This results in network disruption, slow network service and even a loss of network connectivity. Over the course of their research, the team found that while there are link failure management techniques designed to solve this problem, they struggle to completely eliminate packet loss. As a result, network connections take much longer to load and affect the user experience.
Instead of finding ways to completely eliminate packet loss during link failures, the team concluded that packet loss is inevitable and chose to find ways to mask its effects. To this end, the team developed Shared Queue Ring (SQR) – an on-switch mechanism that performs in-network packet loss recovery. The mechanism completely masks packet loss from its end hosts by diverting the affected flows to alternative paths seamlessly. “We found that if we can do packet loss recovery within the network itself, it is possible to completely mask the effects of packet loss and its long recovery time,” explained Raj.
According to the team, the key idea of SQR is to continuously cache a small amount of recently transmitted packets on each switch and to retransmit the cached packets onto an alternative path when a link failure occurs. “Data centres have been using the multiple paths idea for several years now but no one has developed a mechanism that seamlessly switches a failed packet path to an alternative path,” Raj added.
“We implemented our mechanism on a programmable switch ASIC and evaluations from the test showed that SQR was able to completely mask link failures and reduce the long recovery period,” said Raj.
“It was a mere serendipity how this project came along,” said A/P Ben Leong. “Ting was a visiting student at NUS and had previously worked on network updates. Raj is my PhD student and he has been working on programmable switch ASICs. This project would not have been possible without their combined experience in these two fields.”