Abstract
Shareable backup is an economical and effective way to mask failures from application performance. A small number of backup switches are shared network-wide for repairing failures on demand so that the network quickly recovers to its full capacity without applications noticing the failures. This approach avoids complications and ineffectiveness of rerouting. We propose ShareBackup as a prototype architecture to realize this concept and present the detailed design. We implement ShareBackup on a hardware testbed. Its failure recovery takes merely 0.73ms, causing no disruption to routing; and it accelerates Spark and Tez jobs by up to 4.1× under failures. Large-scale simulations with real data center traffic and failure model show that ShareBackup reduces the percentage of job flows prolonged by failures from 47.2% to as little as 0.78%. In all our experiments, the results for ShareBackup have little difference from the no-failure case.
Author supplied keywords
Cite
CITATION STYLE
Wu, D., Huang, X. S., Xia, Y., Dzinamarira, S., Sun, X. S., & Eugene Ng, T. S. (2018). Masking failures from application performance in data center networks with shareable backup. In SIGCOMM 2018 - Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication (pp. 176–190). Association for Computing Machinery. https://doi.org/10.1145/3230543.3230577
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.