Today, root cause analysis of failures in data centers is mostly done through manual inspection. More often than not, customers blame the network as the culprit. However, other components of the system might have caused these failures. To troubleshoot, huge volumes of data are collected over the entire data center. Correlating such large volumes of diverse data collected from different vantage points is a daunting task even for the most skilled technicians. In this paper, we revisit the question: how much can you infer about a failure in the data center using TCP statistics collected at one of the endpoints? Using an agent that captures TCP statistics we devised a classification algorithm that identifies the root cause of failure using this information at a single endpoint. Using insights derived from this classification algorithm we identify dominant TCP metrics that indicate where/why problems occur in the network. We validate and test these methods using data that we collect over a period of six months in the Azure production cloud.
CITATION STYLE
Arzani, B., Ciraci, S., Loo, B. T., Schuster, A., & Outhred, G. (2016). Taking the blame game out of data centers operations with NetPoirot. In SIGCOMM 2016 - Proceedings of the 2016 ACM Conference on Special Interest Group on Data Communication (pp. 440–453). Association for Computing Machinery, Inc. https://doi.org/10.1145/2934872.2934884
Mendeley helps you to discover research relevant for your work.