Taking the blame game out of data centers operations with NetPoirot

85Citations
Citations of this article
103Readers
Mendeley users who have this article in their library.

Abstract

Today, root cause analysis of failures in data centers is mostly done through manual inspection. More often than not, customers blame the network as the culprit. However, other components of the system might have caused these failures. To troubleshoot, huge volumes of data are collected over the entire data center. Correlating such large volumes of diverse data collected from different vantage points is a daunting task even for the most skilled technicians. In this paper, we revisit the question: how much can you infer about a failure in the data center using TCP statistics collected at one of the endpoints? Using an agent that captures TCP statistics we devised a classification algorithm that identifies the root cause of failure using this information at a single endpoint. Using insights derived from this classification algorithm we identify dominant TCP metrics that indicate where/why problems occur in the network. We validate and test these methods using data that we collect over a period of six months in the Azure production cloud.

Author supplied keywords

Cite

CITATION STYLE

APA

Arzani, B., Ciraci, S., Loo, B. T., Schuster, A., & Outhred, G. (2016). Taking the blame game out of data centers operations with NetPoirot. In SIGCOMM 2016 - Proceedings of the 2016 ACM Conference on Special Interest Group on Data Communication (pp. 440–453). Association for Computing Machinery, Inc. https://doi.org/10.1145/2934872.2934884

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free