Automating failure testing research at internet scale

Peter Alvaro; Kolton Andrus; Chris Sanden; Casey Rosenthal; Ali Basiri; Lorin Hochstein

Conference Proceedings

Automating failure testing research at internet scale

Proceedings of the 7th ACM Symposium on Cloud Computing, SoCC 2016 (2016) 17-28

DOI: 10.1145/2987550.2987555

31Citations

58Readers

Get full text

Abstract

Large-scale distributed systems must be built to anticipate and mitigate a variety of hardware and software failures. In order to build confidence that fault-tolerant systems are correctly implemented, Netflix (and similar enterprises) regularly run failure drills in which faults are deliberately injected in their production system. The combinatorial space of failure scenarios is too large to explore exhaustively. Existing failure testing approaches either randomly explore the space of potential failures randomly or exploit the "hunches" of domain experts to guide the search. Random strategies waste resources testing "uninteresting" faults, while programmer-guided approaches are only as good as human intuition and only scale with human effort. In this paper, we describe how we adapted and implemented a research prototype called lineage-driven fault injection (LDFI) to automate failure testing at Netflix. Along the way, we describe the challenges that arose adapting the LDFI model to the complex and dynamic realities of the Netflix architecture.We show how we implemented the adapted algorithm as a service atop the existing tracing and fault injection infrastructure, and present early results.

Author supplied keywords

Cite

CITATION STYLE

APA

Alvaro, P., Andrus, K., Sanden, C., Rosenthal, C., Basiri, A., & Hochstein, L. (2016). Automating failure testing research at internet scale. In Proceedings of the 7th ACM Symposium on Cloud Computing, SoCC 2016 (pp. 17–28). Association for Computing Machinery, Inc. https://doi.org/10.1145/2987550.2987555

Automating failure testing research at internet scale

Abstract

Author supplied keywords

Cite

Register to see more suggestions