Metastable failures in distributed systems

N/ACitations
Citations of this article
28Readers
Mendeley users who have this article in their library.

Abstract

We describe metastable failures-A failure pattern in distributed systems. Currently, metastable failures manifest themselves as black swan events; they are outliers because nothing in the past points to their possibility, have a severe impact, and are much easier to explain in hindsight than to predict. Although instances of metastable failures can look different at the surface, deeper analysis shows that they can be understood within the same framework. We introduce a framework for thinking about metastable failures, apply it to examples observed during years of operating distributed systems at scale, and survey ad-hoc techniques developed post-factum for making systems resilient to known metastable failures. A systematic approach for building systems that are robust against unknown meta-stable failures remains an open problem.

Cite

CITATION STYLE

APA

Bronson, N., Aghayev, A., Charapko, A., & Zhu, T. (2021). Metastable failures in distributed systems. In HotOS 2021 - Proceedings of the 2021 Workshop on Hot Topics in Operating Systems (pp. 221–227). Association for Computing Machinery, Inc. https://doi.org/10.1145/3458336.3465286

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free