Recursive restartability: Turning the reboot sledgehammer into a scalpel

99Citations
Citations of this article
40Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Even after decades of software engineering research, complex computer systems still fail, primarily due to nondeterministic bugs that are typically resolved by rebooting. Conceding that Heisenbugs will remain a fact of life, we propose a systematic investigation of restarts as "high availability medicine." In this paper we show how recursive restartability (RR) - the ability of a system to gracefully tolerate restarts at multiple levels - improves fault tolerance, reduces time-to-repair, and enables system designers to build flexible, highly available software infrastructures. Using several examples of widely deployed software systems, we identify properties that are required of RR systems and outline an agenda for turning the recursive restartability philosophy into a practical software structuring tool. Finally, we describe infrastructural support for RR systems, along with initial ideas on how to analyze and benchmark such systems.

Cite

CITATION STYLE

APA

Candea, G., & Fox, A. (2001). Recursive restartability: Turning the reboot sledgehammer into a scalpel. In Proceedings of the Workshop on Hot Topics in Operating Systems - HOTOS (pp. 125–130). https://doi.org/10.1109/hotos.2001.990072

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free