Recursive restartability: turning the reboot sledgehammer into a scalpel

  • Candea G
  • Fox A
  • 34


    Mendeley users who have this article in their library.
  • 82


    Citations of this article.


Even after decades of software engineering research, complex computer systems still fail, primarily due to nondeterministic bugs that are typically resolved by rebooting. Conceding that Heisenbugs will remain a fact of life, we propose a systematic investigation of restarts as "high availability medicine." In this paper we show how recursive restartability (RR) - the ability of a system to gracefully tolerate restarts at multiple levels improves fault tolerance, reduces time-to-repair and enables system designers to build flexible, highly available software infrastructures. Using several examples of widely deployed software systems, we identify properties that are required of RR systems and outline an agenda for turning the recursive restartability philosophy into a practical software structuring tool. Finally, we describe infrastructural support for RR systems, along with initial ideas on how to analyze and benchmark such systems.

Get free article suggestions today

Mendeley saves you time finding and organizing research

Sign up here
Already have an account ?Sign in

Find this document


  • G. Candea

  • a. Fox

Cite this document

Choose a citation style from the tabs below

Save time finding and organizing research with Mendeley

Sign up for free