Coordinated checkpointing without direct coordination

  • Neves N
  • Fuchs W
N/ACitations
Citations of this article
4Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Coordinated checkpointing is a well-known method to achieve faulttolerance in distributed systems. Long running parallel applications andhigh-availability applications are two potential users of checkpointing,although with different requirements. Parallel applications need lowfailure-free overheads, and high-availability applications require fastand bounded recoveries. In this paper we describe a new coordinatedcheckpoint protocol capable of satisfying both types of applications.The protocol uses time to avoid all types of direct coordination (e.g.,message exchanges and message tagging), reducing the overheads to almosta minimum. To ensure that rapid recoveries can be attained the protocolguarantees small checkpoint latencies. The protocol was implemented andtested on a cluster of workstations connected by a 155 Mbit/sec ATM.Experimental results show that the protocol overheads are very small

Cite

CITATION STYLE

APA

Neves, N., & Fuchs, W. K. (2002). Coordinated checkpointing without direct coordination (pp. 23–31). Institute of Electrical and Electronics Engineers (IEEE). https://doi.org/10.1109/ipds.1998.707706

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free