Coordinated checkpointing without direct coordination

N. Neves; W.K. Fuchs

Conference Proceedings

Coordinated checkpointing without direct coordination

Neves N
Fuchs W

DOI: 10.1109/ipds.1998.707706

N/ACitations

4Readers

Get full text

Abstract

Coordinated checkpointing is a well-known method to achieve faulttolerance in distributed systems. Long running parallel applications andhigh-availability applications are two potential users of checkpointing,although with different requirements. Parallel applications need lowfailure-free overheads, and high-availability applications require fastand bounded recoveries. In this paper we describe a new coordinatedcheckpoint protocol capable of satisfying both types of applications.The protocol uses time to avoid all types of direct coordination (e.g.,message exchanges and message tagging), reducing the overheads to almosta minimum. To ensure that rapid recoveries can be attained the protocolguarantees small checkpoint latencies. The protocol was implemented andtested on a cluster of workstations connected by a 155 Mbit/sec ATM.Experimental results show that the protocol overheads are very small

Cite

CITATION STYLE

APA

Neves, N., & Fuchs, W. K. (2002). Coordinated checkpointing without direct coordination (pp. 23–31). Institute of Electrical and Electronics Engineers (IEEE). https://doi.org/10.1109/ipds.1998.707706

Coordinated checkpointing without direct coordination

Abstract

Cite

Register to see more suggestions