Abstract
Coordinated checkpointing is a well-known method to achieve faulttolerance in distributed systems. Long running parallel applications andhigh-availability applications are two potential users of checkpointing,although with different requirements. Parallel applications need lowfailure-free overheads, and high-availability applications require fastand bounded recoveries. In this paper we describe a new coordinatedcheckpoint protocol capable of satisfying both types of applications.The protocol uses time to avoid all types of direct coordination (e.g.,message exchanges and message tagging), reducing the overheads to almosta minimum. To ensure that rapid recoveries can be attained the protocolguarantees small checkpoint latencies. The protocol was implemented andtested on a cluster of workstations connected by a 155 Mbit/sec ATM.Experimental results show that the protocol overheads are very small
Cite
CITATION STYLE
Neves, N., & Fuchs, W. K. (2002). Coordinated checkpointing without direct coordination (pp. 23–31). Institute of Electrical and Electronics Engineers (IEEE). https://doi.org/10.1109/ipds.1998.707706
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.