Implementing rollback-recovery coordinated checkpoints

1Citations
Citations of this article
5Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Recovering from processor failures in distributed systems is an important problem in the design of reliable systems. The processes should coordinate their operation to guarantee that the set of local check-points taken by the individual processes form a consistent global check-point (recovery line). This allows the system to resume operation from a consistent global state, when recovering from failure. This paper shows the results of the implementation of a transparent (no special needs for applications) and coordinated (non blocking) rollback-recovery distributed algorithm. As it does not block applications, the overhead is reduced during failure-free operation. Furthermore, the rollback procedure can be executed fast as a recovery line is always available and well identified. Our preliminary experimental results show that the algorithm causes very low overhead on the performance (less than 2%), and high dependency on the checkpoint size. Now we study optimizations on the implementation to reduce checkpoint latency. © Springer-Verlag Berlin Heidelberg 2005.

Cite

CITATION STYLE

APA

Buligon, C., Cechin, S., & Jansch-Pôrto, I. (2005). Implementing rollback-recovery coordinated checkpoints. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 3563 LNCS, pp. 246–257). Springer Verlag. https://doi.org/10.1007/11533962_22

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free