Implementing rollback-recovery coordinated checkpoints

Clairton Buligon; Sérgio Cechin; Ingrid Jansch-Pôrto

Conference Proceedings

Implementing rollback-recovery coordinated checkpoints

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2005) 3563 LNCS 246-257

DOI: 10.1007/11533962_22

1Citations

5Readers

Get full text

Abstract

Recovering from processor failures in distributed systems is an important problem in the design of reliable systems. The processes should coordinate their operation to guarantee that the set of local check-points taken by the individual processes form a consistent global check-point (recovery line). This allows the system to resume operation from a consistent global state, when recovering from failure. This paper shows the results of the implementation of a transparent (no special needs for applications) and coordinated (non blocking) rollback-recovery distributed algorithm. As it does not block applications, the overhead is reduced during failure-free operation. Furthermore, the rollback procedure can be executed fast as a recovery line is always available and well identified. Our preliminary experimental results show that the algorithm causes very low overhead on the performance (less than 2%), and high dependency on the checkpoint size. Now we study optimizations on the implementation to reduce checkpoint latency. © Springer-Verlag Berlin Heidelberg 2005.

Cite

CITATION STYLE

APA

Buligon, C., Cechin, S., & Jansch-Pôrto, I. (2005). Implementing rollback-recovery coordinated checkpoints. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 3563 LNCS, pp. 246–257). Springer Verlag. https://doi.org/10.1007/11533962_22

Implementing rollback-recovery coordinated checkpoints

Abstract

Cite

Register to see more suggestions