Optimised Recovery with a Coordinated Checkpoint/Rollback Protocol for Domain Decomposition Applications

Xavier Besseron; Thierry Gautier

Conference Proceedings

Optimised Recovery with a Coordinated Checkpoint/Rollback Protocol for Domain Decomposition Applications

Communications in Computer and Information Science (2008) 14 497-506

DOI: 10.1007/978-3-540-87477-5_53

7Citations

3Readers

Get full text

Abstract

Fault-tolerance protocols play an important role in today long runtime scientific parallel applications. The probability of a failure may be important due to the number of unreliable components involved during an execution. In this paper we present our approach and preliminary results about a new checkpoint/rollback protocol based on a coordinated scheme. One feature of this protocol is that fault recovery only requires a partial restart of other processes thanks to the availability of an abstract representation of the execution. Simulations on a domain decomposition application show that the amount of computations required to restart and the number of involved processes are reduced compared to the classical global rollback protocol. © Springer-Verlag Berlin Heidelberg 2008.

Author supplied keywords

Cite

CITATION STYLE

APA

Besseron, X., & Gautier, T. (2008). Optimised Recovery with a Coordinated Checkpoint/Rollback Protocol for Domain Decomposition Applications. In Communications in Computer and Information Science (Vol. 14, pp. 497–506). https://doi.org/10.1007/978-3-540-87477-5_53

Optimised Recovery with a Coordinated Checkpoint/Rollback Protocol for Domain Decomposition Applications

Abstract

Author supplied keywords

Cite

Register to see more suggestions