Optimised Recovery with a Coordinated Checkpoint/Rollback Protocol for Domain Decomposition Applications

7Citations
Citations of this article
3Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Fault-tolerance protocols play an important role in today long runtime scientific parallel applications. The probability of a failure may be important due to the number of unreliable components involved during an execution. In this paper we present our approach and preliminary results about a new checkpoint/rollback protocol based on a coordinated scheme. One feature of this protocol is that fault recovery only requires a partial restart of other processes thanks to the availability of an abstract representation of the execution. Simulations on a domain decomposition application show that the amount of computations required to restart and the number of involved processes are reduced compared to the classical global rollback protocol. © Springer-Verlag Berlin Heidelberg 2008.

Cite

CITATION STYLE

APA

Besseron, X., & Gautier, T. (2008). Optimised Recovery with a Coordinated Checkpoint/Rollback Protocol for Domain Decomposition Applications. In Communications in Computer and Information Science (Vol. 14, pp. 497–506). https://doi.org/10.1007/978-3-540-87477-5_53

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free