Fault-tolerance protocols play an important role in today long runtime scientific parallel applications. The probability of a failure may be important due to the number of unreliable components involved during an execution. In this paper we present our approach and preliminary results about a new checkpoint/rollback protocol based on a coordinated scheme. One feature of this protocol is that fault recovery only requires a partial restart of other processes thanks to the availability of an abstract representation of the execution. Simulations on a domain decomposition application show that the amount of computations required to restart and the number of involved processes are reduced compared to the classical global rollback protocol. © Springer-Verlag Berlin Heidelberg 2008.
CITATION STYLE
Besseron, X., & Gautier, T. (2008). Optimised Recovery with a Coordinated Checkpoint/Rollback Protocol for Domain Decomposition Applications. In Communications in Computer and Information Science (Vol. 14, pp. 497–506). https://doi.org/10.1007/978-3-540-87477-5_53
Mendeley helps you to discover research relevant for your work.