Impact of over-decomposition on coordinated checkpoint/rollback protocol

Xavier Besseron; Thierry Gautier

Conference ProceedingsOPEN ACCESS

Impact of over-decomposition on coordinated checkpoint/rollback protocol

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2012) 7156 LNCS(PART 2) 322-332

DOI: 10.1007/978-3-642-29740-3_36

4Citations

7Readers

Abstract

Failure free execution will become rare in the future exascale computers. Thus, fault tolerance is now an active field of research. In this paper, we study the impact of decomposing an application in much more parallelism that the physical parallelism on the rollback step of fault tolerant coordinated protocols. This over-decomposition gives the runtime a better opportunity to balance workload after failure without the need of spare nodes, while preserving performance. We show that the overhead on normal execution remains low for relevant factor of over-decomposition. With over-decomposition, restart execution on the remaining nodes after failures shows very good performance compared to classic decomposition approach: our experiments show that the execution time after restart can be reduced by 42 %. We also consider a partial restart protocol to reduce the amount of lost work in case of failure by tracking the task dependencies inside processes. In some cases and thanks to over-decomposition, this partial restart time can represent only 54 % of the global restart time. © 2012 Springer-Verlag Berlin Heidelberg.

Author supplied keywords

Cite

CITATION STYLE

APA

Besseron, X., & Gautier, T. (2012). Impact of over-decomposition on coordinated checkpoint/rollback protocol. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 7156 LNCS, pp. 322–332). Springer Verlag. https://doi.org/10.1007/978-3-642-29740-3_36

Impact of over-decomposition on coordinated checkpoint/rollback protocol

Abstract

Author supplied keywords

Cite

Register to see more suggestions