Replication is more efficient than you think

Anne Benoit; Thomas Herault; Valentin Le Fè;vre; Yves Robert

Conference Proceedings

Replication is more efficient than you think

International Conference for High Performance Computing, Networking, Storage and Analysis, SC (2019)

DOI: 10.1145/3295500.3356171

9Citations

15Readers

Get full text

Abstract

This paper revisits replication coupled with checkpointing for failstop errors. Replication enables the application to survive many fail-stop errors, thereby allowing for longer checkpointing periods. Previously published works use replication with the no-restart strategy, which works as follows: (i) compute the application Mean Time To Interruption (MTTI) M as a function of the number of processor pairs and the individual processor Mean Time Between Failures (MTBF); (ii) use checkpointing period TMTTI no = v2MC à; la Young/Daly, where C is the checkpoint duration; and (iii) never restart failed processors until the application crashes. We introduce the restart strategy where failed processors are restarted after each checkpoint. We compute the optimal checkpointing period Topt rs for this strategy, which is much larger than TMTTI no , thereby decreasing I/O pressure. We show through simulations that using Topt rs and the restart strategy, instead of TMTTI no and the usual no-restart strategy, significantly decreases the overhead induced by replication.

Cite

CITATION STYLE

APA

Benoit, A., Herault, T., Fè;vre, V. L., & Robert, Y. (2019). Replication is more efficient than you think. In International Conference for High Performance Computing, Networking, Storage and Analysis, SC. IEEE Computer Society. https://doi.org/10.1145/3295500.3356171

Replication is more efficient than you think

Abstract

Cite

Register to see more suggestions