Replication is more efficient than you think

9Citations
Citations of this article
15Readers
Mendeley users who have this article in their library.
Get full text

Abstract

This paper revisits replication coupled with checkpointing for failstop errors. Replication enables the application to survive many fail-stop errors, thereby allowing for longer checkpointing periods. Previously published works use replication with the no-restart strategy, which works as follows: (i) compute the application Mean Time To Interruption (MTTI) M as a function of the number of processor pairs and the individual processor Mean Time Between Failures (MTBF); (ii) use checkpointing period TMTTI no = v2MC à; la Young/Daly, where C is the checkpoint duration; and (iii) never restart failed processors until the application crashes. We introduce the restart strategy where failed processors are restarted after each checkpoint. We compute the optimal checkpointing period Topt rs for this strategy, which is much larger than TMTTI no , thereby decreasing I/O pressure. We show through simulations that using Topt rs and the restart strategy, instead of TMTTI no and the usual no-restart strategy, significantly decreases the overhead induced by replication.

Cite

CITATION STYLE

APA

Benoit, A., Herault, T., Fè;vre, V. L., & Robert, Y. (2019). Replication is more efficient than you think. In International Conference for High Performance Computing, Networking, Storage and Analysis, SC. IEEE Computer Society. https://doi.org/10.1145/3295500.3356171

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free