Abstract
This paper revisits replication coupled with checkpointing for failstop errors. Replication enables the application to survive many fail-stop errors, thereby allowing for longer checkpointing periods. Previously published works use replication with the no-restart strategy, which works as follows: (i) compute the application Mean Time To Interruption (MTTI) M as a function of the number of processor pairs and the individual processor Mean Time Between Failures (MTBF); (ii) use checkpointing period TMTTI no = v2MC à; la Young/Daly, where C is the checkpoint duration; and (iii) never restart failed processors until the application crashes. We introduce the restart strategy where failed processors are restarted after each checkpoint. We compute the optimal checkpointing period Topt rs for this strategy, which is much larger than TMTTI no , thereby decreasing I/O pressure. We show through simulations that using Topt rs and the restart strategy, instead of TMTTI no and the usual no-restart strategy, significantly decreases the overhead induced by replication.
Cite
CITATION STYLE
Benoit, A., Herault, T., Fè;vre, V. L., & Robert, Y. (2019). Replication is more efficient than you think. In International Conference for High Performance Computing, Networking, Storage and Analysis, SC. IEEE Computer Society. https://doi.org/10.1145/3295500.3356171
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.