A scalable asynchronous replication-based strategy for fault tolerant MPI applications

John Paul Walters; Vipin Chaudhary

Conference Proceedings

A scalable asynchronous replication-based strategy for fault tolerant MPI applications

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2007) 4873 LNCS 257-268

DOI: 10.1007/978-3-540-77220-0_26

6Citations

11Readers

Get full text

Abstract

As computational clusters increase in size, their mean-time-to-failure reduces. Typically checkpointing is used to minimize the loss of computation. Most checkpointing techniques, however, require a central storage for storing checkpoints. This severely limits the scalability of checkpointing. We propose a scalable replication-based MPI checkpointing facility that is based on LAM/MPI. We extend the existing state of fault-tolerant MPI with asynchronous replication, eliminating the need for central or network storage. We evaluate centralized storage, SAN-based solutions, and a commercial parallel file system, and show that they are not scalable, particularly beyond 64 CPUs. We demonstrate the low overhead of our replication scheme with the NAS Parallel Benchmarks and the High Performance LINPACK benchmark with tests up to 256 nodes while demonstrating that checkpointing and replication can be achieved with much lower overhead than that provided by current techniques. © Springer-Verlag Berlin Heidelberg 2007.

Cite

CITATION STYLE

APA

Walters, J. P., & Chaudhary, V. (2007). A scalable asynchronous replication-based strategy for fault tolerant MPI applications. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 4873 LNCS, pp. 257–268). Springer Verlag. https://doi.org/10.1007/978-3-540-77220-0_26

A scalable asynchronous replication-based strategy for fault tolerant MPI applications

Abstract

Cite

Register to see more suggestions