Execution of MPI applications on clusters and Grid deployments suffering from node and network failures motivates the use of fault tolerant MPI implementations. We present MPICH-V2 (the second protocol of MPICH-V project), an automatic fault tolerant MPI implementation using an innovative protocol that removes the most limiting factor of the pessimistic message logging approach: reliable logging of in transit messages. MPICH-V2 relies on uncoordinated checkpointing, sender based message logging and remote reliable logging of message logical clocks. This paper presents the architecture of MPICH-V2, its theoretical foundation and the performance of the implementation. We compare MPICH-V2 to MPICH-V1 and MPICH-P4 evaluating a) its point-to-point performance, b) the performance for the NAS benchmarks, c) the application performance when many faults occur during the execution. Experimental results demonstrate that MPICH-V2 provides performance close to MPICH-P4 for applications using large messages while reducing dramatically the number of reliable nodes compared to MPICH-V1. © 2003 ACM.
CITATION STYLE
Bouteiller, A., Cappello, F., Hérault, T., Krawezik, G., Lemarinier, P., & Magniette, F. (2003). MPICH-V2: A fault tolerant MPI for volatile nodes based on pessimistic sender based message logging. In Proceedings of the 2003 ACM/IEEE Conference on Supercomputing, SC 2003. https://doi.org/10.1145/1048935.1050176
Mendeley helps you to discover research relevant for your work.