Fault tolerant MapReduce-MPI for HPC clusters

Yanfei Guo; Wesley Bland; Pavan Balaji; Xiaobo Zhou

Conference ProceedingsOPEN ACCESS

Fault tolerant MapReduce-MPI for HPC clusters

International Conference for High Performance Computing, Networking, Storage and Analysis, SC (2015) 15-20-November-2015

DOI: 10.1145/2807591.2807617

27Citations

41Readers

Abstract

Building MapReduce applications using the Message-Passing Interface (MPI) enables us to exploit the performance of large HPC clusters for big data analytics. However, due to the lacking of native fault tolerance support in MPI and the incompatibility between the MapReduce fault tolerance model and HPC schedulers, it is very hard to provide a fault tolerant MapReduce runtime for HPC clusters. We propose and develop FT-MRMPI, the first fault tolerant MapReduce framework on MPI for HPC clusters. We discover a unique way to perform failure detection and recovery by exploiting the current MPI semantics and the new proposal of user-level failure mitigation. We design and develop the checkpoint/restart model for fault tolerant MapReduce in MPI. We further tailor the detect/resume model to conserve work for more efficient fault tolerance. The experimental results on a 256-node HPC cluster show that FT-MRMPI effectively masks failures and reduces the job completion time by 39%.

Cite

CITATION STYLE

APA

Guo, Y., Bland, W., Balaji, P., & Zhou, X. (2015). Fault tolerant MapReduce-MPI for HPC clusters. In International Conference for High Performance Computing, Networking, Storage and Analysis, SC (Vol. 15-20-November-2015). IEEE Computer Society. https://doi.org/10.1145/2807591.2807617

Fault tolerant MapReduce-MPI for HPC clusters

Abstract

Cite

Register to see more suggestions