NR-MPI: A non-stop and fault resilient MPI supporting programmer defined data backup and restore for E-scale super computing systems

2Citations
Citations of this article
5Readers
Mendeley users who have this article in their library.

Abstract

Fault resilience has became a major issue for HPC systems, particularly, in the perspective of future E-scale systems, which will consist of millions of CPU cores and other components. MPI-level fault tolerant constructs, such as ULFM, are being proposed to support software level fault tolerance. However, there are few systematic evaluations by application programmers using benchmarks or pseudo applications. This paper proposes NR-MPI, a Non-stop and Fault Resilient MPI, supporting programmer defined data backup and restore. To help programmers write fault tolerant programs, NR-MPI provides a set of friendly programming interfaces and a state transition diagram for data backup and restore. This paper focuses on design, implementation and evaluation of NR-MPI. Specifically,this paper puts emphases on failure detection in MPI library, friendly programming interface extending for NR-MPI and examples of fault tolerant programs based NRMPI. Furthermore, to support failure recovery of applications, NR-MPI implements data backup interfaces based on double in-memory checkpoint/restart.We conduct experiments with both NPB benchmarks and Sweep3D on TH supercomputer in NSCC-TJ. Experimental results show that NR-MPI based fault tolerant programs can recover from failures online without restarting, and the overhead is small even for applications with tens of thousands of cores.

Cite

CITATION STYLE

APA

Suo, G., Lu, Y., Liao, X., Xie, M., & Cao, H. (2016). NR-MPI: A non-stop and fault resilient MPI supporting programmer defined data backup and restore for E-scale super computing systems. Supercomputing Frontiers and Innovations, 3(1), 4–21. https://doi.org/10.14529/jsfi160101

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free