Reinit++: Evaluating the Performance of Global-Restart Recovery Methods for MPI Fault Tolerance

7Citations
Citations of this article
9Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Scaling supercomputers comes with an increase in failure rates due to the increasing number of hardware components. In standard practice, applications are made resilient through checkpointing data and restarting execution after a failure occurs to resume from the latest checkpoint. However, re-deploying an application incurs overhead by tearing down and re-instating execution, and possibly limiting checkpointing retrieval from slow permanent storage. In this paper we present Reinit++, a new design and implementation of the Reinit approach for global-restart recovery, which avoids application re-deployment. We extensively evaluate Reinit++ contrasted with the leading MPI fault-tolerance approach of ULFM, implementing global-restart recovery, and the typical practice of restarting an application to derive new insight on performance. Experimentation with three different HPC proxy applications made resilient to withstand process and node failures shows that Reinit++ recovers much faster than restarting, up to 6×, or ULFM, up to 3×, and that it scales excellently as the number of MPI processes grows.

Cite

CITATION STYLE

APA

Georgakoudis, G., Guo, L., & Laguna, I. (2020). Reinit++: Evaluating the Performance of Global-Restart Recovery Methods for MPI Fault Tolerance. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 12151 LNCS, pp. 536–554). Springer. https://doi.org/10.1007/978-3-030-50743-5_27

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free