Skip to main content

Middleware to Manage Fault Tolerance Using Semi-Coordinated Checkpoints

0Citations
Citations of this article
8Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Compute node failures are becoming a normal event for many long-running and scalable MPI applications. Keeping within the MPI standards and applying some of the methods developed so far in terms of fault tolerance, we developed a methodology that allows applications to tolerate failures through the creation of semi-coordinated checkpoints within the RADIC architecture. To do this, we developed the ULSC2-RADIC middleware that divides the application into independent MPI worlds where each MPI world would correspond to a compute node and make use of the DMTCP checkpoint library in a semi-coordinated environment. We performed experimental results using scientific applications and the NAS Parallel Benchmarks to assess the overhead and also the functionality in case of a node failure. We evaluated the computational cost of the semi-coordinated checkpoints compared with the coordinated checkpoints.

Cite

CITATION STYLE

APA

Wong, A., Heymann, E., Rexachs, D., & Luque, E. (2021). Middleware to Manage Fault Tolerance Using Semi-Coordinated Checkpoints. IEEE Transactions on Parallel and Distributed Systems, 32(2), 254–268. https://doi.org/10.1109/TPDS.2020.3015615

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free