Check-pointing approach for fault tolerance in OpenSHMEM

2Citations
Citations of this article
1Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Fault tolerance for long running applications is critical to guard against failure of either compute resources or a network. Accomplishing this task in software is non-trivial and there is an added level of complexity for implementing a working model for a one-sided communications library like OpenSHMEM, since there is no matching communication call at the target processing element (PE). In this paper we explore a fault tolerance scheme based on check-point and restart, that caters to the one-sided nature of PGAS programming model while leveraging features very specific to OpenSHMEM. Through a working implementation with the 1-D Jacobi code, we show that the approach is scalable and provides considerable computational resource saving.

Cite

CITATION STYLE

APA

Hao, P., Pophale, S., Shamis, P., Curtis, T., & Chapman, B. (2015). Check-pointing approach for fault tolerance in OpenSHMEM. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 9397, pp. 36–52). Springer Verlag. https://doi.org/10.1007/978-3-319-26428-8_3

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free