Check-pointing approach for fault tolerance in OpenSHMEM

Pengfei Hao; Swaroop Pophale; Pavel Shamis; Tony Curtis; Barbara Chapman

Conference Proceedings

Check-pointing approach for fault tolerance in OpenSHMEM

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2015) 9397 36-52

DOI: 10.1007/978-3-319-26428-8_3

2Citations

1Readers

Get full text

Abstract

Fault tolerance for long running applications is critical to guard against failure of either compute resources or a network. Accomplishing this task in software is non-trivial and there is an added level of complexity for implementing a working model for a one-sided communications library like OpenSHMEM, since there is no matching communication call at the target processing element (PE). In this paper we explore a fault tolerance scheme based on check-point and restart, that caters to the one-sided nature of PGAS programming model while leveraging features very specific to OpenSHMEM. Through a working implementation with the 1-D Jacobi code, we show that the approach is scalable and provides considerable computational resource saving.

Cite

CITATION STYLE

APA

Hao, P., Pophale, S., Shamis, P., Curtis, T., & Chapman, B. (2015). Check-pointing approach for fault tolerance in OpenSHMEM. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 9397, pp. 36–52). Springer Verlag. https://doi.org/10.1007/978-3-319-26428-8_3

Check-pointing approach for fault tolerance in OpenSHMEM

Abstract

Cite

Register to see more suggestions