Fault tolerance for long running applications is critical to guard against failure of either compute resources or a network. Accomplishing this task in software is non-trivial and there is an added level of complexity for implementing a working model for a one-sided communications library like OpenSHMEM, since there is no matching communication call at the target processing element (PE). In this paper we explore a fault tolerance scheme based on check-point and restart, that caters to the one-sided nature of PGAS programming model while leveraging features very specific to OpenSHMEM. Through a working implementation with the 1-D Jacobi code, we show that the approach is scalable and provides considerable computational resource saving.
CITATION STYLE
Hao, P., Pophale, S., Shamis, P., Curtis, T., & Chapman, B. (2015). Check-pointing approach for fault tolerance in OpenSHMEM. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 9397, pp. 36–52). Springer Verlag. https://doi.org/10.1007/978-3-319-26428-8_3
Mendeley helps you to discover research relevant for your work.