Record and replay techniques for HPC systems: A survey

Dylan Chapp; Kento Sato; Dong H. Ahn; Michela Taufer

Journal ArticleOPEN ACCESS

Record and replay techniques for HPC systems: A survey

Supercomputing Frontiers and Innovations (2018) 5(1) 11-30

DOI: 10.14529/jsfi180102

8Citations

5Readers

Abstract

Record-and-replay techniques provide the ability to record executions of nondeterministic applications and re-execute them identically. These techniques find use in the contexts of debugging, reproducibility, and fault-tolerance, especially in the presence of nondeterministic factors such as message races. Record-and-replay techniques are highly diverse in terms of the fidelity of replay they provide, the assumptions they make about the recorded application, the programming models they target, and the runtime overheads they impose. In the high performance computing (HPC) environment, all the above factors must be considered in concert, thus presenting additional implementation challenges. In this manuscript, we survey record-and-replay techniques in terms of the programming models they target and the workloads on which they were evaluated, providing a categorization of these techniques benefiting application developers and researchers targeting exascale challenges. This manuscript answers three questions through this survey: What are the gaps in the existing space of record-and-replay techniques? What is the roadmap to widespread use of record-and-replay on production-scale HPC workloads? And, what are the critical open problems that must be addressed to make record-and-replay viable at exascale?.

Author supplied keywords

Cite

CITATION STYLE

APA

Chapp, D., Sato, K., Ahn, D. H., & Taufer, M. (2018). Record and replay techniques for HPC systems: A survey. Supercomputing Frontiers and Innovations, 5(1), 11–30. https://doi.org/10.14529/jsfi180102

Record and replay techniques for HPC systems: A survey

Abstract

Author supplied keywords

Cite

Register to see more suggestions