McrEngine: A scalable checkpointing system using data-aware aggregation and compression

Tanzima Zerin Islam; Kathryn Mohror; Saurabh Bagchi; Adam Moody; Bronis R. De Supinski; Rudolf Eigenmann

Conference ProceedingsOPEN ACCESS

McrEngine: A scalable checkpointing system using data-aware aggregation and compression

Scientific Programming (2013) 21(3-4) 149-163

DOI: 10.1155/2013/341672

9Citations

26Readers

Abstract

High performance computing (HPC) systems use checkpoint-restart to tolerate failures. Typically, applications store their states in checkpoints on a parallel file system (PFS). As applications scale up, checkpoint-restart incurs high overheads due to contention for PFS resources. The high overheads force large-scale applications to reduce checkpoint frequency, which means more compute time is lost in the event of failure. We alleviate this problem through a scalable checkpoint-restart system, mcrEngine. McrEngine aggregates checkpoints from multiple application processes with knowledge of the data semantics available through widely-used I/O libraries, e.g., HDF5 and netCDF, and compresses them. Our novel scheme improves compressibility of checkpoints up to 115% over simple concatenation and compression. Our evaluation with large-scale application checkpoints show that mcrEngine reduces checkpointing overhead by up to 87% and restart overhead by up to 62% over a baseline with no aggregation or compression. © 2013 - IOS Press and the authors. All rights reserved.

Author supplied keywords

Cite

CITATION STYLE

APA

Islam, T. Z., Mohror, K., Bagchi, S., Moody, A., De Supinski, B. R., & Eigenmann, R. (2013). McrEngine: A scalable checkpointing system using data-aware aggregation and compression. In Scientific Programming (Vol. 21, pp. 149–163). Hindawi Limited. https://doi.org/10.1155/2013/341672

McrEngine: A scalable checkpointing system using data-aware aggregation and compression

Abstract

Author supplied keywords

Cite

Register to see more suggestions