McrEngine: A scalable checkpointing system using data-aware aggregation and compression

9Citations
Citations of this article
26Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

High performance computing (HPC) systems use checkpoint-restart to tolerate failures. Typically, applications store their states in checkpoints on a parallel file system (PFS). As applications scale up, checkpoint-restart incurs high overheads due to contention for PFS resources. The high overheads force large-scale applications to reduce checkpoint frequency, which means more compute time is lost in the event of failure. We alleviate this problem through a scalable checkpoint-restart system, mcrEngine. McrEngine aggregates checkpoints from multiple application processes with knowledge of the data semantics available through widely-used I/O libraries, e.g., HDF5 and netCDF, and compresses them. Our novel scheme improves compressibility of checkpoints up to 115% over simple concatenation and compression. Our evaluation with large-scale application checkpoints show that mcrEngine reduces checkpointing overhead by up to 87% and restart overhead by up to 62% over a baseline with no aggregation or compression. © 2013 - IOS Press and the authors. All rights reserved.

Cite

CITATION STYLE

APA

Islam, T. Z., Mohror, K., Bagchi, S., Moody, A., De Supinski, B. R., & Eigenmann, R. (2013). McrEngine: A scalable checkpointing system using data-aware aggregation and compression. In Scientific Programming (Vol. 21, pp. 149–163). Hindawi Limited. https://doi.org/10.1155/2013/341672

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free