The increasing size and complexity of high performance computing systems have lead to major concerns over fault frequencies and the mechanisms necessary to tolerate these faults. Previous studies have shown that state-of-the-field checkpoint/restart mechanisms will not scale sufficiently for future generation systems. In this work, we explore the feasibility of checkpoint data compression to reduce checkpoint commit latency and storage overheads. Leveraging a simple model for checkpoint compression viability, we conclude that checkpoint data compression should be considered as a part of a scalable checkpoint/restart solution and discuss additional scenarios and improvements that may make checkpoint data compression even more viable. © 2012 Springer-Verlag Berlin Heidelberg.
CITATION STYLE
Ibtesham, D., Arnold, D., Ferreira, K. B., & Bridges, P. G. (2012). On the viability of checkpoint compression for extreme scale fault tolerance. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 7156 LNCS, pp. 302–311). Springer Verlag. https://doi.org/10.1007/978-3-642-29740-3_34
Mendeley helps you to discover research relevant for your work.