On the viability of checkpoint compression for extreme scale fault tolerance

Dewan Ibtesham; Dorian Arnold; Kurt B. Ferreira; Patrick G. Bridges

Conference ProceedingsOPEN ACCESS

On the viability of checkpoint compression for extreme scale fault tolerance

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2012) 7156 LNCS(PART 2) 302-311

DOI: 10.1007/978-3-642-29740-3_34

20Citations

10Readers

Abstract

The increasing size and complexity of high performance computing systems have lead to major concerns over fault frequencies and the mechanisms necessary to tolerate these faults. Previous studies have shown that state-of-the-field checkpoint/restart mechanisms will not scale sufficiently for future generation systems. In this work, we explore the feasibility of checkpoint data compression to reduce checkpoint commit latency and storage overheads. Leveraging a simple model for checkpoint compression viability, we conclude that checkpoint data compression should be considered as a part of a scalable checkpoint/restart solution and discuss additional scenarios and improvements that may make checkpoint data compression even more viable. © 2012 Springer-Verlag Berlin Heidelberg.

Author supplied keywords

Cite

CITATION STYLE

APA

Ibtesham, D., Arnold, D., Ferreira, K. B., & Bridges, P. G. (2012). On the viability of checkpoint compression for extreme scale fault tolerance. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 7156 LNCS, pp. 302–311). Springer Verlag. https://doi.org/10.1007/978-3-642-29740-3_34

On the viability of checkpoint compression for extreme scale fault tolerance

Abstract

Author supplied keywords

Cite

Register to see more suggestions