On the viability of checkpoint compression for extreme scale fault tolerance

20Citations
Citations of this article
10Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

The increasing size and complexity of high performance computing systems have lead to major concerns over fault frequencies and the mechanisms necessary to tolerate these faults. Previous studies have shown that state-of-the-field checkpoint/restart mechanisms will not scale sufficiently for future generation systems. In this work, we explore the feasibility of checkpoint data compression to reduce checkpoint commit latency and storage overheads. Leveraging a simple model for checkpoint compression viability, we conclude that checkpoint data compression should be considered as a part of a scalable checkpoint/restart solution and discuss additional scenarios and improvements that may make checkpoint data compression even more viable. © 2012 Springer-Verlag Berlin Heidelberg.

Cite

CITATION STYLE

APA

Ibtesham, D., Arnold, D., Ferreira, K. B., & Bridges, P. G. (2012). On the viability of checkpoint compression for extreme scale fault tolerance. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 7156 LNCS, pp. 302–311). Springer Verlag. https://doi.org/10.1007/978-3-642-29740-3_34

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free