Cooperative application/OS DRAM fault recovery

20Citations
Citations of this article
18Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Exascale systems will present considerable fault-tolerance challenges to applications and system software. These systems are expected to suffer several hard and soft errors per day. Unfortunately, many fault-tolerance methods in use, such as rollback recovery, are unsuitable for many expected errors, for example DRAM failures. As a result, applications will need to address these resilience challenges to more effectively utilize future systems. In this paper, we describe work on a cross-layer application / OS framework to handle uncorrected memory errors. We illustrate the use of this framework through its integration with a new fault-tolerant iterative solver within the Trilinos library, and present initial convergence results. © 2012 Springer-Verlag Berlin Heidelberg.

Cite

CITATION STYLE

APA

Bridges, P. G., Hoemmen, M., Ferreira, K. B., Heroux, M. A., Soltero, P., & Brightwell, R. (2012). Cooperative application/OS DRAM fault recovery. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 7156 LNCS, pp. 241–250). Springer Verlag. https://doi.org/10.1007/978-3-642-29740-3_28

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free