Exascale systems will present considerable fault-tolerance challenges to applications and system software. These systems are expected to suffer several hard and soft errors per day. Unfortunately, many fault-tolerance methods in use, such as rollback recovery, are unsuitable for many expected errors, for example DRAM failures. As a result, applications will need to address these resilience challenges to more effectively utilize future systems. In this paper, we describe work on a cross-layer application / OS framework to handle uncorrected memory errors. We illustrate the use of this framework through its integration with a new fault-tolerant iterative solver within the Trilinos library, and present initial convergence results. © 2012 Springer-Verlag Berlin Heidelberg.
CITATION STYLE
Bridges, P. G., Hoemmen, M., Ferreira, K. B., Heroux, M. A., Soltero, P., & Brightwell, R. (2012). Cooperative application/OS DRAM fault recovery. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 7156 LNCS, pp. 241–250). Springer Verlag. https://doi.org/10.1007/978-3-642-29740-3_28
Mendeley helps you to discover research relevant for your work.