A case for adaptive redundancy for HPC resilience

4Citations
Citations of this article
11Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Redundancy both in space and time has been widely used to detect and in some cases correct errors in High Performance Computing (HPC) systems. With the HPC community seeking exascale class supercomputers by the end of the decade, unrealistic expectations for correct system behavior will result in exorbitant costs in terms of performance lost and energy expended. Resilience strategies will need to find balance between fault coverage and the overheads incurred. In this work, we propose an adaptive approach that factors in application level knowledge together with runtime inference about the fault tolerance state of the system to dynamically enable redundant multithreading (RMT). Our approach is based on simple programming language extensions, tightly integrated with a compiler infrastructure and a runtime framework that enables managing the performance overheads of redundant computation. © 2014 Springer-Verlag Berlin Heidelberg.

Cite

CITATION STYLE

APA

Hukerikar, S., Diniz, P. C., & Lucas, R. F. (2014). A case for adaptive redundancy for HPC resilience. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 8374 LNCS, pp. 690–697). Springer Verlag. https://doi.org/10.1007/978-3-642-54420-0_67

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free