A case for adaptive redundancy for HPC resilience

Saurabh Hukerikar; Pedro C. Diniz; Robert F. Lucas

Conference ProceedingsOPEN ACCESS

A case for adaptive redundancy for HPC resilience

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2014) 8374 LNCS 690-697

DOI: 10.1007/978-3-642-54420-0_67

4Citations

11Readers

Abstract

Redundancy both in space and time has been widely used to detect and in some cases correct errors in High Performance Computing (HPC) systems. With the HPC community seeking exascale class supercomputers by the end of the decade, unrealistic expectations for correct system behavior will result in exorbitant costs in terms of performance lost and energy expended. Resilience strategies will need to find balance between fault coverage and the overheads incurred. In this work, we propose an adaptive approach that factors in application level knowledge together with runtime inference about the fault tolerance state of the system to dynamically enable redundant multithreading (RMT). Our approach is based on simple programming language extensions, tightly integrated with a compiler infrastructure and a runtime framework that enables managing the performance overheads of redundant computation. © 2014 Springer-Verlag Berlin Heidelberg.

Cite

CITATION STYLE

APA

Hukerikar, S., Diniz, P. C., & Lucas, R. F. (2014). A case for adaptive redundancy for HPC resilience. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 8374 LNCS, pp. 690–697). Springer Verlag. https://doi.org/10.1007/978-3-642-54420-0_67

A case for adaptive redundancy for HPC resilience

Abstract

Cite

Register to see more suggestions