Pattern-based modeling of high-performance computing resilience

1Citations
Citations of this article
5Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

With the growing scale and complexity of high-performance computing (HPC) systems, resilience solutions that ensure continuity of service despite frequent errors and component failures must be methodically designed to balance the reliability requirements with the overheads to performance and power. Design patterns enable a structured approach to the development of resilience solutions, providing hardware and software designers with the building block elements for the rapid development of novel solutions and for adapting existing technologies for emerging, extreme-scale HPC environments. In this paper, we develop analytical models that enable designers to evaluate the reliability and performance characteristics of the design patterns. These models are particularly useful in building a unified framework that analyzes and compares various resilience solutions built using a combination of patterns.

Cite

CITATION STYLE

APA

Hukerikar, S., & Engelmann, C. (2018). Pattern-based modeling of high-performance computing resilience. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 10659 LNCS, pp. 557–568). Springer Verlag. https://doi.org/10.1007/978-3-319-75178-8_45

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free