Towards a model to estimate the reliability of large-scale hybrid supercomputers

1Citations
Citations of this article
9Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Supercomputers stand as a fundamental tool for developing our understanding of the universe. State-of-the-art scientific simulations, big data analyses, and machine learning executions require high performance computing platforms. Such infrastructures have been growing lately with the addition of thousands of newly designed components, calling their resiliency into question. It is crucial to solidify our knowledge on the way supercomputers fail. Other recent studies have highlighted the importance of characterizing failures on supercomputers. This paper aims at modelling component failures of a supercomputer based on Mixed Weibull distributions. The model is built using a real-life multi-year failure record from a leadership-class supercomputer. Using several key observations from the data, we designed an analytical model that is robust enough to represent each of the main components of supercomputers, yet it is flexible enough to alter the composition of the machine and be able to predict resilience of future or hypothetical systems.

Cite

CITATION STYLE

APA

Rojas, E., Meneses, E., Jones, T., & Maxwell, D. (2020). Towards a model to estimate the reliability of large-scale hybrid supercomputers. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 12247 LNCS, pp. 37–51). Springer. https://doi.org/10.1007/978-3-030-57675-2_3

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free