One size does not fit all: Clustering supercomputer failures using a multiple time window approach

Catello Di Martino

Conference Proceedings

One size does not fit all: Clustering supercomputer failures using a multiple time window approach

Di Martino C

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2013) 7905 LNCS 302-316

DOI: 10.1007/978-3-642-38750-0_23

12Citations

8Readers

Get full text

Abstract

This paper proposes a heuristic to improve the analysis of supercomputers error logs. The heuristic is able to estimate the error on the measurement induced by the clustering process of error events and consequently drive the analysis. The goal is to reduce errors induced by the clustering and be able to estimate how much they affect the measurements. The heuristic is validated against 40 synthetic datasets, for different systems ranging from 16k to 256k nodes under different failure assumptions. We show that i) to accurately analyze the complex failure behavior of large computing systems, multiple time windows need to be adopted at the granularity of node subsystems, e.g. memory and I/O, and ii) for large systems, the classical single time window analysis can overestimate the MTBF by more than 150%, while the proposed heuristic can decrease the measurement error of one order of magnitude. © 2013 Springer-Verlag.

Cite

CITATION STYLE

APA

Di Martino, C. (2013). One size does not fit all: Clustering supercomputer failures using a multiple time window approach. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 7905 LNCS, pp. 302–316). https://doi.org/10.1007/978-3-642-38750-0_23

One size does not fit all: Clustering supercomputer failures using a multiple time window approach

Abstract

Cite

Register to see more suggestions