One size does not fit all: Clustering supercomputer failures using a multiple time window approach

12Citations
Citations of this article
8Readers
Mendeley users who have this article in their library.
Get full text

Abstract

This paper proposes a heuristic to improve the analysis of supercomputers error logs. The heuristic is able to estimate the error on the measurement induced by the clustering process of error events and consequently drive the analysis. The goal is to reduce errors induced by the clustering and be able to estimate how much they affect the measurements. The heuristic is validated against 40 synthetic datasets, for different systems ranging from 16k to 256k nodes under different failure assumptions. We show that i) to accurately analyze the complex failure behavior of large computing systems, multiple time windows need to be adopted at the granularity of node subsystems, e.g. memory and I/O, and ii) for large systems, the classical single time window analysis can overestimate the MTBF by more than 150%, while the proposed heuristic can decrease the measurement error of one order of magnitude. © 2013 Springer-Verlag.

Cite

CITATION STYLE

APA

Di Martino, C. (2013). One size does not fit all: Clustering supercomputer failures using a multiple time window approach. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 7905 LNCS, pp. 302–316). https://doi.org/10.1007/978-3-642-38750-0_23

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free