Scalable approach to failure analysis of high-performance computing systems

Doaa Shawky

Journal ArticleOPEN ACCESS

Scalable approach to failure analysis of high-performance computing systems

Shawky D

ETRI Journal (2014) 36(6) 1023-1031

DOI: 10.4218/etrij.14.0113.1133

2Citations

7Readers

Abstract

Failure analysis is necessary to clarify the root cause of a failure, predict the next time a failure may occur, and improve the performance and reliability of a system. However, it is not an easy task to analyze and interpret failure data, especially for complex systems. Usually, these data are represented using many attributes, and sometimes they are inconsistent and ambiguous. In this paper, we present a scalable approach for the analysis and interpretation of failure data of high-performance computing systems. The approach employs rough sets theory (RST) for this task. The application of RST to a large publicly available set of failure data highlights the main attributes responsible for the root cause of a failure. In addition, it is used to analyze other failure characteristics, such as time between failures, repair times, workload running on a failed node, and failure category. Experimental results show the scalability of the presented approach and its ability to reveal dependencies among different failure characteristics.

Author supplied keywords

Cite

CITATION STYLE

APA

Shawky, D. (2014). Scalable approach to failure analysis of high-performance computing systems. ETRI Journal, 36(6), 1023–1031. https://doi.org/10.4218/etrij.14.0113.1133

Scalable approach to failure analysis of high-performance computing systems

Abstract

Author supplied keywords

Cite

Register to see more suggestions