Abstract
Failure analysis is necessary to clarify the root cause of a failure, predict the next time a failure may occur, and improve the performance and reliability of a system. However, it is not an easy task to analyze and interpret failure data, especially for complex systems. Usually, these data are represented using many attributes, and sometimes they are inconsistent and ambiguous. In this paper, we present a scalable approach for the analysis and interpretation of failure data of high-performance computing systems. The approach employs rough sets theory (RST) for this task. The application of RST to a large publicly available set of failure data highlights the main attributes responsible for the root cause of a failure. In addition, it is used to analyze other failure characteristics, such as time between failures, repair times, workload running on a failed node, and failure category. Experimental results show the scalability of the presented approach and its ability to reveal dependencies among different failure characteristics.
Author supplied keywords
Cite
CITATION STYLE
Shawky, D. (2014). Scalable approach to failure analysis of high-performance computing systems. ETRI Journal, 36(6), 1023–1031. https://doi.org/10.4218/etrij.14.0113.1133
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.