Abstract
In this paper, we explore the use of Graph Neural Networks (GNNs) for anomaly anticipation in high performance computing (HPC) systems. We propose a GNN-based approach that leverages the structure of the HPC system (particularly, the physical proximity of the compute nodes) to facilitate anomaly anticipation. We frame the task of forecasting the availability of the compute nodes as a supervised prediction problem; the GNN predicts the probability that a compute node will fail within a fixed-length future window. We empirically demonstrate the viability of the GNN-based approach by conducting experiments on the production Tier-0 super-computer hosted at CINECA datacenter facilities, the largest Italian provider of HPC. The results are extremely promising, showing both anomaly detection capabilities on par with other techniques from the literature (with a special focus on those tested on real, production data) and, more significantly, strong results in terms of anomaly prediction.
Author supplied keywords
Cite
CITATION STYLE
Molan, M., Ahmed Khan, J., Borghesi, A., & Bartolini, A. (2023). Graph Neural Networks for Anomaly Anticipation in HPC Systems. In ICPE 2023 - Companion of the 2023 ACM/SPEC International Conference on Performance Engineering (pp. 239–244). Association for Computing Machinery, Inc. https://doi.org/10.1145/3578245.3585335
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.