Graph Neural Networks for Anomaly Anticipation in HPC Systems

11Citations
Citations of this article
4Readers
Mendeley users who have this article in their library.
Get full text

Abstract

In this paper, we explore the use of Graph Neural Networks (GNNs) for anomaly anticipation in high performance computing (HPC) systems. We propose a GNN-based approach that leverages the structure of the HPC system (particularly, the physical proximity of the compute nodes) to facilitate anomaly anticipation. We frame the task of forecasting the availability of the compute nodes as a supervised prediction problem; the GNN predicts the probability that a compute node will fail within a fixed-length future window. We empirically demonstrate the viability of the GNN-based approach by conducting experiments on the production Tier-0 super-computer hosted at CINECA datacenter facilities, the largest Italian provider of HPC. The results are extremely promising, showing both anomaly detection capabilities on par with other techniques from the literature (with a special focus on those tested on real, production data) and, more significantly, strong results in terms of anomaly prediction.

Cite

CITATION STYLE

APA

Molan, M., Ahmed Khan, J., Borghesi, A., & Bartolini, A. (2023). Graph Neural Networks for Anomaly Anticipation in HPC Systems. In ICPE 2023 - Companion of the 2023 ACM/SPEC International Conference on Performance Engineering (pp. 239–244). Association for Computing Machinery, Inc. https://doi.org/10.1145/3578245.3585335

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free