How Do ML Jobs Fail in Datacenters? Analysis of a Long-Term Dataset from an HPC Cluster

Xiaoyu Chu; Sacheendra Talluri; Laurens Versluis; Alexandru Iosup

Conference ProceedingsOPEN ACCESS

How Do ML Jobs Fail in Datacenters? Analysis of a Long-Term Dataset from an HPC Cluster

ICPE 2023 - Companion of the 2023 ACM/SPEC International Conference on Performance Engineering (2023) 263-268

DOI: 10.1145/3578245.3584726

0Citations

6Readers

Abstract

Reliable job execution is important in High Performance Computing clusters. Understanding the failure distribution and failure pattern of jobs helps HPC cluster managers design better systems, and users design fault tolerant systems. Machine learning is an increasingly popular workload for HPC clusters are used for. But, there is little information on machine learning job failure characteristics on HPC clusters, and how they differ from the previous workload such clusters were used for. The goal of our work is to improve the understanding of machine learning job failures in HPC clusters. We collect and analyze job data spanning the whole of 2022, and over 2∼million jobs. We analyze basic statistical characteristics, the time pattern of failures, resource waste caused by failures, and their autocorrelation. Some of our findings are that machine learning jobs fail at a higher rate than non-ML jobs, and waste much more CPU-time per job when they fail.

Author supplied keywords

Cite

CITATION STYLE

APA

Chu, X., Talluri, S., Versluis, L., & Iosup, A. (2023). How Do ML Jobs Fail in Datacenters? Analysis of a Long-Term Dataset from an HPC Cluster. In ICPE 2023 - Companion of the 2023 ACM/SPEC International Conference on Performance Engineering (pp. 263–268). Association for Computing Machinery, Inc. https://doi.org/10.1145/3578245.3584726

How Do ML Jobs Fail in Datacenters? Analysis of a Long-Term Dataset from an HPC Cluster

Abstract

Author supplied keywords

Cite

Register to see more suggestions