How Do ML Jobs Fail in Datacenters? Analysis of a Long-Term Dataset from an HPC Cluster

0Citations
Citations of this article
6Readers
Mendeley users who have this article in their library.

Abstract

Reliable job execution is important in High Performance Computing clusters. Understanding the failure distribution and failure pattern of jobs helps HPC cluster managers design better systems, and users design fault tolerant systems. Machine learning is an increasingly popular workload for HPC clusters are used for. But, there is little information on machine learning job failure characteristics on HPC clusters, and how they differ from the previous workload such clusters were used for. The goal of our work is to improve the understanding of machine learning job failures in HPC clusters. We collect and analyze job data spanning the whole of 2022, and over 2∼million jobs. We analyze basic statistical characteristics, the time pattern of failures, resource waste caused by failures, and their autocorrelation. Some of our findings are that machine learning jobs fail at a higher rate than non-ML jobs, and waste much more CPU-time per job when they fail.

Cite

CITATION STYLE

APA

Chu, X., Talluri, S., Versluis, L., & Iosup, A. (2023). How Do ML Jobs Fail in Datacenters? Analysis of a Long-Term Dataset from an HPC Cluster. In ICPE 2023 - Companion of the 2023 ACM/SPEC International Conference on Performance Engineering (pp. 263–268). Association for Computing Machinery, Inc. https://doi.org/10.1145/3578245.3584726

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free