Early Termination of Failed HPC Jobs Through Machine and Deep Learning

Michał Zasadziński; Victor Muntés-Mulero; Marc Solé; David Carrera; Thomas Ludwig

Conference ProceedingsOPEN ACCESS

Early Termination of Failed HPC Jobs Through Machine and Deep Learning

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2018) 11014 LNCS 163-177

DOI: 10.1007/978-3-319-96983-1_12

0Citations

6Readers

Abstract

Failed jobs in a supercomputer cause not only waste in CPU time or energy consumption but also decrease work efficiency of users. Mining data collected during the operation of data centers helps to find patterns explaining failures and can be used to predict them. Automating system reactions, e.g., early termination of jobs, when software failures are predicted does not only increase availability and reduce operating cost, but it also frees administrators’ and users’ time. In this paper, we explore a unique dataset containing the topology, operation metrics, and job scheduler history from the petascale Mistral supercomputer. We extract the most relevant system features deciding on the final state of a job through decision trees. Then, we successfully train a neural net to predict job evolution based on power time series of nodes. Finally, we evaluate the effect on CPU time saving for static and dynamic job termination policies.

Author supplied keywords

Cite

CITATION STYLE

APA

Zasadziński, M., Muntés-Mulero, V., Solé, M., Carrera, D., & Ludwig, T. (2018). Early Termination of Failed HPC Jobs Through Machine and Deep Learning. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 11014 LNCS, pp. 163–177). Springer Verlag. https://doi.org/10.1007/978-3-319-96983-1_12

Early Termination of Failed HPC Jobs Through Machine and Deep Learning

Abstract

Author supplied keywords

Cite

Register to see more suggestions