Towards Practical Machine Learning Frameworks for Performance Diagnostics in Supercomputers

0Citations
Citations of this article
5Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Supercomputers are highly sophisticated computing systems designed to handle complex and computationally intensive tasks. Despite their tremendous efficiency, performance problems still arise due to various factors, such as load imbalance, network congestion, and software-related issues. Monitoring frameworks are commonly used to collect telemetry data, which helps identify potential issues before they become critical or debug problems. However, telemetry analytics is essentially a big data problem that is becoming increasingly difficult to manage due to terabytes of telemetry data collected daily. Owing to the limitations of manual analysis, recent analytics frameworks leverage automated machine learning (ML)-based frameworks to identify patterns and anomalies in this data, enabling system administrators and users to take appropriate action towards resolving performance problems quickly. This paper explores the benefits and challenges of ML-based frameworks that automate performance diagnostics, particularly focusing on labeled training data requirements and deployment challenges. We argue that ML-based frameworks can achieve desirable performance diagnosis results while reducing the need for large labeled data sets, and we demonstrate successful prototypes that are suitable for rapid deployment on real-world systems.

Cite

CITATION STYLE

APA

Aksar, B., Sencan, E., Schwaller, B., Leung, V. J., Brandt, J., Kulis, B., … Coskun, A. K. (2023). Towards Practical Machine Learning Frameworks for Performance Diagnostics in Supercomputers. In AI4Sys 2023 - Proceedings of the 1st Workshop on AI for Systems (pp. 1–6). Association for Computing Machinery, Inc. https://doi.org/10.1145/3588982.3603609

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free