On-the-fly calculation of performance metrics with adaptive time resolution for hpc compute jobs

Konstantin Stefanov; Vadim Voevodin

Conference Proceedings

On-the-fly calculation of performance metrics with adaptive time resolution for hpc compute jobs

Communications in Computer and Information Science (2019) 965 609-619

DOI: 10.1007/978-3-030-05807-4_52

0Citations

4Readers

Get full text

Abstract

Performance monitoring is a method to debug performance issues in different types of applications. It uses various performance metrics obtained from the servers the application runs on, and also may use metrics which are produced by the application itself. The common approach to building performance monitoring systems is to store all the data to a database and then to retrieve the data which correspond to the specific job and perform an analysis using that portion of the data. This approach works well when the data stream is not very large. For large performance monitoring data stream this incurs much IO and imposes high requirements on storage systems which process the data. In this paper we propose an adaptive on-the-fly approach to performance monitoring of High Performance Computing (HPC) compute jobs which significantly lowers data streams to be written to a storage. We used this approach to implement performance monitoring system for HPC cluster to monitor compute jobs. The output of our performance monitoring system is a time-series graph representing aggregated performance metrics for the job. The time resolution of the resulted graph is adaptive and depends on the duration of the analyzed job.

Author supplied keywords

Cite

CITATION STYLE

APA

Stefanov, K., & Voevodin, V. (2019). On-the-fly calculation of performance metrics with adaptive time resolution for hpc compute jobs. In Communications in Computer and Information Science (Vol. 965, pp. 609–619). Springer Verlag. https://doi.org/10.1007/978-3-030-05807-4_52

On-the-fly calculation of performance metrics with adaptive time resolution for hpc compute jobs

Abstract

Author supplied keywords

Cite

Register to see more suggestions