On-the-fly calculation of performance metrics with adaptive time resolution for hpc compute jobs

0Citations
Citations of this article
4Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Performance monitoring is a method to debug performance issues in different types of applications. It uses various performance metrics obtained from the servers the application runs on, and also may use metrics which are produced by the application itself. The common approach to building performance monitoring systems is to store all the data to a database and then to retrieve the data which correspond to the specific job and perform an analysis using that portion of the data. This approach works well when the data stream is not very large. For large performance monitoring data stream this incurs much IO and imposes high requirements on storage systems which process the data. In this paper we propose an adaptive on-the-fly approach to performance monitoring of High Performance Computing (HPC) compute jobs which significantly lowers data streams to be written to a storage. We used this approach to implement performance monitoring system for HPC cluster to monitor compute jobs. The output of our performance monitoring system is a time-series graph representing aggregated performance metrics for the job. The time resolution of the resulted graph is adaptive and depends on the duration of the analyzed job.

Cite

CITATION STYLE

APA

Stefanov, K., & Voevodin, V. (2019). On-the-fly calculation of performance metrics with adaptive time resolution for hpc compute jobs. In Communications in Computer and Information Science (Vol. 965, pp. 609–619). Springer Verlag. https://doi.org/10.1007/978-3-030-05807-4_52

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free