Collecting, monitoring, and analyzing facility and systems data at the National Energy Research Scientific Computing Center

26Citations
Citations of this article
19Readers
Mendeley users who have this article in their library.
Get full text

Abstract

As high-performance computing (HPC) resources continue to grow in size and complexity, so too does the volume and velocity of the operational data that is associated with them. At such scales, new mechanisms and technologies are required to continuously gather, store, and analyze this data in near-real time from heterogeneous and distributed sources without impacting the underlying data center operations or HPC resource utilization. In this paper, we describe our experiences in designing and implementing an infrastructure for extreme-scale operational data collection, known as the Operations Monitoring and Notification Infrastructure (OMNI) at the National Energy Research Scientific Computing (NERSC) center at Lawrence Berkeley National Laboratory. OMNI currently holds over 522 billion records of online operational data (totaling over 125TB) and can ingest new data points at an average rate of 25,000 data points per second. Using OMNI as a central repository, facilities and environmental data can be seamlessly integrated and correlated with machine metrics, job scheduler information, network errors, and more, providing a holistic view of data center operations. To demonstrate the value of real-time operational data collection, we present a number of real-world case studies for which having OMNI data readily available led to key operational insights at NERSC. The case results include a reduction in the downtime of an HPC system during a facility transition, as well as a $2.5 million electrical substation savings for the next-generation Perlmutter HPC system.

Cite

CITATION STYLE

APA

Bautista, E., Romanus, M., Davis, T., Whitney, C., & Kubaska, T. (2019). Collecting, monitoring, and analyzing facility and systems data at the National Energy Research Scientific Computing Center. In ACM International Conference Proceeding Series. Association for Computing Machinery. https://doi.org/10.1145/3339186.3339213

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free