GUIDE: A Scalable Information Directory Service to Collect, Federate, and Analyze Logs for Operational Insights into a Leadership HPC Facility

5Citations
Citations of this article
39Readers
Mendeley users who have this article in their library.
Get full text

Abstract

In this paper, we describe the GUIDE framework used to collect, federate, and analyze log data from the Oak Ridge Leadership Computing Facility (OLCF), and how we use that data to derive insights into facility operations. We collect system logs and extract monitoring data at every level of the various OLCF subsystems, and have developed a suite of pre-processing tools to make the raw data consumable. The cleansed logs are then ingested and federated into a central, scalable data warehouse, Splunk, that offers storage, indexing, querying, and visualization capabilities. We have further developed and deployed a set of tools to analyze these multiple disparate log streams in concert and derive operational insights. We describe our experience from developing and deploying the GUIDE infrastructure, and deriving valuable insights on the various subsystems, based on two years of operations in the production OLCF environment. CCS CONCEPTS • General and reference → Performance; Measurement; Reliability; Metrics; • Information systems → Data warehouses; Extraction, transformation and loading; Data analytics; Data mining;

Cite

CITATION STYLE

APA

Vazhkudai, S. S., Miller, R., Tiwari, D., Zimmer, C., Wang, F., Oral, S., … Steinert, D. (2017). GUIDE: A Scalable Information Directory Service to Collect, Federate, and Analyze Logs for Operational Insights into a Leadership HPC Facility. In International Conference for High Performance Computing, Networking, Storage and Analysis, SC (Vol. 2017-November). IEEE Computer Society. https://doi.org/10.1145/3126908.3126946

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free