GPU Behavior on a Large HPC Cluster

Nathan Debardeleben; Sean Blanchard; Laura Monroe; Phil Romero; Daryl Grunau; Craig Idler; Cornell Wright

Conference ProceedingsOPEN ACCESS

GPU Behavior on a Large HPC Cluster

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2014) 8374 LNCS 680-689

DOI: 10.1007/978-3-642-54420-0_66

N/ACitations

14Readers

Abstract

We discuss observed characteristics of GPUs deployed as accelerators in an HPC cluster at Los Alamos National Laboratory. GPUs have a very good theoretical FLOPS rate, and are reasonably inexpensive and available, but they are relatively new to HPC, which demands both consistently high performance across nodes and consistently low error rate. We modified a standard acceptance procedure to test GPU performance, error rate and reliability characteristics, and ran the test suite on a Fermi HPC cluster at LANL. We discuss here our methodology for this testing, and present results relevant to the deployment of GPUs in an HPC environment. In this paper we show performance variability, power usage variability (possibly related), and some reliability concerns on the GPUs tested. We argue for rigorous testing of these devices in deployment as a way of characterizing their behavior. © 2014 Springer-Verlag Berlin Heidelberg.

Author supplied keywords

Cite

CITATION STYLE

APA

Debardeleben, N., Blanchard, S., Monroe, L., Romero, P., Grunau, D., Idler, C., & Wright, C. (2014). GPU Behavior on a Large HPC Cluster. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 8374 LNCS, pp. 680–689). Springer Verlag. https://doi.org/10.1007/978-3-642-54420-0_66

GPU Behavior on a Large HPC Cluster

Abstract

Author supplied keywords

Cite

Register to see more suggestions