Understanding the Effects of Permanent Faults in GPU's Parallelism Management and Control Units

Juan David Guerrero-Balaguera; Josie E.Rodriguez Condia; Fernando F. Dos Santos; Matteo Sonza Reorda; Paolo Rech

Conference ProceedingsOPEN ACCESS

Understanding the Effects of Permanent Faults in GPU's Parallelism Management and Control Units

International Conference for High Performance Computing, Networking, Storage and Analysis, SC (2023)

DOI: 10.1145/3581784.3607086

4Citations

6Readers

Abstract

Modern Graphics Processing Units (GPUs) demand life expectancy extended to many years, exposing the hardware to aging (i.e., permanent faults arising after the end-of-manufacturing test). Hence, techniques to assess permanent fault impacts in GPUs are strongly required, especially in safety-critical domains. This paper presents a method to evaluate permanent faults in the GPU's scheduler and control units, together with the first figures to quantify these effects. We inject 5.83× 105 permanent faults in the gate-level units of a GPU model. Then, we map the observed error categories as software errors by instrumenting 13 applications and two convolutional neural networks, injecting more than 1.65× 105 permanent errors (1,000 errors per application), reducing evaluation times from several years to hundreds of hours. Our results highlight that faults in GPU parallelism management units impact software execution parameters. Moreover, errors in resource management or instructions codes hang the code, while 45% of errors induce silent data corruption.

Author supplied keywords

Cite

CITATION STYLE

APA

Guerrero-Balaguera, J. D., Condia, J. E. R., Dos Santos, F. F., Reorda, M. S., & Rech, P. (2023). Understanding the Effects of Permanent Faults in GPU’s Parallelism Management and Control Units. In International Conference for High Performance Computing, Networking, Storage and Analysis, SC. IEEE Computer Society. https://doi.org/10.1145/3581784.3607086

Understanding the Effects of Permanent Faults in GPU's Parallelism Management and Control Units

Abstract

Author supplied keywords

Cite

Register to see more suggestions