SUGAR: Speeding Up GPGPU Application Resilience Estimation with Input Sizing

0Citations
Citations of this article
7Readers
Mendeley users who have this article in their library.
Get full text

Abstract

As Graphics Processing Units (GPUs) are becoming a de facto solution for accelerating a wide range of applications, their reliable operation is becoming increasingly important. One of the major challenges in the domain of GPU reliability is to accurately measure GPGPU application error resilience. This challenge stems from the fact that a typical GPGPU application spawns a huge number of threads and then utilizes a large amount of potentially unreliable compute and memory resources available on the GPUs. As the number of possible fault locations can be in the billions, evaluating every fault and examining its effect on the application error resilience is impractical. Application resilience is evaluated via extensive fault injection campaigns based on sampling of an extensive fault site space. Typically, the larger the input of the GPGPU application, the longer the experimental campaign. In this work, we devise a methodology, SUGAR (Speeding Up G PGPU Application Resilience Estimation with input sizing), that dramatically speeds up the evaluation of GPGPU application error resilience by judicious input sizing. We show how analyzing a small fraction of the input is sufficient to estimate the application resilience with high accuracy and dramatically reduce the duration of experimentation. Key to our estimation methodology is the discovery of repeating patterns as a function of the input size. Using the well-established fact that error resilience in GPGPU applications is mostly determined by the dynamic instruction (DI) count at the thread level, we discover the patterns that allow to accurately predict application error resilience for arbitrarily large inputs. For the cases that we examine in this paper, this new resilience estimation provides significant speedups (up to 1336 times) and 97.0 on the average, while keeping estimation errors to less than 1%, for details see the full version of this SIGMETRICS paper.

Cite

CITATION STYLE

APA

Yang, L., Nie, B., Jog, A., & Smirni, E. (2021). SUGAR: Speeding Up GPGPU Application Resilience Estimation with Input Sizing. In Performance Evaluation Review (Vol. 49, pp. 45–46). Association for Computing Machinery. https://doi.org/10.1145/3410220.3453917

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free