Abstract
Graphics Processing Units (GPUs) utilize memory hierarchy and Thread-Level Parallelism (TLP) to tolerate off-chip memory latency, which is a significant bottleneck for memory-bound applications. However, parallel threads generate a large number of memory requests, which increases the average memory latency and degrades cache performance due to high contention. Prefetching is an effective technique to reduce memory access latency, and prior research shows the positive impact of stride-based prefetching on GPU performance. However, existing prefetching methods only rely on fixed strides. To address this limitation, this paper proposes a new prefetching technique, Snake, which is built upon chains of variable strides, using throttling and memory decoupling strategies. Snake achieves 80% coverage and 75% accuracy in prefetching demand memory requests, resulting in a 17% improvement in total GPU performance and energy consumption for memory-bound General-Purpose Graphics Processing Unit (GPGPU) applications.
Author supplied keywords
Cite
CITATION STYLE
Mostofi, S., Falahati, H., Mahani, N., Lotfi-Kamran, P., & Sarbazi-Azad, H. (2023). Snake: A Variable-length Chain-based Prefetching for GPUs. In Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2023 (pp. 728–741). Association for Computing Machinery, Inc. https://doi.org/10.1145/3613424.3623782
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.