Locality-driven dynamic GPU cache bypassing

Chao Li; Shuaiwen Leon Song; Hongwen Dai; Albert Sidelnik; Siva Kumar Sastry Hari; Huiyang Zhou

Conference Proceedings

Locality-driven dynamic GPU cache bypassing

Proceedings of the International Conference on Supercomputing (2015) 2015-June 67-77

DOI: 10.1145/2751205.2751237

97Citations

40Readers

Get full text

Abstract

This paper presents novel cache optimizations for massively parallel, throughput-oriented architectures like GPUs. L1 data caches (L1 D-caches) are critical resources for providing high-bandwidth and low-latency data accesses. However, the high number of simultaneous requests from singleinstruction multiple-thread (SIMT) cores makes the limited capacity of L1 D-caches a performance and energy bottleneck, especially for memory-intensive applications. We observe that the memory access streams to L1 D-caches for many applications contain a significant amount of requests with low reuse, which greatly reduce the cache efficacy. Existing GPU cache management schemes are either based on conditional/reactive solutions or hit-rate based designs specifically developed for CPU last level caches, which can limit overall performance. To overcome these challenges, we propose an efficient locality monitoring mechanism to dynamically filter the access stream on cache insertion such that only the data with high reuse and short reuse distances are stored in the L1 D-cache. Specifically, we present a design that integrates locality filtering based on reuse characteristics of GPU workloads into the decoupled tag store of the existing L1 D-cache through simple and cost-effective hardware extensions. Results show that our proposed design can dramatically reduce cache contention and achieve up to 56.8% and an average of 30.3% performance improvement over the baseline architecture, for a range of highly-optimized cache-unfriendly applications with minor area overhead and better energy efficiency. Our design also significantly outperforms the state-of-the-art CPU and GPU bypassing schemes (especially for irregular applications), without generating extra L2 and DRAM level contention.

Author supplied keywords

Cite

CITATION STYLE

APA

Li, C., Song, S. L., Dai, H., Sidelnik, A., Hari, S. K. S., & Zhou, H. (2015). Locality-driven dynamic GPU cache bypassing. In Proceedings of the International Conference on Supercomputing (Vol. 2015-June, pp. 67–77). Association for Computing Machinery. https://doi.org/10.1145/2751205.2751237

Locality-driven dynamic GPU cache bypassing

Abstract

Author supplied keywords

Cite

Register to see more suggestions