Analyzing and leveraging shared L1 caches in GPUs

17Citations
Citations of this article
28Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Graphics Processing Units (GPUs) concurrently execute thousandsof threads, which makes them effective for achieving high throughput for a wide range of applications. However, the memory walloften limits peak throughput. GPUs use caches to address this limitation, and hence several prior works have focused on improvingcache hit rates, which in turn can improve throughput for memoryintensive applications. However, almost all of the prior works assume a conventional cache hierarchy where each GPU core has aprivate local L1 cache and all cores share the L2 cache. Our analysis shows that this canonical organization does not allow optimalutilization of caches because the private nature of L1 caches allowsmultiple copies of the same cache line to get replicated across cores.We introduce a new shared L1 cache organization, where allcores collectively cache a single copy of the data at only one location (core), leading to zero data replication. We achieve this byallowing each core to cache only a non-overlapping slice of theentire address range. Such a design is useful for significantly improving the collective L1 hit rates but incurs latency overheads fromadditional communications when a core requests data not allowedto be present in its own cache. While many workloads can tolerate this additional latency, several workloads show performancesensitivities. Therefore, we develop lightweight communicationoptimization techniques and a run-time mechanism that considers the latency-tolerance characteristics of applications to decidewhich applications should execute in private versus shared L1 cacheorganization and reconfigures the caches accordingly. In effect, weachieve significant performance and energy efficiency improvements, at a modest hardware cost, for applications that prefer theshared organization, with little to no impact on other applications.

Author supplied keywords

Cite

CITATION STYLE

APA

Ibrahim, M. A., Kayiran, O., Eckert, Y., Loh, G. H., & Jog, A. (2020). Analyzing and leveraging shared L1 caches in GPUs. In Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT (pp. 167–173). Institute of Electrical and Electronics Engineers Inc. https://doi.org/10.1145/3410463.3414623

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free