Task-parallel programming models with input annotation-based concurrency extraction at runtime present a promising paradigm for programming multicore processors. Through management of dependencies, task assignments, and orchestration, these models markedly simplify the programming effort for parallelization while exposing higher levels of concurrency. In this paper we show that for multicores with a shared last-level cache (LLC), the concurrency extraction framework can be used to improve the shared LLC performance. Based on the input annotations for future tasks, the runtime instructs the hardware to prioritize data blocks with future reuse while evicting blocks with no future reuse. These instructions allow the hardware to preserve all the blocks for at least some of the future tasks and evict dead blocks. This leads to a considerable improvement in cache efficiency over what is achieved by hardware-only replacement policies, which can replace blocks for all future tasks resulting in poor hit-rates for all future tasks. The proposed hardware-software technique leads to a mean improvement of 18% in application performance and a mean reduction of 26% in misses over a shared LLC managed by the Least Recently Used replacement policy for a set of input-annotated task-parallel programs using the OmpSs programming model implemented on the NANOS++ runtime. In contrast, the state-of-the-art thread-based partitioning scheme suffers an average performance loss of 2% and an average increase of 15% in misses over the baseline.
CITATION STYLE
Pan, A., & Pai, V. S. (2015). Runtime-driven shared last-level cache management for task-parallel programs. In International Conference for High Performance Computing, Networking, Storage and Analysis, SC (Vol. 15-20-November-2015). IEEE Computer Society. https://doi.org/10.1145/2807591.2807625
Mendeley helps you to discover research relevant for your work.