We present Roca, a technique to reduce the opportunity cost of integrating non-programmable, high-throughput accelerators in general-purpose architectures. Roca exploits the insight that non-programmable accelerators are mostly made of private local memories (PLMs), which are key to the accelerators' performance and energy eficiency. Roca transparently exposes PLMs of otherwise unused accelerators to the cache substrate, thereby allowing the system to extract utility from accelerators even when they cannot directly speed up the system's workload. Roca adds low complexity to existing accelerator designs, requires minimal modifications to the cache substrate, and incurs a modest area overhead that is almost entirely due to additional tag storage. We quantify the utility of Roca by comparing the returns of investing area in either regular last-level cache banks or Roca-enabled accelerators. Through simulation of nonaccelerated multiprogrammed workloads on a 16-core system, we extend a 2MB S-NUCA baseline system to show that a 6MB Roca-enabled last-level cache built upon typical accelerators (i.e. whose area is 66% memory) can, on average, realize 70% of the performance and 68% of the energy eficiency benefits of a same-area 8MB S-NUCA configuration, in addition to the potential orders-of-magnitude eficiency and performance improvements that the added accelerators provide to workloads suitable for acceleration.
CITATION STYLE
Cota, E. G., Mantovani, P., & Carloni, L. P. (2016). Exploiting private local memories to reduce the opportunity cost of accelerator integration. In Proceedings of the International Conference on Supercomputing (Vol. 01-03-June-2016). Association for Computing Machinery. https://doi.org/10.1145/2925426.2926258
Mendeley helps you to discover research relevant for your work.