Abstract
In spite of several recent advancements, data movement in modern CNN accelerators remains a significant bottleneck. Architectures like Eyeriss implement large scratchpads within individual processing elements, while architectures like TPU v1 implement large systolic arrays and large monolithic caches. Several data movements in these prior works are therefore across long wires, and account for much of the energy consumption. In thiswork,we design a new wire-aware CNN accelerator,WAX, that employs a deep and distributed memory hierarchy, thus enabling data movement over short wires in the common case. An array of computational units, each with a small set of registers, is placed adjacent to a subarray of a large cache to form a single tile. Shift operations among these registers allow for high reuse with little wire traversal overhead. This approach optimizes the common case, where register fetches and access to a few-kilobyte buffer can be performed at very low cost. Operations beyond the tile require traversal over the cache's H-tree interconnect, but represent the uncommon case. For high reuse of operands, we introduce a family of new data mappings and dataflows. The best dataflow, WAXFlow-3, achieves a 2 improvement in performance and a 2.6-4.4 reduction in energy, relative to Eyeriss. As moreWAX tiles are added, performance scales well until 128 tiles.
Author supplied keywords
Cite
CITATION STYLE
Gudaparthi, S., Narayanan, S., Balasubramonian, R., Giacomin, E., Kambalasubramanyam, H., & Gaillardon, P. E. (2019). Wire-aware architecture and dataflow for CNN accelerators. In Proceedings of the Annual International Symposium on Microarchitecture, MICRO (pp. 1–13). IEEE Computer Society. https://doi.org/10.1145/3352460.3358316
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.