We offer a novel approach to matrix-matrix multiplication computation on computing platforms with memory hierarchies. Constantbandwidth (CB) blocks improve computation throughput for architectures limited by external memory bandwidth. Configuring the shape and size of CB blocks operating from within any memory hierarchy level (e.g., internal SRAM), we achieve high throughput while holding external bandwidth (e.g., with DRAM) constant. We explain how, surprisingly, CB blocks can maintain constant external bandwidth as computation throughput increases. Analogous to partitioning a cake into pieces, we dub our CB-partitioned system CAKE. We show CAKE outperforms state-of-The-Art libraries in computation time on real-world systems where external bandwidth represents a bottleneck, demonstrating CAKE s ability to address the memory wall. CAKE achieves superior performance by directly using theoretically optimal CB-partitioned blocks in tiling and scheduling, obviating the need for extensive design search.
CITATION STYLE
Kung, H. T., Natesh, V., & Sabot, A. (2021). CAKE: Matrix multiplication using constant-bandwidth blocks. In International Conference for High Performance Computing, Networking, Storage and Analysis, SC. IEEE Computer Society. https://doi.org/10.1145/3458817.3476166
Mendeley helps you to discover research relevant for your work.