SIMD2: A Generalized Matrix Instruction Set for Accelerating Tensor Computation beyond GEMM

Yunan Zhang; Po An Tsai; Hung Wei Tseng

Conference ProceedingsOPEN ACCESS

SIMD2: A Generalized Matrix Instruction Set for Accelerating Tensor Computation beyond GEMM

Proceedings - International Symposium on Computer Architecture (2022) 552-566

DOI: 10.1145/3470496.3527411

10Citations

29Readers

Get full text

Abstract

Matrix-multiplication units (MXUs) are now prevalent in every computing platform. The key attribute that makes MXUs so successful is the semiring structure, which allows tiling for both parallelism and data reuse. Nonetheless, matrix-multiplication is not the only algorithm with such attributes. We fnd that many algorithms share the same structure and differ in only the core operation; for example, using add-minimum instead of multiply-add. Algorithms with a semiring-like structure therefore have potential to be accelerated by a general-purpose matrix operation architecture, instead of common MXUs. In this paper, we propose SIMD2, a new programming paradigm to support generalized matrix operations with a semiring-like structure. SIMD2 instructions accelerate eight more types of matrix operations, in addition to matrix multiplications. Since SIMD2 instructions resemble a matrix-multiplication instruction, we are able to build SIMD2 architecture on top of any MXU architecture with minimal modifcations. We developed a framework that emulates and validates SIMD2 using NVIDIA GPUs with Tensor Cores. Across 8 applications, SIMD2 provides up to 38.59× speedup and more than 6.94× on average over optimized CUDA programs, with only 5% of full-chip area overhead.

Cite

CITATION STYLE

APA

Zhang, Y., Tsai, P. A., & Tseng, H. W. (2022). SIMD2: A Generalized Matrix Instruction Set for Accelerating Tensor Computation beyond GEMM. In Proceedings - International Symposium on Computer Architecture (pp. 552–566). Institute of Electrical and Electronics Engineers Inc. https://doi.org/10.1145/3470496.3527411

SIMD2: A Generalized Matrix Instruction Set for Accelerating Tensor Computation beyond GEMM

Abstract

Cite

Register to see more suggestions