Abstract
The success of DNN comes at the expense of excessive memory/-computation cost, which can be addressed by exploiting reduced precision and sparsity jointly. Existing sparse GPU kernels, however, fail to achieve practical speedup over cuBLASHgemm under half-precision. Those for fine-grained sparsity suffer from low data reuse, and others for coarse-grained sparsity are limited by the wrestling between kernel performance and model quality under different grain sizes. We propose column-vector-sparse-encoding that has a smaller grain size under the same reuse rate compared with block sparsity. Column-vector-sparse-encoding can be applied to both SpMM & SDDMM, two major sparse DNN operations. We also introduce the Tensor-Core-based 1D Octet Tiling that has efficient memory access and computation patterns under small grain size. Based on these, we design SpMM and SDDMM kernels and achieve 1.71-7.19x speedup over cuSPARSE. Practical speedup is achieved over cuBLASHgemm under70% and 90% sparsity with 4x1 grain size and half-precision.
Author supplied keywords
Cite
CITATION STYLE
Chen, Z., Qu, Z., Liu, L., Ding, Y., & Xie, Y. (2021). Efficient tensor core-based gpu kernels for structured sparsity under reduced precision. In International Conference for High Performance Computing, Networking, Storage and Analysis, SC. IEEE Computer Society. https://doi.org/10.1145/3458817.3476182
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.