Efficient tensor core-based gpu kernels for structured sparsity under reduced precision

46Citations
Citations of this article
23Readers
Mendeley users who have this article in their library.
Get full text

Abstract

The success of DNN comes at the expense of excessive memory/-computation cost, which can be addressed by exploiting reduced precision and sparsity jointly. Existing sparse GPU kernels, however, fail to achieve practical speedup over cuBLASHgemm under half-precision. Those for fine-grained sparsity suffer from low data reuse, and others for coarse-grained sparsity are limited by the wrestling between kernel performance and model quality under different grain sizes. We propose column-vector-sparse-encoding that has a smaller grain size under the same reuse rate compared with block sparsity. Column-vector-sparse-encoding can be applied to both SpMM & SDDMM, two major sparse DNN operations. We also introduce the Tensor-Core-based 1D Octet Tiling that has efficient memory access and computation patterns under small grain size. Based on these, we design SpMM and SDDMM kernels and achieve 1.71-7.19x speedup over cuSPARSE. Practical speedup is achieved over cuBLASHgemm under70% and 90% sparsity with 4x1 grain size and half-precision.

Cite

CITATION STYLE

APA

Chen, Z., Qu, Z., Liu, L., Ding, Y., & Xie, Y. (2021). Efficient tensor core-based gpu kernels for structured sparsity under reduced precision. In International Conference for High Performance Computing, Networking, Storage and Analysis, SC. IEEE Computer Society. https://doi.org/10.1145/3458817.3476182

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free