HMC-TRAN: A Tensor-core Inspired Hierarchical Model Compression for Transformer-based DNNs on GPU

Shaoyi Huang; Shiyang Chen; Hongwu Peng; Daniel Manu; Zhenglun Kong; Geng Yuan; Lei Yang; Shusen Wang; Hang Liu; Caiwen Ding

Conference ProceedingsOPEN ACCESS

HMC-TRAN: A Tensor-core Inspired Hierarchical Model Compression for Transformer-based DNNs on GPU

Proceedings of the ACM Great Lakes Symposium on VLSI, GLSVLSI (2021) 169-174

DOI: 10.1145/3453688.3461740

7Citations

9Readers

Get full text

Abstract

Although Transformer-based deep learning models have been widely used in many natural language processing (NLP) tasks as well as computer vision, they suffer from gigantic model size and long latency. Network pruning can reduce the computational cost and model size. However, existing works mainly focus on irregular(sparse) pruning, which often causes irregular computations and extra indices per remained weight. In this work, we propose a Tensor-core inspired hierarchical model compression method to push the performance limit on modern GPUs. We present two modes of the two-step process. In the first mode, we use the Tensor-core aware block-based weight pruning method to exploit model sparsity in a coarse-grained manner and then use low-rank [33] decomposition to further reduce the weight storage in a fine-grained manner.In the second mode, we first use irregular pruning to achieve a highly sparse model and then apply the Tensor-core aware weight constraint on the sparse model to decompose the sparse matrix to several smaller but Tensor-core friendly sub-matrices. Experiments on Transformer, BERTBASE models show the proposed method outperforms the state-of-The-Art.

Author supplied keywords

Cite

CITATION STYLE

APA

Huang, S., Chen, S., Peng, H., Manu, D., Kong, Z., Yuan, G., … Ding, C. (2021). HMC-TRAN: A Tensor-core Inspired Hierarchical Model Compression for Transformer-based DNNs on GPU. In Proceedings of the ACM Great Lakes Symposium on VLSI, GLSVLSI (pp. 169–174). Association for Computing Machinery. https://doi.org/10.1145/3453688.3461740

HMC-TRAN: A Tensor-core Inspired Hierarchical Model Compression for Transformer-based DNNs on GPU

Abstract

Author supplied keywords

Cite

Register to see more suggestions