Structured Pruning for Efficient Generative Pre-trained Language Models

22Citations
Citations of this article
25Readers
Mendeley users who have this article in their library.
Get full text

Abstract

The increasing sizes of large generative Pre-trained Language Models (PLMs) hinder their deployment in real-world applications. To obtain efficient PLMs, previous studies mostly focus on pruning the attention heads and feed-forward networks (FFNs) of the Transformer. Nevertheless, we find that in generative PLMs, the hidden dimension shared by many other modules (e.g., embedding layer and layer normalization) contains persistent outliers regardless of the network input. In this study, we propose SIMPLE, a new structured pruning framework for generative PLMs that comprehensively investigates all the above compressible components. To identify redundant network structures, we assign learnable masks over compressible components followed by sparse training. Various sizes of PLMs can be flexibly extracted via different thresholds, and are then task-specifically fine-tuned for further improvement. Extensive experiments on language modeling, summarization and machine translation validate the effectiveness of the proposed method. For example, the pruned BART brings 1.51x/6.96x inference speedup on GPU/CPU with 67% size reduction, and can be further combined with quantization for more than 25× compression.

Cite

CITATION STYLE

APA

Tao, C., Hou, L., Bai, H., Wei, J., Jiang, X., Liu, Q., … Wong, N. (2023). Structured Pruning for Efficient Generative Pre-trained Language Models. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 10880–10895). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.findings-acl.692

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free