Enhancing Scalability of Pre-trained Language Models via Efficient Parameter Sharing

7Citations
Citations of this article
12Readers
Mendeley users who have this article in their library.
Get full text

Abstract

In this paper, we propose a highly parameter-efficient approach to scaling pre-trained language models (PLMs) to a deeper model depth. Unlike prior work that shares all parameters or uses extra blocks, we design a more capable parameter-sharing architecture based on matrix product operator (MPO), an efficient tensor decomposition method to factorize the parameter matrix into a set of local tensors. Based on such a decomposition, we share the important local tensor across all layers for reducing the model size and meanwhile keep layer-specific tensors (also using Adapters) for enhancing the adaptation flexibility. To improve the model training, we further propose a stable initialization algorithm tailored for the MPO-based architecture. Extensive experiments have demonstrated the effectiveness of our proposed model in enhancing scalability and achieving higher performance (i.e., with fewer parameters than BERTBASE, we successfully scale the model depth by a factor of 4× and even achieve 0.1 points higher than BERTLARGE for GLUE score). The code to reproduce the results of this paper can be found at https://github.com/RUCAIBox/MPOBERT-code.

Cite

CITATION STYLE

APA

Liu, P., Gao, Z. F., Chen, Y., Zhao, W. X., & Wen, J. R. (2023). Enhancing Scalability of Pre-trained Language Models via Efficient Parameter Sharing. In Findings of the Association for Computational Linguistics: EMNLP 2023 (pp. 13771–13785). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.findings-emnlp.920

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free