Lessons on Parameter Sharing across Layers in Transformers

Sho Takase; Shun Kiyono

Conference Proceedings

Lessons on Parameter Sharing across Layers in Transformers

Proceedings of the Annual Meeting of the Association for Computational Linguistics (2023) 78-90

DOI: 10.18653/v1/2023.sustainlp-1.5

15Citations

70Readers

Get full text

Abstract

We propose a novel parameter sharing method for Transformers (Vaswani et al., 2017). The proposed approach relaxes a widely used technique, which shares the parameters of one layer with all layers such as Universal Transformers (Dehghani et al., 2019), to improve the efficiency. We propose three strategies: SEQUENCE, CYCLE, and CYCLE (REV) to assign parameters to each layer. Experimental results show that the proposed strategies are efficient in terms of the parameter size and computational time in the machine translation task. We also demonstrate that the proposed strategies are effective in the configuration where we use many training data such as the recent WMT competition. Moreover, we indicate that the proposed strategies are also more efficient than the previous approach (Dehghani et al., 2019) on automatic speech recognition and language modeling tasks.

Cite

CITATION STYLE

APA

Takase, S., & Kiyono, S. (2023). Lessons on Parameter Sharing across Layers in Transformers. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 78–90). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.sustainlp-1.5

Lessons on Parameter Sharing across Layers in Transformers

Abstract

Cite

Register to see more suggestions