Lessons on Parameter Sharing across Layers in Transformers

15Citations
Citations of this article
70Readers
Mendeley users who have this article in their library.
Get full text

Abstract

We propose a novel parameter sharing method for Transformers (Vaswani et al., 2017). The proposed approach relaxes a widely used technique, which shares the parameters of one layer with all layers such as Universal Transformers (Dehghani et al., 2019), to improve the efficiency. We propose three strategies: SEQUENCE, CYCLE, and CYCLE (REV) to assign parameters to each layer. Experimental results show that the proposed strategies are efficient in terms of the parameter size and computational time in the machine translation task. We also demonstrate that the proposed strategies are effective in the configuration where we use many training data such as the recent WMT competition. Moreover, we indicate that the proposed strategies are also more efficient than the previous approach (Dehghani et al., 2019) on automatic speech recognition and language modeling tasks.

Cite

CITATION STYLE

APA

Takase, S., & Kiyono, S. (2023). Lessons on Parameter Sharing across Layers in Transformers. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 78–90). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.sustainlp-1.5

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free