Pretrained Bidirectional Distillation for Machine Translation

Yimeng Zhuang; Mei Tu

Conference ProceedingsOPEN ACCESS

Pretrained Bidirectional Distillation for Machine Translation

Proceedings of the Annual Meeting of the Association for Computational Linguistics (2023) 1 1132-1145

DOI: 10.18653/v1/2023.acl-long.63

2Citations

16Readers

Abstract

Knowledge transfer can boost neural machine translation (NMT), for example, by finetuning a pretrained masked language model (LM). However, it may suffer from the forgetting problem and the structural inconsistency between pretrained LMs and NMT models. Knowledge distillation (KD) may be a potential solution to alleviate these issues, but few studies have investigated language knowledge transfer from pretrained language models to NMT models through KD. In this paper, we propose Pretrained Bidirectional Distillation (PBD) for NMT, which aims to efficiently transfer bidirectional language knowledge from masked language pretraining to NMT models. Its advantages are reflected in efficiency and effectiveness through a globally defined and bidirectional context-aware distillation objective. Bidirectional language knowledge of the entire sequence is transferred to an NMT model concurrently during translation training. Specifically, we propose self-distilled masked language pretraining to obtain the PBD objective. We also design PBD losses to efficiently distill the language knowledge, in the form of token probabilities, to the encoder and decoder of an NMT model using the PBD objective. Extensive experiments reveal that pretrained bidirectional distillation can significantly improve machine translation performance and achieve competitive or even better results than previous pretrain-finetune or unified multilingual translation methods in supervised, unsupervised, and zero-shot scenarios. Empirically, it is concluded that pretrained bidirectional distillation is an effective and efficient method for transferring language knowledge from pretrained language models to NMT models.

Cite

CITATION STYLE

APA

Zhuang, Y., & Tu, M. (2023). Pretrained Bidirectional Distillation for Machine Translation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (Vol. 1, pp. 1132–1145). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.acl-long.63

Pretrained Bidirectional Distillation for Machine Translation

Abstract

Cite

Register to see more suggestions