BERT, MBERT, or BIBERT? A Study on Contextualized Embeddings for Neural Machine Translation

34Citations
Citations of this article
125Readers
Mendeley users who have this article in their library.

Abstract

The success of bidirectional encoders using masked language models, such as BERT, on numerous natural language processing tasks has prompted researchers to attempt to incorporate these pre-trained models into neural machine translation (NMT) systems. However, proposed methods for incorporating pretrained models are non-trivial and mainly focus on BERT, which lacks a comparison of the impact that other pre-trained models may have on translation performance. In this paper, we demonstrate that simply using the output (contextualized embeddings) of a tailored and suitable bilingual pre-trained language model (dubbed BIBERT) as the input of the NMT encoder achieves state-of-the-art translation performance. Moreover, we also propose a stochastic layer selection approach and a concept of dual-directional translation model to ensure the sufficient utilization of contextualized embeddings. In the case of without using back translation, our best models achieve BLEU scores of 30.45 for En→De and 38.61 for De→En on the IWSLT'14 dataset, and 31.26 for En→De and 34.94 for De→En on the WMT'14 dataset, which exceeds all published numbers12

Cite

CITATION STYLE

APA

Xu, H., Van Durme, B., & Murray, K. (2021). BERT, MBERT, or BIBERT? A Study on Contextualized Embeddings for Neural Machine Translation. In EMNLP 2021 - 2021 Conference on Empirical Methods in Natural Language Processing, Proceedings (pp. 6663–6675). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2021.emnlp-main.534

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free