Explicit cross-lingual pre-training for unsupervised machine translation

24Citations
Citations of this article
140Readers
Mendeley users who have this article in their library.

Abstract

Pre-training has proven to be effective in unsupervised machine translation due to its ability to model deep context information in cross-lingual scenarios. However, the cross-lingual information obtained from shared BPE spaces is inexplicit and limited. In this paper, we propose a novel cross-lingual pre-training method for unsupervised machine translation by incorporating explicit cross-lingual training signals. Specifically, we first calculate cross-lingual n-gram embeddings and infer an n-gram translation table from them. With those n-gram translation pairs, we propose a new pre-training model called Cross-lingual Masked Language Model (CMLM), which randomly chooses source n-grams in the input text stream and predicts their translation candidates at each time step. Experiments show that our method can incorporate beneficial cross-lingual information into pre-trained models. Taking pre-trained CMLM models as the encoder and decoder, we significantly improve the performance of unsupervised machine translation. Our code is available at https://github.com/Imagist-Shuo/CMLM.

Cite

CITATION STYLE

APA

Ren, S., Wu, Y., Liu, S., Zhou, M., & Ma, S. (2019). Explicit cross-lingual pre-training for unsupervised machine translation. In EMNLP-IJCNLP 2019 - 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference (pp. 770–779). Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1071

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free