Hierarchical document encoder for parallel corpus mining

17Citations
Citations of this article
90Readers
Mendeley users who have this article in their library.

Abstract

We explore using multilingual document embeddings for nearest neighbor mining of parallel data. Three document-level representations are investigated: (i) document embeddings generated by simply averaging multilingual sentence embeddings; (ii) a neural bag-of-words (BoW) document encoding model; (iii) a hierarchical multilingual document encoder (HiDE) that builds on our sentence-level model. The results show document embeddings derived from sentence-level averaging are surprisingly effective for clean datasets, but suggest models trained hierarchically at the document-level are more effective on noisy data. Analysis experiments demonstrate our hierarchical models are very robust to variations in the underlying sentence embedding quality. Using document embeddings trained with HiDE achieves state-of-the-art performance on United Nations (UN) parallel document mining, 94.9% P@11 for en-fr and 97.3% P@1 for en-es.

Cite

CITATION STYLE

APA

Guo, M., Yang, Y., Stevens, K., Cer, D., Ge, H., Sung, Y. H., … Kurzweil, R. (2019). Hierarchical document encoder for parallel corpus mining. In WMT 2019 - 4th Conference on Machine Translation, Proceedings of the Conference (Vol. 1, pp. 64–72). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/w19-5207

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free