We introduce Vecalign, a novel bilingual sentence alignment method which is linear in time and space with respect to the number of sentences being aligned and which requires only bilingual sentence embeddings. On a standard German-French test set, Vecalign outperforms the previous state-of-the-art method (which has quadratic time complexity and requires a machine translation system) by 5 F1 points. It substantially outperforms the popular Hunalign toolkit at recovering Bible verse alignments in medium- to low-resource language pairs, and it improves downstream MT quality by 1.7 and 1.6 BLEU in Sinhala!English and Nepali!English, respectively, compared to the Hunalign-based Paracrawl pipeline.
CITATION STYLE
Thompson, B., & Koehn, P. (2019). Vecalign: Improved sentence alignment in linear time and space. In EMNLP-IJCNLP 2019 - 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference (pp. 1342–1348). Association for Computational Linguistics. https://doi.org/10.18653/v1/d19-1136
Mendeley helps you to discover research relevant for your work.