Bilingual lexicon induction from non-parallel data with minimal supervision

22Citations
Citations of this article
31Readers
Mendeley users who have this article in their library.

Abstract

Building bilingual lexica from non-parallel data is a longstanding natural language processing research problem that could benefit thousands of resource-scarce languages which lack parallel data. Recent advances of continuous word representations have opened up new possibilities for this task, e.g. by establishing cross-lingual mapping between word embed-dings via a seed lexicon. The method is however unreliable when there are only a limited number of seeds, which is a reasonable setting for resource-scarce languages. We tackle the limitation by introducing a novel matching mechanism into bilingual word representation learning. It captures extra translation pairs exposed by the seeds to incrementally improve the bilingual word embeddings. In our experiments, we find the matching mechanism to substantially improve the quality of the bilingual vector space, which in turn allows us to induce better bilingual lexica with seeds as few as 10.

Cite

CITATION STYLE

APA

Zhang, M., Peng, H., Liu, Y., Luan, H., & Sun, M. (2017). Bilingual lexicon induction from non-parallel data with minimal supervision. In 31st AAAI Conference on Artificial Intelligence, AAAI 2017 (pp. 3379–3385). AAAI press. https://doi.org/10.1609/aaai.v31i1.10988

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free