Substring-based unsupervised transliteration with phonetic and contextual knowledge

2Citations
Citations of this article
69Readers
Mendeley users who have this article in their library.

Abstract

We propose an unsupervised approach for substring-based transliteration which incorporates two new sources of knowledge in the learning process: (i) context by learning substring mappings, as opposed to single character mappings, and (ii) phonetic features which capture cross-lingual character similarity via prior distributions. Our approach is a two-stage iterative, boot-strapping solution, which vastly outperforms Ravi and Knight (2009)’s state-of-the-art unsupervised transliteration method and outperforms a rule-based baseline by up to 50% for top-1 accuracy on multiple language pairs. We show that substring-based models are superior to character-based models, and observe that their top-10 accuracy is comparable to the top-1 accuracy of supervised systems. Our method only requires a phonemic representation of the words. This is possible for many language-script combinations which have a high grapheme-to-phoneme correspondence e.g. scripts of Indian languages derived from the Brahmi script. Hence, Indian languages were the focus of our experiments. For other languages, a grapheme-to-phoneme converter would be required.

Cite

CITATION STYLE

APA

Kunchukuttan, A., Bhattacharyya, P., & Khapra, M. M. (2016). Substring-based unsupervised transliteration with phonetic and contextual knowledge. In CoNLL 2016 - 20th SIGNLL Conference on Computational Natural Language Learning, Proceedings (pp. 270–279). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/k16-1027

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free