We propose an unsupervised approach for substring-based transliteration which incorporates two new sources of knowledge in the learning process: (i) context by learning substring mappings, as opposed to single character mappings, and (ii) phonetic features which capture cross-lingual character similarity via prior distributions. Our approach is a two-stage iterative, boot-strapping solution, which vastly outperforms Ravi and Knight (2009)’s state-of-the-art unsupervised transliteration method and outperforms a rule-based baseline by up to 50% for top-1 accuracy on multiple language pairs. We show that substring-based models are superior to character-based models, and observe that their top-10 accuracy is comparable to the top-1 accuracy of supervised systems. Our method only requires a phonemic representation of the words. This is possible for many language-script combinations which have a high grapheme-to-phoneme correspondence e.g. scripts of Indian languages derived from the Brahmi script. Hence, Indian languages were the focus of our experiments. For other languages, a grapheme-to-phoneme converter would be required.
CITATION STYLE
Kunchukuttan, A., Bhattacharyya, P., & Khapra, M. M. (2016). Substring-based unsupervised transliteration with phonetic and contextual knowledge. In CoNLL 2016 - 20th SIGNLL Conference on Computational Natural Language Learning, Proceedings (pp. 270–279). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/k16-1027
Mendeley helps you to discover research relevant for your work.