Learning multi character alignment rules and classification of training data for transliteration

Dipankar Bose; Sudeshna Sarkar

Conference Proceedings

Learning multi character alignment rules and classification of training data for transliteration

NEWS 2009 - 2009 Named Entities Workshop: Shared Task on Transliteration at the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, ACL-IJCNLP 2009 (2009) 61-64

DOI: 10.3115/1699705.1699721

2Citations

72Readers

Get full text

Abstract

We address the issues of transliteration between Indian languages and English, especially for named entities. We use an EM algorithm to learn the alignment between the languages. We find that there are lot of ambiguities in the rules mapping the characters in the source language to the corresponding characters in the target language. Some of these ambiguities can be handled by capturing context by learning multi-character based alignments and use of character n-gram models. We observed that a word in the source script may have actually originated from different languages. Instead of learning one model for the language pair, we propose that one may use multiple models and a classifier to decide which model to use. A contribution of this work is that the models and classifiers are learned in a completely unsupervised manner. Using our system we were able to get quite accurate transliteration models.

Cite

CITATION STYLE

APA

Bose, D., & Sarkar, S. (2009). Learning multi character alignment rules and classification of training data for transliteration. In NEWS 2009 - 2009 Named Entities Workshop: Shared Task on Transliteration at the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, ACL-IJCNLP 2009 (pp. 61–64). Association for Computational Linguistics (ACL). https://doi.org/10.3115/1699705.1699721

Learning multi character alignment rules and classification of training data for transliteration

Abstract

Cite

Register to see more suggestions