Learning multi character alignment rules and classification of training data for transliteration

2Citations
Citations of this article
72Readers
Mendeley users who have this article in their library.
Get full text

Abstract

We address the issues of transliteration between Indian languages and English, especially for named entities. We use an EM algorithm to learn the alignment between the languages. We find that there are lot of ambiguities in the rules mapping the characters in the source language to the corresponding characters in the target language. Some of these ambiguities can be handled by capturing context by learning multi-character based alignments and use of character n-gram models. We observed that a word in the source script may have actually originated from different languages. Instead of learning one model for the language pair, we propose that one may use multiple models and a classifier to decide which model to use. A contribution of this work is that the models and classifiers are learned in a completely unsupervised manner. Using our system we were able to get quite accurate transliteration models.

Cite

CITATION STYLE

APA

Bose, D., & Sarkar, S. (2009). Learning multi character alignment rules and classification of training data for transliteration. In NEWS 2009 - 2009 Named Entities Workshop: Shared Task on Transliteration at the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, ACL-IJCNLP 2009 (pp. 61–64). Association for Computational Linguistics (ACL). https://doi.org/10.3115/1699705.1699721

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free