We present an efficient method to automatically transform spoken language text to standard written language text for various dialects of Tamil. Our work is novel in that it explicitly addresses the problem and need for processing dialectal and spoken language Tamil. Written language equivalents for dialectal and spoken language forms are obtained using Finite State Transducers (FSTs) where spoken language suffixes are replaced with appropriate written language suffixes. Agglutination and compounding in the resultant text is handled using Conditional Random Fields (CRFs) based word boundary identifier. The essential Sandhi corrections are carried out using a heuristic Sandhi Corrector which normalizes the segmented words to simpler sensible words. During experimental evaluations dialectal spoken to written transformer (DSWT) achieved an encouraging accuracy of over 85% in transformation task and also improved the translation quality of Tamil-English machine translation system by 40%. It must be noted that there is no published computational work on processing Tamil dialects. Ours is the first attempt to study various dialects of Tamil in a computational point of view. Thus, the nature of the work reported here is pioneering.
CITATION STYLE
Marimuthu, K., & Devi, S. L. (2014). Automatic conversion of dialectal Tamil text to standard written Tamil text using FSTs. In 2014 Joint Meeting of SIGMORPHON and SIGFSM MORPHFSM 2014, Proceedings (pp. 37–45). Association for Computational Linguistics (ACL). https://doi.org/10.3115/v1/w14-2805
Mendeley helps you to discover research relevant for your work.