Sandhi splitting is a primary and an important step for any natural language processing (NLP) application for languages which have agglutinative morphology. This paper presents a statistical approach to build a sandhi splitter for agglutinative languages. The input to the model is a valid string in the language and the output is a split of that string into meaningful word/s. The approach adopted comprises of two stages namely Segmentation and Word generation, both of which use conditional random fields (CRFs). Our approach is robust and language independent. The results for two Dravidian languages viz. Telugu and Malayalam show an accuracy of 89.07% and 90.50% respectively.
CITATION STYLE
Kuncham, P., Nelakuditi, K., Nallani, S., & Mamidi, R. (2015). Statistical sandhi splitter for agglutinative languages. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 9041, pp. 164–172). Springer Verlag. https://doi.org/10.1007/978-3-319-18111-0_13
Mendeley helps you to discover research relevant for your work.