An Unsupervised Approach to Develop Stemmer

Mohd. Shahid Husain

Journal ArticleOPEN ACCESS

An Unsupervised Approach to Develop Stemmer

Shahid Husain M

International Journal on Natural Language Computing (2012) 1(2) 15-23

DOI: 10.5121/ijnlc.2012.1202

N/ACitations

25Readers

Abstract

This paper presents an unsupervised approach for the development of a stemmer (For the case of Urdu & Marathi language). Especially, during last few years, a wide range of information in Indian regional languages has been made available on web in the form of e-data. But the access to these data repositories is very low because the efficient search engines/retrieval systems supporting these languages are very limited. Hence automatic information processing and retrieval is become an urgent requirement. To train the system training dataset, taken from CRULP [22] and Marathi corpus [23] are used. For generating suffix rules two different approaches, namely, frequency based stripping and length based stripping have been proposed. The evaluation has been made on 1200 words extracted from the Emille corpus. The experiment results shows that in the case of Urdu language the frequency based suffix generation approach gives the maximum accuracy of 85.36% whereas Length based suffix stripping algorithm gives maximum accuracy of 79.76%. In the case of Marathi language the systems gives 63.5% accuracy in the case of frequency based stripping and achieves maximum accuracy of 82.5% in the case of length based suffix stripping algorithm.

Cite

CITATION STYLE

APA

Shahid Husain, Mohd. (2012). An Unsupervised Approach to Develop Stemmer. International Journal on Natural Language Computing, 1(2), 15–23. https://doi.org/10.5121/ijnlc.2012.1202

An Unsupervised Approach to Develop Stemmer

Abstract

Cite

Register to see more suggestions