Stemming is considered crucial in many NLP and IR applications. In the absence of any linguistic information, stemming is a challenging task. Stemming of words using suffixes of a language as linguistic information is in comparison an easier problem. In this work we considered stemming as a process of obtaining minimum number of lexicon from an unannotated corpus by using a suffix set. We proved that the exact lexicon reduction problem is NP-hard and came up with a polynomial time approximation. One probabilistic model that minimizes the stem distributional entropy is also proposed for stemming. Performances of these models are analyzed using an unannotated corpus and a suffix set of Malayalam, a morphologically rich language of India belonging to the Dravidian family. © 2012 Springer-Verlag.
CITATION STYLE
Vasudevan, N., & Bhattacharyya, P. (2012). Optimal stem identification in presence of suffix list. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 7181 LNCS, pp. 92–103). https://doi.org/10.1007/978-3-642-28604-9_8
Mendeley helps you to discover research relevant for your work.