A thesaurus is a reference work that lists words grouped together according to similarity of meaning (containing synonyms and sometimes antonyms), in contrast to a dictionary, which contains definitions and pronunciations. This paper proposes an innovative approach to improve the classification performance of Persian texts considering a very large thesaurus. The paper proposes a flexible method to recognize and categorize the Persian texts employing a thesaurus as a helpful knowledge. In the corpus, when utilizing the thesaurus the method obtains a more representative set of word-frequencies comparing to those obtained when the method disables the thesaurus. Two types of word relationships are considered in our used thesaurus. This is the first attempt to use a Persian thesaurus in the field of Persian information retrieval. The k-nearest neighbor classifier, decision tree classifier and k-means clustering algorithm are employed as classifier over the frequency based features. Experimental results indicate enabling thesaurus causes the method significantly outperforms in text classification and clustering.
CITATION STYLE
Iman Jamnejad, M., Heidarzadegan, A., & Meshki, M. (2014). Text Recognition with k-means Clustering. Research in Computing Science, 84(1), 29–40. https://doi.org/10.13053/rcs-84-1-3
Mendeley helps you to discover research relevant for your work.