Analysis of similarity measures with WordNet based text document clustering

Nadella Sandhya; A. Govardhan

Conference Proceedings

Analysis of similarity measures with WordNet based text document clustering

Advances in Intelligent and Soft Computing (2012) 132 AISC 703-714

DOI: 10.1007/978-3-642-27443-5_80

16Citations

7Readers

Get full text

Abstract

Text Document Clustering aids in reorganizing the large collections of documents into a smaller number of manageable clusters. While several clustering methods and the associated similarity measures have been proposed in the past, the partition clustering algorithms are reported performing well on document clustering. Usually cosine function is used to measure the similarity between two documents in the criterion function, but it may not work well when the clusters are not well separated. Word meanings are better than word forms in terms of representing the topics of documents. Thus, here we have involved ontology into the text clustering algorithm. In this research WordNet based document representation is attempted by assigning each word a part-ofspeech (POS) tag and by enriching the 'bag-of-words' data representation with synset concept which corresponds to synonym set that is introduced by WordNet. After replacing the 'bag of words' with their respective Synset IDs a variant of K-Means algorithm is used for document clustering. Then we compare the three popular similarity measures (Cosine, Pearson Correlation Coefficient and extended Jaccard) in conjunction with different types of vector space representation (Term Frequency and Term Frequency-Inverse Document Frequency) of documents. © 2012 Springer-Verlag GmbH Berlin Heidelberg.

Cite

CITATION STYLE

APA

Sandhya, N., & Govardhan, A. (2012). Analysis of similarity measures with WordNet based text document clustering. In Advances in Intelligent and Soft Computing (Vol. 132 AISC, pp. 703–714). Springer Verlag. https://doi.org/10.1007/978-3-642-27443-5_80

Analysis of similarity measures with WordNet based text document clustering

Abstract

Cite

Register to see more suggestions