Semantic Document Clustering Using a Similarity Graph

15Citations
Citations of this article
16Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Document clustering addresses the problem of identifying groups of similar documents without human supervision. Unlike most existing solutions that perform document clustering based on keywords matching, we propose an algorithm that considers the meaning of the terms in the documents. For example, a document that contains the words dog and cat multiple times may be placed in the same category as a document that contains the word pet even if the two documents share only noise words in common. Our semantic clustering algorithm is based on a similarity graph that stores the degree of semantic relationship between terms (extracted from WordNet), where a term can be a word or a phrase. We experimentally validate our algorithm on the Reuters-21578 benchmark, which contains 11,362 newswire stories that are grouped in 82 categories using human judgment. We apply the k-means clustering algorithm to group the documents using a similarity metric that is based on keywords matching and one that uses the similarity graph. We show that the second approach produces higher precision and recall, which means that this approach matches closer the results of the human study.

Author supplied keywords

Cite

CITATION STYLE

APA

Stanchev, L. (2016). Semantic Document Clustering Using a Similarity Graph. In Proceedings - 2016 IEEE 10th International Conference on Semantic Computing, ICSC 2016 (pp. 1–8). Institute of Electrical and Electronics Engineers Inc. https://doi.org/10.1109/ICSC.2016.8

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free