Semantic Document Clustering Using a Similarity Graph

Lubomir Stanchev

Conference Proceedings

Semantic Document Clustering Using a Similarity Graph

Stanchev L

Proceedings - 2016 IEEE 10th International Conference on Semantic Computing, ICSC 2016 (2016) 1-8

DOI: 10.1109/ICSC.2016.8

15Citations

16Readers

Get full text

Abstract

Document clustering addresses the problem of identifying groups of similar documents without human supervision. Unlike most existing solutions that perform document clustering based on keywords matching, we propose an algorithm that considers the meaning of the terms in the documents. For example, a document that contains the words dog and cat multiple times may be placed in the same category as a document that contains the word pet even if the two documents share only noise words in common. Our semantic clustering algorithm is based on a similarity graph that stores the degree of semantic relationship between terms (extracted from WordNet), where a term can be a word or a phrase. We experimentally validate our algorithm on the Reuters-21578 benchmark, which contains 11,362 newswire stories that are grouped in 82 categories using human judgment. We apply the k-means clustering algorithm to group the documents using a similarity metric that is based on keywords matching and one that uses the similarity graph. We show that the second approach produces higher precision and recall, which means that this approach matches closer the results of the human study.

Author supplied keywords

Cite

CITATION STYLE

APA

Stanchev, L. (2016). Semantic Document Clustering Using a Similarity Graph. In Proceedings - 2016 IEEE 10th International Conference on Semantic Computing, ICSC 2016 (pp. 1–8). Institute of Electrical and Electronics Engineers Inc. https://doi.org/10.1109/ICSC.2016.8

Semantic Document Clustering Using a Similarity Graph

Abstract

Author supplied keywords

Cite

Register to see more suggestions