Semantically enhanced text stemmer (SETS) for cross-domain document clustering

Ivan Stankov; Diman Todorov; Rossitza Setchi

Conference Proceedings

Semantically enhanced text stemmer (SETS) for cross-domain document clustering

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2013) 7828 LNAI 108-118

DOI: 10.1007/978-3-642-37343-5_12

0Citations

2Readers

Get full text

Abstract

This paper focuses on processing cross-domain document repositories, which is challenged by the word ambiguity and the fact that monosemic words are more domain-oriented than polysemic ones. The paper describes a semantically enhanced text normalization algorithm (SETS) aimed at improving document clustering and investigates the performance of the sk-means clustering algorithm across domains by comparing the cluster coherence produced with semantic-based and traditional (TF-IDF-based) document representations. The evaluation is conducted on 20 generic sub-domains of a thousand documents each randomly selected from the Reuters21578 corpus. The experimental results demonstrate improved coherence of the clusters produced by SETS compared to the text normalization obtained with the Porter stemmer. In addition, semantic-based text normalization is shown to be resistant to noise, which is often introduced in the index aggregation stage. © 2013 Springer-Verlag.

Author supplied keywords

Cite

CITATION STYLE

APA

Stankov, I., Todorov, D., & Setchi, R. (2013). Semantically enhanced text stemmer (SETS) for cross-domain document clustering. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 7828 LNAI, pp. 108–118). https://doi.org/10.1007/978-3-642-37343-5_12

Semantically enhanced text stemmer (SETS) for cross-domain document clustering

Abstract

Author supplied keywords

Cite

Register to see more suggestions