Pairwise document similarity measure based on present term set

Marzieh Oghbaie; Morteza Mohammadi Zanjireh

Journal ArticleOPEN ACCESS

Pairwise document similarity measure based on present term set

Journal of Big Data (2018) 5(1)

DOI: 10.1186/s40537-018-0163-2

37Citations

55Readers

Abstract

Measuring pairwise document similarity is an essential operation in various text mining tasks. Most of the similarity measures judge the similarity between two documents based on the term weights and the information content that two documents share in common. However, they are insufficient when there exist several documents with an identical degree of similarity to a particular document. This paper introduces a novel text document similarity measure based on the term weights and the number of terms appeared in at least one of the two documents. The effectiveness of our measure is evaluated on two real-world document collections for a variety of text mining tasks, such as text document classification, clustering, and near-duplicates detection. The performance of our measure is compared with that of some popular measures. The experimental results showed that our proposed similarity measure yields more accurate results.

Author supplied keywords

Cite

CITATION STYLE

APA

Oghbaie, M., & Mohammadi Zanjireh, M. (2018). Pairwise document similarity measure based on present term set. Journal of Big Data, 5(1). https://doi.org/10.1186/s40537-018-0163-2

Pairwise document similarity measure based on present term set

Abstract

Author supplied keywords

Cite

Register to see more suggestions