Near-Duplicate Document Detection Using Semantic-Based Similarity Measure: A Novel Approach

Rajendra Kumar Roul; Jajati Keshari Sahoo

Conference Proceedings

Near-Duplicate Document Detection Using Semantic-Based Similarity Measure: A Novel Approach

Advances in Intelligent Systems and Computing (2020) 990 543-558

DOI: 10.1007/978-981-13-8676-3_46

2Citations

3Readers

Get full text

Abstract

Detection of near-duplicate pages, especially based on their semantic content is a relevant concern in information retrieval. It is needed to avoid redundancy in the search results against a query as well as facilitate the ranking of the documents in the order of their semantic similarities. Although much work has been done in near-duplicate page detection based on content similarity (as evident in existing literature), the realm of semantic similarity provides a relatively unexplored pool of opportunities. In this paper, a novel technique is proposed to detect whether two documents belonging to a corpus have near-duplicate semantic content or not and a heuristic method is introduced to rank the documents based on their semantic similarity scores. This objective is achieved by examining the proposed technique for computing semantic-based similarity between two documents and applying an averaging mechanism to associate a similarity score to each document in the corpus. The empirical results on DUC datasets witness the effectiveness of the proposed approach.

Author supplied keywords

Cite

CITATION STYLE

APA

Roul, R. K., & Sahoo, J. K. (2020). Near-Duplicate Document Detection Using Semantic-Based Similarity Measure: A Novel Approach. In Advances in Intelligent Systems and Computing (Vol. 990, pp. 543–558). Springer. https://doi.org/10.1007/978-981-13-8676-3_46

Near-Duplicate Document Detection Using Semantic-Based Similarity Measure: A Novel Approach

Abstract

Author supplied keywords

Cite

Register to see more suggestions