Near-Duplicate Document Detection Using Semantic-Based Similarity Measure: A Novel Approach

2Citations
Citations of this article
3Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Detection of near-duplicate pages, especially based on their semantic content is a relevant concern in information retrieval. It is needed to avoid redundancy in the search results against a query as well as facilitate the ranking of the documents in the order of their semantic similarities. Although much work has been done in near-duplicate page detection based on content similarity (as evident in existing literature), the realm of semantic similarity provides a relatively unexplored pool of opportunities. In this paper, a novel technique is proposed to detect whether two documents belonging to a corpus have near-duplicate semantic content or not and a heuristic method is introduced to rank the documents based on their semantic similarity scores. This objective is achieved by examining the proposed technique for computing semantic-based similarity between two documents and applying an averaging mechanism to associate a similarity score to each document in the corpus. The empirical results on DUC datasets witness the effectiveness of the proposed approach.

Cite

CITATION STYLE

APA

Roul, R. K., & Sahoo, J. K. (2020). Near-Duplicate Document Detection Using Semantic-Based Similarity Measure: A Novel Approach. In Advances in Intelligent Systems and Computing (Vol. 990, pp. 543–558). Springer. https://doi.org/10.1007/978-981-13-8676-3_46

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free