Efficient approach for near duplicate document detection using textual and conceptual based techniques

4Citations
Citations of this article
3Readers
Mendeley users who have this article in their library.
Get full text

Abstract

With the rapid development and usage of World Wide Web, there are a huge number of duplicate web pages. To help the search engine for providing results free from duplicates, detection and elimination of duplicates is required. The proposed approach combines the strength of some state of the art duplicate detection algorithms like Shingling and Simhash to efficiently detect and eliminate near duplicate web pages while considering some important factors like word order. In addition, it employs Latent Semantic Indexing (LSI) to detect conceptually similar documents which are often not detected by textual based duplicate detection techniques like Shingling and Simhash. The approach utilizes hamming distance and cosine similarity (for textual and conceptual duplicate detection respectively) between two documents as their similarity measure. For performance measurement, the F-measure of the proposed approach is compared with the traditional Simhash technique. Experimental results show that our approach can outperform the traditional Simhash. © Springer International Publishing Switzerland 2014.

Author supplied keywords

Cite

CITATION STYLE

APA

Roul, R. K., Mittal, S., & Joshi, P. (2014). Efficient approach for near duplicate document detection using textual and conceptual based techniques. In Smart Innovation, Systems and Technologies (Vol. 27, pp. 195–203). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-319-07353-8_23

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free