Efficient approach for near duplicate document detection using textual and conceptual based techniques

Rajendra Kumar Roul; Sahil Mittal; Pravin Joshi

Conference Proceedings

Efficient approach for near duplicate document detection using textual and conceptual based techniques

Smart Innovation, Systems and Technologies (2014) 27(VOL 1) 195-203

DOI: 10.1007/978-3-319-07353-8_23

4Citations

3Readers

Get full text

Abstract

With the rapid development and usage of World Wide Web, there are a huge number of duplicate web pages. To help the search engine for providing results free from duplicates, detection and elimination of duplicates is required. The proposed approach combines the strength of some state of the art duplicate detection algorithms like Shingling and Simhash to efficiently detect and eliminate near duplicate web pages while considering some important factors like word order. In addition, it employs Latent Semantic Indexing (LSI) to detect conceptually similar documents which are often not detected by textual based duplicate detection techniques like Shingling and Simhash. The approach utilizes hamming distance and cosine similarity (for textual and conceptual duplicate detection respectively) between two documents as their similarity measure. For performance measurement, the F-measure of the proposed approach is compared with the traditional Simhash technique. Experimental results show that our approach can outperform the traditional Simhash. © Springer International Publishing Switzerland 2014.

Author supplied keywords

Cite

CITATION STYLE

APA

Roul, R. K., Mittal, S., & Joshi, P. (2014). Efficient approach for near duplicate document detection using textual and conceptual based techniques. In Smart Innovation, Systems and Technologies (Vol. 27, pp. 195–203). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-319-07353-8_23

Efficient approach for near duplicate document detection using textual and conceptual based techniques

Abstract

Author supplied keywords

Cite

Register to see more suggestions