A fusion of algorithms in near duplicate document detection

0Citations
Citations of this article
12Readers
Mendeley users who have this article in their library.
Get full text

Abstract

With the rapid development of the World Wide Web, there are a huge number of fully or fragmentally duplicated pages in the Internet. Return of these near duplicated results to the users greatly affects user experiences. In the process of deploying digital libraries, the protection of intellectual property and removal of duplicate contents needs to be considered. This paper fuses some "state of the art" algorithms to reach a better performance. We first introduce the three major algorithms (shingling, I-match, simhash) in duplicate document detection and their developments in the following days. We take sequences of words (shingles) as the feature of simhash algorithm. We then import the random lexicons based multi fingerprints generation method into shingling base simhash algorithm and named it shingling based multi fingerprints simhash algorithm. We did some preliminary experiments on the synthetic dataset based on the "China-US Million Book Digital Library Project". The experiment result proves the efficiency of these algorithms. © 2012 Springer-Verlag.

Cite

CITATION STYLE

APA

Fan, J., & Huang, T. (2012). A fusion of algorithms in near duplicate document detection. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 7104 LNAI, pp. 234–242). https://doi.org/10.1007/978-3-642-28320-8_20

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free