A fusion of algorithms in near duplicate document detection

Jun Fan; Tiejun Huang

Conference Proceedings

A fusion of algorithms in near duplicate document detection

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2012) 7104 LNAI 234-242

DOI: 10.1007/978-3-642-28320-8_20

0Citations

12Readers

Get full text

Abstract

With the rapid development of the World Wide Web, there are a huge number of fully or fragmentally duplicated pages in the Internet. Return of these near duplicated results to the users greatly affects user experiences. In the process of deploying digital libraries, the protection of intellectual property and removal of duplicate contents needs to be considered. This paper fuses some "state of the art" algorithms to reach a better performance. We first introduce the three major algorithms (shingling, I-match, simhash) in duplicate document detection and their developments in the following days. We take sequences of words (shingles) as the feature of simhash algorithm. We then import the random lexicons based multi fingerprints generation method into shingling base simhash algorithm and named it shingling based multi fingerprints simhash algorithm. We did some preliminary experiments on the synthetic dataset based on the "China-US Million Book Digital Library Project". The experiment result proves the efficiency of these algorithms. © 2012 Springer-Verlag.

Author supplied keywords

Cite

CITATION STYLE

APA

Fan, J., & Huang, T. (2012). A fusion of algorithms in near duplicate document detection. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 7104 LNAI, pp. 234–242). https://doi.org/10.1007/978-3-642-28320-8_20

A fusion of algorithms in near duplicate document detection

Abstract

Author supplied keywords

Cite

Register to see more suggestions