Identifying and filtering near-duplicate documents

Andrei Z. Broder

Conference Proceedings

Identifying and filtering near-duplicate documents

Broder A

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2000) 1848 1-10

DOI: 10.1007/3-540-45123-4_1

266Citations

179Readers

Get full text

Abstract

The mathematical concept of document resemblance cap- tures well the informal notion of syntactic similarity. The resemblance can be estimated using a fixed size “sketch” for each document. For a large collection of documents (say hundreds of millions) the size of this sketch is of the order of a few hundred bytes per document. However, for efficient large scale web indexing it is not necessary to de- termine the actual resemblance value: it suffices to determine whether newly encountered documents are duplicates or near-duplicates of documents already indexed. In other words, it suffices to determine whether the resemblance is above a certain threshold. In this talk we show how this determination can be made using a "sample" of less than 50 bytes per document. The basic approach for computing resemblance has two aspects: first, resemblance is expressed as a set (of strings) intersection problem, and second, the relative size of intersections is evaluated by a process of random sampling that can be done independently for each document. The process of estimating the relative size of intersection of sets and the threshold test discussed above can be applied to arbitrary sets, and thus might be of independent interest. The algorithm for filtering near-duplicate documents discussed here has been successfully implemented and has been used for the last three years in the context of the AltaVista search engine.

Cite

CITATION STYLE

APA

Broder, A. Z. (2000). Identifying and filtering near-duplicate documents. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 1848, pp. 1–10). Springer Verlag. https://doi.org/10.1007/3-540-45123-4_1

Identifying and filtering near-duplicate documents

Abstract

Cite

Register to see more suggestions