Compact features for detection of near-duplicates in distributed retrieval

Yaniv Bernstein; Milad Shokouhi; Justin Zobel

Conference Proceedings

Compact features for detection of near-duplicates in distributed retrieval

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2006) 4209 LNCS 110-121

DOI: 10.1007/11880561_10

13Citations

17Readers

Get full text

Abstract

In distributed information retrieval, answers from separate collections are combined into a single result set. However, the collections may overlap. The fact that the collections are distributed means that it is not in general feasible to prune duplicate and near-duplicate documents at index time. In this paper we introduce and analyze the grainy hash vector, a compact document representation that can be used to efficiently prune duplicate and near-duplicate documents from result lists. We demonstrate that, for a modest bandwidth and computational cost, many near-duplicates can be accurately removed from result lists produced by a cooperative distributed information retrieval system. © Springer-Verlag Berlin Heidelberg 2006.

Cite

CITATION STYLE

APA

Bernstein, Y., Shokouhi, M., & Zobel, J. (2006). Compact features for detection of near-duplicates in distributed retrieval. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 4209 LNCS, pp. 110–121). Springer Verlag. https://doi.org/10.1007/11880561_10

Compact features for detection of near-duplicates in distributed retrieval

Abstract

Cite

Register to see more suggestions