Compact features for detection of near-duplicates in distributed retrieval

13Citations
Citations of this article
17Readers
Mendeley users who have this article in their library.
Get full text

Abstract

In distributed information retrieval, answers from separate collections are combined into a single result set. However, the collections may overlap. The fact that the collections are distributed means that it is not in general feasible to prune duplicate and near-duplicate documents at index time. In this paper we introduce and analyze the grainy hash vector, a compact document representation that can be used to efficiently prune duplicate and near-duplicate documents from result lists. We demonstrate that, for a modest bandwidth and computational cost, many near-duplicates can be accurately removed from result lists produced by a cooperative distributed information retrieval system. © Springer-Verlag Berlin Heidelberg 2006.

Cite

CITATION STYLE

APA

Bernstein, Y., Shokouhi, M., & Zobel, J. (2006). Compact features for detection of near-duplicates in distributed retrieval. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 4209 LNCS, pp. 110–121). Springer Verlag. https://doi.org/10.1007/11880561_10

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free