Constructing a text corpus for inexact duplicate detection

9Citations
Citations of this article
22Readers
Mendeley users who have this article in their library.
Get full text

Abstract

As online document collections continue to expand, both on the Web and in proprietary environments, the need for duplicate detection becomes more critical. The goal of this work is to facilitate (a) investigations into the phenomenon of near duplicates and (b) algorithmic approaches to minimizing its negative effect on search results. Harnessing the expertise of both client-users and professional searchers, we establish principled methods to generate a test collection for identifying and handling inexact duplicate documents.

Cite

CITATION STYLE

APA

Conrad, J. G., & Schriber, C. R. (2004). Constructing a text corpus for inexact duplicate detection. In Proceedings of Sheffield SIGIR - Twenty-Seventh Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 582–583). Association for Computing Machinery (ACM). https://doi.org/10.1145/1008992.1009131

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free