As online document collections continue to expand, both on the Web and in proprietary environments, the need for duplicate detection becomes more critical. The goal of this work is to facilitate (a) investigations into the phenomenon of near duplicates and (b) algorithmic approaches to minimizing its negative effect on search results. Harnessing the expertise of both client-users and professional searchers, we establish principled methods to generate a test collection for identifying and handling inexact duplicate documents.
CITATION STYLE
Conrad, J. G., & Schriber, C. R. (2004). Constructing a text corpus for inexact duplicate detection. In Proceedings of Sheffield SIGIR - Twenty-Seventh Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 582–583). Association for Computing Machinery (ACM). https://doi.org/10.1145/1008992.1009131
Mendeley helps you to discover research relevant for your work.