Online duplicate document detection: signature reliability in a dynamic retrieval environment

  • Conrad J
  • Guo X
  • Schriber C
  • 32

    Readers

    Mendeley users who have this article in their library.
  • 45

    Citations

    Citations of this article.

Abstract

As online document collections continue to expand, both on the Web and in proprietary environments, the need for du- plicate detection becomes more critical. Few users wish to retrieve search results consisting of sets of duplicate docu- ments, whether identical duplicates or close matches. Our goal in this work is to investigate the phenomenon and de- termine one or more approaches that minimize its impact on search results. Recent work has focused on using some form of signature to characterize a document in order to reduce the complexity of document comparisons. A representative technique constructs a ‘fingerprint’ of the rarest or richest features in a document using collection statistics as criteria for feature selection. One of the challenges of this approach, however, arises from the fact that in production environ- ments, collections of documents are always changing, with new documents, or new versions of documents, arriving fre- quently, and other documents periodically removed. When an enterprise proceeds to freeze a training collection in order to stabilize the underlying repository of such features and its associated collection statistics, issues of coverage and com- pleteness arise. We show that even with very large training collections possessing extremely high feature correlations be- fore and after updates, underlying fingerprints remain sensi- tive to subtle changes. We explore alternative solutions that benefit from the development of massive meta-collections made up of sizable components frommultiple domains. This technique appears to offer a practical foundation for finger- print stability. We also consider mechanisms for updating training collections while mitigating signature instability. Our research is divided into three parts. We begin with a study of the distribution of duplicate types in two broad- ranging news collections consisting of approximately 50 mil- lion documents. We then examine the utility of document signatures in addressing identical or nearly identical dupli- cate documents and their sensitivity to collection updates. Finally, we investigate a flexible method of characterizing and comparing documents in order to permit the identifica- tion of non-identical duplicates. This method has produced promising results following an extensive evaluation using a production-based test collection created by domain experts.

Author-supplied keywords

  • all or part of
  • data management
  • doc signatures
  • duplicate document detection
  • is granted without fee
  • or hard copies of
  • permission to make digital
  • personal or classroom use
  • provided that copies are
  • this work for

Get free article suggestions today

Mendeley saves you time finding and organizing research

Sign up here
Already have an account ?Sign in

Find this document

Get full text

Authors

  • J.G. Conrad

  • X.S. Guo

  • C.P. Schriber

Cite this document

Choose a citation style from the tabs below

Save time finding and organizing research with Mendeley

Sign up for free