Online duplicate document detection: signature reliability in a dynamic retrieval environment

  • Conrad J
  • Guo X
  • Schriber C
  • 32

    Readers

    Mendeley users who have this article in their library.
  • 43

    Citations

    Citations of this article.

Abstract

As online document collections continue to expand, both on the Web and in proprietary environments, the need for du- plicate detection becomes more critical. Few users wish to retrieve search results consisting of sets of duplicate docu- ments, whether identical duplicates or close matches. Our goal in this work is to investigate the phenomenon and de- termine one or more approaches that minimize its impact on search results. Recent work has focused on using some form of signature to characterize a document in order to reduce the complexity of document comparisons. A representative technique constructs a ‘fingerprint’ of the rarest or richest features in a document using collection statistics as criteria for feature selection. One of the challenges of this approach, however, arises from the fact that in production environ- ments, collections of documents are always changing, with new documents, or new versions of documents, arriving fre- quently, and other documents periodically removed. When an enterprise proceeds to freeze a training collection in order to stabilize the underlying repository of such features and its associated collection statistics, issues of coverage and com- pleteness arise. We show that even with very large training collections possessing extremely high feature correlations be- fore and after updates, underlying fingerprints remain sensi- tive to subtle changes. We explore alternative solutions that benefit from the development of massive meta-collections made up of sizable components frommultiple domains. This technique appears to offer a practical foundation for finger- print stability. We also consider mechanisms for updating training collections while mitigating signature instability. Our research is divided into three parts. We begin with a study of the distribution of duplicate types in two broad- ranging news collections consisting of approximately 50 mil- lion documents. We then examine the utility of document signatures in addressing identical or nearly identical dupli- cate documents and their sensitivity to collection updates. Finally, we investigate a flexible method of characterizing and comparing documents in order to permit the identifica- tion of non-identical duplicates. This method has produced promising results following an extensive evaluation using a production-based test collection created by domain experts.

Author-supplied keywords

  • all or part of
  • data management
  • doc signatures
  • duplicate document detection
  • is granted without fee
  • or hard copies of
  • permission to make digital
  • personal or classroom use
  • provided that copies are
  • this work for

Get free article suggestions today

Mendeley saves you time finding and organizing research

Sign up here
Already have an account ?Sign in

Find this document

Authors

  • J.G. Conrad

  • X.S. Guo

  • C.P. Schriber

Cite this document

Choose a citation style from the tabs below

Save time finding and organizing research with Mendeley

Sign up for free