As online document collections continue to expand, both on the Web and in proprietary environments, the need for du- plicate detection becomes more critical. Few users wish to retrieve search results consisting of sets of duplicate docu- ments, whether identical duplicates or close matches. Our goal in this work is to investigate the phenomenon and de- termine one or more approaches that minimize its impact on search results. Recent work has focused on using some form of signature to characterize a document in order to reduce the complexity of document comparisons. A representative technique constructs a ‘fingerprint’ of the rarest or richest features in a document using collection statistics as criteria for feature selection. One of the challenges of this approach, however, arises from the fact that in production environ- ments, collections of documents are always changing, with new documents, or new versions of documents, arriving fre- quently, and other documents periodically removed. When an enterprise proceeds to freeze a training collection in order to stabilize the underlying repository of such features and its associated collection statistics, issues of coverage and com- pleteness arise. We show that even with very large training collections possessing extremely high feature correlations be- fore and after updates, underlying fingerprints remain sensi- tive to subtle changes. We explore alternative solutions that benefit from the development of massive meta-collections made up of sizable components frommultiple domains. This technique appears to offer a practical foundation for finger- print stability. We also consider mechanisms for updating training collections while mitigating signature instability. Our research is divided into three parts. We begin with a study of the distribution of duplicate types in two broad- ranging news collections consisting of approximately 50 mil- lion documents. We then examine the utility of document signatures in addressing identical or nearly identical dupli- cate documents and their sensitivity to collection updates. Finally, we investigate a flexible method of characterizing and comparing documents in order to permit the identifica- tion of non-identical duplicates. This method has produced promising results following an extensive evaluation using a production-based test collection created by domain experts.
Mendeley saves you time finding and organizing research
Choose a citation style from the tabs below