The case of the duplicate documents measurement, search, and science

Justin Zobel; Yaniv Bernstein

Conference Proceedings

The case of the duplicate documents measurement, search, and science

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2006) 3841 LNCS 26-39

DOI: 10.1007/11610113_4

5Citations

9Readers

Get full text

Abstract

Many of the documents in large text collections are duplicates and versions of each other. In recent research, we developed new methods for finding such duplicates; however, as there was no directly comparable prior work, we had no measure of whether we had succeeded. Worse, the concept of "duplicate" not only proved difficult to define, but on reflection was not logically defensible. Our investigation highlighted a paradox of computer science research: objective measurement of outcomes involves a subjective choice of preferred measure; and attempts to define measures can easily founder in circular reasoning. Also, some measures are abstractions that simplify complex real-world phenomena, so success by a measure may not be meaningful outside the context of the research. These are not merely academic concerns, but are significant problems in the design of research projects. In this paper, the case of the duplicate documents is used to explore whether and when it is reasonable to claim that research is successful. © Springer-Verlag Berlin Heidelberg 2006.

Cite

CITATION STYLE

APA

Zobel, J., & Bernstein, Y. (2006). The case of the duplicate documents measurement, search, and science. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 3841 LNCS, pp. 26–39). https://doi.org/10.1007/11610113_4

The case of the duplicate documents measurement, search, and science

Abstract

Cite

Register to see more suggestions