Detecting similar documents using salient terms

James W. Cooper; Anni R. Coden; Eric W. Brown

Conference Proceedings

Detecting similar documents using salient terms

International Conference on Information and Knowledge Management, Proceedings (2002) 245-251

DOI: 10.1145/584792.584835

34Citations

25Readers

Get full text

Abstract

We describe a system for rapidly determining document similarity among a set of documents obtained from an information retrieval (IR) system. We obtain a ranked list of the most important terms in each document using a rapid phrase recognizer system. We store these in a database and compute document similarity using a simple database query. If the number of terms found to not be contained in both documents is less than some predetermined threshold compared to the total number of terms in the document, these documents are determined to be very similar. We compare this to the shingles approach.

Author supplied keywords

Cite

CITATION STYLE

APA

Cooper, J. W., Coden, A. R., & Brown, E. W. (2002). Detecting similar documents using salient terms. In International Conference on Information and Knowledge Management, Proceedings (pp. 245–251). Association for Computing Machinery (ACM). https://doi.org/10.1145/584792.584835

Detecting similar documents using salient terms

Abstract

Author supplied keywords

Cite

Register to see more suggestions