Detecting similar documents using salient terms

  • Cooper J
  • Coden A
  • Brown E
  • 34

    Readers

    Mendeley users who have this article in their library.
  • 26

    Citations

    Citations of this article.

Abstract

We describe a system for rapidly determining document similarity among a set of documents obtained from an information retrieval (IR) system. We obtain a ranked list of the most important terms in each document using a rapid phrase recognizer system. We store these in a database and compute document similarity using a simple database query. If the number of terms found to not be contained in both documents is less than some predetermined threshold compared to the total number of terms in the document, these documents are determined to be very similar. We compare this to the shingles approach.

Author-supplied keywords

  • databases
  • document similarity
  • duplicate documents
  • shingles
  • text mining

Get free article suggestions today

Mendeley saves you time finding and organizing research

Sign up here
Already have an account ?Sign in

Find this document

Authors

  • James W Cooper

  • Anni R Coden

  • Eric W Brown

Cite this document

Choose a citation style from the tabs below

Save time finding and organizing research with Mendeley

Sign up for free