SCAM: A copy detection mechanism ...
SCAM: A Copy Detection Mechanism for Digital Do cuments Narayanan Shivakumar, Hector Garcia-Molina Department of Computer Science Stanford University Stanford, CA 94305-2140 fshiva, hectorg@cs.stanford.edu Abstract Copy detection in Digital Libraries may provide the necessary guarantees for publishers and newsfeed ser- vices to oer valuable on-line data. We consider the case for a registration server that maintains regis- tered do cuments against which new do cuments can b e checked for overlap. In this pap er we present a new scheme for detecting copies based on compar- ing the word frequency o ccurrences of the new do cu- ment against those of registered do cuments. We also rep ort on an exp erimental comparison b etween our prop osed scheme and COPS [6], a detection scheme based on sentence overlap. The tests involve over a million comparisons of netnews articles and show that in general the new scheme p erforms b etter in detecting do cuments that have partial overlap. Keywords: Copy Detection, Plagiarism, Registra- tion Ser-ver, Databases. 1 Intro duction A Digital Library provides users with on-line access to digitized news articles, b o oks, and other information. This material is based up on work supp orted by the Na- tional Science Foundation under Co op erative Agreement IRI- 9411306. Funding for this co op erative agreement is also pro- vided by ARPA, NASA, and the industrial partners of the Stanford Digital Libraries Pro ject. Any opinions, nding, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reect the views of the National Science Foundation or the other sp onsors. This work was supp orted by an equipment grant from Digital Equip- ment Corp oration. In this environment, a user may easily redistribute this digital information on bulletin b oards and mail- ing lists. Unless this problem is \solved," few pub- lishers or authors will place valuable information in these Digital Libraries. Most existing techniques that address this prob- lem fall into two categories, those of copy prevention and copy detection. Copy prevention schemes include physical isolation of the information (e.g., by placing it on a stand-alone CD-ROM system), use of sp ecial- purp ose hardware for authorization [18], and active do cuments that are essentially do cuments encapsu- lated by programs [10]. We b elieve that prevention techniques may b e cumb ersome, may get in the way of the honest user [6], and may make it dicult to share information. Furthermore, prevention schemes are not always bulletpro of since do cuments may b e recorded by using software emulators [6]. The other approach is not to place restrictions on the distribution of do cuments, but to detect illegal copies. Detection schemes fall into two categories, signature based and registration based. In signature based schemes, a \signature" is added to the do c- ument, and this signature can b e used to trace the origins of the do cument. For example, one p opular approach is to incorp orate watermarks such as word spacings and checksum into do cuments [5, 4, 22, 7, 3]. Signature schemes have two weaknesses: (a) the signatures often can b e removed automatically, lead- ing to untraceable do cuments, and (b) they are not useful for detecting partial overlap. For these rea- sons we advo cate registration based copy detection schemes. With these schemes original do cuments are registered and stored in a rep ository [17, 2]. Sub- sequent do cuments that are pro duced are compared 1
against the pre-registered do cuments for partial or complete overlap. This check can b e initiated by a p erson, e.g., a program committee memb er checking if a conference submission overlaps signicantly with previous pap ers, or automatically by a program, e.g., a bulletin-b oards or electronic mail gateway checking messages going through to see if they include copies of copyrighted articles. The rep ository of registered do cuments can b e compacted in a variety of ways [6] and p erio dically distributed to mail gateways and bulletin b oards so that checks can b e done lo cally. Another application of registration copy detection is for ltering duplicate messages often found in news- groups and mailing lists [25]. There are a numb er of ways to detect duplication with registered do cuments. In COPS [6], registered do cuments are broken up into sentences or sequences of sentences, and are stored in the registration server. Subsequent query do cuments are broken up in the same way and are compared against the registered do cuments. If a query do cument shares more than a given threshold of matching sentences (or sequences of sentences) with a registered do cument, the user is notied. Another scheme is presented in [14], where the problem of nding \similar" les is addressed. The mechanism works by selecting a few words as an- chors and computing checksums of a following win- dow of characters for comparison. It is mainly in- tended for le management applications and the de- tection of les that are very similar, but not for de- tection of small text overlaps. Registration schemes can also b e broken. For ex- ample, with COPS, a user can mo dify a large numb er of sentences, e.g., by adding or changing a word, ren- dering the new do cument untraceable to the original. However, this requires substantial manual work, and for this reason we b elieve registration based copy de- tection is sup erior to signature based schemes. Although COPS has b een shown to work well [6], it do es have some problems. In particular, it has some diculties in detecting sentences. Often equations, gures, and abbreviations confuse it. Also, checking for overlap involves many random prob es into the reg- istration database, and is exp ensive. For these rea- sons, we have explored alternative schemes. In this pap er we present a comparison scheme based on the word o ccurrence frequencies of do cu- ments. Conceptually, we compute a vector that gives the frequency with which each p ossible word o ccurs in the new do cument. Then we compare this vector against \similar" vectors in the database of registered do cuments. This is very similar to how Information Retrieval (IR) systems compute do cument similari- ties [20], except that we use a new similarity mea- sure that more accurately characterizes copy overlap, while traditional IR systems lo ok for semantic sim- ilarity. Several schemes have b een prop osed to en- hance IR schemes, such as use of signature les [8], lexical analysis [1], stoplists [13, 9], stemming al- gorithms [12, 15], thesaurus [21] and ranking algo- rithms [19]. Since our approach is based on IR, such schemes are orthogonal to our mo del, and one or more of these schemes could b e used to enhance our do cu- ment comparison mechanism. Our scheme is based on words, which are easier to detect than sentences, and hence may b e more ac- curate, esp ecially for informal do cuments. We also b elieve that word access patterns have more lo cality than sentence access patterns and this may lead to improved p erformance in some cases. However, our main motivation in cho osing words is that sentence based mechanisms such as COPS, cannot detect par- tial sentence overlaps. Hence we b elieve that word based schemes may b e sup erior to sentence based mechanisms in detecting plagiarism in do cuments. To supp ort our claims, we present results comparing COPS against our prototyp e SCAM (Stanford Copy Analysis Mechanism) on 1233 1233 netnews arti- cle pairs, and show that in general SCAM p erforms b etter than COPS in detecting instances of plagia- rism. However, we also note that SCAM rep orts more false positives than COPS, where false p ositives are pairs of do cuments that are rep orted to b e p ossible instances of plagiarism, but are not. We also compare SCAM against a traditional vector-based IR scheme on the same 1233 netnews articles, and show that SCAM again p erforms b etter in detecting do cument overlaps. 2 Copy Detection Preliminar- ies In this section, we present the architecture of a generic copy detection server and intro duce relevant 2