Sign up & Download
Sign in

Citation pattern matching algorithms for citation-based plagiarism detection: greedy citation tiling, citation chunking and longest common citation sequence

by Bela Gipp, Norman Meuschke
Proceedings of the 11th ACM Symposium on Document Engineering DocEng2011 (2011)

Cite this document (BETA)

Available from gipp.com
Page 1
hidden

Citation pattern matching algorithms for citation-based plagiarism detection: greedy citation tiling, citation chunking and longest common citation sequence

Citation Pattern Matching Algorithms for Citation-based
Plagiarism Detection: Greedy Citation Tiling, Citation
Chunking and Longest Common Citation Sequence
Bela Gipp
OvGU, Germany & UC Berkeley, USA
gipp@berkeley.edu
Norman Meuschke
OvGU, Germany & UC Berkeley, USA
meuschke@berkeley.edu

ABSTRACT
Plagiarism Detection Systems have been developed to locate
instances of plagiarism e.g. within scientific papers. Studies have
shown that the existing approaches deliver reasonable results in
identifying copy&paste plagiarism, but fail to detect more
sophisticated forms such as paraphrased, translated or idea
plagiarism. The authors of this paper demonstrated in recent
studies [4, 15] that the detection rate can be significantly
improved by not only relying on text analysis, but by additionally
analyzing the citations of a document. Citations are valuable
language independent markers that are similar to a fingerprint. In
fact, our examinations of real world cases have shown that the
order of citations in a document often remains similar even if the
text has been strongly paraphrased or translated in order to
disguise plagiarism.
This paper introduces three algorithms and discusses their
suitability for the purpose of Citation-based Plagiarism Detection.
Due to the numerous ways in which plagiarism can occur, these
algorithms need to be versatile. They must be capable of detecting
transpositions, scaling and combinations in a local and global
form. The algorithms are coined Greedy Citation Tiling, Citation
Chunking and Longest Common Citation Sequence. The
evaluation showed that common forms of plagiarism can be
detected reliably if these algorithms are combined.
Categories and Subject Descriptors
H.3.3 [Clustering]: INFORMATION STORAGE AND
RETRIEVAL – Information Search and Retrieval.
General Terms
Algorithms, Experimentation, Measurement, Languages
Keywords
Plagiarism Detection Systems, Citation-based, Citation Order
Analysis, Citation Pattern Analysis

1. INTRODUCTION
Plagiarism describes the appropriation of other persons’ ideas,
intellectual or creative work and passing them of as one’s own [7].
For including the act of self-plagiarism (see 2.1) we broaden the
scope of the term and define academic plagiarism as using words
and/or ideas from other sources without due acknowledgement
imposed by academic principles.
It is a particularly common problem among college students
worldwide, but also notably present among established
researchers. In a self-report study among ~82,000 students about
40% of undergraduates and ~25% of graduates engaged in
plagiarism within 12 months prior to the study [29]. Results of
other studies range as high as ~90% of the subjects self-reporting
acts of plagiarism [27].
In academia numerous cases of plagiarism have become publicly
known. An automated plagiarism check of ~285,000 scientific
texts of arXiv.org yielded more than 500 documents very likely to
have been plagiarized. In addition, 30.000 documents (~20% of
the collection) were found to be very likely duplicates or
containing: “[…] excessive self-plagiarism […]” [43, p. 12].
The existing approaches for plagiarism detection have their
weaknesses. Using the words of Weber-Wulff, the organizer of
regular comparisons for productive Plagiarism Detection Systems
(PDS), the current state of available systems can be summarized
as follows: “[…] PDS find copies, not plagiarism.” [50, p. 6].
The paper is structured as follows. After giving an overview of
different forms of plagiarism, the detection approaches currently
used and a discussion of their strength and weaknesses, the
Citation-based Plagiarism Detection approach is briefly presented.
Subsequently, the newly developed algorithms for Citation-based
Plagiarism Detection are introduced, evaluated and their
suitability for detecting different forms of plagiarism is discussed.
Finally, the suitability of the presented approaches is
demonstrated using real cases of plagiarism.
2. RELATED WORK
2.1 Forms of Plagiarism
Observations of plagiarism behavior in practice reveal a number
of commonly found methods for illegitimate text usage, which are
characterized below.
Copy&Paste (c&p) plagiarism specifies the act of taking over text
verbatim from another author [49].
Disguised plagiarism subsumes practices intended to mask copied
segments [26]. Four different masking techniques have been
identified. These are:
∙ Shake&Paste (s&p) plagiarism is characterized by copying and
merging sentences or paragraphs from different sources with
slight adjustments necessary for forming a coherent text [49];
∙ Expansive plagiarism refers to the insertion of additional text
into or in addition to copied segments [26];

Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies
are not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
DocEng’11, September 19–22, 2011, Mountain View, CA, USA.
Copyright 2011 ACM 978-1-4503-0863-2/11/09…$10.00.
249
Page 2
hidden
∙ Contractive plagiarism describes the summary or trimming of
copied material [26];
∙ Mosaic plagiarism encompasses the merge of text segments
from different sources and obfuscating the plagiarism by
changing word order, substituting words with synonyms or
entering/deleting filling words [26, 49];
Technical disguise summarizes techniques for hiding plagiarized
content from being automatically detected by exploiting
weaknesses off current text-based analysis methods e.g. by
substituting characters with graphically identical symbols from
foreign alphabets or inserting letters in white font color [20].
Undue paraphrasing defines the intentional rewriting of foreign
thoughts in the vocabulary and style of the plagiarist without
giving due credit for concealing the original source [26].
Translated plagiarism is defined as the manual or automated
conversion of content from one language to another intended to
cover its origin [49].
Idea plagiarism encompasses the usage of a broader foreign
concept without due source acknowledgement [28]. Examples are
the appropriation of research approaches and methods,
experimental setups, argumentative structures, background
sources etc. [13].
Self-plagiarism characterizes the partial or complete reuse of
one’s own previous writings not being justified by scientific goals,
e.g. for presenting updates or providing access to a larger
community, but primarily serving the author, e.g. for artificially
increasing citation counts [5, 11].
2.2 Existing Plagiarism Detection Approaches
Plagiarism Detection (PD) is a hypernym for computer-based
procedures supporting the identification of plagiarism incidences.
Existing PD methods can be categorized into external and
intrinsic approaches [26, 45].
External PD methods compare a suspicious document to a
collection of genuine works. Different comparison strategies have
been proposed in this context.
String matching procedures [2, 32, 52] aim to identify longest
pairs of identical text strings. These strings are treated as
indicators for potential plagiarism if the share they represent with
regard to the overall text exceeds a chosen threshold. Suffix
document models, such as suffix trees or suffix arrays, have
mostly been used for that purpose in the context of PD.
The strength of substring matching methods is their perfect
detection accuracy with regard to literal text overlaps. Their major
drawbacks are the relative difficulty of detecting disguised
plagiarisms as well as the required computational effort. The
former fact is intuitive when recalling the exact matching
approach of the detection procedure. The later barrier results from
the use of suffix data structures. The most space-efficient suffix
tree [25], suffix array [24] and suffix vector [33] implementations
allow searching in linear time and require on average ~8݊ of
storage space, with ݊ being the number of symbols in the original
document. String B-Trees allow searching in ܱሺlog ݊ሻ, but also
require multiple times the storage space of the original document
[25]. This renders them impracticable for most large document
collections.
Employing vector space retrieval based on different term units
has been proposed e.g. by [9, 40, 22]. Vector space models (VSM)
are a standard, highly performance tuned Information Retrieval
(IR) concept that can overcome the effort-related limitations of
elaborate string matching. VSM consider a set of terms, which
commonly has been extracted from the whole document or larger
parts of the text, for similarity computation. Therefore, vector
space retrieval methods just like string matching is classified as
global similarity assessments [47].
The well-known cosine measure is a widely used similarity
function in PD settings as it is for other IR tasks. More complex
similarity functions tend to incorporate semantic information e.g.
by considering word synonyms [21] or pre-computing semantic
relations [48] between terms. The aforementioned papers show
that such considerations can increase detection performance, at
the cost of significantly increasing the computational effort
required. In the experiments reported in [3] considering synonyms
improved the F-measure of the respective detection procedures by
2-3 times. However, the runtime required for doing so increased
by more than 27 times.
The detection performance of VSM based PDS is dependent on
the individual plagiarism incidence to be analyzed and the
parameter configuration, e.g. term unit and term selection
strategy, of the specific detection method [18, p. 155]. However,
the global similarity assessment of VSMs tends to be detrimental
to detection accuracy in PD settings. Verbatim plagiarism is more
commonly related to smaller, confined segments of a document,
which favors local similarity analysis [47].
Fingerprinting methods, being the most widely used PD
approach, perform a local similarity assessment. They aim to form
a representative digest of a document by selecting a set of
multiple substrings from it. The set represents the fingerprint; its
elements are called minutiae [19]. Mathematical, hash-like
functions can be applied on minutiae for transforming them into
more space efficient byte strings [12].
A suspicious document is checked for plagiarism by computing its
fingerprint and querying each minutia with a pre-computed index
of fingerprints for all documents of a reference collection.
Minutiae found matching with those of other documents indicate
shared text segments and suggest potential plagiarism upon
exceeding a certain similarity threshold [6].
The inherent challenge of fingerprinting is finding a document
representation that reduces computational effort to a suitable
dimension, while limiting the information loss incurred to achieve
acceptable detection accuracy [31]. A number of parameters, e.g.
the chunking strategy, chunk size (granularity of the fingerprint)
or number of minutiae (resolution of the fingerprint), reflect that
challenge. There is no definite answer to the question of which
parameter combination is the best, since this choice is strongly
dependent on the nature and size of the collection as well as the
amount and form of plagiarism.
Conventional fingerprinting methods implicitly encode the term
order of a document in proportion to the length of the chosen text
chunk. STEIN proposes an approach, termed fuzzy-fingerprinting,
which disregards term order by using a VSM of document terms
instead of substrings for minutia computation [44].
Fuzzy-Fingerprints are primarily targeted at reducing
computational effort. Compared to fingerprinting using
word-3-grams and a MD5 hash function they can be computed >5
times faster, but have been shown to be inferior in detection
accuracy [47].
Intrinsic PD methods, opposed to the approaches presented so far,
do not depend on the existence of a reference corpus. They
250

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

7 Readers on Mendeley
by Discipline
 
 
by Academic Status
 
29% Researcher (at an Academic Institution)
 
14% Student (Bachelor)
 
14% Student (Master)
by Country
 
43% United Kingdom
 
14% Netherlands
 
14% Germany