Sign up & Download
Sign in

Comparative evaluation of text-and citation-based plagiarism detection approaches using guttenplag

by B Gipp, N Meuschke, J Beel
Proceeding of the 11th annual international ACMIEEE joint conference on Digital libraries (2010)

Cite this document (BETA)

Available from gipp.com
Page 1
hidden

Comparative evaluation of text-and citation-based plagiarism detection approaches using guttenplag

Citation Based Plagiarism Detection - A New Approach to
Identify Plagiarized Work Language Independently

Bela Gipp
UC Berkeley / OvGU
102 South Hall, Berkeley
+1 (510) 859-3860
gipp@berkeley.edu
Jöran Beel
UC Berkeley / OvGU
102 South Hall, Berkeley
+1 (510) 859-3860
beel@berkeley.edu
ABSTRACT
This paper describes a new approach towards detecting plagiarism
and scientific documents that have been read but not cited. In
contrast to existing approaches, which analyze documents‘ words
but ignore their citations, this approach is based on citation
analysis and allows duplicate and plagiarism detection even if a
document has been paraphrased or translated, since the relative
position of citations remains similar. Although this approach
allows in many cases the detection of plagiarized work that could
not be detected automatically with the traditional approaches, it
should be considered as an extension rather than a substitute.
Whereas the known text analysis methods can detect copied or, to
a certain degree, modified passages, the proposed approach
requires longer passages with at least two citations in order to
create a digital fingerprint.
Categories and Subject Descriptors
H.3.3 [Clustering]: INFORMATION STORAGE AND
RETRIEVAL – Information Search and Retrieval.
General Terms
Algorithms, Measurement, Languages
Keywords
Plagiarism Detection, Duplicate Detection, Citation Analysis,
Citation Order Analysis, Language Independent
1. INTRODUCTION
Plagiarism is defined as the ‗use or close imitation of the language
and thoughts of another author and the representation of them as
one's own original work.‘1
Plenty of websites addressing students and scholars give advice
on how to ensure that plagiarized text cannot be identified by a
plagiarism detection system such as copyscape.com. The most
common advice given is to paraphrase and use synonyms, or even
copy from sources that were written in another language.
Plagiarism detection services responded by integrating

1 Random House Compact Unabridged Dictionary,
1996

Copyright is held by the author/owner(s).
HT’10, June 13–16, 2010, Toronto, Ontario, Canada .
ACM 978-1-4503-0041-4/10/06.

dictionaries and sophisticated data analysis methods. However,
these systems still have unsatisfying detection rates if text is
paraphrased or translated as shown at the International
Competition on Plagiarism Detection in 2009 [6].
2. RELATED WORK
Hundreds of papers have been published covering sophisticated
approaches to detect plagiarism, and dozens of applications were
developed. All of them use more or less sophisticated approaches
to analyze the text, but ignore the used citations [3], [6]. These
approaches deliver excellent results in detecting copied text
passages, but fail if text has been paraphrased or translated—for
example, from German to English. Instead of analyzing the words
of a document, this paper suggests analyzing the used citations.
To our knowledge, applying citation analysis approaches to detect
plagiarism has not yet been attempted. Several citation analysis
approaches, however, have been developed as a measure of
subject relatedness. In 1963, Kessler introduced [2] the concept of
bibliographic coupling. Document A and Document B are
bibliographically coupled if they cite one or more documents in
common. Figure 1 illustrates this approach: Documents A and B
are related because they both cite Documents 1, 2 and 3.

Doc A
cited
Doc B
cited
[1]
[2]
[3]
cites
Doc A
citing
Doc B
citing
[1]
[2]
[3]
cites

Figure 1: Bibliographic coupling (left) and co-citation (right)
A variation of this, called co-citation, was proposed by
Marshakova [4] and Small [5]. Two documents are ―co-cited‖
when at least one document cites both. This approach is illustrated
on the right in Figure 1: Documents A and B are related because
both are cited by Documents 1, 2 and 3. The more co-citations
two documents receive, the more related they are. A further
development of this approach is Citation Proximity Analysis,
which identifies related documents by their co-occurrence of
citations under consideration of their proximity to each other [1].
All approaches allow the calculation of the coupling strength and
Preprint of: Bela Gipp and Joeran Beel. Citation Based Plagiarism Detection – A New Approach to Identify Plagiarized Work Language Independently.
In Proceedings of the 21th ACM Conference on Hyptertext and Hypermedia. ACM, June 2010. Downloaded from http://www.sciplore.org

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

3 Readers on Mendeley
by Discipline
 
by Academic Status
 
33% Other Professional
 
33% Doctoral Student
 
33% Researcher (at an Academic Institution)
by Country
 
33% Germany
 
33% France
 
33% United States