Automatic generation of summary obfuscation corpus for plagiarism detection

Sabino Miranda-Jiménez; Efstathios Stamatatos

Journal ArticleOPEN ACCESS

Automatic generation of summary obfuscation corpus for plagiarism detection

Acta Polytechnica Hungarica (2017) 14(3) 99-112

DOI: 10.12700/APH.14.3.2017.3.6

2Citations

10Readers

Abstract

In this paper, we describe an approach to create a summary obfuscation corpus for the task of plagiarism detection. Our method is based on information from the Document Understanding Conferences related to years 2001 and 2006, for the English language. Overall, an unattributed summary used within someone else’s document is considered a kind of plagiarism because the main author’s ideas are still in a succinct form. In order to create the corpus, we use a Named Entity Recognizer (NER) to identify the entities within an original document, its associated summaries, and target documents. After, these entities, together with similar paragraphs in target documents, are used to make fake suspicious documents and plagiarized documents. The corpus was tested in plagiarism competition.

Author supplied keywords

Cite

CITATION STYLE

APA

Miranda-Jiménez, S., & Stamatatos, E. (2017). Automatic generation of summary obfuscation corpus for plagiarism detection. Acta Polytechnica Hungarica, 14(3), 99–112. https://doi.org/10.12700/APH.14.3.2017.3.6

Automatic generation of summary obfuscation corpus for plagiarism detection

Abstract

Author supplied keywords

Cite

Register to see more suggestions