In this paper, we describe an approach to create a summary obfuscation corpus for the task of plagiarism detection. Our method is based on information from the Document Understanding Conferences related to years 2001 and 2006, for the English language. Overall, an unattributed summary used within someone else’s document is considered a kind of plagiarism because the main author’s ideas are still in a succinct form. In order to create the corpus, we use a Named Entity Recognizer (NER) to identify the entities within an original document, its associated summaries, and target documents. After, these entities, together with similar paragraphs in target documents, are used to make fake suspicious documents and plagiarized documents. The corpus was tested in plagiarism competition.
CITATION STYLE
Miranda-Jiménez, S., & Stamatatos, E. (2017). Automatic generation of summary obfuscation corpus for plagiarism detection. Acta Polytechnica Hungarica, 14(3), 99–112. https://doi.org/10.12700/APH.14.3.2017.3.6
Mendeley helps you to discover research relevant for your work.