TF-IDF-inspired detection for cross-language source code plagiarism and collusion

9Citations
Citations of this article
22Readers
Mendeley users who have this article in their library.

Abstract

Several computing courses allow students to choose which programming language they want to use for completing a programming task. This can lead to cross-language code plagiarism and collusion in which the copied code file is rewritten in another programming language. In response, this paper proposes a detection technique that is able to accurately compare code files written in various programming languages but with limited eort in accommodating such languages at the development stage. The only language-dependent feature used in the technique is a source code tokenizer; no code conversion is applied. The impact of coincidental similarity is reduced by applying a TF-IDF-inspired weighting in which rare matches are prioritized. Our evaluation shows that the technique outperforms common techniques in academia for handling language-conversion disguises. Furthermore, it is comparable to these techniques when dealing with conventional disguises.

Cite

CITATION STYLE

APA

Karnalim, O. (2020). TF-IDF-inspired detection for cross-language source code plagiarism and collusion. Computer Science, 21(1), 97–121. https://doi.org/10.7494/csci.2020.21.1.3389

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free