Abstract
Several computing courses allow students to choose which programming language they want to use for completing a programming task. This can lead to cross-language code plagiarism and collusion in which the copied code file is rewritten in another programming language. In response, this paper proposes a detection technique that is able to accurately compare code files written in various programming languages but with limited eort in accommodating such languages at the development stage. The only language-dependent feature used in the technique is a source code tokenizer; no code conversion is applied. The impact of coincidental similarity is reduced by applying a TF-IDF-inspired weighting in which rare matches are prioritized. Our evaluation shows that the technique outperforms common techniques in academia for handling language-conversion disguises. Furthermore, it is comparable to these techniques when dealing with conventional disguises.
Author supplied keywords
Cite
CITATION STYLE
Karnalim, O. (2020). TF-IDF-inspired detection for cross-language source code plagiarism and collusion. Computer Science, 21(1), 97–121. https://doi.org/10.7494/csci.2020.21.1.3389
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.