TF-IDF-inspired detection for cross-language source code plagiarism and collusion

Oscar Karnalim

Journal ArticleOPEN ACCESS

TF-IDF-inspired detection for cross-language source code plagiarism and collusion

Karnalim O

Computer Science (2020) 21(1) 97-121

DOI: 10.7494/csci.2020.21.1.3389

9Citations

22Readers

Abstract

Several computing courses allow students to choose which programming language they want to use for completing a programming task. This can lead to cross-language code plagiarism and collusion in which the copied code file is rewritten in another programming language. In response, this paper proposes a detection technique that is able to accurately compare code files written in various programming languages but with limited eort in accommodating such languages at the development stage. The only language-dependent feature used in the technique is a source code tokenizer; no code conversion is applied. The impact of coincidental similarity is reduced by applying a TF-IDF-inspired weighting in which rare matches are prioritized. Our evaluation shows that the technique outperforms common techniques in academia for handling language-conversion disguises. Furthermore, it is comparable to these techniques when dealing with conventional disguises.

Author supplied keywords

Cite

CITATION STYLE

APA

Karnalim, O. (2020). TF-IDF-inspired detection for cross-language source code plagiarism and collusion. Computer Science, 21(1), 97–121. https://doi.org/10.7494/csci.2020.21.1.3389

TF-IDF-inspired detection for cross-language source code plagiarism and collusion

Abstract

Author supplied keywords

Cite

Register to see more suggestions