Classification feature sets for source code plagiarism detection in Java

3Citations
Citations of this article
14Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

In programming learning environments, the pressure of delivering many programming assignments makes plagiarism the easiest solution. This highly threatens the learning process; therefore, the need of an automatic, fast, and accurate detection of source code plagiarism becomes essential. To detect whether a pair of Java files is plagiarized, this paper proposes four classification feature sets: (i) structural histogram features, histogram-based features for summarizing similarity matrices; (ii) lexical per-class features, extracted from a lexical similarity matrix between the classes of the two compared files based on character 3-grams; (iii) structural counting features, twelve counting features representing the code structure; and (iv) modified original features: a set of modifications on the features of the used baseline. The results show that the best feature sets in F-measure are the structural histogram features and the lexical per-class features combined, which improve the F-measure by 4% compared to the baseline. The added features slow down the execution time. However, it is still efficient, given that it can classify 70k pairs in 23 min. In addition, we partially re-annotated the SOurce COde Re-use dataset. After the re-annotation, the F-measure of both the baseline and our work is improved, and our work achieves an F-measure of 93.6%, which is 7.5% higher than the new F-measure of the baseline. In addition, some remarks and recommendations are provided for using the SOurce COde Re-use dataset as a benchmark.

Cite

CITATION STYLE

APA

Hosam, E., Hadhoud, M., Atiya, A., & Fayek, M. (2022). Classification feature sets for source code plagiarism detection in Java. Journal of Engineering and Applied Science, 69(1). https://doi.org/10.1186/s44147-022-00155-8

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free