Measuring performance of n-gram and jaccard-similarity metrics in document plagiarism application

15Citations
Citations of this article
44Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

String-based similarity metrics were mainly used to lexically measure the similarity between words based on the string sequences and character compositions. This research aimed to build an application that can identify the similarity between documents. The program employed two lexical-based algorithms, N-gram and Jaccard, to check the documents similarity. The author focused on analysing the algorithms' performance based on accuracy, sensitivity, and efficiency metric. Datasets used in this research were the final thesis documents in Indonesian and English language. Experiment results revealed that Jaccard algorithm has a better performance in term of accuracy and sensitivity compared to N-gram. Notwithstanding its superior performance, Jaccard had a longer running time than N-gram to process documents. Furthermore, the results also pointed out that the cross-language documents were indeed affecting the degree of similarity checking.

Cite

CITATION STYLE

APA

Diana, N. E., & Hanana Ulfa, I. (2019). Measuring performance of n-gram and jaccard-similarity metrics in document plagiarism application. In Journal of Physics: Conference Series (Vol. 1196). Institute of Physics Publishing. https://doi.org/10.1088/1742-6596/1196/1/012069

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free