Abstract
String-based similarity metrics were mainly used to lexically measure the similarity between words based on the string sequences and character compositions. This research aimed to build an application that can identify the similarity between documents. The program employed two lexical-based algorithms, N-gram and Jaccard, to check the documents similarity. The author focused on analysing the algorithms' performance based on accuracy, sensitivity, and efficiency metric. Datasets used in this research were the final thesis documents in Indonesian and English language. Experiment results revealed that Jaccard algorithm has a better performance in term of accuracy and sensitivity compared to N-gram. Notwithstanding its superior performance, Jaccard had a longer running time than N-gram to process documents. Furthermore, the results also pointed out that the cross-language documents were indeed affecting the degree of similarity checking.
Cite
CITATION STYLE
Diana, N. E., & Hanana Ulfa, I. (2019). Measuring performance of n-gram and jaccard-similarity metrics in document plagiarism application. In Journal of Physics: Conference Series (Vol. 1196). Institute of Physics Publishing. https://doi.org/10.1088/1742-6596/1196/1/012069
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.