According to the scientific institutes, Plagiarism is defined as claiming someone else's ideas or efforts as one's own without citing the sources. Systems of plagiarism detection typically use a text similarity algorithm in a text document to look for common sentences between source and suspicious documents, either by directly matching the sentences or by embedding the sentences into a vector using TFIDF-like or other methods and then calculating the distance or the similarity between the source and suspect sentence vectors. The cosine similarity method is one of the methods for determining that distance. To cluster the documents and choose only related documents for detection, an unsupervised Machine learning technique such as K-means could be utilized. In this paper, a plagiarism detecting application was created and tested on many text document types, including doc, Docx, and pdf of research papers that were collected from the web to build the source corpus. To calculate the level of similarity between the suspicious article and the corpus of source articles, the TFIDF text encoding approach is used with NLP, K-means clustering, and cosine similarity algorithms. The proposed application was carried out with five different documents and resulted in different ratios of plagiarism, the first document has a 0.27 ratio, the second document has a 0.15 ratio, the third document has 0.19 ratio while document 4 has a 0.42 ratio, and finally, document 5 has 0.37 ratio of plagiarism. The generated detailed plagiarism ratio report presents the percentage of plagiarism in the suspicious article document. Depending on the threshold value, the application will decide if the suspicious document is acceptable or not.
CITATION STYLE
Saeed, A. A. M., & Taqa, A. Y. (2022). A proposed approach for plagiarism detection in Article documents. SinkrOn, 7(2), 568–578. https://doi.org/10.33395/sinkron.v7i2.11381
Mendeley helps you to discover research relevant for your work.