The sequence labeling approach for text alignment of plagiarism detection

Leilei Kong; Zhongyuan Han; Haoliang Qi

Journal ArticleOPEN ACCESS

The sequence labeling approach for text alignment of plagiarism detection

KSII Transactions on Internet and Information Systems (2019) 13(9) 4814-4832

DOI: 10.3837/tiis.2019.09.026

1Citations

9Readers

Abstract

Plagiarism detection is increasingly exploiting text alignment. Text alignment involves extracting the plagiarism passages in a pair of the suspicious document and its source document. The heuristics have achieved excellent performance in text alignment. However, the further improvements of the heuristic methods mainly depends more on the experiences of experts, which makes the heuristics lack of the abilities for continuous improvements. To address this problem, machine learning maybe a proper way. Considering the position relations and the context of text segments pairs, we formalize the text alignment task as a problem of sequence labeling, improving the current methods at the model level. Especially, this paper proposes to use the probabilistic graphical model to tag the observed sequence of pairs of text segments. Hence we present the sequence labeling approach for text alignment in plagiarism detection based on Conditional Random Fields. The proposed approach is evaluated on the PAN@CLEF 2012 artificial high obfuscation plagiarism corpus and the simulated paraphrase plagiarism corpus, and compared with the methods achieved the best performance in PAN@CLEF 2012, 2013 and 2014. Experimental results demonstrate that the proposed approach significantly outperforms the state of the art methods.

Author supplied keywords

Cite

CITATION STYLE

APA

Kong, L., Han, Z., & Qi, H. (2019). The sequence labeling approach for text alignment of plagiarism detection. KSII Transactions on Internet and Information Systems, 13(9), 4814–4832. https://doi.org/10.3837/tiis.2019.09.026

The sequence labeling approach for text alignment of plagiarism detection

Abstract

Author supplied keywords

Cite

Register to see more suggestions