Research on similarity detection of massive text based on semantic fingerprint

1Citations
Citations of this article
5Readers
Mendeley users who have this article in their library.

Abstract

In order to find the required information quickly and efficiently in massive texts, this paper proposes a method of combining semantic fingerprint with cosine distance. After text preprocessing for Chinese texts, the Term Frequency-Inverse Document Frequency algorithm is used to extract feature words of the text, and then screen the text initially by the Simhash algorithm, finally compare these candidate texts tby using the cosine distance for the second similarity to extract the most similar texts. Based on a single Simhash algorithm, the proposed method can greatly improve the accuracy and recall under the modified textual environment, and can also meet the needs of massive texts' similarity testing requirements. Therefore, this method of combining semantic fingerprint with cosine distance can effectively make up for the problem of high false positive rate of Simhash algorithm and is more suitable for the similarity detection of massive texts in fact.

Cite

CITATION STYLE

APA

Jin, X., Zhang, S., Liu, J., & Guan, H. (2017). Research on similarity detection of massive text based on semantic fingerprint. In Proceedings of Science (Vol. 2017-December). Sissa Medialab Srl. https://doi.org/10.22323/1.300.0009

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free