Research on similarity detection of massive text based on semantic fingerprint

Xiaolin Jin; Shuwu Zhang; Jie Liu; Hu Guan

Conference Proceedings

Research on similarity detection of massive text based on semantic fingerprint

Proceedings of Science (2017) 2017-December

DOI: 10.22323/1.300.0009

2Citations

6Readers

Get full text

Abstract

In order to find the required information quickly and efficiently in massive texts, this paper proposes a method of combining semantic fingerprint with cosine distance. After text preprocessing for Chinese texts, the Term Frequency-Inverse Document Frequency algorithm is used to extract feature words of the text, and then screen the text initially by the Simhash algorithm, finally compare these candidate texts tby using the cosine distance for the second similarity to extract the most similar texts. Based on a single Simhash algorithm, the proposed method can greatly improve the accuracy and recall under the modified textual environment, and can also meet the needs of massive texts' similarity testing requirements. Therefore, this method of combining semantic fingerprint with cosine distance can effectively make up for the problem of high false positive rate of Simhash algorithm and is more suitable for the similarity detection of massive texts in fact.

Cite

CITATION STYLE

APA

Jin, X., Zhang, S., Liu, J., & Guan, H. (2017). Research on similarity detection of massive text based on semantic fingerprint. In Proceedings of Science (Vol. 2017-December). Sissa Medialab Srl. https://doi.org/10.22323/1.300.0009

Research on similarity detection of massive text based on semantic fingerprint

Abstract

Cite

Register to see more suggestions