Word importance-based similarity of documents metric (WISDM)

Viktor Botev; Kaloyan Marinov; Florian Schäfer

Conference ProceedingsOPEN ACCESS

Word importance-based similarity of documents metric (WISDM)

ACM International Conference Proceeding Series (2017) 17-23

DOI: 10.1145/3127526.3127530

5Citations

8Readers

Abstract

We present the Word importance-based similarity of documents metric (WISDM), a fast and scalable novel method for document similarity/distance computation for analysis of scientic documents. It is based on recent advancements in the area of word embeddings. WISDM combines learned word vectors together with traditional count-based models for document similarity computation, eventually achieving state-of-the-art performance and precision. The novel method rst selects from two text documents those words that carry the most information and forms a word set for each document respectively. Then it relies on an existing word embeddings model to get the vector representations of the selected words. In the nal step, it computes the closeness of the two sets of word vector representations, t into a matrix, using a correlation coecient. The presented metric was evaluated on three tasks, relevant to the analysis of scientic documents, and three data sets of open access scientic research. The results demonstrate that WISDM achieves signicant performance speed-up in comparison to state-of-the-art metrics with a very marginal drop in precision.

Author supplied keywords

Cite

CITATION STYLE

APA

Botev, V., Marinov, K., & Schäfer, F. (2017). Word importance-based similarity of documents metric (WISDM). In ACM International Conference Proceeding Series (pp. 17–23). Association for Computing Machinery. https://doi.org/10.1145/3127526.3127530

Word importance-based similarity of documents metric (WISDM)

Abstract

Author supplied keywords

Cite

Register to see more suggestions