Word importance-based similarity of documents metric (WISDM)

5Citations
Citations of this article
8Readers
Mendeley users who have this article in their library.

Abstract

We present the Word importance-based similarity of documents metric (WISDM), a fast and scalable novel method for document similarity/distance computation for analysis of scientic documents. It is based on recent advancements in the area of word embeddings. WISDM combines learned word vectors together with traditional count-based models for document similarity computation, eventually achieving state-of-the-art performance and precision. The novel method rst selects from two text documents those words that carry the most information and forms a word set for each document respectively. Then it relies on an existing word embeddings model to get the vector representations of the selected words. In the nal step, it computes the closeness of the two sets of word vector representations, t into a matrix, using a correlation coecient. The presented metric was evaluated on three tasks, relevant to the analysis of scientic documents, and three data sets of open access scientic research. The results demonstrate that WISDM achieves signicant performance speed-up in comparison to state-of-the-art metrics with a very marginal drop in precision.

Cite

CITATION STYLE

APA

Botev, V., Marinov, K., & Schäfer, F. (2017). Word importance-based similarity of documents metric (WISDM). In ACM International Conference Proceeding Series (pp. 17–23). Association for Computing Machinery. https://doi.org/10.1145/3127526.3127530

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free