We present the Word importance-based similarity of documents metric (WISDM), a fast and scalable novel method for document similarity/distance computation for analysis of scientic documents. It is based on recent advancements in the area of word embeddings. WISDM combines learned word vectors together with traditional count-based models for document similarity computation, eventually achieving state-of-the-art performance and precision. The novel method rst selects from two text documents those words that carry the most information and forms a word set for each document respectively. Then it relies on an existing word embeddings model to get the vector representations of the selected words. In the nal step, it computes the closeness of the two sets of word vector representations, t into a matrix, using a correlation coecient. The presented metric was evaluated on three tasks, relevant to the analysis of scientic documents, and three data sets of open access scientic research. The results demonstrate that WISDM achieves signicant performance speed-up in comparison to state-of-the-art metrics with a very marginal drop in precision.
CITATION STYLE
Botev, V., Marinov, K., & Schäfer, F. (2017). Word importance-based similarity of documents metric (WISDM). In ACM International Conference Proceeding Series (pp. 17–23). Association for Computing Machinery. https://doi.org/10.1145/3127526.3127530
Mendeley helps you to discover research relevant for your work.