Corpus specificity in LSA and word2vec: The role of out-of-domain documents

10Citations
Citations of this article
109Readers
Mendeley users who have this article in their library.

Abstract

Despite the popularity of word embeddings, the precise way by which they acquire semantic relations between words remain unclear. In the present article, we investigate whether LSA and word2vec capacity to identify relevant semantic relations increases with corpus size. One intuitive hypothesis is that the capacity to identify relevant associations should increase as the amount of data increases. However, if corpus size grows in topics which are not specific to the domain of interest, signal to noise ratio may weaken. Here we investigate the effect of corpus specificity and size in word-embeddings, and for this, we study two ways for progressive elimination of documents: the elimination of random documents vs. the elimination of documents unrelated to a specific task. We show that word2vec can take advantage of all the documents, obtaining its best performance when it is trained with the whole corpus. On the contrary, the specialization (removal of out-of-domain documents) of the training corpus, accompanied by a decrease of dimensionality, can increase LSA word-representation quality while speeding up the processing time. From a cognitive-modeling point of view, we point out that LSA's word-knowledge acquisitions may not be efficiently exploiting higher-order co-occurrences and global relations, whereas word2vec does.

Cite

CITATION STYLE

APA

Altszyler, E., Sigman, M., & Slezak, D. F. (2018). Corpus specificity in LSA and word2vec: The role of out-of-domain documents. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 1–10). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/w18-3001

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free