Detecting topics in documents by clustering word vectors

Guilherme Raiol de Miranda; Rodrigo Pasti; Leandro Nunes de Castro

Conference Proceedings

Detecting topics in documents by clustering word vectors

Advances in Intelligent Systems and Computing (2020) 1003 235-243

DOI: 10.1007/978-3-030-23887-2_27

12Citations

25Readers

Get full text

Abstract

The automatic detection of topics in a set of documents is one of the most challenging and useful tasks in Natural Language Processing. Word2Vec has proven to be an effective tool for the distributed representation of words (word embeddings) usually applied to find their linguistic context. This paper proposes the use of a Self-Organizing Map (SOM) to cluster the word vectors generated by Word2Vec so as to find topics in the texts. After running SOM, a k-means algorithm is applied to separate the SOM output grid neurons into k clusters, such that the words mapped into each centroid represent the topics of that cluster. Our approach was tested on a benchmark text dataset with 19,997 texts and 20 groups. The results showed that the method is capable of finding the expected groups, sometimes merging some of them that deal with similar topics.

Author supplied keywords

Cite

CITATION STYLE

APA

de Miranda, G. R., Pasti, R., & de Castro, L. N. (2020). Detecting topics in documents by clustering word vectors. In Advances in Intelligent Systems and Computing (Vol. 1003, pp. 235–243). Springer Verlag. https://doi.org/10.1007/978-3-030-23887-2_27

Detecting topics in documents by clustering word vectors

Abstract

Author supplied keywords

Cite

Register to see more suggestions