Performance comparison of TF*IDF, LDA and paragraph vector for document classification

Jindong Chen; Pengjia Yuan; Xiaoji Zhou; Xijin Tang

Conference Proceedings

Performance comparison of TF*IDF, LDA and paragraph vector for document classification

Communications in Computer and Information Science (2016) 660 225-235

DOI: 10.1007/978-981-10-2857-1_20

N/ACitations

16Readers

Get full text

Abstract

To meet the fast and effective requirements of document classification in Web 2.0, the most direct strategy is to reduce the dimension of document representation without much information loss. Topic model and neural network language model are two main strategies to represent document in a low-dimensional space. To compare the effectiveness of bag-of-words, topic model and neural network language model for document classification, TF*IDF, latent Dirichlet allocation (LDA) and Paragraph Vector model are selected. Based on the generated vectors of these three methods, support vector machine classifiers are developed respectively. The performances of these three methods on English and Chinese document collections are evaluated. The experimental results show that TF*IDF outperforms LDA and Paragraph Vector, but the high-dimensional vectors take up much time and memory. Furthermore, through cross validation, the results reveal that stop words elimination and the size of training samples significantly affect the performances of LDA and Paragraph Vector, and Paragraph Vector displays its potential to overwhelm two other methods. Finally, the suggestions related with stop words elimination and data size for LDA and Paragraph Vector training are provided.

Author supplied keywords

Cite

CITATION STYLE

APA

Chen, J., Yuan, P., Zhou, X., & Tang, X. (2016). Performance comparison of TF*IDF, LDA and paragraph vector for document classification. In Communications in Computer and Information Science (Vol. 660, pp. 225–235). Springer Verlag. https://doi.org/10.1007/978-981-10-2857-1_20

Performance comparison of TF*IDF, LDA and paragraph vector for document classification

Abstract

Author supplied keywords

Cite

Register to see more suggestions