Performance comparison of TF*IDF, LDA and paragraph vector for document classification

N/ACitations
Citations of this article
16Readers
Mendeley users who have this article in their library.
Get full text

Abstract

To meet the fast and effective requirements of document classification in Web 2.0, the most direct strategy is to reduce the dimension of document representation without much information loss. Topic model and neural network language model are two main strategies to represent document in a low-dimensional space. To compare the effectiveness of bag-of-words, topic model and neural network language model for document classification, TF*IDF, latent Dirichlet allocation (LDA) and Paragraph Vector model are selected. Based on the generated vectors of these three methods, support vector machine classifiers are developed respectively. The performances of these three methods on English and Chinese document collections are evaluated. The experimental results show that TF*IDF outperforms LDA and Paragraph Vector, but the high-dimensional vectors take up much time and memory. Furthermore, through cross validation, the results reveal that stop words elimination and the size of training samples significantly affect the performances of LDA and Paragraph Vector, and Paragraph Vector displays its potential to overwhelm two other methods. Finally, the suggestions related with stop words elimination and data size for LDA and Paragraph Vector training are provided.

Cite

CITATION STYLE

APA

Chen, J., Yuan, P., Zhou, X., & Tang, X. (2016). Performance comparison of TF*IDF, LDA and paragraph vector for document classification. In Communications in Computer and Information Science (Vol. 660, pp. 225–235). Springer Verlag. https://doi.org/10.1007/978-981-10-2857-1_20

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free