Efficiency of SVM classifier with Word2Vec and Doc2Vec models

Maria Mihaela Truşcă

Journal ArticleOPEN ACCESS

Efficiency of SVM classifier with Word2Vec and Doc2Vec models

Truşcă M

Proceedings of the International Conference on Applied Statistics (2019) 1(1) 496-503

DOI: 10.2478/icas-2019-0043

N/ACitations

47Readers

Abstract

Support Vector Machine model is one of the most intensive used text data classifiers ever since the moment of its development. However, its performance depends not only on its features but also on data preprocessing and model tuning. The main purpose of this paper is to compare the efficiency of more Support Vector Machine models using both TF-IDF approach and Word2Vec and Doc2Vec neural networks for text data representation. Besides the data vectorization process, I try to enhance the models’ efficiency by identifying which kind of kernel fits better the data or if it is just better to opt for the linear case. My results prove that for the “Reuters 21578” dataset, nonlinear Support Vector Machine is more efficient when the conversion of text data into numerical attributes is realized using Word2Vec models instead of TF-IDF and Doc2Vec representations. When it is considered that data meet linear separability requirements, TF-IDF representation outperforms all other options. Surprisingly, Doc2Vec models have the lowest performance and only in terms of computational cost they provide satisfactory results. This paper proves that while Word2Vec models are truly efficient for text data representation, Doc2Vec neural networks are unable to exceed even TF-IDF index representation. This evidence contradicts the common idea according to which Doc2Vec models should provide a better insight into the training data domain than Word2Vec models and certainly than the TF-IDF index.

Cite

CITATION STYLE

APA

Truşcă, M. M. (2019). Efficiency of SVM classifier with Word2Vec and Doc2Vec models. Proceedings of the International Conference on Applied Statistics, 1(1), 496–503. https://doi.org/10.2478/icas-2019-0043

Efficiency of SVM classifier with Word2Vec and Doc2Vec models

Abstract

Cite

Register to see more suggestions