With the rapid development of the Internet, the high dimensional text data has increased rapidly. How to build an efficient and extensible text classification algorithm has become a hot topic in the field of data mining. Aiming at the problems of high feature dimension, sparse data and long computation time in traditional SVM classification algorithm based on TF-IDF (Term Frequency-Inverse Document Frequency), we propose a novel hybrid system for Chinese text classification: CSVM, which is independent of the hand-designed features and domain knowledge. Firstly, the encoding words are done by constructing a text vocabulary of size m for the input language, and then quantize each word using 1-of-m encoding. Secondly, we exploit the CNN (Convolutional Neural Network) to extract the morphological features of character vectors from each word, and then through large scale text material training the semantic feature of each word vectors are be obtained the semantic feature of each word vectors. Finally, the text classification is carried out with the SVM multiple classifier. Testing on a text dataset with 10 categories, the experimental results show that the CSVM algorithm is more effective than other traditional Chinese text classification algorithm.
CITATION STYLE
Wu, H., Li, D., & Cheng, M. (2019). Chinese text classification based on character-level CNN and SVM. In Communications in Computer and Information Science (Vol. 986, pp. 227–238). Springer Verlag. https://doi.org/10.1007/978-981-13-6473-0_20
Mendeley helps you to discover research relevant for your work.