Word segmentation is commonly a preprocessing step for Chinese text representation in building a text classification system. We have found that Chinese text representation based on segmented words may lose some valuable features for classification, no matter the segmented results are correct or not. To preserve these features, we propose to use character-based N-gram to represent the Chinese text in a larger scale feature space. Considering the sparsity problem of the N-gram data, we suggest the L1-regularized logistic regression (L1-LR) model to classify Chinese text for better generalization and interpretation. The experimental results demonstrate our proposed method can get better performance than those state-of-the-art methods. Further qualitative analysis also shows that character-based N-gram representation with L1-LR is reasonable and effective for text classification. © Springer-Verlag 2013.
CITATION STYLE
Fu, Q., Dai, X., Huang, S., & Chen, J. (2013). Forgetting word segmentation in Chinese text classification with L1-regularized logistic regression. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 7819 LNAI, pp. 245–255). https://doi.org/10.1007/978-3-642-37456-2_21
Mendeley helps you to discover research relevant for your work.