Forgetting word segmentation in Chinese text classification with L1-regularized logistic regression

4Citations
Citations of this article
6Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Word segmentation is commonly a preprocessing step for Chinese text representation in building a text classification system. We have found that Chinese text representation based on segmented words may lose some valuable features for classification, no matter the segmented results are correct or not. To preserve these features, we propose to use character-based N-gram to represent the Chinese text in a larger scale feature space. Considering the sparsity problem of the N-gram data, we suggest the L1-regularized logistic regression (L1-LR) model to classify Chinese text for better generalization and interpretation. The experimental results demonstrate our proposed method can get better performance than those state-of-the-art methods. Further qualitative analysis also shows that character-based N-gram representation with L1-LR is reasonable and effective for text classification. © Springer-Verlag 2013.

Cite

CITATION STYLE

APA

Fu, Q., Dai, X., Huang, S., & Chen, J. (2013). Forgetting word segmentation in Chinese text classification with L1-regularized logistic regression. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 7819 LNAI, pp. 245–255). https://doi.org/10.1007/978-3-642-37456-2_21

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free