Forgetting word segmentation in Chinese text classification with L1-regularized logistic regression

Qiang Fu; Xinyu Dai; Shujian Huang; Jiajun Chen

Conference Proceedings

Forgetting word segmentation in Chinese text classification with L1-regularized logistic regression

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2013) 7819 LNAI(PART 2) 245-255

DOI: 10.1007/978-3-642-37456-2_21

4Citations

6Readers

Get full text

Abstract

Word segmentation is commonly a preprocessing step for Chinese text representation in building a text classification system. We have found that Chinese text representation based on segmented words may lose some valuable features for classification, no matter the segmented results are correct or not. To preserve these features, we propose to use character-based N-gram to represent the Chinese text in a larger scale feature space. Considering the sparsity problem of the N-gram data, we suggest the L1-regularized logistic regression (L1-LR) model to classify Chinese text for better generalization and interpretation. The experimental results demonstrate our proposed method can get better performance than those state-of-the-art methods. Further qualitative analysis also shows that character-based N-gram representation with L1-LR is reasonable and effective for text classification. © Springer-Verlag 2013.

Author supplied keywords

Cite

CITATION STYLE

APA

Fu, Q., Dai, X., Huang, S., & Chen, J. (2013). Forgetting word segmentation in Chinese text classification with L1-regularized logistic regression. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 7819 LNAI, pp. 245–255). https://doi.org/10.1007/978-3-642-37456-2_21

Forgetting word segmentation in Chinese text classification with L1-regularized logistic regression

Abstract

Author supplied keywords

Cite

Register to see more suggestions