A comparison and semi-quantitative analysis of words and character-bigrams as features in Chinese text categorization

Jingyang Li; Maosong Sun; Xian Zhang

Conference ProceedingsOPEN ACCESS

A comparison and semi-quantitative analysis of words and character-bigrams as features in Chinese text categorization

COLING/ACL 2006 - 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (2006) 1 545-552

DOI: 10.3115/1220175.1220244

25Citations

85Readers

Abstract

Words and character-bigrams are both used as features in Chinese text processing tasks, but no systematic comparison or analysis of their values as features for Chinese text categorization has been reported heretofore. We carry out here a full performance comparison between them by experiments on various document collections (including a manually word-segmented corpus as a golden standard), and a semi-quantitative analysis to elucidate the characteristics of their behavior; and try to provide some preliminary clue for feature term choice (in most cases, character-bigrams are better than words) and dimensionality setting in text categorization systems. © 2006 Association for Computational Linguistics.

Cite

CITATION STYLE

APA

Li, J., Sun, M., & Zhang, X. (2006). A comparison and semi-quantitative analysis of words and character-bigrams as features in Chinese text categorization. In COLING/ACL 2006 - 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Vol. 1, pp. 545–552). Association for Computational Linguistics (ACL). https://doi.org/10.3115/1220175.1220244

A comparison and semi-quantitative analysis of words and character-bigrams as features in Chinese text categorization

Abstract

Cite

Register to see more suggestions