Words and character-bigrams are both used as features in Chinese text processing tasks, but no systematic comparison or analysis of their values as features for Chinese text categorization has been reported heretofore. We carry out here a full performance comparison between them by experiments on various document collections (including a manually word-segmented corpus as a golden standard), and a semi-quantitative analysis to elucidate the characteristics of their behavior; and try to provide some preliminary clue for feature term choice (in most cases, character-bigrams are better than words) and dimensionality setting in text categorization systems. © 2006 Association for Computational Linguistics.
CITATION STYLE
Li, J., Sun, M., & Zhang, X. (2006). A comparison and semi-quantitative analysis of words and character-bigrams as features in Chinese text categorization. In COLING/ACL 2006 - 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Vol. 1, pp. 545–552). Association for Computational Linguistics (ACL). https://doi.org/10.3115/1220175.1220244
Mendeley helps you to discover research relevant for your work.