Raising high-degree overlapped character bigrams into trigrams for dimensionality reduction in chinese text categorization

2Citations
Citations of this article
5Readers
Mendeley users who have this article in their library.
Get full text

Abstract

High dimensionality of feature space is a crucial obstacle for Automated Text Categorization. According to the characteristics of Chinese character N-grams, this paper reveals that there exists a kind of redundancy arising from feature overlapping. Focusing on Chinese character bigrams, the paper puts forward a concept of δ-overlapping between two bigrams, and proposes a new method of dimensionality reduction, called δ-Overlapped Raising (δ-OR), by raising the δ-overlapped bigrams into their corresponding trigrams. Moreover, the paper designs a two-stage dimensionality reduction strategy for Chinese bigrams by integrating a filtering method based on Chi-CIG score function and the δ-OR method. Experimental results on a large-scale Chinese document collection indicate that, on the basis of the first stage of reduction processing, δ-OR at the second stage can significantly reduce the dimension of feature space without sacrificing categorization effectiveness. We believe that the above methodology would be language-independent. © Springer-Verlag 2004.

Cite

CITATION STYLE

APA

Xue, D., & Sun, M. (2004). Raising high-degree overlapped character bigrams into trigrams for dimensionality reduction in chinese text categorization. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2945, 584–595. https://doi.org/10.1007/978-3-540-24630-5_72

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free