High dimensionality of feature space is a crucial obstacle for Automated Text Categorization. According to the characteristics of Chinese character N-grams, this paper reveals that there exists a kind of redundancy arising from feature overlapping. Focusing on Chinese character bigrams, the paper puts forward a concept of δ-overlapping between two bigrams, and proposes a new method of dimensionality reduction, called δ-Overlapped Raising (δ-OR), by raising the δ-overlapped bigrams into their corresponding trigrams. Moreover, the paper designs a two-stage dimensionality reduction strategy for Chinese bigrams by integrating a filtering method based on Chi-CIG score function and the δ-OR method. Experimental results on a large-scale Chinese document collection indicate that, on the basis of the first stage of reduction processing, δ-OR at the second stage can significantly reduce the dimension of feature space without sacrificing categorization effectiveness. We believe that the above methodology would be language-independent. © Springer-Verlag 2004.
CITATION STYLE
Xue, D., & Sun, M. (2004). Raising high-degree overlapped character bigrams into trigrams for dimensionality reduction in chinese text categorization. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2945, 584–595. https://doi.org/10.1007/978-3-540-24630-5_72
Mendeley helps you to discover research relevant for your work.