A hybrid statistical data pre-processing approach for language-independent text classification

2Citations
Citations of this article
7Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Data pre-processing is an important topic in Text Classification (TC). It aims to convert the original textual data in a data-mining-ready structure, where the most significant text-features that serve to differentiate between text-categories are identified. Broadly speaking, textual data pre-processing techniques can be divided into three groups: (i) linguistic, (ii) statistical, and (iii) hybrid (i) & (ii). With regard to language-independent TC, our study relates to the statistical aspect only. The nature of textual data pre-processing includes: Document-base Representation (DR) and Feature Selection (FS). In this paper, we propose a hybrid statistical FS approach that integrates two existing (statistical FS) techniques, DIAAF (Darmstadt Indexing Approach Association Factor) and GSSC (Galavotti(Sebastiani(Simi Coefficient). Our proposed approach is presented under a statistical "bag of phrases" DR setting. The experimental results, based on the well-established associative text classification approach, demonstrate that our proposed technique outperforms existing mechanisms with respect to the accuracy of classification. © 2009 Springer.

Cite

CITATION STYLE

APA

Wang, Y. J., Coenen, F., & Sanderson, R. (2009). A hybrid statistical data pre-processing approach for language-independent text classification. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 5678 LNAI, pp. 338–349). https://doi.org/10.1007/978-3-642-03348-3_33

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free