Automatic text classification using neural network and statistical approaches

1Citations
Citations of this article
5Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Automatic Classification is crucial for text retrieval, knowledge management, and decision making as it converts text from raw data to a real knowledge. In this paper, Text Automatic Classification has been introduced starting from improved methods for the preprocessing algorithms that are common to the different classifiers, then enhanced algorithms for two different classifiers: a Statistical one and a Neural Network based one. The preprocessing algorithm included words features extraction, stop words removal, and enhanced word stemming. For the statistical classifier, weighting techniques have been introduced to enhance the statistical classification concluding that the combination of the Term Frequence X Inverse Document Frequency (TFxIDF) and the Category Frequency (CF) gives the highest classification. For the neural network based classifier, a classification model has been proposed using an artificial neural network trained by the Back propagation learning algorithm. Due to the high dimensionality of the feature space typical for textual data, scalability is poor if the neural network is trained using this high dimensional raw data. In order to improve the scalability of the proposed model, four dimensionality reduction techniques have been proposed to reduce the feature space into an input space of much lower dimension for the neural network classifier. The first three of these techniques are domain dependent term selection methods: the Document Frequency (DF) method, the Category Frequency- Document Frequency (CF-DF) method and the TFxIDF method. The fourth technique is a domain independent feature extraction method based on a statistical multivariate data analysis technique which is the Principal Component Analysis (PCA) an this technique was the best as per the done experiments. The proposed classifiers have been tested through experiments conducted using a subset of the Reuters-21,758 test collection for text classification. Although this paper considered English as the language under research to make use of the standard Reuters-21,758, the proposed model could be used for other languages.

Cite

CITATION STYLE

APA

ElGhazaly, T. (2018). Automatic text classification using neural network and statistical approaches. In Studies in Computational Intelligence (Vol. 740, pp. 351–369). Springer Verlag. https://doi.org/10.1007/978-3-319-67056-0_17

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free