Investigating the Impact of Preprocessing Techniques and Representation Models on Arabic Text Classification using Machine Learning

8Citations
Citations of this article
18Readers
Mendeley users who have this article in their library.

Abstract

Arabic Text Classification (ATC) is a crucial step for various Natural Language Processing (NLP) applications. It emerged as a response to the exponential growth of online content like social posts and review comments. In this study, preprocessing techniques and representation models are used to evaluate the effectiveness of ATC using Machine Learning (ML). Generally, the ATC operation depends on various factors, such as stemming in preprocessing, feature extraction and selection, and the nature of the dataset. To enhance the overall classifi-cation performance, preprocessing methodologies are primarily employed to transform each Arabic term into its root form and reduce the dimensionality of representation. In the representation of Arabic text, feature extraction and selection processes are imperative, as they significantly enhance the performance of ATC. This study implements the chosen classifiers using various feature selection algorithms. The comprehensive assessment of classification outcomes is conducted by comparing various classi-fiers, including Multinomial Naive Bayes (MNB), Bernoulli Naive Bayes (BNB), Stochastic Gradient Descent (SGD), Support Vector Classifier (SVC), Logistic Regression (LR), and linear Support Vector Classifier (LSVC). These ML classifiers are assessed utilizing short and long Arabic text benchmark datasets called BBC Arabic corpus and the COVID-19 dataset. The assessment findings indicate that the efficacy of classification is significantly influenced by the preprocessing methods, representation model, classification algorithm, and the datasets’ characteristics. In most cases, the SGDC and LSVC have consistently surpassed other classifiers for the datasets under consideration when significant features are chosen.

Cite

CITATION STYLE

APA

Masadeh, M., Moustapha, A., Sharada, B., Hanumanthappa, J., Hemachandran, K., Chola, C., & Muaad, A. Y. (2024). Investigating the Impact of Preprocessing Techniques and Representation Models on Arabic Text Classification using Machine Learning. International Journal of Advanced Computer Science and Applications, 15(1), 1115–1123. https://doi.org/10.14569/IJACSA.2024.01501110

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free