Investigating the Impact of Preprocessing Techniques and Representation Models on Arabic Text Classification using Machine Learning

Mahmoud Masadeh; A. Moustapha; B. Sharada; J. Hanumanthappa; K. Hemachandran; Channabasava Chola; Abdullah Y. Muaad

Journal ArticleOPEN ACCESS

Investigating the Impact of Preprocessing Techniques and Representation Models on Arabic Text Classification using Machine Learning

International Journal of Advanced Computer Science and Applications (2024) 15(1) 1115-1123

DOI: 10.14569/IJACSA.2024.01501110

8Citations

18Readers

Abstract

Arabic Text Classification (ATC) is a crucial step for various Natural Language Processing (NLP) applications. It emerged as a response to the exponential growth of online content like social posts and review comments. In this study, preprocessing techniques and representation models are used to evaluate the effectiveness of ATC using Machine Learning (ML). Generally, the ATC operation depends on various factors, such as stemming in preprocessing, feature extraction and selection, and the nature of the dataset. To enhance the overall classifi-cation performance, preprocessing methodologies are primarily employed to transform each Arabic term into its root form and reduce the dimensionality of representation. In the representation of Arabic text, feature extraction and selection processes are imperative, as they significantly enhance the performance of ATC. This study implements the chosen classifiers using various feature selection algorithms. The comprehensive assessment of classification outcomes is conducted by comparing various classi-fiers, including Multinomial Naive Bayes (MNB), Bernoulli Naive Bayes (BNB), Stochastic Gradient Descent (SGD), Support Vector Classifier (SVC), Logistic Regression (LR), and linear Support Vector Classifier (LSVC). These ML classifiers are assessed utilizing short and long Arabic text benchmark datasets called BBC Arabic corpus and the COVID-19 dataset. The assessment findings indicate that the efficacy of classification is significantly influenced by the preprocessing methods, representation model, classification algorithm, and the datasets’ characteristics. In most cases, the SGDC and LSVC have consistently surpassed other classifiers for the datasets under consideration when significant features are chosen.

Author supplied keywords

Cite

CITATION STYLE

APA

Masadeh, M., Moustapha, A., Sharada, B., Hanumanthappa, J., Hemachandran, K., Chola, C., & Muaad, A. Y. (2024). Investigating the Impact of Preprocessing Techniques and Representation Models on Arabic Text Classification using Machine Learning. International Journal of Advanced Computer Science and Applications, 15(1), 1115–1123. https://doi.org/10.14569/IJACSA.2024.01501110

Investigating the Impact of Preprocessing Techniques and Representation Models on Arabic Text Classification using Machine Learning

Abstract

Author supplied keywords

Cite

Register to see more suggestions