Enhancing Natural Language Processing in Somali Text Classification: A Comprehensive Framework for Stop Word Removal

0Citations
Citations of this article
6Readers
Mendeley users who have this article in their library.

Abstract

Text classification is a prominent field of study in information retrieval and natural language processing, where a crucial component is the utilization of a stop word list. This list helps identify frequently occurring words that have little relevance in classification and are consequently removed during pre-processing. Although various stopword lists have been devised for the English language, a standardized stopword list specifically tailored for Somali text classification is yet to be established. This research presents a comprehensive framework for stop word removal in the context of the Somali language, aiming to enhance the effectiveness of various Natural Language Processing (NLP) tasks. The proposed methodology encompasses several essential steps, including noise identification, noise removal, character normalization, data masking, tokenization, POS tagging, and lemmatization. By analysing a substantial dataset containing 79,741,231 tokens and 71,871,585 words, the framework demonstrates its capability to identify and eliminate stop words, thereby reducing vector space and improving the performance of NLP algorithms. The research highlights the unique linguistic features of Somali, such as contextual variations and morphological complexities. It discusses the potential applications of the developed stop word list in sentiment analysis, information retrieval, and document classification. This work contributes valuable insights to the field of language technology, particularly in underrepresented languages, and paves the way for further advancements in NLP models tailored to diverse linguistic contexts.

Cite

CITATION STYLE

APA

Abdirahman, A. A., Hashi, A. O., Dahir, U. M., Elmi, M. A., & Rodriguez, O. E. R. (2023). Enhancing Natural Language Processing in Somali Text Classification: A Comprehensive Framework for Stop Word Removal. International Journal of Engineering Trends and Technology, 71(12), 40–49. https://doi.org/10.14445/22315381/IJETT-V71I12P205

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free