Rule-based text normalization for malay social media texts

17Citations
Citations of this article
72Readers
Mendeley users who have this article in their library.

Abstract

—Malay social media text is a text written on social media networks like Twitter. Commonly, this text comprises nonstandard words, filled with dialects, foreign languages, word abbreviations, grammatical neglect, spelling errors, and many more. It is well known that this type of text is difficult to process due to its high noise and distinct text structure. Such problems can be resolved using rigorous text normalization, which is critical before any technique can be implemented and evaluated on social media text. In this paper, an improved normalization method towards Malay social media text was proposed by converting non-standard Malay words using a rule-based model. The method normalizes common language words often used by Malaysian users, such as non-standard Malay (like dialect and slangs), Romanized Arabic, and English words. Thus, a Malay text normalizer was proposed using a set of rules that extend across different domains of natural language processing (NLP) and is expected to address the challenges of processing Malay social media text. This study implements the proposed Malay text normalizer in a Part-of-Speech (POS) tagging application to evaluate the normalizer’s performance. The implementation demonstrates a substantial improvement in the POS tagging efficiency over several pre-processing stages, with an improvement of accuracy up to 31.8%. The increase of accuracy in the POS tagging indicates two main points. First, the Malay text normalizer’s rules improve the performance of a Malay text normalizer on social media text. Second, our proposed Malay text normalizer has successfully improved the POS tagging percentage and demonstrates the importance of normalized pre-processing in any NLP application.

Cite

CITATION STYLE

APA

Ariffin, S. N. A. N., & Tiun, S. (2020). Rule-based text normalization for malay social media texts. International Journal of Advanced Computer Science and Applications, 11(10), 156–162. https://doi.org/10.14569/IJACSA.2020.0111021

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free