Employing a Multilingual Transformer Model for Segmenting Unpunctuated Arabic Text

8Citations
Citations of this article
20Readers
Mendeley users who have this article in their library.

Abstract

Long unpunctuated texts containing complex linguistic sentences are a stumbling block to processing any low-resource languages. Thus, approaches that attempt to segment lengthy texts with no proper punctuation into simple candidate sentences are a vitally important preprocessing task in many hard-to-solve NLP applications. To this end, we propose a preprocessing solution for segmenting unpunctuated Arabic texts into potentially independent clauses. This solution consists of: (1) a punctuation detection model built on top of a multilingual BERT-based model, and (2) some generic linguistic rules for validating the resulting segmentation. Furthermore, we optimize the strategy of applying these linguistic rules using our suggested greedy-like algorithm. We call the proposed solution PDTS (standing for Punctuation Detector for Text Segmentation). Concerning the evaluation, we showcase how PDTS can be effectively employed as a text tokenizer for unpunctuated documents (i.e., mimicking the transcribed audio-to-text documents). Experimental findings across two evaluation protocols (involving an ablation study and a human-based judgment) demonstrate that PDTS is practically effective in both performance quality and computational cost. In particular, PDTS can reach an average F-Measure score of approximately (Formula presented.), indicating a minimum improvement of roughly (Formula presented.) (i.e., compared to the performance of the state-of-the-art competitor models).

Cite

CITATION STYLE

APA

Alshanqiti, A. M., Albouq, S., Alkhodre, A. B., Namoun, A., & Nabil, E. (2022). Employing a Multilingual Transformer Model for Segmenting Unpunctuated Arabic Text. Applied Sciences (Switzerland), 12(20). https://doi.org/10.3390/app122010559

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free