Data Augmentation Methods for Low-Resource Orthographic Syllabification

4Citations
Citations of this article
16Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

An n-gram syllabification model generally produces a high error rate for a low-resource language, such as Indonesian, because of the high rate of out-of-vocabulary (OOV) n-grams. In this paper, a combination of three methods of data augmentations is proposed to solve the problem, namely swapping consonant-graphemes, flipping onsets, and transposing nuclei. An investigation on 50k Indonesian words shows that the combination of three data augmentation methods drastically increases the amount of both unigrams and bigrams. A previous procedure of flipping onsets has been proven to enhance the standard bigram-syllabification by relatively decreasing the syllable error rate (SER) by up to 18.02%. Meanwhile, the previous swapping consonant-graphemes has been proven to give a relative decrement of SER up to 31.39%. In this research, a new transposing nuclei-based augmentation method is proposed and combined with both flipping and swapping procedures to tackle the drawback of bigram syllabification in handling the OOV bigrams. An evaluation based on k-fold cross-validation (k-FCV), using k= 5, for 50 thousand Indonesian formal words concludes that the proposed combination of the three procedures relatively decreases the mean SER produced by the standard bigram model by up to 37.63%. The proposed model is comparable to the fuzzy k-nearest neighbor in every class (FkNNC)-based model. It is worse than the state-of-the-art model, which is developed using a combination of bidirectional long short-term memory (BiLSTM), convolutional neural networks (CNN), and conditional random fields (CRF), but it offers a low complexity.

Cite

CITATION STYLE

APA

Suyanto, S., Lhaksmana, K. M., Bijaksana, M. A., & Kurniawan, A. (2020). Data Augmentation Methods for Low-Resource Orthographic Syllabification. IEEE Access, 8, 147399–147406. https://doi.org/10.1109/ACCESS.2020.3015778

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free