An n-gram syllabification model generally produces a high error rate for a low-resource language, such as Indonesian, because of the high rate of out-of-vocabulary (OOV) n-grams. In this paper, a combination of three methods of data augmentations is proposed to solve the problem, namely swapping consonant-graphemes, flipping onsets, and transposing nuclei. An investigation on 50k Indonesian words shows that the combination of three data augmentation methods drastically increases the amount of both unigrams and bigrams. A previous procedure of flipping onsets has been proven to enhance the standard bigram-syllabification by relatively decreasing the syllable error rate (SER) by up to 18.02%. Meanwhile, the previous swapping consonant-graphemes has been proven to give a relative decrement of SER up to 31.39%. In this research, a new transposing nuclei-based augmentation method is proposed and combined with both flipping and swapping procedures to tackle the drawback of bigram syllabification in handling the OOV bigrams. An evaluation based on k-fold cross-validation (k-FCV), using k= 5, for 50 thousand Indonesian formal words concludes that the proposed combination of the three procedures relatively decreases the mean SER produced by the standard bigram model by up to 37.63%. The proposed model is comparable to the fuzzy k-nearest neighbor in every class (FkNNC)-based model. It is worse than the state-of-the-art model, which is developed using a combination of bidirectional long short-term memory (BiLSTM), convolutional neural networks (CNN), and conditional random fields (CRF), but it offers a low complexity.
CITATION STYLE
Suyanto, S., Lhaksmana, K. M., Bijaksana, M. A., & Kurniawan, A. (2020). Data Augmentation Methods for Low-Resource Orthographic Syllabification. IEEE Access, 8, 147399–147406. https://doi.org/10.1109/ACCESS.2020.3015778
Mendeley helps you to discover research relevant for your work.