Pre-trained transformer-based language models for Sundanese

17Citations
Citations of this article
90Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

The Sundanese language has over 32 million speakers worldwide, but the language has reaped little to no benefits from the recent advances in natural language understanding. Like other low-resource languages, the only alternative is to fine-tune existing multilingual models. In this paper, we pre-trained three monolingual Transformer-based language models on Sundanese data. When evaluated on a downstream text classification task, we found that most of our monolingual models outperformed larger multilingual models despite the smaller overall pre-training data. In the subsequent analyses, our models benefited strongly from the Sundanese pre-training corpus size and do not exhibit socially biased behavior. We released our models for other researchers and practitioners to use.

Cite

CITATION STYLE

APA

Wongso, W., Lucky, H., & Suhartono, D. (2022). Pre-trained transformer-based language models for Sundanese. Journal of Big Data, 9(1). https://doi.org/10.1186/s40537-022-00590-7

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free