Not Enough Data to Pre-train Your Language Model? MT to the Rescue!

6Citations
Citations of this article
13Readers
Mendeley users who have this article in their library.

Abstract

In recent years, pre-trained transformer-based language models (LM) have become a key resource for implementing most NLP tasks. However, pre-training such models demands large text collections not available in most languages. In this paper, we study the use of machine-translated corpora for pre-training LMs. We answer the following research questions: RQ1: Is MT-based data an alternative to real data for learning a LM?; RQ2: Can real data be complemented with translated data and improve the resulting LM? In order to validate these two questions, several BERT models for Basque have been trained, combining real data and synthetic data translated from Spanish. The evaluation carried out on 9 NLU tasks indicates that models trained exclusively on translated data offer competitive results. Furthermore, models trained with real data can be improved with synthetic data, although further research is needed on the matter.

Cite

CITATION STYLE

APA

Urbizu, G., Vicente, I. S., Saralegi, X., & Corral, A. (2023). Not Enough Data to Pre-train Your Language Model? MT to the Rescue! In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 3826–3836). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.findings-acl.235

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free