Preparing the Vuk’uzenzele and ZA-gov-multilingual South African multilingual corpora

11Citations
Citations of this article
18Readers
Mendeley users who have this article in their library.
Get full text

Abstract

This paper introduces two multilingual government themed corpora in various South African languages. The corpora were collected by gathering the South African Government newspaper (Vuk’uzenzele), as well as South African government speeches (ZA-gov-multilingual), that are translated into all 11 South African official languages. The corpora can be used for a myriad of downstream NLP tasks. The corpora were created to allow researchers to study the language used in South African government publications, with a focus on understanding how South African government officials communicate with their constituents. In this paper we highlight the process of gathering, cleaning and making available the corpora. We create parallel sentence corpora for Neural Machine Translation (NMT) tasks using Language-Agnostic Sentence Representations (LASER) embeddings. With these aligned sentences we then provide NMT benchmarks for 9 indigenous languages by fine-tuning a massively multilingual pre-trained language model.

Cite

CITATION STYLE

APA

Lastrucci, R., Dzingirai, I., Rajab, J., Madodonga, A., Shingange, M., Njini, D., & Marivate, V. (2023). Preparing the Vuk’uzenzele and ZA-gov-multilingual South African multilingual corpora. In 4th Workshop on Resources for African Indigenous Languages, RAIL 2023 - Proceedings of the Workshop (pp. 18–25). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.rail-1.3

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free