Deep learning and low-resource languages: How much data is enough? A case study of three linguistically distinct South African languages

2Citations
Citations of this article
14Readers
Mendeley users who have this article in their library.

Abstract

In this paper we present a case study for three under-resourced linguistically distinct South African languages (Afrikaans, isiZulu, and Sesotho sa Leboa) to investigate the influence of data size and linguistic nature of a language on the performance of different embedding types. Our experimental setup consists of training embeddings on increasing amounts of data and then evaluating the impact of data size for the downstream task of part of speech tagging. We find that relatively little data can produce useful representations for this specific task for all three languages. Our analysis also shows that the influence of linguistic and orthographic differences between languages should not be underestimated: morphologically complex, conjunctively written languages (isiZulu in our case) need substantially more data to achieve good results, while disjunctively written languages require substantially less data. This is not only the case with regard to the data for training the embedding model, but also annotated training material for the task at hand. It is therefore imperative to know the characteristics of the language you are working on to make linguistically informed choices about the amount of data and the type of embeddings to use.

Cite

CITATION STYLE

APA

Eiselen, R., & Gaustad, T. (2023). Deep learning and low-resource languages: How much data is enough? A case study of three linguistically distinct South African languages. In 4th Workshop on Resources for African Indigenous Languages, RAIL 2023 - Proceedings of the Workshop (pp. 42–53). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.rail-1.6

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free