Generating Varied Training Corpora in Runyankore Using a Combined Semantic and Syntactic, Pattern-Grammar-based Approach

1Citations
Citations of this article
59Readers
Mendeley users who have this article in their library.

Abstract

Machine learning algorithms have been applied to achieve high levels of accuracy in tasks associated with the processing of natural language. However, these algorithms require large amounts of training data in order to perform efficiently. Since most Bantu languages lack the required training corpora because they are computationally under-resourced, we investigated how to generate a large varied training corpus in Runyankore, a Bantu language indigenous to Uganda. We found the use of a combined semantic and syntactic, pattern and grammar-based approach to be applicable to this purpose, and used it to generate one million sentences, both labelled and unlabelled, which can be applied as training data for machine learning algorithms. The generated text was evaluated in two ways: (1) assessing the semantics encoded in word embeddings obtained from the generated text, which showed correct word similarity; and (2) applying the labelled data to tasks such as sentiment analysis, which achieved satisfactory levels of accuracy.

Cite

CITATION STYLE

APA

Byamugisha, J. (2020). Generating Varied Training Corpora in Runyankore Using a Combined Semantic and Syntactic, Pattern-Grammar-based Approach. In INLG 2020 - 13th International Conference on Natural Language Generation, Proceedings (pp. 273–282). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2020.inlg-1.34

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free