Understanding Native Language Identification for Brazilian Indigenous Languages

2Citations
Citations of this article
7Readers
Mendeley users who have this article in their library.

Abstract

We investigate native language identification (LangID) for Brazilian Indigenous Languages (BILs), using the Bible as training data. Our research extends from previous work, by presenting two analyses on the generalization of Bible-based LangID in non-biblical data. First, with newly collected non-biblical datasets, we show that such a LangID can still provide quite reasonable accuracy in languages for which there are more established writing standards, such as Guarani Mbya and Kaigang, but there can be a quite drastic drop in accuracy depending on the language. Then, we applied the LangID on a large set of texts, about 13M sentences from the Portuguese Wikipedia, towards understanding the difficulty factors may come out of such task in practice. The main outcome is that the lack of handling other American indigenous languages can affect considerably the precision for BILs, suggesting the need of a joint effort with related languages from the Americas.

Cite

CITATION STYLE

APA

Cavalin, P., Domingues, P. H., Nogima, J., & Pinhanez, C. (2023). Understanding Native Language Identification for Brazilian Indigenous Languages. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 12–18). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.americasnlp-1.3

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free