Introducing UberText 2.0: A Corpus of Modern Ukrainian at Scale

Dmytro Chaplynskyi

Conference Proceedings

Introducing UberText 2.0: A Corpus of Modern Ukrainian at Scale

Chaplynskyi D

EACL 2023 - 2nd Ukrainian Natural Language Processing Workshop, UNLP 2023 - Proceedings of the Workshop (2023) 1-10

DOI: 10.18653/v1/2023.unlp-1.1

15Citations

17Readers

Get full text

Abstract

This paper addresses the need for massive corpora for a low-resource language and presents the publicly available UberText 2.0 corpus for the Ukrainian language and discusses the methodology of its construction. While the collection and maintenance of such a corpus is more of a data extraction and data engineering task, the corpus itself provides a solid foundation for natural language processing tasks. It can enable the creation of contemporary language models and word embeddings, resulting in a better performance of numerous downstream tasks for the Ukrainian language. In addition, the paper and software developed can be used as a guidance and model solution for other low-resource languages. The resulting corpus is available for download on the project page. It has 3.274 billion tokens, consists of 8.59 million texts and takes up 32 gigabytes of space.

Cite

CITATION STYLE

APA

Chaplynskyi, D. (2023). Introducing UberText 2.0: A Corpus of Modern Ukrainian at Scale. In EACL 2023 - 2nd Ukrainian Natural Language Processing Workshop, UNLP 2023 - Proceedings of the Workshop (pp. 1–10). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.unlp-1.1

Introducing UberText 2.0: A Corpus of Modern Ukrainian at Scale

Abstract

Cite

Register to see more suggestions