GPT-2 Metadata Pretraining Towards Instruction Finetuning for Ukrainian

Volodymyr Kyrylov; Dmytro Chaplynskyi

Conference ProceedingsOPEN ACCESS

GPT-2 Metadata Pretraining Towards Instruction Finetuning for Ukrainian

EACL 2023 - 2nd Ukrainian Natural Language Processing Workshop, UNLP 2023 - Proceedings of the Workshop (2023) 32-39

DOI: 10.18653/v1/2023.unlp-1.4

3Citations

13Readers

Abstract

We explore pretraining unidirectional language models on 4B tokens from the largest curated corpus of Ukrainian, UberText 2.0. We enrich document text by surrounding it with weakly structured metadata, such as title, tags, and publication year, enabling metadata-conditioned text generation and text-conditioned metadata prediction at the same time. We pretrain GPT-2 Small, Medium, and Large models on a single GPU, reporting training times, BPC on BrUK, BERTScore, and BLEURT on titles for 1000 News from the Future. Next, we venture to formatting POS and NER datasets as instructions, and train low-rank attention adapters, performing these tasks as constrained text generation. We release our models for the community at https://github.com/proger/uk4b.

Cite

CITATION STYLE

APA

Kyrylov, V., & Chaplynskyi, D. (2023). GPT-2 Metadata Pretraining Towards Instruction Finetuning for Ukrainian. In EACL 2023 - 2nd Ukrainian Natural Language Processing Workshop, UNLP 2023 - Proceedings of the Workshop (pp. 32–39). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.unlp-1.4

GPT-2 Metadata Pretraining Towards Instruction Finetuning for Ukrainian

Abstract

Cite

Register to see more suggestions