Murreviikko – A Dialectologically Annotated and Normalized Dataset of Finnish Tweets

Olli Kuparinen

Conference Proceedings

Murreviikko – A Dialectologically Annotated and Normalized Dataset of Finnish Tweets

Kuparinen O

ACL 2023 - 10th Workshop on NLP for Similar Languages, Varieties and Dialects, VarDial 2023 - Proceedings of the Workshop (2023) 31-39

DOI: 10.18653/v1/2023.vardial-1.3

6Citations

16Readers

Get full text

Abstract

This paper presents Murreviikko, a dataset of dialectal Finnish tweets which have been dialectologically annotated and manually normalized to a standard form. The dataset can be used as a test set for dialect identification and dialect-to-standard normalization, for instance. We evaluate the dataset on the normalization task, comparing an existing normalization model built on a spoken dialect corpus and three newly trained models with different architectures. We find that there are significant differences in normalization difficulty between the dialects, and that a character-level statistical machine translation model performs best on the Murreviikko tweet dataset.

Cite

CITATION STYLE

APA

Kuparinen, O. (2023). Murreviikko – A Dialectologically Annotated and Normalized Dataset of Finnish Tweets. In ACL 2023 - 10th Workshop on NLP for Similar Languages, Varieties and Dialects, VarDial 2023 - Proceedings of the Workshop (pp. 31–39). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.vardial-1.3

Murreviikko – A Dialectologically Annotated and Normalized Dataset of Finnish Tweets

Abstract

Cite

Register to see more suggestions