Abstract
This paper presents Murreviikko, a dataset of dialectal Finnish tweets which have been dialectologically annotated and manually normalized to a standard form. The dataset can be used as a test set for dialect identification and dialect-to-standard normalization, for instance. We evaluate the dataset on the normalization task, comparing an existing normalization model built on a spoken dialect corpus and three newly trained models with different architectures. We find that there are significant differences in normalization difficulty between the dialects, and that a character-level statistical machine translation model performs best on the Murreviikko tweet dataset.
Cite
CITATION STYLE
Kuparinen, O. (2023). Murreviikko – A Dialectologically Annotated and Normalized Dataset of Finnish Tweets. In ACL 2023 - 10th Workshop on NLP for Similar Languages, Varieties and Dialects, VarDial 2023 - Proceedings of the Workshop (pp. 31–39). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.vardial-1.3
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.