Murreviikko – A Dialectologically Annotated and Normalized Dataset of Finnish Tweets

6Citations
Citations of this article
16Readers
Mendeley users who have this article in their library.
Get full text

Abstract

This paper presents Murreviikko, a dataset of dialectal Finnish tweets which have been dialectologically annotated and manually normalized to a standard form. The dataset can be used as a test set for dialect identification and dialect-to-standard normalization, for instance. We evaluate the dataset on the normalization task, comparing an existing normalization model built on a spoken dialect corpus and three newly trained models with different architectures. We find that there are significant differences in normalization difficulty between the dialects, and that a character-level statistical machine translation model performs best on the Murreviikko tweet dataset.

Cite

CITATION STYLE

APA

Kuparinen, O. (2023). Murreviikko – A Dialectologically Annotated and Normalized Dataset of Finnish Tweets. In ACL 2023 - 10th Workshop on NLP for Similar Languages, Varieties and Dialects, VarDial 2023 - Proceedings of the Workshop (pp. 31–39). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.vardial-1.3

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free