Pos-Tagging different varieties of Occitan with single-dialect resources

6Citations
Citations of this article
60Readers
Mendeley users who have this article in their library.

Abstract

In this study, we tackle the question of pos-Tagging written Occitan, a lesser-resourced language with multiple dialects each containing several varieties. For pos-Tagging, we use a supervised machine learning approach, requiring annotated training and evaluation corpora and optionally a lexicon, all of which were prepared as part of the study. Although we evaluate two dialects of Occitan, Lengadocian and Gascon, the training material and lexicon concern only Lengadocian. We concluded that reasonable results (>89% accuracy) are possible with a very limited training corpus (2500 tokens), as long as it is compensated by intensive use of the lexicon. Results are much lower across dialects, and pointers are provided for improvement. Finally, we compare the relative contribution of more training material vs. a larger lexicon, and conclude that within our configuration, spending effort on lexicon construction yields higher returns.

Cite

CITATION STYLE

APA

Vergez-Couret, M., Urieli, A., & Informatique, J. (2014). Pos-Tagging different varieties of Occitan with single-dialect resources. In 1st Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects, VarDial 2014 at the 25th International Conference on Computational Linguistics: System Demonstrations, COLING 2014 - Proceedings (pp. 21–29). Association for Computational Linguistics (ACL). https://doi.org/10.3115/v1/w14-5303

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free