In this study, we tackle the question of pos-Tagging written Occitan, a lesser-resourced language with multiple dialects each containing several varieties. For pos-Tagging, we use a supervised machine learning approach, requiring annotated training and evaluation corpora and optionally a lexicon, all of which were prepared as part of the study. Although we evaluate two dialects of Occitan, Lengadocian and Gascon, the training material and lexicon concern only Lengadocian. We concluded that reasonable results (>89% accuracy) are possible with a very limited training corpus (2500 tokens), as long as it is compensated by intensive use of the lexicon. Results are much lower across dialects, and pointers are provided for improvement. Finally, we compare the relative contribution of more training material vs. a larger lexicon, and conclude that within our configuration, spending effort on lexicon construction yields higher returns.
CITATION STYLE
Vergez-Couret, M., Urieli, A., & Informatique, J. (2014). Pos-Tagging different varieties of Occitan with single-dialect resources. In 1st Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects, VarDial 2014 at the 25th International Conference on Computational Linguistics: System Demonstrations, COLING 2014 - Proceedings (pp. 21–29). Association for Computational Linguistics (ACL). https://doi.org/10.3115/v1/w14-5303
Mendeley helps you to discover research relevant for your work.