Pos-Tagging different varieties of Occitan with single-dialect resources

Marianne Vergez-Couret; Assaf Urieli; Joliciel Informatique

Conference Proceedings

Pos-Tagging different varieties of Occitan with single-dialect resources

1st Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects, VarDial 2014 at the 25th International Conference on Computational Linguistics: System Demonstrations, COLING 2014 - Proceedings (2014) 21-29

DOI: 10.3115/v1/w14-5303

7Citations

66Readers

Get full text

Abstract

In this study, we tackle the question of pos-Tagging written Occitan, a lesser-resourced language with multiple dialects each containing several varieties. For pos-Tagging, we use a supervised machine learning approach, requiring annotated training and evaluation corpora and optionally a lexicon, all of which were prepared as part of the study. Although we evaluate two dialects of Occitan, Lengadocian and Gascon, the training material and lexicon concern only Lengadocian. We concluded that reasonable results (>89% accuracy) are possible with a very limited training corpus (2500 tokens), as long as it is compensated by intensive use of the lexicon. Results are much lower across dialects, and pointers are provided for improvement. Finally, we compare the relative contribution of more training material vs. a larger lexicon, and conclude that within our configuration, spending effort on lexicon construction yields higher returns.

Cite

CITATION STYLE

APA

Vergez-Couret, M., Urieli, A., & Informatique, J. (2014). Pos-Tagging different varieties of Occitan with single-dialect resources. In 1st Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects, VarDial 2014 at the 25th International Conference on Computational Linguistics: System Demonstrations, COLING 2014 - Proceedings (pp. 21–29). Association for Computational Linguistics (ACL). https://doi.org/10.3115/v1/w14-5303

Pos-Tagging different varieties of Occitan with single-dialect resources

Abstract

Cite

Register to see more suggestions