Abstract
When developing NLP tools for low-resource languages, one is often confronted with the lack of annotated data. We propose to circumvent this bottleneck by training a supervised HMM tagger on a closely related language for which annotated data are available, and translating the words in the tagger parameter files into the low-resource language. The translation dictionaries are created with unsupervised lexicon induction techniques that rely only on raw textual data. We obtain a tagging accuracy of up to 89.08% using a Spanish tagger adapted to Catalan, which is 30.66% above the performance of an unadapted Spanish tagger, and 8.88% below the performance of a supervised tagger trained on annotated Catalan data. Furthermore, we evaluate our model on several Romance, Germanic and Slavic languages and obtain tagging accuracies of up to 92%.
Cite
CITATION STYLE
Scherrer, Y. (2014). Unsupervised adaptation of supervised part-of-speech taggers for closely related languages. In 1st Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects, VarDial 2014 at the 25th International Conference on Computational Linguistics: System Demonstrations, COLING 2014 - Proceedings (pp. 30–38). Association for Computational Linguistics (ACL). https://doi.org/10.3115/v1/w14-5304
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.