This paper describes our contribution to two challenges in data-driven lemmatization. We approach lemmatization in the framework of a two-stage process, where first lemma candidates are generated and afterwards a ranker chooses the most probable lemma from these candidates. The first challenge is that languages with rich morphology like Modern German can feature morphological changes of different kinds, in particular word-internal modification. This makes the generation of the correct lemma a harder task than just removing suffixes (stemming). The second challenge that we address is spelling variation as it appears in non-standard texts. We experiment with different generators that are specifically tailored to deal with these two challenges. We show in an oracle setting that there is a possible increase in lemmatization accuracy of 14% with our methods to generate lemma candidates on Middle Low German, a group of historical dialects of German (1200-1650 AD). Using a log-linear model to choose the correct lemma from the set, we obtain an actual increase of 5.56%.
CITATION STYLE
Barteld, F., Schröder, I., & Zinsmeister, H. (2016). Dealing with word-internal modification and spelling variation in data-driven lemmatization. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 52–62). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/w16-2106
Mendeley helps you to discover research relevant for your work.