Dealing with word-internal modification and spelling variation in data-driven lemmatization

Fabian Barteld; Ingrid Schröder; Heike Zinsmeister

Conference ProceedingsOPEN ACCESS

Dealing with word-internal modification and spelling variation in data-driven lemmatization

Proceedings of the Annual Meeting of the Association for Computational Linguistics (2016) 52-62

DOI: 10.18653/v1/w16-2106

4Citations

64Readers

Abstract

This paper describes our contribution to two challenges in data-driven lemmatization. We approach lemmatization in the framework of a two-stage process, where first lemma candidates are generated and afterwards a ranker chooses the most probable lemma from these candidates. The first challenge is that languages with rich morphology like Modern German can feature morphological changes of different kinds, in particular word-internal modification. This makes the generation of the correct lemma a harder task than just removing suffixes (stemming). The second challenge that we address is spelling variation as it appears in non-standard texts. We experiment with different generators that are specifically tailored to deal with these two challenges. We show in an oracle setting that there is a possible increase in lemmatization accuracy of 14% with our methods to generate lemma candidates on Middle Low German, a group of historical dialects of German (1200-1650 AD). Using a log-linear model to choose the correct lemma from the set, we obtain an actual increase of 5.56%.

Cite

CITATION STYLE

APA

Barteld, F., Schröder, I., & Zinsmeister, H. (2016). Dealing with word-internal modification and spelling variation in data-driven lemmatization. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 52–62). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/w16-2106

Dealing with word-internal modification and spelling variation in data-driven lemmatization

Abstract

Cite

Register to see more suggestions