Dealing with word-internal modification and spelling variation in data-driven lemmatization

4Citations
Citations of this article
64Readers
Mendeley users who have this article in their library.

Abstract

This paper describes our contribution to two challenges in data-driven lemmatization. We approach lemmatization in the framework of a two-stage process, where first lemma candidates are generated and afterwards a ranker chooses the most probable lemma from these candidates. The first challenge is that languages with rich morphology like Modern German can feature morphological changes of different kinds, in particular word-internal modification. This makes the generation of the correct lemma a harder task than just removing suffixes (stemming). The second challenge that we address is spelling variation as it appears in non-standard texts. We experiment with different generators that are specifically tailored to deal with these two challenges. We show in an oracle setting that there is a possible increase in lemmatization accuracy of 14% with our methods to generate lemma candidates on Middle Low German, a group of historical dialects of German (1200-1650 AD). Using a log-linear model to choose the correct lemma from the set, we obtain an actual increase of 5.56%.

Cite

CITATION STYLE

APA

Barteld, F., Schröder, I., & Zinsmeister, H. (2016). Dealing with word-internal modification and spelling variation in data-driven lemmatization. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 52–62). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/w16-2106

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free