We consider the statistical lemmatization problem in which lemmatizers are trained on (word form, lemma) pairs. In particular, we consider this problem for ancient Latin, a language with high degree of morphological variability.We investigate whether general purpose stringto- string transduction models are suitable for this task, and find that they typically perform (much) better than more restricted lemmatization techniques/heuristics based on suffix transformations.We also experimentally test whether string transduction systems that perform well on one string-to-string translation task (here, G2P) perform well on another (here, lemmatization) and vice versa, and find that a joint n-gram modeling performs better on G2P than a discriminative model of our own making but that this relationship is reversed for lemmatization. Finally, we investigate how the learned lemmatizers can complement lexicon-based systems, e.g., by tackling the OOV and/or the disambiguation problem.
CITATION STYLE
Eger, S. (2015). Designing and comparing G2P-type lemmatizers for a morphology-rich language. In Communications in Computer and Information Science (Vol. 537, pp. 27–40). Springer Verlag. https://doi.org/10.1007/978-3-319-23980-4_2
Mendeley helps you to discover research relevant for your work.