Poor man's stemming: Unsupervised recognition of same-stem words

7Citations
Citations of this article
17Readers
Mendeley users who have this article in their library.
Get full text

Abstract

We present a new fully unsupervised human-intervention-free algorithm for stemming for an open class of languages. Since it does not rely on existing large data collections or other linguistic resources than raw text it is especially attractive for low-density languages. The stemming problem is formulated as a decision whether two given words are variants of the same stem and requires that, if so, there is a concatenative relation between the two. The underlying theory makes no assumptions on whether the language uses a lot of morphology or not, whether it is prefixing or suffixing, or whether affixes are long or short. It does however make the assumption that 1. salient affixes have to be frequent, 2. words essentially are variable length sequences of random characters, and furthermore 3. that a heuristic on what constitutes a systematic affix alteration is valid. Tested on four typologically distant languages, the stemmer shows very promising results in an evaluation against a human-made gold standard. © Springer-Verlag Berlin Heidelberg 2006.

Cite

CITATION STYLE

APA

Hammarström, H. (2006). Poor man’s stemming: Unsupervised recognition of same-stem words. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 4182 LNCS, pp. 323–337). Springer Verlag. https://doi.org/10.1007/11880592_25

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free