Poor man's stemming: Unsupervised recognition of same-stem words

Harald Hammarström

Conference Proceedings

Poor man's stemming: Unsupervised recognition of same-stem words

Hammarström H

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2006) 4182 LNCS 323-337

DOI: 10.1007/11880592_25

7Citations

17Readers

Get full text

Abstract

We present a new fully unsupervised human-intervention-free algorithm for stemming for an open class of languages. Since it does not rely on existing large data collections or other linguistic resources than raw text it is especially attractive for low-density languages. The stemming problem is formulated as a decision whether two given words are variants of the same stem and requires that, if so, there is a concatenative relation between the two. The underlying theory makes no assumptions on whether the language uses a lot of morphology or not, whether it is prefixing or suffixing, or whether affixes are long or short. It does however make the assumption that 1. salient affixes have to be frequent, 2. words essentially are variable length sequences of random characters, and furthermore 3. that a heuristic on what constitutes a systematic affix alteration is valid. Tested on four typologically distant languages, the stemmer shows very promising results in an evaluation against a human-made gold standard. © Springer-Verlag Berlin Heidelberg 2006.

Cite

CITATION STYLE

APA

Hammarström, H. (2006). Poor man’s stemming: Unsupervised recognition of same-stem words. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 4182 LNCS, pp. 323–337). Springer Verlag. https://doi.org/10.1007/11880592_25

Poor man's stemming: Unsupervised recognition of same-stem words

Abstract

Cite

Register to see more suggestions