Putting successor variety stemming to work

Benno Stein; Martin Potthast

Conference Proceedings

Putting successor variety stemming to work

Studies in Classification, Data Analysis, and Knowledge Organization (2007) 367-374

DOI: 10.1007/978-3-540-70981-7_41

11Citations

15Readers

Get full text

Abstract

Stemming algorithms find canonical forms for inflected words, e. g. for declined nouns or conjugated verbs. Since such a unification of words with respect to gender, number, time, and case is a language-specific issue, stemming algorithms operationalize a set of linguistically motivated rules for the language in question. The most well-known rule-based algorithm for the English language is from Porter (1980). The paper presents a statistical stemming approach which is based on the analysis of the distribution of word prefixes in a document collection, and which thus is widely language-independent. In particular, our approach tackles the problem of index construction for multi-lingual documents. Related work for statistical stemming either focuses on stemming quality (such as Bachin et al. (2002) or Bordag (2005)) or investigates runtime performance (Mayfield and McNamee (2003) for example), but neither provides a reasonable tradeoff between both. For selected retrieval tasks under vector-based document models we report on new results related to stemming quality and collection size dependency. Interestingly, successor variety stemming has neither been investigated under similarity concerns for index construction nor is it applied as a technology in current retrieval applications. The results show that this disregard is not justified.

Cite

CITATION STYLE

APA

Stein, B., & Potthast, M. (2007). Putting successor variety stemming to work. In Studies in Classification, Data Analysis, and Knowledge Organization (pp. 367–374). Kluwer Academic Publishers. https://doi.org/10.1007/978-3-540-70981-7_41

Putting successor variety stemming to work

Abstract

Cite

Register to see more suggestions