Automatic semantic subject indexing of web documents in highly inflected languages

Reetta Sinkkilä; Osma Suominen; Eero Hyvönen

Conference ProceedingsOPEN ACCESS

Automatic semantic subject indexing of web documents in highly inflected languages

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2011) 6643 LNCS(PART 1) 215-229

DOI: 10.1007/978-3-642-21034-1_15

8Citations

16Readers

Abstract

Structured semantic metadata about unstructured web documents can be created using automatic subject indexing methods, avoiding laborious manual indexing. A succesful automatic subject indexing tool for the web should work with texts in multiple languages and be independent of the domain of discourse of the documents and controlled vocabularies. However, analyzing text written in a highly inflected language requires word form normalization that goes beyond rule-based stemming algorithms. We have tested the state-of-the art automatic indexing tool Maui on Finnish texts using three stemming and lemmatization algorithms and tested it with documents and vocabularies of different domains. Both of the lemmatization algorithms we tested performed significantly better than a rule-based stemmer, and the subject indexing quality was found to be comparable to that of human indexers. © 2011 Springer-Verlag Berlin Heidelberg.

Cite

CITATION STYLE

APA

Sinkkilä, R., Suominen, O., & Hyvönen, E. (2011). Automatic semantic subject indexing of web documents in highly inflected languages. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 6643 LNCS, pp. 215–229). https://doi.org/10.1007/978-3-642-21034-1_15

Automatic semantic subject indexing of web documents in highly inflected languages

Abstract

Cite

Register to see more suggestions