Using String Kernels for Classification of Slovenian Web Documents

  • Fortuna B
  • Mladenič D
N/ACitations
Citations of this article
4Readers
Mendeley users who have this article in their library.
Get full text

Abstract

In this paper we present an approach for classifying web pages obtained from the Slovenian Internet directory where the web sites covering different topics are organized into a topic ontology.We tested two different methods for representing text documents, both in combination with the linear SVM classification algorithm. The first representation that we have used is a standard bag-of-words approach with TFIDF weights and cosine distance used as similarity measure. We compared this to String kernels where text documents are compared not by words but by substrings. This removes the need for stemming or lemmatisation which can be important issue when documents are in languages other than English and tools for stemming or lemmatisation are unavailable or are expensive to make or learn. In highly inflected natural languages, such as Slovene language, the same word can have many different forms, thus String kernels have an advantage here over the bagof- words. In this paper we show that on classification of documents written in highly inflected natural language the situation is opposite and String Kernels significantly outperform the standard bag-of-words representation. Our experiments also show that the advantage of String kernels is more evident for domains with unbalanced class distribution.

Cite

CITATION STYLE

APA

Fortuna, B., & Mladenič, D. (2006). Using String Kernels for Classification of Slovenian Web Documents. In From Data and Information Analysis to Knowledge Engineering (pp. 358–365). Springer-Verlag. https://doi.org/10.1007/3-540-31314-1_43

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free