Topic classification problem solving for morphologically complex languages

Jurgita Kapočiūtė-Dzikienė; Tomas Krilavičius

Conference Proceedings

Topic classification problem solving for morphologically complex languages

Communications in Computer and Information Science (2016) 639 511-524

DOI: 10.1007/978-3-319-46254-7_41

3Citations

2Readers

Get full text

Abstract

In this paper we are presenting a topic classification task for the morphologically complex Lithuanian and Russian languages, using popular supervised machine learning techniques. In our research we experimentally investigated two text classification methods and a big variety of feature types covering different levels of abstraction: character, lexical, and morpho-syntactic. In order to have comparable results for the both languages, we kept experimental conditions as similar as possible: the datasets were composed of the normative texts, taken from the news portals; contained similar topics; and had the same number of texts in each topic. The best results (~0.86 of the accuracy) were achieved with the Support Vector Machine method and the token lemmas as a feature representation type. The character feature type capturing relevant patterns of the complex inflectional morphology without any external morphological tools was the second best. Since these findings hold for the both Lithuanian and Russian languages, we assume, they should hold for the entire group of the Baltic and Slavic languages.

Author supplied keywords

Cite

CITATION STYLE

APA

Kapočiūtė-Dzikienė, J., & Krilavičius, T. (2016). Topic classification problem solving for morphologically complex languages. In Communications in Computer and Information Science (Vol. 639, pp. 511–524). Springer Verlag. https://doi.org/10.1007/978-3-319-46254-7_41

Topic classification problem solving for morphologically complex languages

Abstract

Author supplied keywords

Cite

Register to see more suggestions