Bag-of-words, bag-of-topics and word-to-vec based subject classification of text documents in Polish - A comparative study

21Citations
Citations of this article
25Readers
Mendeley users who have this article in their library.
Get full text

Abstract

This paper deals with the problem of classification of Polish language documents in terms of a subject category. We compare four state-of-the-art approaches to this task which differ primarily in the way the documents are represented by feature vectors. Two methods considered in the study use frequency-of-words or frequency-of-topics representation of the documents and rely on the Natural Language Processing (NLP) technology to pre-process the raw text. Two alternative methods do not involve the NLP technology. They construct feature vectors using vector representation of words (Word2Vec method) or using a frequency of topics derived from the raw text. These four approaches are evaluated using 3 corpora with 5, 34 and 25 subject categories respectively and with a different level of class discrimination. Results suggest that no single method outperforms other method in all tests, however tests with large number of training observations seem to favour the NLP-free Word2Vec methods.

Cite

CITATION STYLE

APA

Walkowiak, T., Datko, S., & Maciejewski, H. (2019). Bag-of-words, bag-of-topics and word-to-vec based subject classification of text documents in Polish - A comparative study. In Advances in Intelligent Systems and Computing (Vol. 761, pp. 526–535). Springer Verlag. https://doi.org/10.1007/978-3-319-91446-6_49

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free