This paper deals with the problem of classification of Polish language documents in terms of a subject category. We compare four state-of-the-art approaches to this task which differ primarily in the way the documents are represented by feature vectors. Two methods considered in the study use frequency-of-words or frequency-of-topics representation of the documents and rely on the Natural Language Processing (NLP) technology to pre-process the raw text. Two alternative methods do not involve the NLP technology. They construct feature vectors using vector representation of words (Word2Vec method) or using a frequency of topics derived from the raw text. These four approaches are evaluated using 3 corpora with 5, 34 and 25 subject categories respectively and with a different level of class discrimination. Results suggest that no single method outperforms other method in all tests, however tests with large number of training observations seem to favour the NLP-free Word2Vec methods.
CITATION STYLE
Walkowiak, T., Datko, S., & Maciejewski, H. (2019). Bag-of-words, bag-of-topics and word-to-vec based subject classification of text documents in Polish - A comparative study. In Advances in Intelligent Systems and Computing (Vol. 761, pp. 526–535). Springer Verlag. https://doi.org/10.1007/978-3-319-91446-6_49
Mendeley helps you to discover research relevant for your work.