Abstract
This paper presents implementation of the system for subject classification of text documents based on the Apache Spark distributed computing framework. Classification of text documents starts with generation of high-dimensional feature vectors from documents; the task realized with methods and tools for natural language processing. The next steps involve reduction of dimensionality of feature vectors and training classifiers. In the paper we show how these consecutive steps can be realized on the Apache Spark platform dedicated to distributed processing of big data. We illustrate the proposed method by a sample classifier aimed to predict subject category of a document in English language Wikipedia.
Author supplied keywords
Cite
CITATION STYLE
Semberecki, P., & Maciejewski, H. (2016). Distributed classification of text documents on Apache Spark platform. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 9692, pp. 621–630). Springer Verlag. https://doi.org/10.1007/978-3-319-39378-0_53
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.