Distributed classification of text documents on Apache Spark platform

Piotr Semberecki; Henryk Maciejewski

Conference Proceedings

Distributed classification of text documents on Apache Spark platform

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2016) 9692 621-630

DOI: 10.1007/978-3-319-39378-0_53

22Citations

10Readers

Get full text

Abstract

This paper presents implementation of the system for subject classification of text documents based on the Apache Spark distributed computing framework. Classification of text documents starts with generation of high-dimensional feature vectors from documents; the task realized with methods and tools for natural language processing. The next steps involve reduction of dimensionality of feature vectors and training classifiers. In the paper we show how these consecutive steps can be realized on the Apache Spark platform dedicated to distributed processing of big data. We illustrate the proposed method by a sample classifier aimed to predict subject category of a document in English language Wikipedia.

Author supplied keywords

Cite

CITATION STYLE

APA

Semberecki, P., & Maciejewski, H. (2016). Distributed classification of text documents on Apache Spark platform. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 9692, pp. 621–630). Springer Verlag. https://doi.org/10.1007/978-3-319-39378-0_53

Distributed classification of text documents on Apache Spark platform

Abstract

Author supplied keywords

Cite

Register to see more suggestions