Distributed classification of text documents on Apache Spark platform

22Citations
Citations of this article
10Readers
Mendeley users who have this article in their library.
Get full text

Abstract

This paper presents implementation of the system for subject classification of text documents based on the Apache Spark distributed computing framework. Classification of text documents starts with generation of high-dimensional feature vectors from documents; the task realized with methods and tools for natural language processing. The next steps involve reduction of dimensionality of feature vectors and training classifiers. In the paper we show how these consecutive steps can be realized on the Apache Spark platform dedicated to distributed processing of big data. We illustrate the proposed method by a sample classifier aimed to predict subject category of a document in English language Wikipedia.

Cite

CITATION STYLE

APA

Semberecki, P., & Maciejewski, H. (2016). Distributed classification of text documents on Apache Spark platform. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 9692, pp. 621–630). Springer Verlag. https://doi.org/10.1007/978-3-319-39378-0_53

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free