Comparative study of apache spark MLlib clustering algorithms

Sasan Harifi; Ebrahim Byagowi; Madjid Khalilian

Conference Proceedings

Comparative study of apache spark MLlib clustering algorithms

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2017) 10387 LNCS 61-73

DOI: 10.1007/978-3-319-61845-6_7

12Citations

21Readers

Get full text

Abstract

Clustering of big data has received much attention recently. Analytics algorithms on big datasets require tremendous computational capabilities. Apache Spark is a popular open- source platform for large-scale data processing that is well-suited for iterative machine learning tasks. This paper presents an overview of Apache Spark Machine Learning Library (Spark.MLlib) algorithms. The clustering methods consist of Gaussian Mixture Model (GMM), Power-Iteration Clustering method, Latent Dirichlet Allocation (LDA), and k-means are completely described. In this paper, three benchmark datasets include Forest Cover Type, KDD Cup 99 and Internet Advertisements used for experiments. The same algorithms that can be compared with each other, compared. For a better understanding of the results of the experiments, the algorithms are described with suitable tables and graphs.

Author supplied keywords

Cite

CITATION STYLE

APA

Harifi, S., Byagowi, E., & Khalilian, M. (2017). Comparative study of apache spark MLlib clustering algorithms. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 10387 LNCS, pp. 61–73). Springer Verlag. https://doi.org/10.1007/978-3-319-61845-6_7

Comparative study of apache spark MLlib clustering algorithms

Abstract

Author supplied keywords

Cite

Register to see more suggestions