Comparative study of apache spark MLlib clustering algorithms

12Citations
Citations of this article
21Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Clustering of big data has received much attention recently. Analytics algorithms on big datasets require tremendous computational capabilities. Apache Spark is a popular open- source platform for large-scale data processing that is well-suited for iterative machine learning tasks. This paper presents an overview of Apache Spark Machine Learning Library (Spark.MLlib) algorithms. The clustering methods consist of Gaussian Mixture Model (GMM), Power-Iteration Clustering method, Latent Dirichlet Allocation (LDA), and k-means are completely described. In this paper, three benchmark datasets include Forest Cover Type, KDD Cup 99 and Internet Advertisements used for experiments. The same algorithms that can be compared with each other, compared. For a better understanding of the results of the experiments, the algorithms are described with suitable tables and graphs.

Cite

CITATION STYLE

APA

Harifi, S., Byagowi, E., & Khalilian, M. (2017). Comparative study of apache spark MLlib clustering algorithms. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 10387 LNCS, pp. 61–73). Springer Verlag. https://doi.org/10.1007/978-3-319-61845-6_7

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free