Data clustering is inevitable in today’s era of data deluge. k-Means is a popular partition based clustering technique. However, with the increase in size and complexity of data, it is no longer suitable. There is an urgent need to shift towards parallel algorithms. We present a MapReduce based k-Means clustering, which is scalable and fault tolerant. The major advantage of our proposed work is that it dynamically determines the number of clusters, unlike k-Means where the final number of clusters has to be specified. MapReduce jobs are iteration sensitive as multiple read and write to the file system increase the cost as well as computation time. The algorithm proposed is not iterative one, it reads the data from and writes the output back to the file system once. We show that the proposed algorithm performs better than an Improved MapReduce based k-Means clustering algorithm.
CITATION STYLE
Sinha, A., & Jana, P. K. (2017). A novel mapreduce based k-means clustering. In Advances in Intelligent Systems and Computing (Vol. 458, pp. 247–255). Springer Verlag. https://doi.org/10.1007/978-981-10-2035-3_26
Mendeley helps you to discover research relevant for your work.