Clustering is a technique to partition data into different groups in such a way that data items in a group are more similar to each other than the data points in any other group. The assumption of infinite main memory is very usual while designing most of the clustering algorithms but this assumption fails when the size of data set starts increasing. In this scenario, data needs to be stored in the secondary memory and time spent in the input/outputs (I/O) dominates the actual computational time. Therefore by reducing the I/O, the efficiency of the clustering techniques can be improved. In this paper, one shared near neighbor based algorithm is devised by minimizing its I/O complexity to make it suitable for the Big Data in external memory model proposed by Aggarwal and Vitter. There is no change in the computational steps, hence cluster quality remains the same. We implement the algorithm in the STXXL library to show its efficacy for Big Data sets.
CITATION STYLE
Pandey, S., Samal, M., & Mohanty, S. K. (2020). An SNN-DBSCAN Based Clustering Algorithm for Big Data. In Advances in Intelligent Systems and Computing (Vol. 1082, pp. 127–137). Springer. https://doi.org/10.1007/978-981-15-1081-6_11
Mendeley helps you to discover research relevant for your work.