Parallel Implementation of Statistical DBSCAN Algorithm for Spark-based Clustering on Google Cloud Platform

1Citations
Citations of this article
8Readers
Mendeley users who have this article in their library.

Abstract

We present a new parallel density-based spatial clustering of applications with noise (DBSCAN) algorithm for spark on the google cloud platform (GCP). Statistical analysis is applied to determine DBSCAN's optimal parameters to enhance clustering performance. for scalability cost-based, R-tree partitioning is selected based on the distribution of the dataset into balanced workloads. Parallel DBSCAN consists of three parts: local DBSCAN, partitioning, and merging. Optimizing the partitioning of parallel DBSCAN is important to save time and space compared to serial DBSCAN. This approach can improve the performance and time cost of large datasets. the modified statistical cost-based (SCbs-DBSCAN) is applied to the UCI (university of california irvine) standard datasets, basic benchmark clustering and large different scales of data. For clustering performance and time cost, the experimental results show that the proposed algorithm achieve 10~15% more efficiently, and can run about 1.5x~3x faster than alternative Parallel DBSCAN method on Spark without sacrificing clustering quality

Cite

CITATION STYLE

APA

Awaad, A. M., & Hefny, H. (2023). Parallel Implementation of Statistical DBSCAN Algorithm for Spark-based Clustering on Google Cloud Platform. International Journal of Intelligent Engineering and Systems, 16(2), 279–290. https://doi.org/10.22266/ijies2023.0430.23

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free