Parallel Implementation of Statistical DBSCAN Algorithm for Spark-based Clustering on Google Cloud Platform

Ahmad M. Awaad; Hesham Hefny

Journal ArticleOPEN ACCESS

Parallel Implementation of Statistical DBSCAN Algorithm for Spark-based Clustering on Google Cloud Platform

International Journal of Intelligent Engineering and Systems (2023) 16(2) 279-290

DOI: 10.22266/ijies2023.0430.23

1Citations

9Readers

Abstract

We present a new parallel density-based spatial clustering of applications with noise (DBSCAN) algorithm for spark on the google cloud platform (GCP). Statistical analysis is applied to determine DBSCAN's optimal parameters to enhance clustering performance. for scalability cost-based, R-tree partitioning is selected based on the distribution of the dataset into balanced workloads. Parallel DBSCAN consists of three parts: local DBSCAN, partitioning, and merging. Optimizing the partitioning of parallel DBSCAN is important to save time and space compared to serial DBSCAN. This approach can improve the performance and time cost of large datasets. the modified statistical cost-based (SCbs-DBSCAN) is applied to the UCI (university of california irvine) standard datasets, basic benchmark clustering and large different scales of data. For clustering performance and time cost, the experimental results show that the proposed algorithm achieve 10~15% more efficiently, and can run about 1.5x~3x faster than alternative Parallel DBSCAN method on Spark without sacrificing clustering quality

Author supplied keywords

References Powered by Scopus

View more at Scopus

Cited by Powered by Scopus

View more at Scopus

Cite

CITATION STYLE

APA

Awaad, A. M., & Hefny, H. (2023). Parallel Implementation of Statistical DBSCAN Algorithm for Spark-based Clustering on Google Cloud Platform. International Journal of Intelligent Engineering and Systems, 16(2), 279–290. https://doi.org/10.22266/ijies2023.0430.23

Readers over time

Readers' Seniority

Professor / Associate Prof. 1

50%

PhD / Post grad / Masters / Doc 1

50%

Readers' Discipline

Business, Management and Accounting 1

50%

Computer Science 1

50%

Parallel Implementation of Statistical DBSCAN Algorithm for Spark-based Clustering on Google Cloud Platform

Abstract

Author supplied keywords

References Powered by Scopus

Data clustering: A review

Data Mining: Concepts and Techniques

Internet of things in industries: A survey

Cited by Powered by Scopus

Vectorized Highly Parallel Density-based Clustering for Applications with Noise

Register to see more suggestions

Cite

Readers over time

Readers' Seniority

Readers' Discipline