An ensemble method for estimating the number of clusters in a big data set using multiple random samples

Mohammad Sultan Mahmud; Joshua Zhexue Huang; Rukhsana Ruby; Kaishun Wu

Journal ArticleOPEN ACCESS

An ensemble method for estimating the number of clusters in a big data set using multiple random samples

Journal of Big Data (2023) 10(1)

DOI: 10.1186/s40537-023-00709-4

12Citations

21Readers

Abstract

Clustering a big dataset without knowing the number of clusters presents a big challenge to many existing clustering algorithms. In this paper, we propose a Random Sample Partition-based Centers Ensemble (RSPCE) algorithm to identify the number of clusters in a big dataset. In this algorithm, a set of disjoint random samples is selected from the big dataset, and the I-niceDP algorithm is used to identify the number of clusters and initial centers in each sample. Subsequently, a cluster ball model is proposed to merge two clusters in the random samples that are likely sampled from the same cluster in the big dataset. Finally, based on the ball model, the RSPCE ensemble method is used to ensemble the results of all samples into the final result as a set of initial cluster centers in the big dataset. Intensive experiments were conducted on both synthetic and real datasets to validate the feasibility and effectiveness of the proposed RSPCE algorithm. The experimental results show that the ensemble result from multiple random samples is a reliable approximation of the actual number of clusters, and the RSPCE algorithm is scalable to big data.

Author supplied keywords

Cite

CITATION STYLE

APA

Mahmud, M. S., Huang, J. Z., Ruby, R., & Wu, K. (2023). An ensemble method for estimating the number of clusters in a big data set using multiple random samples. Journal of Big Data, 10(1). https://doi.org/10.1186/s40537-023-00709-4

An ensemble method for estimating the number of clusters in a big data set using multiple random samples

Abstract

Author supplied keywords

Cite

Register to see more suggestions