Hierarchical Density-Based Clustering Using MapReduce

Joelson Antonio dos Santos; Talat Iqbal Syed; Murilo C. Naldi; Ricardo J. G. B. Campello; Joerg Sander

Journal ArticleOPEN ACCESS

Hierarchical Density-Based Clustering Using MapReduce

Santos J
Syed T
Naldi M
et al.

IEEE Transactions on Big Data (2019) 7(1) 102-114

DOI: 10.1109/tbdata.2019.2907624

N/ACitations

38Readers

Abstract

Hierarchical density-based clustering is a powerful tool for exploratory data analysis. However, its applicability to large datasets is limited because the computational complexity. In the literature, there have been attempts to parallelize algorithms such as Single-Linkage, which in principle can also be extended to the broader scope of hierarchical density-based clustering, but hierarchical clustering algorithms are inherently difficult to parallelize with MapReduce. In this paper, we discuss why adapting previous approaches to parallelize Single-Linkage clustering using MapReduce leads to very inefficient solutions when one wants to compute density-based clustering hierarchies. Preliminarily, we discuss one such solution, which is based on an exact, yet very computationally demanding, random blocks parallelization scheme. To be able to efficiently apply hierarchical density-based clustering to large datasets using MapReduce, we then propose a different parallelization scheme that computes an approximate clustering hierarchy based on a much faster, recursive sampling approach. This approach is based on HDBSCAN*, the state-of-the-art hierarchical density-based clustering algorithm, combined with a data summarization technique called data bubbles. The proposed method is evaluated in terms of both runtime and quality of the approximation on a number of datasets, showing its effectiveness and scalability.

Cite

CITATION STYLE

APA

Santos, J. A. dos, Syed, T. I., Naldi, M. C., Campello, R. J. G. B., & Sander, J. (2019). Hierarchical Density-Based Clustering Using MapReduce. IEEE Transactions on Big Data, 7(1), 102–114. https://doi.org/10.1109/tbdata.2019.2907624

Hierarchical Density-Based Clustering Using MapReduce

Abstract

Cite

Register to see more suggestions