TeraHAC: Hierarchical Agglomerative Clustering of Trillion-Edge Graphs

Laxman Dhulipala; Jakub Łącki; Jason Lee; Vahab Mirrokni

Journal ArticleOPEN ACCESS

TeraHAC: Hierarchical Agglomerative Clustering of Trillion-Edge Graphs

Dhulipala L
Łącki J
Lee J
et al.

Proceedings of the ACM on Management of Data (2023) 1(3) 1-27

DOI: 10.1145/3617341

N/ACitations

7Readers

Abstract

We introduce TeraHAC, a (1+ε)-approximate hierarchical agglomerative clustering (HAC) algorithm which scales to trillion-edge graphs. Our algorithm is based on a new approach to computing (1+ε)-approximate HAC, which is a novel combination of the nearest-neighbor chain algorithm and the notion of (1+ε)-approximate HAC. Our approach allows us to partition the graph among multiple machines and make significant progress in computing the clustering within each partition before any communication with other partitions is needed.We evaluate TeraHAC on a number of real-world and synthetic graphs of up to 8 trillion edges. We show that TeraHAC requires over 100x fewer rounds compared to previously known approaches for computing HAC. It is up to 8.3x faster than SCC, the state-of-the-art distributed algorithm for hierarchical clustering, while achieving 1.16x higher quality. In fact, TeraHAC essentially retains the quality of the celebrated HAC algorithm while significantly improving the running time.

Cite

CITATION STYLE

APA

Dhulipala, L., Łącki, J., Lee, J., & Mirrokni, V. (2023). TeraHAC: Hierarchical Agglomerative Clustering of Trillion-Edge Graphs. Proceedings of the ACM on Management of Data, 1(3), 1–27. https://doi.org/10.1145/3617341

TeraHAC: Hierarchical Agglomerative Clustering of Trillion-Edge Graphs

Abstract

Cite

Register to see more suggestions