Approximate K-Means++ in sublinear time

Olivier Bachem; Mario Lucic; S. Hamed Hassani; Andreas Krause

Conference ProceedingsOPEN ACCESS

Approximate K-Means++ in sublinear time

30th AAAI Conference on Artificial Intelligence, AAAI 2016 (2016) 1459-1467

DOI: 10.1609/aaai.v30i1.10259

134Citations

73Readers

Abstract

The quality of K-Means clustering is extremely sensitive to proper initialization. The classic remedy is to apply k-means++ to obtain an initial set of centers that is provably competitive with the optimal solution. Unfortunately, k-means++ requires k full passes over the data which limits its applicability to massive datasets. We address this problem by proposing a simple and efficient seeding algorithm for K-Means clustering. The main idea is to replace the exact D2-sampling step in k-means++ with a substantially faster approximation based on Markov Chain Monte Carlo sampling. We prove that, under natural assumptions on the data, the proposed algorithm retains the full theoretical guarantees of k-means++ while its computational complexity is only sublinear in the number of data points. For such datasets, one can thus obtain a provably good clustering in sublinear time. Extensive experiments confirm that the proposed method is competitive with k-means++ on a variety of real-world, largescale datasets while offering a reduction in runtime of several orders of magnitude.

Cite

CITATION STYLE

APA

Bachem, O., Lucic, M., Hassani, S. H., & Krause, A. (2016). Approximate K-Means++ in sublinear time. In 30th AAAI Conference on Artificial Intelligence, AAAI 2016 (pp. 1459–1467). AAAI press. https://doi.org/10.1609/aaai.v30i1.10259

Approximate K-Means++ in sublinear time

Abstract

Cite

Register to see more suggestions