Approximate K-Means++ in sublinear time

134Citations
Citations of this article
73Readers
Mendeley users who have this article in their library.

Abstract

The quality of K-Means clustering is extremely sensitive to proper initialization. The classic remedy is to apply k-means++ to obtain an initial set of centers that is provably competitive with the optimal solution. Unfortunately, k-means++ requires k full passes over the data which limits its applicability to massive datasets. We address this problem by proposing a simple and efficient seeding algorithm for K-Means clustering. The main idea is to replace the exact D2-sampling step in k-means++ with a substantially faster approximation based on Markov Chain Monte Carlo sampling. We prove that, under natural assumptions on the data, the proposed algorithm retains the full theoretical guarantees of k-means++ while its computational complexity is only sublinear in the number of data points. For such datasets, one can thus obtain a provably good clustering in sublinear time. Extensive experiments confirm that the proposed method is competitive with k-means++ on a variety of real-world, largescale datasets while offering a reduction in runtime of several orders of magnitude.

Cite

CITATION STYLE

APA

Bachem, O., Lucic, M., Hassani, S. H., & Krause, A. (2016). Approximate K-Means++ in sublinear time. In 30th AAAI Conference on Artificial Intelligence, AAAI 2016 (pp. 1459–1467). AAAI press. https://doi.org/10.1609/aaai.v30i1.10259

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free