Scalable Training of Mixture Models via Coresets

  • Feldman D
  • Faulkner M
  • Krause A
  • 102


    Mendeley users who have this article in their library.
  • 26


    Citations of this article.


How can we train a statistical mixture model on a massive data set? In this paper, we show how to construct coresets for mixtures of Gaussians and natural generalizations. A coreset is a weighted subset of the data, which guarantees that models fitting the coreset will also provide a good fit for the original data set. We show that, perhaps surprisingly, Gaussian mixtures admit coresets of size \emphindependent of the size of the data set. More precisely, we prove that a weighted set of $O(dk^3/\eps^2)$ data points suffices for computing a $(1+\eps)$-approximation for the optimal model on the original $n$ data points. Moreover, such coresets can be efficiently constructed in a map-reduce style computation, as well as in a streaming setting. Our results rely on a novel reduction of statistical estimation to problems in computational geometry, as well as new complexity results about mixtures of Gaussians. We empirically evaluate our algorithms on several real data sets, including a density estimation problem in the context of earthquake detection using accelerometers in mobile phones.

Author-supplied keywords

  • Clustering
  • Core-Sets
  • Coresets
  • Streaming Algorithm
  • Streaming Clustering

Get free article suggestions today

Mendeley saves you time finding and organizing research

Sign up here
Already have an account ?Sign in

Find this document

  • PUI: 364744145
  • SGR: 84889749678
  • SCOPUS: 2-s2.0-84889749678
  • ISBN: 9781618395993


  • Dan Feldman

  • Matthew Faulkner

  • Andreas Krause

Cite this document

Choose a citation style from the tabs below

Save time finding and organizing research with Mendeley

Sign up for free