κ-means for streaming and distributed big sparse data

22Citations
Citations of this article
20Readers
Mendeley users who have this article in their library.

Abstract

We provide the first streaming algorithm for computing a provable approximation to the κ-means of sparse Big Data. Here, sparse Big Data is a stream of n vectors in ℝd, where each vector has O(1) non-zeroes entries and possibly d ≥ n. E.g., adjacency matrix of a graph, web-links, social network, document-terms, or image-features matrices. Our streaming algorithm stores at most logn κO(1) input points in memory. If the stream is distributed among M machines, the running time reduces by a factor of M, while communicating a total of M κO(1) (sparse) input points between the machines. Our main contribution is a deterministic algorithm for computing a sparse (κ,ϵ)-coreset, which is a weighted subset of κO(1) input points that approximates the sum of squared distances from the n input points to every set of κ centers, up to (1 ± ϵ) factor, for any given constant ϵ > 0. This is the first such coreset of size independent of both d and n. Our experimental results show how our algorithm can bs used to boost the performance of any given κ-means heuristics, even in the off-line setting. Open access to our implementation is also provided.

Cite

CITATION STYLE

APA

Barger, A., & Feldman, D. (2016). κ-means for streaming and distributed big sparse data. In 16th SIAM International Conference on Data Mining 2016, SDM 2016 (pp. 342–350). Society for Industrial and Applied Mathematics Publications. https://doi.org/10.1137/1.9781611974348.39

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free