Consistent sampling is a technique for specifying, in small space, a subset S of a potentially large universe U such that the elements in S satisfy a suitably chosen sampling condition. Given a subset I ⊆ U it should be possible to quickly compute I ∩ S, i.e., the elements in I satisfying the sampling condition. Consistent sampling has important applications in similarity estimation, and estimation of the number of distinct items in a data stream. In this paper we generalize consistent sampling to the setting where we are interested in sampling size-k subsets occurring in some set in a collection of sets of bounded size b, where k is a small integer. This can be done by applying standard consistent sampling to the k-subsets of each set, but that approach requires time Θ(bk). Using a carefully designed hash function, for a given sampling probability p ∈ (0,1], we show how to improve the time complexity to Θ(b⌈k/2⌉ log log b + pbk) in expectation, while maintaining strong concentration bounds for the sample. The space usage of our method is Θ(b⌈k/4⌉). We demonstrate the utility of our technique by applying it to several well-studied data mining problems. We show how to efficiently estimate the number of frequent k-itemsets in a stream of transactions and the number of bipartite cliques in a graph given as incidence stream. Further, building upon a recent work by Campagna et al., we show that our approach can be applied to frequent itemset mining in a parallel or distributed setting. We also present applications in graph stream mining. © 2014 Springer International Publishing.
CITATION STYLE
Kutzkov, K., & Pagh, R. (2014). Consistent subset sampling. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 8503 LNCS, pp. 294–305). Springer Verlag. https://doi.org/10.1007/978-3-319-08404-6_26
Mendeley helps you to discover research relevant for your work.