Cardinality estimation plays an important role in processing big data. We consider the challenging problem of computing millions or more distinct count aggregations in a single pass and allowing these aggregations to be further combined into coarser aggregations. These arise naturally in many applications including networking, databases, and real-time business reporting. We demonstrate existing approaches to solve this problem are inherently awed, exhibiting bias that can be arbitrarily large, and propose new methods for solving this problem that have theoretical guarantees of correctness and tight, practical error estimates. This is achieved by carefully combining CountMin and HyperLogLog sketches and a theoretical analysis using statistical estimation techniques. These methods also advance cardinality estimation for individual multisets, as they provide a provably consistent estimator and tight condence intervals that have exactly the correct asymptotic coverage.
CITATION STYLE
Ting, D. (2019). Approximate distinct counts for billions of datasets. In Proceedings of the ACM SIGMOD International Conference on Management of Data (pp. 69–86). Association for Computing Machinery. https://doi.org/10.1145/3299869.3319897
Mendeley helps you to discover research relevant for your work.