Approximate distinct counts for billions of datasets

Daniel Ting

Conference ProceedingsOPEN ACCESS

Approximate distinct counts for billions of datasets

Ting D

Proceedings of the ACM SIGMOD International Conference on Management of Data (2019) 69-86

DOI: 10.1145/3299869.3319897

25Citations

23Readers

Abstract

Cardinality estimation plays an important role in processing big data. We consider the challenging problem of computing millions or more distinct count aggregations in a single pass and allowing these aggregations to be further combined into coarser aggregations. These arise naturally in many applications including networking, databases, and real-time business reporting. We demonstrate existing approaches to solve this problem are inherently awed, exhibiting bias that can be arbitrarily large, and propose new methods for solving this problem that have theoretical guarantees of correctness and tight, practical error estimates. This is achieved by carefully combining CountMin and HyperLogLog sketches and a theoretical analysis using statistical estimation techniques. These methods also advance cardinality estimation for individual multisets, as they provide a provably consistent estimator and tight condence intervals that have exactly the correct asymptotic coverage.

Author supplied keywords

Cite

CITATION STYLE

APA

Ting, D. (2019). Approximate distinct counts for billions of datasets. In Proceedings of the ACM SIGMOD International Conference on Management of Data (pp. 69–86). Association for Computing Machinery. https://doi.org/10.1145/3299869.3319897

Approximate distinct counts for billions of datasets

Abstract

Author supplied keywords

Cite

Register to see more suggestions