Data Sketching for Real Time Analytics: Theory and Practice

9Citations
Citations of this article
7Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Speed, cost, and scale. These are 3 of the biggest challenges in analyzing big data. While modern data systems continue to push the boundaries of scale, the problems of speed and cost are fundamentally tied to the size of data being scanned or processed. Processing thousands of queries that each access terabytes of data with sub-second latency remains infeasible. Data sketching techniques provide means to drastically reduce this size, allowing for real-time or interactive data analysis with reduced costs but with approximate answers. This tutorial covers a number of useful data sketching and sampling methods and demonstrate their use using the Apache DataSketches project. We focus particularly on common problems in analytic problems such as counting distinct items, quantiles, histograms, heavy hitters, and aggregations with large group bys. For these, we covers algorithms, techniques, and theory that can aid both practitioners and theorists in constructing sketches and designing systems that achieve desired error guarantees. For practitioners and implementers, we show how some of these sketches can be easily instantiated using the Apache Datasketches project.

Cite

CITATION STYLE

APA

Ting, D., Malkin, J., & Rhodes, L. (2020). Data Sketching for Real Time Analytics: Theory and Practice. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 3567–3568). Association for Computing Machinery. https://doi.org/10.1145/3394486.3406480

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free