Speed, cost, and scale. These are 3 of the biggest challenges in analyzing big data. While modern data systems continue to push the boundaries of scale, the problems of speed and cost are fundamentally tied to the size of data being scanned or processed. Processing thousands of queries that each access terabytes of data with sub-second latency remains infeasible. Data sketching techniques provide means to drastically reduce this size, allowing for real-time or interactive data analysis with reduced costs but with approximate answers. This tutorial covers a number of useful data sketching and sampling methods and demonstrate their use using the Apache DataSketches project. We focus particularly on common problems in analytic problems such as counting distinct items, quantiles, histograms, heavy hitters, and aggregations with large group bys. For these, we covers algorithms, techniques, and theory that can aid both practitioners and theorists in constructing sketches and designing systems that achieve desired error guarantees. For practitioners and implementers, we show how some of these sketches can be easily instantiated using the Apache Datasketches project.
CITATION STYLE
Ting, D., Malkin, J., & Rhodes, L. (2020). Data Sketching for Real Time Analytics: Theory and Practice. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 3567–3568). Association for Computing Machinery. https://doi.org/10.1145/3394486.3406480
Mendeley helps you to discover research relevant for your work.