Reptile: Aggregation-level Explanations for Hierarchical Data

Zezhou Huang; Eugene Wu

Conference ProceedingsOPEN ACCESS

Reptile: Aggregation-level Explanations for Hierarchical Data

Proceedings of the ACM SIGMOD International Conference on Management of Data (2022) 399-413

DOI: 10.1145/3514221.3517854

2Citations

13Readers

Get full text

Abstract

Users often can see from overview-level statistics that some results look "off", but are rarely able to characterize even the type of error. Reptile is an iterative human-in-the-loop explanation and cleaning system for errors in hierarchical data. Users specify an anomalous distributive aggregation result (a complaint), and Reptile recommends drill-down operations to help the user "zoom-in"on the underlying errors. Unlike prior explanation systems that intervene on raw records, Reptile intervenes by learning a group's expected statistics, and ranks drill-down sub-groups by how much the intervention fixes the complaint. This group-level formulation supports a wide range of error types (missing, duplicates, value errors) and uniquely leverages the distributive properties of the user complaint. Further, the learning-based intervention lets users provide domain expertise that Reptile learns from. In each drill-down iteration, Reptile must train a large number of predictive models. We thus extend factorized learning from count-join queries to aggregation-join queries, and develop a suite of optimizations that leverage the data's hierarchical structure. These optimizations reduce runtimes by >6× compared to a Lapack-based implementation. When applied to real-world Covid-19 and African farmer survey data, Reptile correctly identifies 21/30 (vs 2 using existing explanation approaches) and 20/22 errors. Reptile has been deployed in Ethiopia and Zambia, and used to clean nation-wide farmer survey data; the clean data has been used to design national drought insurance policies.

Author supplied keywords

Cite

CITATION STYLE

APA

Huang, Z., & Wu, E. (2022). Reptile: Aggregation-level Explanations for Hierarchical Data. In Proceedings of the ACM SIGMOD International Conference on Management of Data (pp. 399–413). Association for Computing Machinery. https://doi.org/10.1145/3514221.3517854

Reptile: Aggregation-level Explanations for Hierarchical Data

Abstract

Author supplied keywords

Cite

Register to see more suggestions