Users often can see from overview-level statistics that some results look "off", but are rarely able to characterize even the type of error. Reptile is an iterative human-in-the-loop explanation and cleaning system for errors in hierarchical data. Users specify an anomalous distributive aggregation result (a complaint), and Reptile recommends drill-down operations to help the user "zoom-in"on the underlying errors. Unlike prior explanation systems that intervene on raw records, Reptile intervenes by learning a group's expected statistics, and ranks drill-down sub-groups by how much the intervention fixes the complaint. This group-level formulation supports a wide range of error types (missing, duplicates, value errors) and uniquely leverages the distributive properties of the user complaint. Further, the learning-based intervention lets users provide domain expertise that Reptile learns from. In each drill-down iteration, Reptile must train a large number of predictive models. We thus extend factorized learning from count-join queries to aggregation-join queries, and develop a suite of optimizations that leverage the data's hierarchical structure. These optimizations reduce runtimes by >6× compared to a Lapack-based implementation. When applied to real-world Covid-19 and African farmer survey data, Reptile correctly identifies 21/30 (vs 2 using existing explanation approaches) and 20/22 errors. Reptile has been deployed in Ethiopia and Zambia, and used to clean nation-wide farmer survey data; the clean data has been used to design national drought insurance policies.
CITATION STYLE
Huang, Z., & Wu, E. (2022). Reptile: Aggregation-level Explanations for Hierarchical Data. In Proceedings of the ACM SIGMOD International Conference on Management of Data (pp. 399–413). Association for Computing Machinery. https://doi.org/10.1145/3514221.3517854
Mendeley helps you to discover research relevant for your work.