Capturing and querying structural provenance in spark with pebble

Ralf Diestelkämper; Melanie Herschel

Conference Proceedings

Capturing and querying structural provenance in spark with pebble

Proceedings of the ACM SIGMOD International Conference on Management of Data (2019) 1893-1896

DOI: 10.1145/3299869.3320225

2Citations

7Readers

Get full text

Abstract

Analyzing and debugging Spark processing pipelines is a tedious task which typically involves a lot of engineering effort. The task becomes even more complex when the pipelines process nested data. Provenance solutions that track the derivation process of individual data items assist data engineers while debugging these pipelines. However, state-of-the-art solutions do not precisely track nested data items. We demonstrate Pebble, a system for capturing and querying a new type of provenance on nested data in Spark called structural provenance. It captures access and modification of top-level as well as nested data items, and allows querying the provenance of nested items based on tree-pattern-matching. Implemented as a standalone library on top of Apache Spark, it seamlessly leverages the underlying infrastructure for scalability. Through the graphical user interface implemented in a Jupyter notebook we showcase ten debugging scenarios of Spark programs on real-world datasets.

Cite

CITATION STYLE

APA

Diestelkämper, R., & Herschel, M. (2019). Capturing and querying structural provenance in spark with pebble. In Proceedings of the ACM SIGMOD International Conference on Management of Data (pp. 1893–1896). Association for Computing Machinery. https://doi.org/10.1145/3299869.3320225

Capturing and querying structural provenance in spark with pebble

Abstract

Cite

Register to see more suggestions