Efficient provenance storage over nested data collections

Manish Kumar Anand; Shawn Bowers; Timothy McPhillips; Bertram Ludäscher

Conference ProceedingsOPEN ACCESS

Efficient provenance storage over nested data collections

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology, EDBT'09 (2009) 958-969

DOI: 10.1145/1516360.1516470

58Citations

66Readers

Abstract

Scientific workflow systems are increasingly used to automate complex data analyses, largely due to their benefits over traditional approaches for workflow design, optimization, and provenance recording. Many workflow systems employ a simple dependency model to represent the provenance of data produced by workflow runs. Although commonly adopted, this model does not capture explicit data dependencies introduced by "provenance-aware" processes, and it can lead to inefficient storage when workflow data is complex or structured. We present a provenance model, extending the conventional approach, that supports (i) explicit data dependencies and (ii) nested data collections. Our model adopts techniques from reference-based XML versioning, adding annotations for process and data dependencies. We present strategies and reduction techniques to store immediate and transitive provenance information within our model, and examine trade-offs among update time, storage size, and query response time. We evaluate our approach on real-world and synthetic workflow execution traces, demonstrating significant reductions in storage size, while also reducing the time required to store and query provenance information. Copyright 2009 ACM.

References Powered by Scopus

View more at Scopus

Cited by Powered by Scopus

View more at Scopus

Cite

CITATION STYLE

APA

Anand, M. K., Bowers, S., McPhillips, T., & Ludäscher, B. (2009). Efficient provenance storage over nested data collections. In Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology, EDBT’09 (pp. 958–969). https://doi.org/10.1145/1516360.1516470

Readers over time

Readers' Seniority

PhD / Post grad / Masters / Doc 28

49%

Researcher 16

28%

Professor / Associate Prof. 12

21%

Lecturer / Post doc 1

Readers' Discipline

Computer Science 55

90%

Agricultural and Biological Sciences 4

Materials Science 1

Medicine and Dentistry 1

Efficient provenance storage over nested data collections

Abstract

References Powered by Scopus

Scientific workflow management and the Kepler system

A survey of data provenance in e-science

Examining the challenges of scientific workflows

Cited by Powered by Scopus

A survey on provenance: What for? What form? What from?

The foundations for provenance on the Web

Techniques for efficiently querying scientific workflow provenance graphs

Register to see more suggestions

Cite

Readers over time

Readers' Seniority

Readers' Discipline