The Origin of Data
Abstract
The Oxford English Dictionary defines provenance as (i) the fact of coming from some particular source or quarter; origin, derivation. (ii) the history or pedigree of a work of art, manuscript, rare book, etc.; concr., a record of the ultimate derivation and passage of an item through its various owners. In art, knowing the provenance of an artwork lends weight and authority to it while providing a context for curators and the public to understand and appreciate the works value. Without such a documented history, the work may be misunderstood, unappreciated, or undervalued. In computer systems, knowing the provenance of digital objects would provide them with greater weight, authority, and context just as it does for works of art. Specifically, if the prove- nance of digital objects could be determined, then users could understand how documents were produced, how simulation results were generated, and why decisions were made. Provenance is of particular importance in science, where experimental results are reused, reproduced, and verified. However, science is increasingly being done through large-scale collaborations that span multiple institutions, which makes the problem of determining the provenance of scientific results significantly harder. Current approaches to this problemare not designed specifically formulti-institutional scien- tific systems and their evolution towards greater dynamic and peer-to-peer topologies. Therefore, this thesis advocates a new approach, namely, that through the autonomous creation, scalable recording, and principled organisation of documentation of systems processes, the determina- tion of the provenance of results produced by complex multi-institutional scientific systems is enabled. The dissertation makes four contributions to the state of the art. First is the idea that provenance is a query performed over documentation of a systems past process. Thus, the problem is one of how to collect and collate documentation from multiple distributed sources and organise it in a manner that enables the provenance of a digital object to be determined. Second is an open, generic, shared, principled data model for documentation of processes, which enables its collation so that it provides high-quality evidence that a systems processes occurred. Once documentation has been created, it is recorded into specialised repositories called provenance stores using a formally specified protocol, which ensures documentation has high- quality characteristics. Furthermore, patterns and techniques are given to permit the distributed deployment of provenance stores. The protocol and patterns are the third contribution. The fourth contribution is a characterisation of the use of documentation of process to answer questions related to the provenance of digital objects and the impact recording has on application performance. Specifically, in the context of a bioinformatics case study, it is shown that six different provenance use cases are answered given an overhead of 13% on experiment run- time. Beyond the case study, the solution has been applied to other applications including fault tolerance in service-oriented systems, aerospace engineering, and organ transplant management.
The Origin of Data
The Origin of Data
Enabling the Determination of Provenance in Multi-institutional Scientific
Systems through the Documentation of Processes
by
Paul T. Groth
A thesis submitted in partial fulfillment for the
degree of Doctor of Philosophy
in the
Faculty of Engineering, Science and Mathematics
School of Electronics and Computer Science
September 2007
ABSTRACT
FACULTY of ENGINEERING, SCIENCE and MATHEMATICS
SCHOOL OF ELECTRONICS AND COMPUTER SCIENCE
Doctor of Philosophy
The Origin of Data
Enabling the Determination of Provenance in Multi-institutional Scientific Systems
through the Documentation of Processes
by Paul T. Groth
The Oxford English Dictionary defines provenance as (i) the fact of coming from some particular
source or quarter; origin, derivation. (ii) the history or pedigree of a work of art, manuscript,
rare book, etc.; concr., a record of the ultimate derivation and passage of an item through its
various owners. In art, knowing the provenance of an artwork lends weight and authority to it
while providing a context for curators and the public to understand and appreciate the work’s
value. Without such a documented history, the work may be misunderstood, unappreciated, or
undervalued.
In computer systems, knowing the provenance of digital objects would provide them with
greater weight, authority, and context just as it does for works of art. Specifically, if the prove-
nance of digital objects could be determined, then users could understand how documents were
produced, how simulation results were generated, and why decisions were made. Provenance
is of particular importance in science, where experimental results are reused, reproduced, and
verified. However, science is increasingly being done through large-scale collaborations that span
multiple institutions, which makes the problem of determining the provenance of scientific results
significantly harder.
Current approaches to this problem are not designed specifically for multi-institutional scien-
tific systems and their evolution towards greater dynamic and peer-to-peer topologies. Therefore,
this thesis advocates a new approach, namely, that through the autonomous creation, scalable
recording, and principled organisation of documentation of systems’ processes, the determina-
tion of the provenance of results produced by complex multi-institutional scientific systems is
enabled. The dissertation makes four contributions to the state of the art.
First is the idea that provenance is a query performed over documentation of a system’s
past process. Thus, the problem is one of how to collect and collate documentation from multiple
distributed sources and organise it in a manner that enables the provenance of a digital object
to be determined.
Second is an open, generic, shared, principled data model for documentation of processes,
which enables its collation so that it provides high-quality evidence that a system’s processes
occurred. Once documentation has been created, it is recorded into specialised repositories called
provenance stores using a formally specified protocol, which ensures documentation has high-
quality characteristics. Furthermore, patterns and techniques are given to permit the distributed
deployment of provenance stores. The protocol and patterns are the third contribution.
The fourth contribution is a characterisation of the use of documentation of process to
answer questions related to the provenance of digital objects and the impact recording has on
application performance. Specifically, in the context of a bioinformatics case study, it is shown
that six different provenance use cases are answered given an overhead of 13% on experiment run-
time. Beyond the case study, the solution has been applied to other applications including fault
tolerance in service-oriented systems, aerospace engineering, and organ transplant management.
Acknowledgements ix
1 Introduction 1
1.1 A Problem of Confidence . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 The Assurance of Provenance . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 The Role of Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Thesis Statement and Contributions . . . . . . . . . . . . . . . . . . . . . 7
1.5 Presentation Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.6 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 A Critical Analysis of Provenance Systems 11
2.1 Multi-institutional Scientific Systems . . . . . . . . . . . . . . . . . . . . . 12
2.1.1 Web Services and Service Oriented Architectures . . . . . . . . . . 13
2.1.2 The Use of Workflows . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Provenance and Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.1 Provenance in Art . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.2 Two Perspectives on Process . . . . . . . . . . . . . . . . . . . . . 17
2.3 Provenance Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.1 Version Control Systems . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.2 Application-Specific Systems . . . . . . . . . . . . . . . . . . . . . 20
2.3.3 Operating System Level Provenance Systems . . . . . . . . . . . . 22
2.3.4 Provenance in Database Systems . . . . . . . . . . . . . . . . . . . 23
2.3.5 Distributed Debugging, Monitoring and Recovery . . . . . . . . . . 25
2.3.6 Workflow-centric Systems . . . . . . . . . . . . . . . . . . . . . . . 27
2.3.7 Data Models for Provenance . . . . . . . . . . . . . . . . . . . . . 30
2.4 Cross-Cutting Concerns . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.4.1 Levels of Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.4.2 Answering Queries Related to Provenance . . . . . . . . . . . . . . 33
2.4.3 Causality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.5 Analysis Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3 A Model of Process Documentation 41
3.1 Motivation for a Generic, Shared Process Documentation Data Model . . 42
3.2 Advantageous Characteristics for a Shared Data Model . . . . . . . . . . . 44
3.3 The Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3.1 Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
ii
5 Case Study: The Amino Acid Compressibility Experiment 116
5.1 A Short Introduction to Biochemistry and Information Theory . . . . . . 117
5.1.1 Biochemistry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.1.2 Information Theory . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.2 ACE: The Amino Acid Compressibility Experiment . . . . . . . . . . . . . 122
5.3 ACE as a Multi-Institutional Scientific System . . . . . . . . . . . . . . . 125
5.4 Six Provenance Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6 Evaluation 131
6.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
6.1.1 Non-functional requirements . . . . . . . . . . . . . . . . . . . . . 132
6.1.2 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
6.1.2.1 The Provenance Service . . . . . . . . . . . . . . . . . . . 133
6.1.2.2 The Provenance Store Client . . . . . . . . . . . . . . . . 134
6.1.3 Technologies Used by PReServ . . . . . . . . . . . . . . . . . . . . 136
6.2 Evaluation Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
6.3 Provenance Store Performance . . . . . . . . . . . . . . . . . . . . . . . . 138
6.3.1 Storage Size Impact . . . . . . . . . . . . . . . . . . . . . . . . . . 138
6.3.2 Multiple Client Connections Impact . . . . . . . . . . . . . . . . . 139
6.4 Case Study Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
6.5 Use Case Satisfaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
6.5.1 Use Case 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
6.5.2 Use Case 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
6.5.3 Use Case 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
6.5.4 Use Case 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
6.5.5 Use Case 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
6.5.6 Use Case 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
6.6 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
6.6.1 More detail vs. more time . . . . . . . . . . . . . . . . . . . . . . . 157
6.6.2 Confidence and longevity vs. space and time . . . . . . . . . . . . 158
6.6.3 Throughput vs. contention . . . . . . . . . . . . . . . . . . . . . . 158
6.6.4 Space vs. time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
6.7 Confidence Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
6.8 Related Work and Other Applications . . . . . . . . . . . . . . . . . . . . 161
6.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
7 Conclusion 164
7.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
7.1.1 Process Documentation and Provenance . . . . . . . . . . . . . . . 165
7.1.2 The P-Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
7.1.3 Recording process documentation . . . . . . . . . . . . . . . . . . . 166
7.1.4 Performance Impact . . . . . . . . . . . . . . . . . . . . . . . . . . 167
7.2 Support for High-Quality Documentation . . . . . . . . . . . . . . . . . . 168
7.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
7.3.1 Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
7.3.2 Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
7.4 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
Bibliography 173
1.1 Brochure from Starbucks discussing the origins of their coffee . . . . . . . 4
1.2 The labels on these eggs show their provenance . . . . . . . . . . . . . . . 5
2.1 Provenance of the painting Woman Holding a Balance by Johannes Ver-
meer [141] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1 A simple example application . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2 Concept map describing process . . . . . . . . . . . . . . . . . . . . . . . 47
3.3 Concept map describing process documentation . . . . . . . . . . . . . . 51
3.4 An example of documenting process at different levels of abstraction . . . 54
3.5 Concept map describing tracers . . . . . . . . . . . . . . . . . . . . . . . 57
3.6 An example of the contents of a p-structure that documents the interac-
tions I2 and I3 from Figure 3.1 . . . . . . . . . . . . . . . . . . . . . . . . 58
3.7 Concept map describing provenance . . . . . . . . . . . . . . . . . . . . . 60
3.8 Causal graph describing the provenance of a numerical result . . . . . . . 61
4.1 SeparateStore pattern diagram . . . . . . . . . . . . . . . . . . . . . . . . 71
4.2 ContextPassing pattern diagram . . . . . . . . . . . . . . . . . . . . . . . 72
4.3 SharedStore pattern diagram . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.4 A simple example application . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.5 The SeparateStore pattern applied . . . . . . . . . . . . . . . . . . . . . . 75
4.6 The SharedStore pattern applied . . . . . . . . . . . . . . . . . . . . . . . 76
4.7 The ContextPassing pattern applied . . . . . . . . . . . . . . . . . . . . . 76
4.8 An example of linking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.9 Contents of provenance stores . . . . . . . . . . . . . . . . . . . . . . . . 80
4.10 The messages of PReP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.11 State Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.12 Provenance Store rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.13 The rules of the ASM used by sending and receiving actors . . . . . . . . 93
4.14 Measures for tables and messages defined in the ASM . . . . . . . . . . . 103
4.15 Legend for Figures 4.16, 4.17, and 4.18 . . . . . . . . . . . . . . . . . . . . 111
4.16 State transition diagram depicting Lemma 4.37 . . . . . . . . . . . . . . . 112
4.17 State transition diagram depicting the inductive hypothesis for proof of
Lemma 4.37 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
4.18 State transition diagram depicting the inductive step for proof of Lemma
4.37 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.1 The 3D structure of the myoglobin protein. . . . . . . . . . . . . . . . . . 118
5.2 The amino acids and their abbreviations . . . . . . . . . . . . . . . . . . 118
vi
5.3 An example mutation matrix . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.4 The Taylor Categorisation of amino acids. . . . . . . . . . . . . . . . . . . 120
5.5 A basic communication system as defined by Shannon . . . . . . . . . . . 121
5.6 The ACE workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
6.1 The Provenance Service Architecture . . . . . . . . . . . . . . . . . . . . 135
6.2 Provenance store size impact on p-assertion record times. . . . . . . . . . 140
6.3 Contention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
6.4 Throughput as the number of jobs and threads per jobs increases . . . . 142
6.5 Colour map of throughput . . . . . . . . . . . . . . . . . . . . . . . . . . 142
6.6 ACE deployment workflow . . . . . . . . . . . . . . . . . . . . . . . . . . 143
6.7 Frequency distribution of job times . . . . . . . . . . . . . . . . . . . . . 145
6.8 Distribution of job parallelism . . . . . . . . . . . . . . . . . . . . . . . . 146
6.9 Frequency distribution of p-assertion recording job times . . . . . . . . . 147
6.10 Maximum, Minimum and Average job record times both with and without
p-assertion recording . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
6.11 Graph of groupings sorted by their ACE information efficiency values . . 148
6.12 A collated sequence as the product of two sequences identified by their
file paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
6.13 Example of the results produced for Use Case 1 . . . . . . . . . . . . . . . 150
6.14 Example of the results produced for Use Case 2 . . . . . . . . . . . . . . . 151
6.15 Example of the results produced for Use Case 3 . . . . . . . . . . . . . . . 152
6.16 Example of the results produced for Use Case 4 . . . . . . . . . . . . . . . 153
6.17 Example of the results produced for Use Case 5 . . . . . . . . . . . . . . . 155
6.18 Example of the results produced for Use Case 6 . . . . . . . . . . . . . . . 157
I, Paul Groth, declare that the thesis entitled The Origin of Data:Enabling the Deter-
mination of Provenance in Multi-institutional Scientific Systems through the Documen-
tation of Processes and the work presented in the thesis are both my own, and have
been generated by me as the result of my own original research. I confirm that:
• this work was done wholly or mainly while in candidature for a research degree at
this University;
• where any part of this thesis has previously been submitted for a degree or any
other qualification at this University or any other institution, this has been clearly
stated;
• where I have consulted the published work of others, this is always clearly at-
tributed;
• where I have quoted from the work of others, the source is always given. With the
exception of such quotations, this thesis is entirely my own work;
• I have acknowledged all main sources of help;
• where the thesis is based on work done by myself jointly with others, I have made
clear exactly what was done by others and what I have contributed myself;
• parts of this work have been published as listed in Section 1.6.
Signed:
Date:
viii
When I first started this process, I had images of a lone scholar in the midst of a library
somewhere surrounded by books writing away in isolation. For both my sanity and my
tremendous benefit, this was not the case. Without the support of the following people,
this dissertation would have never been done. More importantly, this process would not
have been so fun.
First, a big thank you goes to my parents, without your confidence in me this would not
have been possible. Your passion and belief in education are an inspiration. Thomas,
your advice on writing, creativity, and my serve is second to none. LaRue, your un-
conditional encouragement, concern, and love have been a rock to stand on during this
process.
One piece of advice my parents gave me before starting this PhD is that picking a good
supervisor is absolutely critical in order to achieve success. Luckily for me, I picked
a great one, Professor Luc Moreau. Luc, thank you for teaching me the importance
of being systematic, how to actually do a proof, how to write like a scientist, and the
importance of fluid intake. The time you have spent having intense discussions with me,
reading my work, and putting up with my realistic schedules is very much appreciated.
In summary, good job.
Thanks to Michael Luck for giving me an another perspective on academia and being a
nice guy. Klaus: Ich danke dir vielmals. Ohne dein Experiment und Rat wu¨rde diese
Doktorarbeit nie gegeben. To the Iridis guys, Ivan Woolten and David Baker, without
your help (and responding to my strange requests), I would have never pulled this off.
A special shout out goes to Paul Townend for being the first and best user of my software.
Thanks to the people on the PASOA and EU Provenance projects for using the software
and your insightful comments on my ideas. Andrej and Tibo, your summer projects
were cool, thanks for coming on board to help.
Thanks to everyone in the IAM Lab, you make it a great place to work. Thanks to
Victor Tan and Weijian Fang for always taking the time to discuss technical stuff and
putting up with my Americanisms. Maria, it was great having another North American
around. The Lab and my time in Southampton would not have been as interesting or fun
without Roxana and Shiv. I hope your current endeavours take you where you want to
go. The last bit of crunch time before submission was made bearable by my chats with
Maira. Don’t worry you’ll be finished soon. I would be amiss not to mention Danius,
who went from scary lab management guy, to a good friend. Thanks for the time at
Ceno looking at barmaids and giving me advice on my thesis.
For over two years in the lab, the desk next to mine was occupied by one Chris Bailey who
calmly put up with my commentary both verbal and via MSN as well as the occasional
ix
Nerf gun attack. Dude, thanks for the coffee in Starbucks and listening to my grandiose
ideas.
Beyond sitting in Starbucks, playing tennis has kept me in good spirits. Thanks to Kat
& the Tennis Team for showing me what British student culture is really like, oh, and
the tennis was fun as well. Thanks to Eyre for letting Moss take hours on the weekend
to do battle with me on court. Moss, hopefully, we can have a hit again one day soon.
From Pensacola, I have to thank Joe for his extended emails that remind me what
the “real” world is like. Shamma, the once a year lessons on how to be both hip and
a computer scientist are good but they still haven’t rubbed off. Thanks to Niranjan,
whose recommendation got me here in the first place and whose New Year’s Eve parties
are always the place to give an update on life’s progress. Also thanks to David Eccles,
who introduced me to the cheeky pint before I even got to England and who showed me
that TGIs is a great place to do research.
My time in Southampton would not have been nearly as fun without the people of 25
High Road. Thanks to all of you for your epic support. Steve, women!?, what more
can I say. Thanks for your wisdom and for proofing thesis chapters. Seb, the french
food, the french wine, the bbqs, the server, and the unbelievable guitar have made it
an awesome trip. Mischa, thanks for the talks about life, academia, and Web 2.0 as
well as proofing this entire dissertation. It is very much appreciated. I’ll always fondly
remember sprawling out on your sofa. Martin, thanks for showing the way and letting me
move into High Road in the first place. Respect. Laura, talking to you about the wider
world has made life more colourful. In the end though, who is right? The dreads: Tony
stay metal. Simon thanks for being proof that scientist actually use multi-institutional
scientific systems. Ben, thanks for showing me what life is really about. Sofie, thanks
for the living room chats.
Finally, the quality of this dissertation would be much less without the help of Geraldine
and Simon. Geraldine, thanks for making Simon and I talk about things other than
provenance. Without your intervention, dinner and drinks would not have been as
exciting. Simon, working with you has been a stupendous experience. I will miss our
lunches together. Your ideas, your arguments, your questioning have taught me a lot
about how to do research, write code, and approach life. Thank you.
• The ability to identify what the inputs to an experiment were and where they
came from.
• The ability to know who performed an experiment and who is responsible for its
results.
The standard scientific practices of peer-review and publication take into account these
factors and provide the bedrock on which the confidence in scientific results is based.
However, in the context of multi-institutional scientific systems, that may involve hun-
dreds of individuals, institutions, and components, it becomes difficult for scientists,
reviewers, and the public to obtain all the information they need to be confident in
the results these systems generate. Fundamentally, users need to understand how these
results were produced, their history, their origins, their provenance.
1.2 The Assurance of Provenance
The Oxford English Dictionary defines provenance as (i) the fact of coming from some
particular source or quarter; origin, derivation. (ii) the history or pedigree of a work
of art, manuscript, rare book, etc.; concretely, a record of the ultimate derivation and
passage of an item through its various owners.
In the field of art, knowing the provenance of an artwork provides collectors, curators,
and the public a context, which provides the means to understand, verify, and evaluate
that artwork. Provenance gives assurance that the artwork has value; that it is, for
example, truly painted by Johannes Vermeer and is not actually a forgery by Han van
Meegeren. Similarly, when Starbucks Coffee produces a brochure like the one in Figure
1.1, they are using a guarantee about the provenance of their coffee to both reassure
customers and indicate the quality of it.
Just as knowing the provenance of a work of art provides it with greater weight, authority
and context, knowing the provenance of a digital object or data item offers similar
benefits. In particular, when detailed enough, the provenance of a digital object contains
all the information necessary to provide confidence to its users. Each of the various
factors used by scientists in their confidence judgements are addressed by having a
comprehensive record of a digital object’s derivation.
In different domains and environments, what constitutes a comprehensive record of
derivation may vary radically. For example, in art, the provenance of a painting usually
only details its chain of ownership. However, in some cases, it is not only necessary to
know the chain of ownership but also the various restorations the painting went through.
In food science, the provenance of food purchased at a grocery store would include where
the food was grown, how it was transported, packaged, and processed. An example of
Figure 1.2: The labels on these eggs show their provenance
the provenance of food is the label put on all eggs sold in Germany as shown in Figure
1.2. This label indicates how the hen that laid the egg was raised, what country the
egg is from, the farm where the egg was produced, and the cage where the egg was laid.
For digital objects, the provenance could include everything from the algorithms used
in processing to the user who started a computational simulation.
Thus, depending on what information gives a user confidence, the kind of information
returned from a query about an item’s provenance varies. The unifying theme between
the above records of derivation is that they document part of the process that led to an
item in a particular state. For example, the restorations of a painting are part of the
larger process that led to the painting in its current state. Thus, knowing the entirety of
the process that led to the painting as it is would also include the various restorations it
has undergone. Therefore, conceptually, we define the provenance of a result produced
by a system as follows:
Definition 1.1. The provenance of a result is the process that led to that result.
In computational systems, results are usually data items, and thus throughout this work
we focus primarily on the provenance of data, which would be the process that led to
the data item in question. By understanding the process that led to the result produced
by a multi-institutional scientific system, a scientist can have confidence in it.
1.3 The Role of Documentation
Processes, however, are ephemeral, they occur and then are gone. Thus, to show that
a process did in fact happen some evidence is necessary. Revisiting the Oxford English
Dictionary’s definition of provenance, we note it concretely defines provenance as “a
record of the ultimate derivation and passage of an item...”. We view such a record
1. The provenance of a digital object can be answered by a query over documentation
of a system’s processes. Making this distinction explicit brings benefits in terms of
system design. It enables a separation of concerns between creators and queriers,
which allows queries to be performed by independent parties. Furthermore, it
caters for the specialisation in the design of creation, recording, and querying
components.
2. A data model for process documentation, which allows for the provenance of results
to be obtained and that is based on two key principles:
(a) Process documentation must represent causal relations between entities for
the provenance of results to be determined.
(b) To enable provenance queries to be answered accurately, documentation should
be high quality. It should have the characteristics of being factual, attributable,
autonomously creatable, process oriented, immutable and finalizable. These
characteristics are supported both by the data model and the recording of
process documentation into provenance stores.
These principles are necessary to enable provenance questions to be answered ac-
curately in distributed multi-institutional settings. This contribution is discussed
primarily in Chapter 3
3. A protocol and patterns that enable the scalable recording of documentation into
provenance stores (Chapter 4). The protocol enforces the recording of process
documentation with high-quality characteristics. This is shown through a series
of proofs (Section 4.4.5). Scalability is shown through controlled experiments
conducted on an implementation of the repository that follows the protocol spec-
ification (Section 6.3).
4. A characterisation of the use of documentation of process to answer questions
related to the provenance of digital objects and the impact recording has on ap-
plication performance. Specifically, the solution is evaluated in the context of a
real world application from bioinformatics (Chapter 5). It is shown that six dif-
ferent provenance use cases are answered given an overhead of 13% on experiment
runtime (Section 6.4). While these use cases are specific to the case study, they
reflect a range of provenance questions that scientists might pose.
1.5 Presentation Overview
This dissertation is organised as follows.
Chapter 2 discusses in greater detail the nature of provenance and its relationship to
processes. It also analyses the state of the art for determining provenance in computa-
• P. Groth, S. Miles, W. Fang, S. C. Wong, K.-P. Zauner, and L. Moreau. Recording
and Using Provenance in a Protein Compressibility Experiment. In Proceedings of
the 14th IEEE International Symposium on High Performance Distributed Com-
puting (HPDC-14), pages 201–208, July 2005
• P. Groth, S. Miles, and L. Moreau. PReServ: Provenance Recording for Services.
In Proceedings of the UK OST e-Science Fourth All Hands Meeting (AHM05),
September 2005
• P. Groth, S. Miles, and S. Munroe. Principles of High Quality Documentation
for Provenance: A Philosophical Discussion. In Moreau and Foster [135], pages
278–286
• P. Groth, S. Miles, and L. Moreau. A Shared Model for Documentation of Pro-
cesses Enabling the Determination of Provenance. ACM Transactions on Internet
Technology, 2007. Under Review
In addition, results of this dissertation were used as the basis of other applications, and
published as follows:
• P. Townend, P. Groth, and J. Xu. A Provenance-Aware Weighted Fault Tolerance
Scheme for Service-Based Applications. In Proceedings of the 8th IEEE Inter-
national Symposium on Object-oriented Real-time distributed Computing (ISORC
2005), pages 258–266. IEEE Computer Society, May 2005
• V. Tan, P. Groth, S. Miles, S. Jiang, S. Munroe, S. Tsasakou, and L. Moreau.
Security Issues in a SOA-Based Provenance System. In Moreau and Foster [135],
pages 203–211
• S. Miles, S. C. Wong, W. Feng, P. Groth, K.-P. Zauner, and L. Moreau. Provenance-
based Validation of e-Science Experiments. Journal of Web Semantics, 5(1):28–38,
2007
• S. Miles, P. Groth, S. Munroe, S. Jiang, T. Assandri, and L. Moreau. Extracting
Causal Graphs from an Open Provenance Data Model. Concurrency and Compu-
tation: Practice and Experience, 2007
• S. Miles, P. Groth, S. Munroe, M. Luck, and L. Moreau. AgentPrIMe: Adapting
MAS designs to build confidence. In Proceedings of 8th Internation Workshop on
Agent Oriented Software Engineering, 2007
• Heterogeneous - The kinds of resources available in a multi-institutional scientific
system often vary widely. Furthermore, the rules and policies that govern the
various resources vary dramatically across institutions.
• Dynamic - The resources available to a multi-institutional scientific system vary
over time: members may join or leave, facilities may be introduced, removed or
upgraded and the problem set may evolve. Furthermore, there may be a dynamic
range of participation in the system. For example, some institutions could be
core members of the system whereas others are on the periphery providing only a
small contribution. Likewise, there may be a core set of problems that the system
addresses and a series of smaller related problems.
Multi-institutional scientific systems exist in a variety of domains including earthquake
engineering [146], high energy physics [85], chemistry [75], climatology [20], weather fore-
casting [148], astronomy [177], and medicine [166]. These systems rely on software and
hardware infrastructure collectively known as the Grid [64]. This infrastructure enables
systems to share and cope with dynamic heterogeneous resources provided by multiple
parties. The software portion of the Grid is provided through common middleware (i.e.
software libraries and services used by applications). There are various Grid middleware
packages available including the Globus Toolkit [66], the UNiform Interface to Comput-
ing Resources (UNICORE) [173], gLite [79], and the Open Middleware Infrastructure
Institute stack [14]. These middleware packages provide services ranging from security to
resource discovery and thus ease the creation of multi-institutional scientific systems by
allowing them to take advantage of existing software that deals with the complications
of aggregating distributed resources.
2.1.1 Web Services and Service Oriented Architectures
One of the goals of these middleware packages is to facilitate interoperability between
disparate systems. To encourage and strengthen interoperability, the Grid community,
as represented by organisations like the Open Grid Forum and the Organization for the
Advancement of Structured Information Standards (OASIS), have embraced World Wide
Web technologies, particularly, Web Services [14]. By using standard technologies that
are widely deployed such as eXtended Markup Language (XML) [33], Uniform Resource
Locators (URLs) [19] and the Hypertext Transfer Protocol (HTTP) [63], Web Services
provide cross-platform communication and data interoperability between applications.
The adoption of Web Services means that the plethora of Web-based resources can also
become part of the Grid infrastructure. For example, iSpecies.org combines data from
the Google Scholar and Yahoo Image search websites with processing power from the
National Center for Biotechnology Information (NCBI) to generate species specific web
pages.
The adoption of Web Services also reflects the move towards the use of the Service
Oriented Architecture (SOA) style of designing multi-institutional systems [69, 65]. SOA
is an architectural style that views applications as a set of loosely coupled services
communicating via a common transport. A service, in turn, is defined as a well-defined,
self-contained, entity that takes input and produces output in accordance with a well-
defined interface1. In the context of Web Services, a service’s interface can be expressed
in the Web Services Definition Language (WSDL) [42] and it can communicate using
the common transport protocol SOAP [130].
The SOA style provides three benefits when building multi-institutional scientific sys-
tems. First, it hides implementation behind an interface allowing a service’s implemen-
tation to change without impacting the user. For example, in the case of iSpecies.org,
NCBI could change the underlying hardware or programming language it uses to pro-
cess requests while not impacting the website. This is important in a multi-institutional
context because an institution may wish to change the implementation of its services
without having to involve collaborators. Loose coupling achieves the second benefit of
the SOA style, service reuse. The ability to reuse services is particularly important in
dynamic systems, where institutions are transient members and the landscape of re-
sources changes. In such an environment, it is critical for institutions to be able to
reuse and reallocate the services they provide to different collaborations. Finally, the
SOA style encourages platform independence. By only requiring a common transport,
the SOA style allows the use of the programming language, operating system, or im-
plementation technique that is best suited for a particular service. For example, in the
case of earthquake engineering, a shake table will use bespoke hardware and software
whereas computational simulators would use a parallel programming platform like MPI
(Message Passing Interface) [182].
Because of the benefits of SOA and Web Services, Grid middleware (such as Globus
Toolkit 4 [66], gLite [79] and UNICORE [124]) has transitioned to these technologies.
Furthermore, many multi-institutional systems from a variety of domains, including
climate modelling [20], weather forecasting [148], and astronomy [177], have adopted
these techniques and technologies. Bioinformatics is a particularly good example of a
domain that has used and benefited from service orientation [189, 169, 81]. Databases
containing a variety of genetic and biomedical data along with services to analyse that
data have been made available by a variety of institutions including the NCBI and the
European Bioinformatics Institute [170, 187]. These services have been then integrated
in order to investigate biological problems such as Williams-Beuren syndrome [171].
Because the bioinformatics community has widely adopted the service oriented approach
to building multi-institutional systems, we have chosen a case study from the field to
evaluate our provenance solution.
1This definition was derived from [186], [121] and [69].
Storage System [137] captures the execution of any program executing within the Linux
Operating System. In a single institution, it is possible to mandate the adoption of a
particular execution environment. However, in systems spanning multiple institutions,
it is difficult for institutions to impose such a mandate on each other. Furthermore,
in multi-institutional scientific systems, it is highly unlikely that all provenance-related
information pertaining to a particular experiment can be stored or aggregated into one
centralised location. Because these systems are dynamic and no one institution would
either want to be responsible for maintaining all the information after the experiment’s
end or give up its own information to some other institution, a unique challenge arises
that is not present in a single institution scenario. Thus, the inability to determine the
provenance of scientific results in multi-institutional scientific systems is a significant
problem and the one we aim to address.
Before detailing our solution to the provenance problem, we analyse the available systems
to see their deficiencies and strengths with respect to a multi-institutional environment.
To help in this analysis, we first discuss the concept of provenance and its relation to
process in more detail.
2.2 Provenance and Process
Provenance has a long history of usage in art. By understanding its usage in that
domain, some general characteristics of the concept can be derived, which are useful in
understanding the concept within computer systems, particularly the relation between
provenance and process. We now discuss provenance in art from which we explicate two
perspectives on process. When we discuss process throughout this dissertation, the word
should be understood by its common sense definition: a continuous and regular action
or succession of actions, taking place or carried on in a definite manner, and leading to
the accomplishment of some result (Oxford English Dictionary).
2.2.1 Provenance in Art
In art, the term provenance is used to describe the history of ownership of a work of
art (Oxford English Dictionary). For example, Figure 2.12 shows the provenance for a
painting by Johannes Vermeer. Notice that several statements in the provenance are of
the form “possibly...”, this exemplifies the uncertainty of the provenance of the painting.
This uncertainty about the provenance of an artwork is not unusual; on the contrary, it is
the norm especially with works created before the 17th century [101]. Thus, much of the
work in determining the provenance of an artwork is finding, analysing and judging the
2The image and provenance information in Figure 2.1 are used with permission of the National Gallery
of Art, Washington.
domain of satellite image processing.
The Collaborative Analysis Versioning Environment System (CAVES) and Collaborative
Development Shell (CODESH) are designed to provide a virtual logbook for distributed
collaborative groups [28]. Both CAVES and CODESH are interactive shells that users
log into to perform various data analysis tasks. Similar to S, the systems track the
user interaction with the shell and stores them as session logs. These logs are then
published to a server allowing other members of the collaborative group to investigate
and replay other users’ sessions. The system is designed specifically for sharing users’
interactive analysis sessions in multi-institutional collaboratories, however, it is not a
complete solution because it does not capture what goes on outside the interactive shell.
In the context of distributed job execution on the Grid, work has concentrated on gath-
ering statistical information and re-running jobs. Both Quill++ [151] and gLite Job
Provenance [60] support these tasks. These systems are designed to be scalable and to
mimimise the impact of provenance on job execution. Capturing data for provenance
in job execution environments is an important part of an overall solution for multi-
institutional scientific systems, however, both Quill++ and gLite are tied completely to
their execution engines (Condor and gLite respectively) and thus are not adequate as a
total solution for heterogeneous systems.
2.3.3 Operating System Level Provenance Systems
By capturing the execution of programs and their dependencies between each other
through the operating system, operating system level provenance systems are more
generic than the systems discussed above because they are application and domain
independent.
One example of such a system is the Transparent Result Caching (TREC) prototype
[184]. TREC uses the Solaris UNIX proc system to intercept various UNIX system calls
in order to build a dependency map between those calls. Using this map, a trace of a
program’s execution can be transparently captured, which can be used for example to
automatically build a makefile from the users interaction with the operating system.
Similar to TREC and ES3 [74], the Provenance-aware Storage System (PASS) integrates
with the Linux kernel to capture all operating system calls, storing arguments to those
calls as well as the dependencies between them. PASS stores this data in the Berkeley
DB database, which allows a variety of queries and processing to be performed on the
database [137]. Although PASS successfully completed the Provenance Challenge and
was able to run the scientific workflow described, that workflow had to be run on a single
computer [156]. PASS differs from TREC by integrating with the kernel directly. By not
executing in user-space as TREC does, PASS is able to gather more information than
TREC. For example, PASS captures read and write system calls whereas TREC does
workflow run is identified uniquely [162]. A variety of provenance related queries can be
answered by Karma using a combination of SQL queries and a Web Services API [164].
The Virtual Data System (VDS) is a workflow system, which focuses on data inten-
sive scientific applications [202]. The system takes a functional approach: Executable
applications are described as transformations (i.e. functions) and the input to those
applications are described by derivations that bind particular data to a transformation
(i.e. function calls). The syntax to describe derivations and transformations is called the
Virtual Data Language (VDL) [71]. Workflows described in VDL can then be submitted
to workflow planners such as Pegasus [55] or converted to run in workflow enactment
engines such as Condor DAGMan [76]. When the concrete workflow is executed the pa-
rameters along with information about the runtime environment are stored in the VDS.
Like Karma this parameter and runtime information is submitted back to the VDS by
the invoked services. Once stored in the VDS, parameter and runtime information can
then be combined with lineage information inferred from the VDL to answer a variety
of provenance queries [201]. This reliance on the existence of a workflow to infer prove-
nance is one of the major disadvantages to the VDS approach because it does not allow
provenance to be determined in cases where the workflow no longer exists.
Szomszor and Moreau argues for infrastructure support for provenance in Grid and Web
Service applications [178]. An architecture and implementation was developed around
a workflow enactment engine recording data into a separate repository. To cater for
reproducibility all the inputs and outputs to Web Services are recorded along with the
workflow script and the interface definitions of services. The recording interface provided
by the implementation supports both the asynchronous and synchronous submission of
data. The implementation also has a validation capability that determines if a particular
result is current by re-executing the workflow and comparing the execution to the one
documented in the repository.
The workflow systems described here can all answer a variety of queries about the prove-
nance of data. However, as multi-institutional scientific systems become increasingly
decentralised, centralised workflow enactment engines do not have all the information
necessary to provide the complete provenance of various results. For example, if a
workflow enactment engine, A, called a service, B, which also executes a workflow, the
provenance-related information stored by A would not contain the information about
how B produced its results and thus the complete provenance of the output of A could
not be determined. Furthermore, with the exception of VDT and Karma, the systems
described only capture the workflow enactment engines view of a service invocation, this
leads to the possibility of manipulation of provenance-related information produced by
the workflow enactment engine. In a multi-institutional environment, both the work-
flow enactment engine and the service in an interaction need to record documentation of
their involvement with each other. Finally, while all these systems have accessible well
specified data models, they are all tied to the particular notion of workflow implemented
service (i.e a function, actor, black box). When a workflow is executed a partial order of
Steps are created. Steps are instances of Step-classes with the input and output of the
Step attached. Thus, there is a direct link between specification and execution represen-
tation. Furthermore, Step-classes can inherit from other Step-classes allowing multiple
levels of abstraction to be expressed. However, the system relies on the presence of a
workflow definition (step-classes), which may not always be available, to understand the
functionality of a given service. Furthermore, when using the model, queries must infer
connections between steps by using time stamps and input/output matching. In a dis-
tributed system, timestamps may not be correctly ordered thus causing false connections
between steps to be present.
2.4 Cross-Cutting Concerns
Having reviewed a variety of provenance systems with respect to multi-institutional
scientific systems, we now analyse three cross-cutting concerns, namely, the level of
abstraction systems address, the nature of queries, and the fundamental role causality
plays in provenance.
2.4.1 Levels of Abstraction
The literature often discusses the “granularity” at which provenance-related information
is captured [161]. We find this term confusing, it is not clear what makes a system fine
grain or course grain. Instead of discussing granularity, we focus on the notion of various
levels of abstraction within applications. There are three axes in our notion of levels of
abstraction for provenance systems. They are the nesting of components, the nesting of
data, and the vocabulary used to describe processes. We address each in turn.
From a software engineering perspective, applications are often built using a hierarchy of
components. In object-oriented systems, objects contain other objects which contain ob-
jects themselves. Likewise, in a functional systems, functions call other functions which
in turn call other functions. This nesting of components is critical to the reusability of
software. It allows applications to be built by reusing and hooking together components
to create new functionality. With respect to provenance systems, it enables provenance
to be queried at different levels of component nesting. For example, a Web Service that
plots a graph contains components that perform various mathematical and drawing rou-
tines. A provenance system can capture the fact that a Web Service was called and a
graph was returned but it can also capture the use of the various drawing or mathemat-
ical components within the Web Service. This allows a user to view the provenance of a
result in terms of high-level components and then progressively break these components
down to view how the nested components contributed to result’s generation.
2.4.2 Answering Queries Related to Provenance
The fundamental goal of provenance systems is to enable users to answer questions
about the results produced by their systems. Expanding on Buneman’s categorisation
of why-provenance and where-provenance, the W7 model [150] identifies the broad range
of queries that fall under the heading of provenance. This conceptual model categorises
provenance into “what”, “when”, “where”, “how”, “who”, “which” and “why” questions
and provides Entity-Relationship diagrams defining data elements useful when answering
each question. To give an intuition as to the questions that would be asked by scientists
using a multi-institutional system, we now list example queries mapping to each category.
• What were the inputs to this experiment?
• When did the experiment run?
• Where did the experimental data come from?
• How fast did the experiment execute?
• Which data sources were accessed while running the experiment?
• Why did this part of the experiment fail?
To answer these questions, the reviewed provenance systems, query the data they have
collected using a range of implementations from extensions to SQL [3] to their own query
APIs [199]. A common thread to all these systems is that, to answer questions related to
the provenance of data, they rely on the equivalent of a dependency graph between data
or events, which is then traversed to obtain an answer. In the case of Workflow-based
systems, the dependencies expressed by the workflow are used to generate the graph.
In systems such as PASS and ES3, the dependencies between operating system calls are
explicitly captured and used to create a graph. Likewise, in database systems, weak
inversion functions express dependency information. These graphs may not contain all
the information necessary to answer a specific question, however, they provide a method
to connect the information required.
Thus, such a dependency graph is a representation of provenance and share two prop-
erties. First, the edges of the graph represent connections or relationships between data
or events. These relationships denote functions or operations applied to data. In gen-
eral, they represent the causal connection between data or events. For example, the
output of a function is caused by its inputs or the execution of a program is caused by
the user double clicking an icon. The notion of causality as a general way to represent
these connections stems from work in distributed systems [4, 108, 119]. Furthermore, in
rollback-recovering protocols, causal connections provide benefits over synchronisation
and transaction based approaches [174]. For example, a synchronisation-based approach
using a counterfactual definition [116], that is: if A had not occurred, then B would not
have occurred, all else being equal.
Much of the work on causation in computer science has focused on inferring causal
relationships from data sets [168]. Inferring causal relationships helps solve problems
in a variety of areas including artificial intelligence [144] and data mining [160]. Pearl
gives a systematic and mathematical treatment of causality [145]. He defines causality
in terms of probabilistic functions and directed acyclic graphs. Specifically, the following
definition for causal structure is given.
A causal structure of a set of variables V (defined as probability distribu-
tions) is a directed acyclic graph (DAG) in which each node corresponds to a
distinct element of V, and each link represents a direct functional relationship
among the corresponding variables.
From this definition, Pearl goes on to present tools for mathematically reasoning about
and inferring causality. A wide variety of techniques based on similar models are available
for inference of causal relationship relationships, particularly using Bayesian methods
[168]. Unlike this work, we are not trying to infer causality from some data set but
instead rely on observations to document causality within systems, specifically within
distributed systems. We use the notion of “observation by participation” that is a
component within a distributed system can observe data or events when it processes
such data or generates such events.
In distributed systems research, causality is discussed primarily with respect to asyn-
chronous distributed systems. Such systems are modelled by sets of automata that
perform three kinds of actions: sending a message (send event), receiving a message
(receive event), and internal events [119, 458]. For modelling reliable first in, first out
communication channels between automata, Lynch defines a cause function that maps
a receive event to a prior send event in the same channel, β [119, 460]. This function is
defined as follows:
1. For every receive event E1, E1 and cause(E1) contain the same message argument.
2. cause is surjective (onto)
3. cause is injective (one-to-one)
4. cause preserves order, that is, there do not exist receive events E1 and E2 with E1
preceding E2 in β and cause(E2) preceding cause(E1) in β.
Intuitively, the definition is saying that receipt of a message is caused by the sending of
that same message. Lynch expands the notion of causality to include the idea that an
an occurrence can only be known by the executor of it, any other automata or service
would only be able to infer the existence of the occurrence.
Using the notion of occurrence, we define our own notion of causality as follows:
Definition 2.1 (Causation). An occurrence O2 is caused by an occurrence O1 if one
of the following hold:
1. O2 is functionally related to O1 (i.e. O2 has a direct functional relationship to O1
from Pearl).
2. O1 is the sending of a message and O2 is the corresponding receiving of the message
(as defined by the cause relationship from Lynch).
3. O1 and O2 are related by a chain of relationships of types 1 and 2.
In summary, Definition 2.1 differs from the definitions provided by Lamport and Lynch
in two important aspects. First, the definition deals with both events and data. Second,
it provides a specific traceable causal connection between the reception of a message and
subsequent sending of another message. These two aspects make Definition 2.1 more
suitable for determining the provenance of results.
In this section, we have briefly discussed two views of causality from distributed systems
and causal inference research. Based on these views, we defined a particular notion of
causality suited to provenance and multi-institutional scientific systems.
2.5 Analysis Conclusions
We have presented a wide range of systems and models that address the problem of
provenance in computational systems. We also have investigated three cross-cutting
concerns: multiple levels of abstraction, queries and causality. From our analysis, we
have come to the following six key conclusions.
1. The Service Oriented Architecture style is the primary software engineering ap-
proach to designing multi-institutional applications.
In Section 2.1.1, we noted that the SOA style is suited to multi-insitutional scien-
tific systems because it hides implementation, enables service reuse, and encourages
platform independence. Because of these benefits, the SOA style has been used in
a variety of scientific systems that consider a range of domains from bioinformatics
to weather forecasting. Furthermore, the SOA style is being adopted by a vari-
ety of Grid-middleware platforms, including the Globus Toolkit. These platforms
provide the basis for a large number of multi-institutional scientific systems [67].
the data they capture and store (process documentation) and the representation of
provenance that they retrieve from that data. Finally, open, generic, data models are
important in allowing provenance describing multi-site processes to be retrieved.
In the next Chapter, we describe an open data model that supports high-quality char-
acteristics and takes into account our analysis conclusions.
A Model of Process
Documentation
At the beginning of this dissertation, we outlined the need for the provenance of results
in order to establish confidence in those results especially when they are produced by
dynamic multi-institutional scientific systems. Furthermore, we introduced the notion
that provenance is a question answered by querying documentation of an application’s
process. This novel distinction between provenance and process documentation allows
provenance questions, unknown at the time of application execution, to be successfully
answered provided enough documentation has been produced. Additionally, this sepa-
ration of concerns allows components to be specialised for their specific role, either the
creation of process documentation or its querying. To enable the creation and query-
ing of process documentation by distributed software components, there must be some
shared understanding between all these components. In this chapter, we present a data
model for process documentation that provides this shared understanding.
But what should this data model look like? In Chapter 2, we arrived at six conclu-
sions about the state of the art for determining provenance in multi-institutional sys-
tems. Taking these conclusions into account, this chapter specifies a model of process
documentation that is compatible with SOAs, provides explicit veridical relationships
between occurrences, and is both technology and domain independent. Furthermore,
as discussed later, the data model is designed to support high quality characteristics
derived from a use case analysis.
Therefore, the contributions of this Chapter are as follows:
• A more detailed description of the set of characteristics that define high quality
process documentation.
• A precise conceptual definition of a generic data model for process documentation
41
6. Platform independence: Internet applications are often developed using a va-
riety of platforms (i.e. operating systems, programming languages, architectures).
Such a model allows for provenance to be determined from process documentation
generated by application components running on any platform.
3.2 Advantageous Characteristics for a Shared Data Model
While any generic, shared data model could provide these benefits and allow the prove-
nance of results to be determined. The data model that we specify is designed to support
the creation of accurate process documentation. In Section 1.3, we outlined a number
of characteristics that help ensure accurate process documentation. We termed process
documentation that adheres to these characteristics, high quality. These characteristics
were derived from an analysis of use cases from several domains [126]. We looked at both
the technical requirements enumerated in the analysis as well as the use cases themselves.
We found that in a majority of the use cases, process documentation provides evidence
that a process occurred. Thus, these characteristics are justified by philosophical ar-
guments that equate process documentation to evidence. Beyond these philosophical
arguments, a number of the characteristics also directly support a technical requirement
enumerated in the use case analysis. We now revisit these characteristics and justify
them in greater detail below.
Characteristic 1 (Factual). In the previous chapter, we noted that a number of prove-
nance systems produce documentation that contains both factual and inferred informa-
tion. With this combination, it is difficult to determine whether the process evidenced
by the documentation actually occurred as described. Thus, we introduce the notion
that process documentation should be factual: it should only be about what is known
to have occurred in an application. To support this characteristic, we adopt the notion
of observation by participation that is process documentation that evinces a particu-
lar application operation should only be created by the component that performed the
operation.
Characteristic 2 (Attributable). In a court of law, evidence, particularly testimony,
is judged by the person or institution who provides it. Furthermore, if it is found
that the evidence given is false then remedial action can be taken against the provider.
Similarly, if a user deems that process documentation is somehow erroneous, the user
must know who is responsible for the creation of the documentation so that the party
can be held accountable. By insuring users know the accountable party, they will have
greater confidence in process documentation.
Characteristic 3 (Autonomously Creatable). In both criminal and scientific in-
vestigations, evidence is gathered at the most appropriate time and by the most ap-
propriate person, device, or institution. By analogy, the distributed components of a
3.3 The Data Model
We now present the various concepts that underpin the p-structure using our simple
example shown in Figure 3.1. First, we define a specific notion of process represented by
the p-structure. Next, we detail the data model and its constituent parts. After which,
requirements on component behaviour are defined. Finally, we describe how provenance
can be determined from process documentation organised using the model.
For each set of concepts, we provide a concept map [143] that gives an overview of the
concepts and the relationships between them. Concept maps were chosen because they
are designed for human consumption. Computer parsable representations are available
in XML [138], OWL1 and Java [89]. The use of concept maps was inspired in part by the
Web Services Architecture specification [24], which also uses concept maps. The concept
maps, shown in Figures 3.2, 3.3, 3.5 and 3.7, contain concepts represented by shaded
rounded rectangles and relationships linked by lines between concepts. The words in
the middle of a line denote the kind of relationship between the linked concepts. Maps
are read downward, or if an arrow is present, in the direction which the arrow points.
For example, the top portion of Figure 3.2 can be read as “Actors play a role”, ‘Actors
have points of communication’. In the text, we italicise the first occurrence of concepts
that appear in a concept map.
3.3.1 Process
The concepts discussed in this section are summarised by Figure 3.2. Applications are
developed to address a variety of problems using different programming languages, design
approaches, and execution environments. To represent this dynamic range of situations,
we take a particular perspective on all applications, which embraces the principles of
encapsulation and abstraction to enable process documentation to be created at varying
levels of detail while still preserving coherence across applications and their components.
The perspective we take is to view applications as composed of entities, called actors,
each of which represents a set of functionality within the application. Actors interact
with other actors by the sending and receiving of messages through well-defined points
of communication. Such a view naturally fits with service oriented architectures, one of
the primary software engineering approaches for complex multi-institutional applications
[69]. Our decomposition of applications is conceptual and is not restricted to applications
already based on message passing. For example, as we view messages as information
exchanged by actors, two threads communicating by a shared memory can also be viewed
as actors.
To aid developers in the mapping of their applications to this conceptual perspective,
1http://www.pasoa.org/schemas/ontologies/pstruct025.owl
Figure 3.2: Concept map describing process
a software engineering methodology for the decomposition of applications into actors
has been created [139]. For example, using this methodology, the example application
was decomposed into actors that map to each step in the workflow i.e. Client Initiator,
Mathematical Function, and Collate Sample. Each actor represents some functionality
at a level of abstraction. Through the addition of actors, application functionality can be
represented at a greater level detail. For instance, some of the functionality encapsulated
by the Mathematical Function actor is represented in more detail by the Collate Sample
actor.
When decomposing an application, specific points of communication are pinpointed and
given an identifier called a message source or message sink that respectively denote
where messages are sent and received. An actor may have any number of points of
communication; the only restriction is that they are clearly identified. For example, in
a Web Services context, a message sink would be the endpoint reference of an actor.
An actor’s functionality may only operate or work upon messages and by extension
the data within messages that it has received or that have been created by the actor’s
functionality. Therefore, to define the boundary of an actor, we define the scope of an
actor as the set of messages that have been received or been created by the actor (i.e.
message creation and message reception). Thus, the scope defined here is not a static
scope defined by software components but a dynamic scope that exists in the space of
execution. Concretely, an actor that represents a procedure would contain within its
scope not only all the inputs to the procedure and all the output data the procedure
returns, but also all the data it may read or store in memory.
From scope, we define the notions of inside and outside an actor. Data is said to be
inside an actor when it is part of a message that is an element of an actor’s scope.
Likewise, data is said to be outside an actor when it is not part of a message that is
an element of an actor’s scope. This definition also applies to the events within an
application. If an event happens to data inside an actor, then it is also said to be inside
the actor. Correspondingly, if an event happens to data outside an actor, then the event
is also said to be outside the actor. Thus, both receive and send events are inside an
actor when the message being sent or received is also inside that actor.
Scope provides a mechanism that allows data or events to be located within an appli-
cation. However, a mechanism is also needed that expresses how events and data are
connected together to form a process within an application execution. Given our mes-
sage passing perspective, all data is contained within messages and the basic events that
we consider are the sending and receiving of messages. Therefore, we define how these
primitives are connected.
From Part 2 of Definition 2.1, we maintain that the receipt of a message is caused
by the sending of that same message. The combination of a send and receive event
along with the message exchanged is termed an interaction. Therefore, an interaction
expresses both a causal connection between a sending event and a receiving event as well
as the contents of the message being exchanged. Specifically, an interaction describes
an external causal connection, a logical connection formed where, for a given actor, the
event inside the actor is caused by an event outside the actor. An interaction matches
this definition because the internal receive event is caused by a send event outside the
receiving actor. An example of an interaction is the sending of a response from the
Collate Sample actor to the Mathematical Function actor.
With respect to interactions, actors can also play different roles. An actor may have the
role of a message sender in one interaction and may play the role of a message receiver in
another. An interaction can have metadata associated with it. This metadata is usually
embedded within the message being exchanged. However, it may be associated in some
other manner. Using this metadata, a sender can share information with a receiver
that enables process documentation produced by these separate actors to be collated
together. Specifically, a sender can generate a unique key for an interaction, termed an
interaction key, and share this with the receiver. Thus, allowing the two actors to refer
to a specific interaction using the same identifier. Based on the interaction key both the
sending and receiving events can be identified by event identifiers. As we later show,
these identifiers are crucial to organising process documentation.
Interactions express the causal connection between the sending and receiving of a mes-
sage. However, they do not express the causal connection between the receiving of a
message and the sending of another message within the scope of an actor. The sending
of a message by an actor is caused by the execution of some functionality within the
actor and this, in turn, is caused by the receipt of a set of messages. This causal connec-
tion follows from Part 1 of Definition 2.1. We term the causal connection between the
receiving of messages and the sending of messages caused by an actor’s functionality a
transformation.2 This causal connection is termed an internal causal connection because
the events being connected are both inside the actor.
Consequently, in an application, the receiving of data by an actor may cause a transfor-
mation to occur. This transformation may cause the sending of data, which itself causes
the receipt of data in another actor, which in turn may cause a transformation and so
on. Thus, a process can be defined as follows:
Definition 3.1. (Interaction-based Process Definition) A process is a causally connected
set of interactions and transformations.
Using this definition, precise organised process documentation can be created. We now
describe a model, the p-structure, tied to this definition of process.
2Our model of process does not allow sequences of connected transformations inside an actor. This
restriction ensures a simple decomposition rule: if more detail is needed about a transformation, the
actor must be decomposed into more actors. Furthermore, the introduction of transformation sequences
within an actor would necessitate the introduction of more primitives, thus, adding complexity to the
model.
F
ig
u
r
e
3.
3:
C
on
ce
pt
m
ap
de
sc
ri
bi
ng
pr
oc
es
s
do
cu
m
en
ta
ti
on
The representation of the message contained in the interaction p-assertion often contains
an exact duplicate of the message, but, in some instances it may not be feasible to have
such a representation, for example, when the data being transferred needs to remain
anonymous to users of process documentation or is of a large size. In the example appli-
cation, this occurs when the sample generated by the Collate Sample actor is replaced
with a reference to save storage space. To allow for these cases while still preserving an
accurate representation, we allow a message to be transformed in a well-defined manner
during the generation of a p-assertion, which is termed styling the p-assertion. The
styling that is performed is defined explicitly by a documentation style. Causal depen-
dencies are not tracked for these styling transformations because they pertain to the
creation of process documentation as opposed to the production of application results.
Likewise, the created p-assertions are not seen as application data and are not in the
scope of an actor. For example, when the Collate Sample actor receives a large mes-
sage, it may store that to a local database. To document the reception of the message, it
generates a p-assertion with a reference to the data within it. While documentation pro-
duced after styling may not be as detailed as a copy of a message, it still provides critical
evidence that a process occurred while allowing practical issues such as anonymization
and scalability to be catered for. Therefore, like all other process documentation, it
should should be immutable. We discuss the particular case of references in more detail
in Section 6.6.2
Interaction p-assertions document both the data within applications as well as the ex-
ternal causal connections between the actors within those applications.
3.3.2.2 Relationship P-assertions
Unlike interaction p-assertions, relationship p-assertions represent internal causal con-
nections between occurrences, which are defined as events or data items involved in
events. For example, in our simple example application, an occurrence is the reception
of an initiator request by the Mathematical Function actor. The data items in ques-
tion can be entire messages or parts of messages. To locate a part of a message within
process documentation data accessors are introduced, which are descriptions of how to
find parts within p-assertions that document messages. Therefore, an occurrence within
a relationship p-assertion is identified by locating the p-assertion where the event is
documented and, if necessary, a data accessor.
A relationship p-assertion identifies one or more occurrences that are causes and one
occurrence that is the effect of those causes. We limit a relationship p-assertion to
one effect to make it easier to find the provenance of a particular occurrence: with this
approach, there is no need to disambiguate which causes are associated with a particular
effect. The specific relationship between these causes and the effect is described by a
relation. The two types of causal relationships that are allowed between occurrences are
abstraction is required, process documentation for each level can coexist and complement
one another to facilitate analyses.
3.3.2.4 Internal Information P-assertions
We now discuss one final type of p-assertion that facilitates abstraction and is introduced
for convenience when using the model. It is often the case that a piece of data plays an
important role in a process but the manner of its generation is not of interest. Examples
of this include the time, the memory usage of an actor, and the configuration of an
actor. All of these data items can be represented using relationship p-assertions and
interaction p-assertions. However, using internal information p-assertions, the detail of
how these data items were obtained can be abstracted away and one is left with just
the data item and its basic causal connection to the process. We make the restriction
that the data item is obtained by the actor either just before the sending of a message
or just after the receipt of a message. This ties the data item explicitly to a particular
occurrence in the process. Essentially, it allows an actor to assert, for example, that it
sent a message at a particular time or that its memory usage was 20% after receiving
a message. Thus, an internal information p-assertion represents the receipt of data by
an actor from some other unidentified actor and is causally connected to the sending
or receiving of a message by the former actor. The causal connections represented by
internal information p-assertions are different for sending and receiving events and are
as follows:
1. Sending: The sending of a message is caused by the receipt of the data within the
internal information p-assertion, obtained just before the sending of the message.
2. Receiving: The receiving of a message causes the receipt of the data within the
internal information p-assertion, obtained just after the receipt of the message.
We note that because internal information p-assertions represent the receipt of mes-
sages they can be used as causes within relationship p-assertions. Also, an actor can
style the data during the creation of the internal information p-assertion. Thus, an in-
ternal information p-assertion consists of four parts: an asserter identity, the data, the
documentation style of the data, and the event identifier of the event to which the data
is causally connected. Internal information p-assertions allow data items to be made
explicit without, the sometimes unnecessary, overhead of creating documentation for
their generation.
3.3.2.5 The P-Structure
P-assertions contain the elements necessary to represent a process. However, without
some organisation it would be difficult to discover distinct processes within process
of p-assertions to find a piece of metadata, queriers can look in a well defined location
that is independent of any particular message format.
We now look briefly at one mechanism, tracers, that is metadata used to demarcate
processes. See Figure 3.5 for an overview.
Figure 3.5: Concept map describing tracers
Tracers are tokens associated with interactions that identify the larger process that a
particular interaction belongs to. Tracers are similar to transaction contexts within
transaction processing systems. Just as a transaction context distinguishes a particular
transaction, tracers distinguish or demarcate processes from one another in process
documentation by identifying a set of interactions, typically involving several actors, that
belong to a particular process. Actors can inject, i.e. add, tracers into an interaction’s
metadata. When an actor receives a tracer metadata, it can propagate or not propagate
the tracer to subsequent messages that it sends. This is similar to the passing of a
transaction context through the operations involved within a transaction. Injection and
propagation are determined via tracer semantics, which are identified in the token. An
actor has a choice as to whether it chooses to make use of tracers. When exposed
interaction metadata contains a tracer, it is known that interaction documented was
part of the process identified by the tracer. Thus, tracers assist in identifying particular
processes within process documentation.
Given our interaction-centric perspective, the p-structure logically groups together the
two views that document an interaction into what we term an interaction record. The
a causality graph, which represents the provenance of a particular occurrence can be
extracted from the p-structure. Different forms of causality graphs (i.e. where the
vertices and edges of the graph represent different entities) can be extracted from the
p-structure depending upon usage. For example, one form of causality graph could have
all edges representing a causal connection and all vertices being either a cause or effect,
which could be useful in studying the purely causal relationship between occurrences
whereas another form maybe useful when looking at the transformations applied. Here,
we use a simple form for illustration purposes.
Figure 3.7: Concept map describing provenance
Figure 3.8 shows the provenance of a numerical result. The nodes in the graph are
occurrences, in the role of causes, effects or both. The edges in the graph are hyperedges3
and represent the causal connections extracted from relationship p-assertions. All arrows
on the edges point from effect to cause. The external causal connections represented by
interaction p-assertions are collapsed into the numbers shown to the bottom right of each
node. These numbers map to the interactions shown in Figure 3.1. Internal information
p-assertions are shown as annotations connected to the interactions by double-arrow
headed lines. To save space, not all of these p-assertions are shown. The relationship
p-assertions shown in Figure 3.6 are found in this figure.
Figure 3.8 evinces the claim that the provenance of an occurrence can be explained at
different levels of abstraction. For example, the edge labelled as is result of mathematical
3A hyperedge is an edge that can have any number of vertices.
numerical
result
collated
sample
calculated from
sources
generated from
initiator
request
is caused by
I4
I3
I2
is result of mathematical
function caused by
I1
Institution 2
Institution 1
Institution 2
Institution 3
sample
size
I2
is caused by
Figure 3.8: Causal graph describing the provenance of a numerical result
function caused by between numerical result and initiator request abstracts the three
nodes and four hyper edges to its left.
3.4 High-Quality Characteristics Revisited
We now revisit each high-quality characteristic and discuss how each is addressed by the
data model.
Characteristic 1: Factual
This characteristic is the basis of our notion of actors and thus is at the core of our data
model. If application developers follow the data model specification, actors will only
assert what is in their scope and thus process documentation will be factual. Essentially,
the data model is a contract between the creator and the querier. In the case of factuality,
it places an obligation on a creator that it only records data about what it knows to have
occurred. Therefore, queriers can assume that process documentation will be factual and
take action if they find that the contract is broken. The data model helps to enforce
this contract by ensuring that effect of a relationship p-assertion is in the view with the
relationship p-assertion, hence, the occurrence that is the effect must be documented as
being part of the actor’s scope. If the effect is not found within the same view, then it
is a sign that factuality was not adhered to. Furthermore, a querier can check to see
if the causes in a relationship p-assertion share the same assertor as the relationship
needs to be provided when in process documentation. Furthermore, the p-structure
organises process documentation so that questions pertaining to past processes, such as
determining the provenance of data item, can be answered easily.
3. A data model for provenance should be well-defined and independent from any one
domain or technology to cater for multiple platforms and programs.
The p-structure is defined at a precise conceptual level supported by Concept Maps
designed for effective human consumption. We do not define the p-structure in terms of
any one platform or technology and the conceptual model has been instantiated using
three different technologies. Finally, the conceptual model is not tied to any particular
scientific or business domain.
4. Multiple levels of abstraction must be supported to satisfy a range of queries.
The p-structure was specifically designed to cater for multiple levels of abstraction as dis-
cussed in Section 3.3.2.3. One of the purposes of introducing the mapping of applications
into actors is to allow for the nesting of components to be expressed. When defining
relationship p-assertions, structural relations were introduced specifically to cater for
the nesting of data. Likewise, the decision to allow multiple relationship p-assertions to
be created between the same occurrences was introduced so that multiple vocabularies
could be used when describing a transformation that occurred within an actor. Finally,
internal information p-assertions allow details of how internal information was produced
to be abstracted away when it is convenient to do so.
5. The storage of provenance-related information should be separated from its collection
point to ease management and query processing.
The next chapter focusses on the storage of process documentation. However, the def-
inition of the p-structure is key to supporting the separation of the creation of process
documentation and the querying of it. It provides a shared understanding between
asserters of process documentation and queriers of it. Because of the p-structure, a
querier can understand process documentation without having direct knowledge of the
actors who created it. The p-structure’s organisation of p-assertions into views, the pair-
ing of views into interaction records, the introduction of exposed interaction metadata,
the ability to identify each p-assertion uniquely, and the independent representation of
transformations in relationship p-assertions were all designed to allow users of process
documentation to traverse and understand it independently from its creators.
6. Causal dependency tracking is critical for understanding the provenance of data.
Causal dependencies or causal connections are at the heart of the p-structure. Two
types of p-assertions, relationship p-assertions and interaction p-assertions are designed
to capture causal connections. They follow directly from the definition of causality we
defined in Chapter 2. Relationship p-assertions allow the functional relationships from
Part 1 of Definition 2.1 to be represented. Likewise, interaction p-assertions allow the
causality of sending a message and another party receiving the message from Part 2 of
Definition 2.1 to be represented. Thus, two of the core components of our data model
are designed to express causality.
From this review, we have shown that the design decisions leading to the p-structure
can be traced back to either the conclusions we obtained from our analysis of related
work or the high-quality characteristics enumerated in Section 3.2.
3.6 Related Work
In Chapter 2, we reviewed a number of provenance systems. Here, we briefly revisit
some of those systems’ data models and distinguish them from the p-structure. A major
difference between the p-structure and other models is that it is defined in terms of a
conceptual model designed for human consumption instead of relying on a computer
parsable syntax or formalism. This allows the model to be instantiated in a variety of
languages and enables both syntactic and conceptual compatibility. For example, process
documentation from two institutions could both be represented by the p-structure in
XML and thus be conceptually and syntactically compatible. However, two institutions
could adopt the p-structure but use different languages (i.e. one uses XML, the other uses
serialized Java objects) and then their process documentation would only be conceptually
compatible.
Unlike the workflow centric models from MyGrid [200], Kepler [29], REDUX[15], and
Szomszor [178], the p-structure is specifically designed to handle both the workflow
enactment engine’s and service’s view of an interaction. Furthermore, even workflow-
based systems that support the modelling of both views of an interaction, such as Karma
[165], still require the centralised coordination of the workflow enactment engine to
properly demarcate process documentation from different workflow runs. In essence, the
p-structure supports a peer-to-peer topology whereas these other models are designed
for a centralised layout.
Additionally, these workflow-centric models often rely on the workflow definition to
provide the causal connections between service inputs and outputs. Relying on the
workflow definition is not always possible because services are implemented using a
variety of implementation languages. The p-structure, on the other hand, provides
a generic way to express causal connections independent of any particular workflow
definition.
As discussed in Section 2.4.1, support for multiple levels of abstraction is key for scientist
to easily use the process documentation. Many of the provenance systems discussed do
not innately support multiple abstraction levels. The p-structure, however, is designed
from the ground up to support different levels of abstraction through an approach that
enables both high level and low level descriptions of actor functionality to be expressed
at the same time through the decomposition of actors and multiple vocabularies for
relationship p-assertions. While ZOOM [44] supports abstraction of actor functionality,
it does not support the documentation of causal connections in a robust manner because
it relies on time stamps. Furthermore, it does support a direct mechanism like tracers
to distinguish between processes nor does it support attribution inherently.
The p-structure strikes a balance between being open enough to support a variety of
applications while being structured enough such that generic tools can be reliably built
upon it. This is in contrast to Myers et al.’s approach [140], which advocates a completely
open model with the only structure coming from the underlying syntax of RDF. Because
of this totally open approach, different institutions could have completely different ways
of organising process documentation. We believe that this approach prevents generic
provenance specific tools from operating over a range of process documentation generated
by different institutions. Likewise, process documentation tailored too specifically to one
platform like PASS [156], S [18], CODESH [28] or Trio [188], prevents provenance from
being determined across multiple institutions that use a variety of platforms.
Therefore, the p-structure is a novel data model for use in determining the provenance
of results produced by systems that span multiple institutions.
3.7 Summary
In this chapter, we discussed how a generic, shared data model of process documentation
facilitates the sharing of process documentation between institutions; it allows for the
development of tools that work across domains and applications and for the creation of
future-proof process documentation by independent application components running on
a variety of platforms. We presented a detailed conceptual definition of just such a data
model, the p-structure. The presentation was facilitated by the use of concept maps.
The specification of the p-structure began with a definition of process that embraces the
Service-Oriented Architectural style. Using a simple example, we showed how the p-
structure enables multiple levels of abstraction to be supported and how the provenance
of a data item could be extracted from process documentation following our data model.
After fully specifying the p-structure, we demonstrated that it supports the creation
of high-quality documentation of process as well as the analysis conclusions drawn in
Section 2.5.
The contributions of this chapter were two fold: a detailed description of high-quality
characteristics and the precise conceptual definition of a data model that supports these
characteristics. While other systems may allow the provenance of digital objects to
it has stored may also disappear. Furthermore, actors may not have persistent storage
and thus may not be able to store p-assertions in a permanent manner. Even if an actor
has access to some persistent storage, it may not have enough capacity to keep all the
process documentation for every process that it contributed to.
Finally, if each actor maintains its own p-assertions, then enforcement of access control
across an application’s process documentation becomes challenging since it would require
each actor to track the access control privileges of a myriad of queriers.
Given these difficulties, the approach we adopt utilises provenance stores, which allow
process documentation to be stored during execution for multiple physical or digital
objects; second, because they are specialised for the storage of p-assertions, provenance
stores can be built to persistently store large amounts of process documentation and
to deal appropriately with problems of security and access control. We note that a
provenance store is a role. Any actor, as long as it supports the provenance store
interface and designated qualities of service, can be a provenance store. Therefore, in
actual deployments a provenance store can be integrated with other actors.
We now discuss how developers can deploy provenance stores within their application
architectures such that process documentation can be recorded effectively.
4.2 Deployment Patterns
To be able to cope with documentation from a multi-institutional application, prove-
nance stores may need to be distributed since there can be a large quantity of data, in
a large number of p-assertions, recorded by a large number of actors, each with their
own security domain, privacy requirements, etc. The requirement for creating process
documentation in distributed applications, such that all documentation related to their
execution can be retrieved again, presents a developer with several deployment prob-
lems. These include in what computer the provenance store should be located, how many
provenance stores to deploy, and where in the network topology to deploy provenance
stores. To address these problems, a set of deployment patterns are now introduced.
A pattern [7, 6, 80] describes a solution to a common design problem; the solution
described must strike a balance between being concrete enough to be applicable and
abstract enough so that it can be applied to a range of similar problem situations. The
patterns presented here provide reusable solutions that developers can use to integrate
p-assertion recording into their application architectures. The format of these patterns
is as follows:
Context Actors record p-assertions in provenance stores following the SeparateStore
and ContextPassing patterns.
Problem The SeparateStore and ContextPassing patterns may lead developers to be-
lieve that for every application actor, there is a corresponding provenance store. How-
ever, developers may not want to deploy a provenance store for every application actor,
especially when the number of application actors is large. Also, in order to retrieve
the provenance of a result, each provenance store must be contacted resulting in slower
query performance.
Solution Application actors are allowed to record p-assertions into a shared prove-
nance store. The SharedStore pattern clarifies the way in which SeparateStore and
ContextPassing can be applied. Both SeparateStore and ContextPassing are agnostic
as to what provenance store an actor may use to record its p-assertions. SharedStore
emphasises that actors can record their p-assertions in any store they choose and prove-
nance stores may hold p-assertions from multiple actors. It does not prescribe how
many stores there should be and which provenance stores should be shared. It is left
to the developer applying the pattern. SharedStore allows developers to determine the
distribution of provenance stores that fits their application.
4.2.4 Pattern Application
The patterns that we have introduced show how p-assertions can be recorded in prove-
nance stores by actors. The documentation of process can be recorded for an entire
system by applying a selection of these patterns to every actor and every interaction in
a system. We now show how these patterns can be applied using the simple example
introduced on page 43 and shown again in Figure 4.4.
First, the SeparateStore pattern is applied so that each actor can record p-assertions
into a provenance store. The application of this pattern is depicted in Figure 4.5.
Second, the SharedStore pattern is applied. Using this pattern, it is decided that the
best deployment of provenance stores, in this case, is to have the Mathematical Function
and Collate Sample actors share a common provenance store. Figure 4.6 shows the
application of the SharedStore pattern to our simple example application.
Finally, to ensure that interaction keys are passed between actors and process documen-
tation is connected, the ContextPassing pattern is applied as shown in Figure 4.7. Thus,
all three patterns have been applied in order to appropriately record p-assertions.
These recording patterns allow for the flexible deployment of provenance stores. We
also conjecture but do not show that these patterns aid scalability by allowing multi-
ple provenance stores to be deployed for an application. The patterns can be applied
to any number of interacting actors using any number of provenance stores to record
Client
Initiator
Mathematical
Function
Collate
Sample
I2 I3
I1
I4
I1: initiator request
I2: collate sample request
I3: collate sample response
I4: numerical result
Figure 4.4: A simple example application
Client
Initiator
Mathematical
Function
Collate
Sample
Provenance
Store 1
Provenance
Store 2
Provenance
Store 3
Provenance
Store 4
Figure 4.5: The SeparateStore pattern applied
Client
Initiator
Mathematical
Function
Collate
Sample
Provenance
Store 1
Provenance
Store 2
Provenance
Store 3
Figure 4.6: The SharedStore pattern applied
Client
Initiator
Mathematical
Function
Collate
Sample
Provenance
Store 1
Provenance
Store 2
Provenance
Store 3
3
2
1
4
Figure 4.7: The ContextPassing pattern applied
p-assertions. These distribution patterns however do not mandate the number of prove-
nance stores that must be used in a given application, nor the way they must be shared;
it is left to the application developer to make those decisions.
4.3 Connecting Distributed Documentation
One of the consequences of applications that span multiple institutions and the use
of the above deployment patterns is that process documentation need not be centrally
located and can reside across various locations. There are several benefits to this: the
elimination of a central point of failure, the spreading of demand across multiple services
and the ability for provenance stores to exist in different network areas (for example,
one provenance store may be behind a firewall whereas another is not). In general,
allowing p-assertions to be recorded across multiple locations increases the flexibility
and scalability of systems recording p-assertions.
However, to retrieve the provenance of a result, distributed process documentation must
be connected so that the provenance of results can be found. The technique of linking,
discussed below, enables this distribution.
Given that the p-assertions documenting a given execution may be spread across multiple
stores, there must be some mechanism to retrieve these p-assertions in order to validate,
visualise or replay the represented process. To facilitate such a retrieval mechanism, we
introduce the notion of a link defined as follows.
Definition 4.1. A link is a pointer to a provenance store.
We note that links are necessarily unidirectional : a link always points to a remote
provenance store location. Links are used in two instances, which we now describe.
4.3.1 View Links
The first use of a link deals with the situation where a sender’s view of an interaction
and a receiver’s view of the same interaction as identified by a shared interaction key
are stored in two different provenance stores. It is necessary for each actor to record a
link, which we refer to as a View Link , that points to the provenance store where the
opposite party recorded their p-assertions. Thus, the sender in an interaction records a
link to the provenance store that the receiver used to record p-assertions for the given
interaction, and vice-versa. This allows querying actors to navigate from one provenance
store to the other in order to retrieve both views of an interaction. We note that View
Links point to provenance stores only, not to particular pieces of data in a provenance
store; the actual data of interest can be found by a local search of the provenance store.
Contents of PA
interaction key p-assertion type p-assertion content
1 interaction M1
2 interaction M2
2 internal information View Link to PB
2 relationship 2 is related to 1, Cause Link to PR
Contents of PB
interaction key p-assertion type p-assertion content
2 interaction M2, with View Link to PA
Figure 4.9: Contents of provenance stores
Links provide a solution to the problem of connecting distributed process documentation.
Similar to the Web, the unidirectional nature of links avoids the problem of having to
synchronise between provenance stores when recording a link. Instead, each actor is
responsible for recording a link just as each web page author is responsible for adding
links to other pages as appropriate. Creating links is lightweight; the information needed
to establish a link is minimal. Furthermore, the link structure provides a structured and
simple mechanism for querying actors to traverse provenance stores hosted by multiple
institutions.
So far, we have discussed the high level concept of recording process documentation
into provenance stores and how this documentation can be connected so that it can be
queried in a distributed environment. We now present a protocol by which actors can
record their p-assertions into provenance stores.
4.4 PReP: The P-assertion Recording Protocol
The P-assertion Recording Protocol (PReP) defines the communication between actors
and the expected behaviour of those actors when recording p-assertions. In this case,
PReP has the following benefits:
1. It ensures that the data residing in the provenance store is compatible with the
p-structure.
2. It provides a well-defined interface for actors to record p-assertions.
3. It guarantees certain beneficial properties in terms of both the data being recorded
and its operation.
In the previous chapter, we enumerated the following characteristics that high quality
process documentation should possess: factual, attributable, autonomously creatable,
process oriented, immutable, finalizable. PReP supports these characteristics by enforc-
ing several associated properties, which are discussed below.
4.4.1 Properties of PReP
Below are a six properties, which PReP enforces to help ensure that the process docu-
mentation within provenance stores is high quality. Essentially, each characteristic maps
to a property, which can then be enforced.
1. As discussed in Chapter 3, a p-assertion’s semantics dictates that it is always
interpreted as being factual (i.e. about an occurrence within the scope of the actor
creating the p-assertion). To ensure factual process documentation is contained
within the provenance store, the datatype safety property is introduced, which
guarantees that the protocol only allows p-assertions to be recorded. This property
prevents untyped information from entering the provenance store, hence, factual
process documentation can be more readily enforced.
2. Support for attribution is a fundamental part of the p-structure data model; an
asserter identity is included in each p-assertion. It can also be used for access
control on the provenance store. Thus, it is important to ensure that this identity
is preserved during the recording of the p-assertion into the provenance store.
To provide this assurance, PReP enforces the identity preserving property, which
states that the asserter identity will not be changed during or after recording the
p-assertion.
3. Within the p-structure, interaction keys are vital for supporting the autonomous
creation of process documentation. To ensure that interaction keys are created and
passed between actors correctly, Section 3.3.3 introduced three actor behaviour
rules. Here, we show that the protocol has the property of being actor behaviour
compliant i.e. that that the protocol enforces these three rules.
4. We introduce the property of process reflection that supports the characteristic of
process orientation. It is defined as: eventually an application’s execution will be
described by process documentation recorded in provenance stores. This means
that after the application has completely recorded the documentation of its execu-
tion in a set of provenance stores, a querier will be able to retrieve the provenance
of any result produced by the distributed system.
5. In distributed systems, a safety property is one that states something will not
happen [110]. In the context of PReP, we introduce the following safety property:
no p-assertion is erased, overwritten or modified once recorded in a provenance
store. This particular property is important because it supports the immutable
characteristic, which as discussed earlier, is vital for queriers to have confidence in
process documentation.
6. To support finalizable process documentation, we introduce the concept of com-
pleteness, which is that an object has all its constituents. In the case of PReP, we
and is responsible for it.
The submission finished message is similar to the record p-assertion message, except
that the p-assertion parameter is replaced with an integer representing how many p-
assertions a provenance store should receive in total from an actor for an interaction.
By knowing how many p-assertions should be recorded, a provenance store can determine
when an actor has finished recording p-assertions in the context of a particular View,
which is used to determine when it is complete. Because of the asynchronous nature of
the protocol, the submission finished message, like any other message, can be sent at any
time. We note that in most cases an actor will send this message after it has recorded
all its p-assertions. Thus, the submission finished does not imply that an actor guesses
how many p-assertions it will create and record for a particular interaction. Instead,
the ability to send the message at any time preserves the asynchronicity of the protocol.
For example, consider an actor that has already created all its p-assertions, because of
PReP’s design, the actor can record all the p-assertions and declare that its submission
is finished in parallel.
The last kind of message exchanged by recorders and provenance stores is the acknowl-
edgement message. Each message received by a provenance store is acknowledged by
an acknowledgement message, which contains the global p-assertion key contained in
the message being acknowledged. There is some computation time in processing record
messages and storing their contents. Therefore, acknowledgement messages allow actors
to track whether their p-assertions have been stored within the provenance store. This
is useful when the actor wishes to notify other actors that it has completed recording.
Furthermore, the provenance store can use the acknowledgement message to return er-
ror messages or other implementation specific information. Thus, the acknowledgement
message is not used for guaranteeing message delivery or flow control but, instead for
the recorder to track the state of the provenance store.
We now describe the dependencies between the messages defined above. Due to the
asynchronous nature of the protocol, the dependencies are minimal. They are as follows:
• For any application message in a given interaction, a record p-assertion or submis-
sion finished message about that interaction must contain the same interaction
key as the application message.
• Acknowledgement messages must be sent after the receipt of the message that is
being acknowledged.
4.4.3 PReP’s Behavioural Constraints
The set of messages and their dependencies impose some behavioural constraints on the
roles of sender, receiver, recorder, and provenance store. We now make such behaviour
explicit. Our intent here is to give an intuitive description of the required behaviour of
these actors and then use the formalisation that follows to give a precise definition of
that behaviour. We enumerate these behaviour rules below. The first three of these
rules are the behaviour rules defined in Section 3.3.3.
1. (Unique Interaction Key Rule) A sender must generate a globally unique interac-
tion key for every new interaction and assign it to that interaction.
2. (Interaction Key Transmission Rule) A sender must send the interaction key to
the receiver by including it within the application message being sent.
3. (Appropriate Interaction Rule) Both receiver and senders must use the interaction
key associated with an interaction, I, when asserting p-assertions about I.
4. A recorder must keep track of the messages it has sent to a provenance store for a
particular interaction until the acknowledgements are received for them.
5. The provenance store must be waiting to receive messages and when it receives
a message, it must process its content, store it and return the appropriate ac-
knowledgement message. It is necessary that the provenance store determines how
many p-assertions it has received from a particular actor for a given interaction
and compare that to the number of p-assertions the recorder has declared so that
it can determine if process documentation in a given View is complete. If doc-
umentation is marked as complete for an interaction, the provenance store must
prevent any additional p-assertions from being recorded. Likewise, it has to detect
attempts to overwrite previously stored p-assertions and respond with an acknowl-
edgement message. If a p-assertion is received by a provenance store and its LPID
has already been used, then the provenance store discards the p-assertion and the
same acknowledgement message is returned. This means that once an actor has
recorded a p-assertion, it cannot override that assertion.
We now present a formal model of PReP.
4.4.4 A Formal Model
To show that PReP satisfies the properties listed in Section 4.4.1, we now present a
formalisation of PReP in terms of the behaviour of the actors involved in the protocol
and the messages used. We have chosen to model PReP as an abstract state machine
(ASM) because it provides a precise, implementation-independent means of describing
the protocol. The ASM notation we adopt has been used previously to describe a dis-
tributed reference counting algorithm [133] and a fault-tolerant directory service for
mobile agents [131]. The abstract machine characterizes the behaviour of actors with
A = {a1, a2, . . . , an} (Set of Actor Identities)
Senders ⊆ A (Set of Sender Identities)
Receivers ⊆ A (Set of Receivers Identities)
PS ⊆ A (Set of Provenance Store Identities)
REL = {r1, r2, . . . , rn} (Set of Business Logic Descriptions)
P-Assertion = {α1, α2, . . .} (Set of P-Assertions)
M = app : IK×Data → M (Set of Messages)
| rec : IK×RI×A× LPID× P-Assertion → M
| sf : IK×RI×A× LPID× N+ → M
| ack : IK×RI× LPID → M
SF = {m ∈ M | m = sf(κ, v, ι, na)} (Set of Submission Finished Messages)
R = {m ∈ M | m = rec(κ, v, ι, lpid, α)} (Set of Record Messages)
IK = Senders×Receivers× N (Set of Interaction Keys)
RI = {S, R} (Set of Role Identifiers)
V = IK×RI → SF⊥ × P(R)× P(LPID) (Set of Views)
PSS = A → V (Set of Provenance Stores)
TO SEND = A → IK → Bag(M) (Set of Messages To Send Tables)
SENT = A → IK → Bag(M) (Set of Sent Messages Tables)
ACK = A → IK → Bag(M) (Set of Acknowledged Messages Tables)
ASSERT = A → IK×RI → Bag(P-Assertions) (Set of p-assertions to be recorded)
LPID MAP = A → IK×RI → P(LPID) (Map from actor to
local p-assertion ids)
LC = Senders → N (Set of Local Counters)
K = A×A → Bag(M) (Set of Channels)
C = PSS×K× TO SEND× SENT×
ACK×ASSERT× LPID MAP× LC (Set of Configurations)
Characteristic Variables:
a ∈ A
as ∈ Sender
ar ∈ Receiver
aps ∈ PS
r ∈ REL
m ∈ M
d ∈ Data
α ∈ P-Assertion
κ ∈ IK
v ∈ RI
na ∈ N+
lpid ∈ LPID
k ∈ K
lpids ∈ P(LPID)
recs ∈ P(R)
store T ∈ PSS
to send T ∈ TO SEND
sent T ∈ SENT
ack T ∈ ACK
assert T ∈ ASSERT
lpid T ∈ LPID MAP
lc ∈ LC
c ∈ C
Initial State / Configuration:
ci = 〈store Ti, ki, to send Ti, sent Ti, ack Ti, assert Ti, lpid Ti, lci〉
where:
store Ti = λaλκv · 〈⊥, ∅, ∅〉, ki = λaiaj · ∅,
to send Ti = λaiκi · ∅, sent Ti = λaiiki · ∅,
ack Ti = λaiiki · ∅, assert Ti = λaiκivi · ∅,
lpid Ti = λaiκivi · ∅, lci = λai · 0
Figure 4.11: State Space
The machine proceeds from this initial state through its execution by going through
transitions that lead to new states. These transitions are defined by the rules of the
state machine discussed in the next section.
When describing the execution of a state machine, we use the following notation and
definitions.
• A transition is the application of a rule to one configuration to achieve another
configuration.
• A reachable configuration is a configuration of the ASM that can be reached by
transitions from the initial configuration.
• 7−→ denotes a transition.
• c 7−→∗ c′ denotes any number of transitions from a configuration c to another
configuration c′.
We now discuss the specific rules of the ASM.
4.4.4.2 State Machine Rules
The permissible transitions in the ASM are described through rules, which are repre-
sented using the following notation.
rule name(v1, v2, · · · ) :
condition1(v1, v2, · · · )
∧ condition2(v1, v2, · · · ) ∧ · · ·
→ {
pseudo statement1;
· · ·
pseudo statementn;
}
Rules are identified by their name and a number of parameters that the rule operates
over. Any number of conditions must be met for a rule to fire. Once a rule’s conditions
are met, the rule can fire. The execution of a rule is a transition of the state machine
and is atomic in order to maintain its consistency. A new state is achieved after applying
all the rule’s pseudo-statements to the state that met the rule’s conditions.
We use send, receive and table update pseudo-statements. Informally, send(a1, a2,m)
inserts a message m into the channel from actor a1 to actor a2, and receive(a1, a2,m)
removes the message. Likewise, the table update operation puts a message into a table.
The notation table T is used to refer to any table in the state space. Formally, these
pseudo-statements act as state transformers and are defined as follows.
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime


