An Architecture for Provenance Systems
Contract (2006)
Available from eprints.soton.ac.uk
or
Abstract
This document covers the logical and process architectures of provenance systems. The logical architecture identifies key roles and their interactions, whereas the process architecture discusses distribution and security. A fundamental aspect of our presentation is its technology-independent nature, which makes it reusable: the principles that are exposed in this document may be applied to different technologies.
Available from eprints.soton.ac.uk
Page 1
An Architecture for Provenance Systems
PROVENANCE
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
An Architecture for Provenance Systems
Authors: Paul Groth
Sheng Jiang
Simon Miles
Steve Munroe
Victor Tan
Sofia Tsasakou
Luc Moreau
Reviewers: All project partners
Identifier: D3.1.1 (Final Architecture)
Type: Deliverable
Version: 0.6
Version: November 29, 2006
Status: public
Abstract
This document covers the logical and process architectures of provenance sys-
tems. The logical architecture identifies key roles and their interactions, whereas
the process architecture discusses distribution and security. A fundamental aspect
of our presentation is its technology-independent nature, which makes it reusable:
the principles that are exposed in this document may be applied to different tech-
nologies.
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
1
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
An Architecture for Provenance Systems
Authors: Paul Groth
Sheng Jiang
Simon Miles
Steve Munroe
Victor Tan
Sofia Tsasakou
Luc Moreau
Reviewers: All project partners
Identifier: D3.1.1 (Final Architecture)
Type: Deliverable
Version: 0.6
Version: November 29, 2006
Status: public
Abstract
This document covers the logical and process architectures of provenance sys-
tems. The logical architecture identifies key roles and their interactions, whereas
the process architecture discusses distribution and security. A fundamental aspect
of our presentation is its technology-independent nature, which makes it reusable:
the principles that are exposed in this document may be applied to different tech-
nologies.
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
1
Page 2
PROVENANCE
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
Executive Summary
Provenance Definition
According to the Oxford English Dictionary, provenance is defined as (i) the fact of
coming from some particular source or quarter; origin, derivation. (ii) the history or
pedigree of a work of art, manuscript, rare book, etc.; concr., a record of the ultimate
derivation and passage of an item through its various owners.
Provenance is already well understood in the study of fine art where it refers to
the trusted, documented history of some art object. Given that documented history,
the object attains an authority that allows scholars to understand and appreciate its
importance and context relative to other works. Art objects that do not have a trusted,
proven history may be treated with some scepticism by those that study and view
them. This same concept of provenance may also be applied to data and information
generated within computer systems. This being so, one of our primary objectives is
to define a representation of provenance that is suitable for computer systems, and the
necessary architecture to make use of such a representation. Hence, in this context, we
define the provenance of a piece of data as the process that led to that piece of data.
Computational Provenance
Generally, in computer systems, applications produce data. Our vision is to trans-
form applications into so called provenance-aware applications, so that when they run,
they produce a description of their execution. Such descriptions, which we refer to as
process documentation, are stored in a provenance store, which is a repository for the
storage and management of process documentation. Additionally, the provenance store
also provides querying facilities to enable services to retrieve the provenance of data
items. In support of this vision we have designed a provenance architecture, including
suitable data models and the necessary underpinning functionality, with concerns for
scalability and security.
The development of the architecture has been strongly influenced by the service-
oriented architectural style, according to which services or actors interact with each
other by exchanging messages. By enabling actors to make execution-related asser-
tions, or p-assertions, we ensure that necessary and sufficient forms of process docu-
mentation are captured to be able to give a complete account of any data item’s prove-
nance. For example, the p-assertion model allows us to document various aspects of
execution, and thus provide descriptions of those parts of an execution that relate to,
or impact upon, a given data item. This allows a user to determine the data item’s rela-
tionships to other data items and processes, such as its dependencies or causal effects
and, at the same time, provides a description of the data flow through an application.
The p-assertions within a provenance store are organised in a conceptual structure,
called the p-structure, based around interaction records, each of which is a collection
of p-assertions that relate to a single interaction (i.e. an individual message exchange).
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
2
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
Executive Summary
Provenance Definition
According to the Oxford English Dictionary, provenance is defined as (i) the fact of
coming from some particular source or quarter; origin, derivation. (ii) the history or
pedigree of a work of art, manuscript, rare book, etc.; concr., a record of the ultimate
derivation and passage of an item through its various owners.
Provenance is already well understood in the study of fine art where it refers to
the trusted, documented history of some art object. Given that documented history,
the object attains an authority that allows scholars to understand and appreciate its
importance and context relative to other works. Art objects that do not have a trusted,
proven history may be treated with some scepticism by those that study and view
them. This same concept of provenance may also be applied to data and information
generated within computer systems. This being so, one of our primary objectives is
to define a representation of provenance that is suitable for computer systems, and the
necessary architecture to make use of such a representation. Hence, in this context, we
define the provenance of a piece of data as the process that led to that piece of data.
Computational Provenance
Generally, in computer systems, applications produce data. Our vision is to trans-
form applications into so called provenance-aware applications, so that when they run,
they produce a description of their execution. Such descriptions, which we refer to as
process documentation, are stored in a provenance store, which is a repository for the
storage and management of process documentation. Additionally, the provenance store
also provides querying facilities to enable services to retrieve the provenance of data
items. In support of this vision we have designed a provenance architecture, including
suitable data models and the necessary underpinning functionality, with concerns for
scalability and security.
The development of the architecture has been strongly influenced by the service-
oriented architectural style, according to which services or actors interact with each
other by exchanging messages. By enabling actors to make execution-related asser-
tions, or p-assertions, we ensure that necessary and sufficient forms of process docu-
mentation are captured to be able to give a complete account of any data item’s prove-
nance. For example, the p-assertion model allows us to document various aspects of
execution, and thus provide descriptions of those parts of an execution that relate to,
or impact upon, a given data item. This allows a user to determine the data item’s rela-
tionships to other data items and processes, such as its dependencies or causal effects
and, at the same time, provides a description of the data flow through an application.
The p-assertions within a provenance store are organised in a conceptual structure,
called the p-structure, based around interaction records, each of which is a collection
of p-assertions that relate to a single interaction (i.e. an individual message exchange).
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
2
Page 3
PROVENANCE
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
The p-structure provides a hierarchical view of process documentation that facilitates
the retrieval of p-assertions, independently of the actual technology used in a given
application.
Provenance Functionality
From a functional perspective, the provenance store supports two operations: record-
ing p-assertions and queries over p-assertions.
In order to record p-assertions, the architecture offers a recording interface based
on the p-assertion recording protocol (PReP). PReP is designed to be stateless to allow
for asynchronous and out-of-order recording by actors. Furthermore, the provenance
store’s behaviour is specified to ensure that p-assertions do not become modified or
deleted, preserving documentation in its original form, thus reflecting execution as it
was originally documented.
Once recorded, documentation is then available for third parties to obtain the prove-
nance of data items, which is achieved via a process documentation query interface for
the retrieval of p-assertions and their contents, and a provenance query interface for
the retrieval of a data item’s provenance. Querying the provenance of a given data item
involves: identification of the data item at a specific point during execution, and scop-
ing of the process of interest to filter causal and functional relationships. The output of
queries comes in the form of a collection of p-assertions representing a portion of the
data flow graph, which allows a user to understand the provenance of the data item in
question up to the specified point in execution.
Non-Functional Considerations
In terms of non-functional requirements, a provenance architecture must address three
important considerations: scalability, security and management.
For many applications, extremely large amounts of process documentation can po-
tentially be captured. This presents problems for recording, querying, management
and storage of such information. Consequently, there is a need to deal explicitly with
such scalability issues and, since the applications that record provenance may be dis-
tributed and large scale, the sheer quantity of recorded p-assertions requires a scalable
means of storing them. To achieve this, the architecture enables several recording
patterns that provide flexible ways for recording actors to record p-assertions. For
example, one pattern allows different actors to record p-assertions in different stores,
even if they refer to the same interaction. Because the documentation of a single pro-
cess may end up being recorded in several provenance stores, in order to collect all the
p-assertions about a process, it is necessary to provide directional view links to these
provenance stores, where other parts of the documentation may be found.
For some applications, p-assertions may relate to large data sets, such as an actor’s
state, for example. In such cases, storage capacity problems can arise that are dealt
with by allowing p-assertions to reference data that may be stored externally. The re-
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
3
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
The p-structure provides a hierarchical view of process documentation that facilitates
the retrieval of p-assertions, independently of the actual technology used in a given
application.
Provenance Functionality
From a functional perspective, the provenance store supports two operations: record-
ing p-assertions and queries over p-assertions.
In order to record p-assertions, the architecture offers a recording interface based
on the p-assertion recording protocol (PReP). PReP is designed to be stateless to allow
for asynchronous and out-of-order recording by actors. Furthermore, the provenance
store’s behaviour is specified to ensure that p-assertions do not become modified or
deleted, preserving documentation in its original form, thus reflecting execution as it
was originally documented.
Once recorded, documentation is then available for third parties to obtain the prove-
nance of data items, which is achieved via a process documentation query interface for
the retrieval of p-assertions and their contents, and a provenance query interface for
the retrieval of a data item’s provenance. Querying the provenance of a given data item
involves: identification of the data item at a specific point during execution, and scop-
ing of the process of interest to filter causal and functional relationships. The output of
queries comes in the form of a collection of p-assertions representing a portion of the
data flow graph, which allows a user to understand the provenance of the data item in
question up to the specified point in execution.
Non-Functional Considerations
In terms of non-functional requirements, a provenance architecture must address three
important considerations: scalability, security and management.
For many applications, extremely large amounts of process documentation can po-
tentially be captured. This presents problems for recording, querying, management
and storage of such information. Consequently, there is a need to deal explicitly with
such scalability issues and, since the applications that record provenance may be dis-
tributed and large scale, the sheer quantity of recorded p-assertions requires a scalable
means of storing them. To achieve this, the architecture enables several recording
patterns that provide flexible ways for recording actors to record p-assertions. For
example, one pattern allows different actors to record p-assertions in different stores,
even if they refer to the same interaction. Because the documentation of a single pro-
cess may end up being recorded in several provenance stores, in order to collect all the
p-assertions about a process, it is necessary to provide directional view links to these
provenance stores, where other parts of the documentation may be found.
For some applications, p-assertions may relate to large data sets, such as an actor’s
state, for example. In such cases, storage capacity problems can arise that are dealt
with by allowing p-assertions to reference data that may be stored externally. The re-
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
3
Page 4
PROVENANCE
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
placement of a data item with a reference can be seen as the result of a transformation,
and constitutes just one of the possible ways that messages can be transformed using
several documentation styles, which provide for more flexible ways to make assertions
about data, and enable requirements on scalability and security to be met.
Security represents a central concern in many application domains, and it is stan-
dard software engineering methodology to integrate security features at the earliest
time possible in the development life-cycle. Security concerns, both in relation to the
interactions of the internal components of provenance systems and the actors using
such systems are addressed, to ensure that appropriate access control for provenance
stores is maintained. In addition, it is important that p-assertions can be attributed to
the actor responsible for creating them, which is achieved by the inclusion of assertion
signatures.
Management is not specific to provenance, but should contain functionality that is
common to most data management systems, such as notification to users of changes
to a provenance store (e.g. the addition or removal of p-assertions) and indexing of a
provenance store’s contents.
By developing an industrial strength provenance architecture, the EU Provenance
project has made possible the capture and exploitation of provenance, and thus greatly
facilitates the growth and utility of Grid-based applications by explicitly tackling the
problems of trust, accountability, compliance and validation in such open, distributed
systems.
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
4
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
placement of a data item with a reference can be seen as the result of a transformation,
and constitutes just one of the possible ways that messages can be transformed using
several documentation styles, which provide for more flexible ways to make assertions
about data, and enable requirements on scalability and security to be met.
Security represents a central concern in many application domains, and it is stan-
dard software engineering methodology to integrate security features at the earliest
time possible in the development life-cycle. Security concerns, both in relation to the
interactions of the internal components of provenance systems and the actors using
such systems are addressed, to ensure that appropriate access control for provenance
stores is maintained. In addition, it is important that p-assertions can be attributed to
the actor responsible for creating them, which is achieved by the inclusion of assertion
signatures.
Management is not specific to provenance, but should contain functionality that is
common to most data management systems, such as notification to users of changes
to a provenance store (e.g. the addition or removal of p-assertions) and indexing of a
provenance store’s contents.
By developing an industrial strength provenance architecture, the EU Provenance
project has made possible the capture and exploitation of provenance, and thus greatly
facilitates the growth and utility of Grid-based applications by explicitly tackling the
problems of trust, accountability, compliance and validation in such open, distributed
systems.
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
4
Page 5
PROVENANCE
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
Members of the PROVENANCE consortium:
IBM United Kingdom Limited United Kingdom
University of Southampton United Kingdom
University of Wales, Cardiff United Kingdom
Deutsches Zentrum fur Luft- und Raumfahrt s.V. Germany
Universitat Politecnica de Catalunya Spain
Magyar Tudomanyos Akademia Szamitastechnikai es
Automatizalasi Kutato Intezet Hungary
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
5
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
Members of the PROVENANCE consortium:
IBM United Kingdom Limited United Kingdom
University of Southampton United Kingdom
University of Wales, Cardiff United Kingdom
Deutsches Zentrum fur Luft- und Raumfahrt s.V. Germany
Universitat Politecnica de Catalunya Spain
Magyar Tudomanyos Akademia Szamitastechnikai es
Automatizalasi Kutato Intezet Hungary
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
5
Page 6
PROVENANCE
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
Contents
1 Introduction 10
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2 Structure of Document . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3 Status of this Document . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2 Provenance Definition 15
2.1 Common Sense Definition . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Context: Service Oriented Architectures . . . . . . . . . . . . . . . . 15
2.3 Definition of Provenance . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Representation of Provenance . . . . . . . . . . . . . . . . . . . . . 17
2.5 Provenance Lifecycle and Three Provenance Views . . . . . . . . . . 20
2.6 Beyond Computer Data . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.7 The Nature of Queries . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3 Logical Architecture 26
3.1 Architecture vs System . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2 Role Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3 Logical Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4 The P-Header . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4 Security Architecture 33
4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2 Provenance Related Security Issues . . . . . . . . . . . . . . . . . . 36
4.3 Provenance Store Security Architecture . . . . . . . . . . . . . . . . 38
4.3.1 Components of Security Architecture . . . . . . . . . . . . . 38
4.3.2 Interaction Between Components . . . . . . . . . . . . . . . 42
4.4 Security in Other Architecture Components . . . . . . . . . . . . . . 45
4.4.1 Between other components and the provenance store . . . . . 45
4.4.2 Intermediate components . . . . . . . . . . . . . . . . . . . . 46
4.4.3 Delegation of identity or access control . . . . . . . . . . . . 46
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
6
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
Contents
1 Introduction 10
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2 Structure of Document . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3 Status of this Document . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2 Provenance Definition 15
2.1 Common Sense Definition . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Context: Service Oriented Architectures . . . . . . . . . . . . . . . . 15
2.3 Definition of Provenance . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Representation of Provenance . . . . . . . . . . . . . . . . . . . . . 17
2.5 Provenance Lifecycle and Three Provenance Views . . . . . . . . . . 20
2.6 Beyond Computer Data . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.7 The Nature of Queries . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3 Logical Architecture 26
3.1 Architecture vs System . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2 Role Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3 Logical Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4 The P-Header . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4 Security Architecture 33
4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2 Provenance Related Security Issues . . . . . . . . . . . . . . . . . . 36
4.3 Provenance Store Security Architecture . . . . . . . . . . . . . . . . 38
4.3.1 Components of Security Architecture . . . . . . . . . . . . . 38
4.3.2 Interaction Between Components . . . . . . . . . . . . . . . 42
4.4 Security in Other Architecture Components . . . . . . . . . . . . . . 45
4.4.1 Between other components and the provenance store . . . . . 45
4.4.2 Intermediate components . . . . . . . . . . . . . . . . . . . . 46
4.4.3 Delegation of identity or access control . . . . . . . . . . . . 46
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
6
Page 7
PROVENANCE
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
4.5 Additional security issues . . . . . . . . . . . . . . . . . . . . . . . . 48
4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5 Scalability Architecture 51
5.1 Recording Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.1.1 SeparateStore Pattern . . . . . . . . . . . . . . . . . . . . . . 52
5.1.2 ContextPassing Pattern . . . . . . . . . . . . . . . . . . . . . 53
5.1.3 SharedStore Pattern . . . . . . . . . . . . . . . . . . . . . . . 55
5.1.4 Pattern Application . . . . . . . . . . . . . . . . . . . . . . . 56
5.2 Linking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.2.1 View Links . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.2.2 Object Links . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.2.3 Linking Summary . . . . . . . . . . . . . . . . . . . . . . . 58
5.3 Data Staging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.4 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.4.1 By-Value versus By-Reference recording . . . . . . . . . . . 61
5.4.2 Record-Once versus Record-Many . . . . . . . . . . . . . . . 62
5.5 P-Assertion Templates . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.6 Large Query Results . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.7 Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6 Provenance Modelling 67
6.1 Identifying Interactions . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.2 Identifying P-Assertions and Data . . . . . . . . . . . . . . . . . . . 68
6.3 Interaction Contexts and the P-Header . . . . . . . . . . . . . . . . . 69
6.4 Interaction P-Assertion Modelling . . . . . . . . . . . . . . . . . . . 70
6.5 Documentation Style Modelling . . . . . . . . . . . . . . . . . . . . 73
6.6 Actor State P-Assertion Modelling . . . . . . . . . . . . . . . . . . . 76
6.7 Relationship P-Assertion Modelling . . . . . . . . . . . . . . . . . . 77
6.8 The P-Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.9 Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.10 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
7 Functionality 86
7.1 Recording Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
7.2 Provenance Query Interface . . . . . . . . . . . . . . . . . . . . . . . 89
7.2.1 Query Data Handles . . . . . . . . . . . . . . . . . . . . . . 90
7.2.2 Relationship Target Filters . . . . . . . . . . . . . . . . . . . 93
7.2.3 Provenance Query Results . . . . . . . . . . . . . . . . . . . 97
7.3 Process Documentation Query Interface . . . . . . . . . . . . . . . . 98
7.4 Management Interface . . . . . . . . . . . . . . . . . . . . . . . . . 99
7.4.1 Notification of Provenance Store Use . . . . . . . . . . . . . 99
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
7
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
4.5 Additional security issues . . . . . . . . . . . . . . . . . . . . . . . . 48
4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5 Scalability Architecture 51
5.1 Recording Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.1.1 SeparateStore Pattern . . . . . . . . . . . . . . . . . . . . . . 52
5.1.2 ContextPassing Pattern . . . . . . . . . . . . . . . . . . . . . 53
5.1.3 SharedStore Pattern . . . . . . . . . . . . . . . . . . . . . . . 55
5.1.4 Pattern Application . . . . . . . . . . . . . . . . . . . . . . . 56
5.2 Linking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.2.1 View Links . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.2.2 Object Links . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.2.3 Linking Summary . . . . . . . . . . . . . . . . . . . . . . . 58
5.3 Data Staging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.4 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.4.1 By-Value versus By-Reference recording . . . . . . . . . . . 61
5.4.2 Record-Once versus Record-Many . . . . . . . . . . . . . . . 62
5.5 P-Assertion Templates . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.6 Large Query Results . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.7 Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6 Provenance Modelling 67
6.1 Identifying Interactions . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.2 Identifying P-Assertions and Data . . . . . . . . . . . . . . . . . . . 68
6.3 Interaction Contexts and the P-Header . . . . . . . . . . . . . . . . . 69
6.4 Interaction P-Assertion Modelling . . . . . . . . . . . . . . . . . . . 70
6.5 Documentation Style Modelling . . . . . . . . . . . . . . . . . . . . 73
6.6 Actor State P-Assertion Modelling . . . . . . . . . . . . . . . . . . . 76
6.7 Relationship P-Assertion Modelling . . . . . . . . . . . . . . . . . . 77
6.8 The P-Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.9 Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.10 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
7 Functionality 86
7.1 Recording Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
7.2 Provenance Query Interface . . . . . . . . . . . . . . . . . . . . . . . 89
7.2.1 Query Data Handles . . . . . . . . . . . . . . . . . . . . . . 90
7.2.2 Relationship Target Filters . . . . . . . . . . . . . . . . . . . 93
7.2.3 Provenance Query Results . . . . . . . . . . . . . . . . . . . 97
7.3 Process Documentation Query Interface . . . . . . . . . . . . . . . . 98
7.4 Management Interface . . . . . . . . . . . . . . . . . . . . . . . . . 99
7.4.1 Notification of Provenance Store Use . . . . . . . . . . . . . 99
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
7
Page 8
PROVENANCE
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
7.4.2 Provenance Store Utility . . . . . . . . . . . . . . . . . . . . 100
7.5 Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
7.5.1 Provenance Store Capability Policies . . . . . . . . . . . . . 101
7.6 Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
7.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
8 Actor Behaviour 105
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
8.2 Architectural Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
8.3 Tracers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
8.3.1 Session Tracer . . . . . . . . . . . . . . . . . . . . . . . . . 107
8.3.2 Other Tracers . . . . . . . . . . . . . . . . . . . . . . . . . . 108
8.4 Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
8.5 Documentation Style Driven Message Transformation . . . . . . . . . 110
8.6 Actor Capability Policies . . . . . . . . . . . . . . . . . . . . . . . . 112
8.6.1 Recording . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
8.6.2 Querying . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
8.6.3 Service Requirement Policies . . . . . . . . . . . . . . . . . 113
8.7 Actor Side Library . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
8.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
9 Justification 115
9.1 Software Requirements Document . . . . . . . . . . . . . . . . . . . 115
9.1.1 Functional Requirements . . . . . . . . . . . . . . . . . . . . 115
9.1.2 Performance Requirements . . . . . . . . . . . . . . . . . . . 120
9.1.3 Interface Requirements . . . . . . . . . . . . . . . . . . . . . 121
9.1.4 Operational Requirements . . . . . . . . . . . . . . . . . . . 121
9.1.5 Documentation Requirements . . . . . . . . . . . . . . . . . 122
9.1.6 Security Requirements . . . . . . . . . . . . . . . . . . . . . 122
9.1.7 Other Requirements . . . . . . . . . . . . . . . . . . . . . . 124
9.2 Tools Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
9.3 Scalability Requirements . . . . . . . . . . . . . . . . . . . . . . . . 127
9.4 Requirements from the OTM/EHCR Application . . . . . . . . . . . 128
9.5 Requirements from the Aerospace Engineering Application . . . . . . 132
9.6 Implementation Recommendations . . . . . . . . . . . . . . . . . . . 134
9.6.1 Provenance Store . . . . . . . . . . . . . . . . . . . . . . . . 134
9.6.2 Processing and UI Services . . . . . . . . . . . . . . . . . . . 135
9.6.3 Actor-Side Libraries . . . . . . . . . . . . . . . . . . . . . . 135
9.6.4 Application Use of Provenance Architecture . . . . . . . . . . 136
9.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
8
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
7.4.2 Provenance Store Utility . . . . . . . . . . . . . . . . . . . . 100
7.5 Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
7.5.1 Provenance Store Capability Policies . . . . . . . . . . . . . 101
7.6 Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
7.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
8 Actor Behaviour 105
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
8.2 Architectural Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
8.3 Tracers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
8.3.1 Session Tracer . . . . . . . . . . . . . . . . . . . . . . . . . 107
8.3.2 Other Tracers . . . . . . . . . . . . . . . . . . . . . . . . . . 108
8.4 Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
8.5 Documentation Style Driven Message Transformation . . . . . . . . . 110
8.6 Actor Capability Policies . . . . . . . . . . . . . . . . . . . . . . . . 112
8.6.1 Recording . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
8.6.2 Querying . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
8.6.3 Service Requirement Policies . . . . . . . . . . . . . . . . . 113
8.7 Actor Side Library . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
8.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
9 Justification 115
9.1 Software Requirements Document . . . . . . . . . . . . . . . . . . . 115
9.1.1 Functional Requirements . . . . . . . . . . . . . . . . . . . . 115
9.1.2 Performance Requirements . . . . . . . . . . . . . . . . . . . 120
9.1.3 Interface Requirements . . . . . . . . . . . . . . . . . . . . . 121
9.1.4 Operational Requirements . . . . . . . . . . . . . . . . . . . 121
9.1.5 Documentation Requirements . . . . . . . . . . . . . . . . . 122
9.1.6 Security Requirements . . . . . . . . . . . . . . . . . . . . . 122
9.1.7 Other Requirements . . . . . . . . . . . . . . . . . . . . . . 124
9.2 Tools Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
9.3 Scalability Requirements . . . . . . . . . . . . . . . . . . . . . . . . 127
9.4 Requirements from the OTM/EHCR Application . . . . . . . . . . . 128
9.5 Requirements from the Aerospace Engineering Application . . . . . . 132
9.6 Implementation Recommendations . . . . . . . . . . . . . . . . . . . 134
9.6.1 Provenance Store . . . . . . . . . . . . . . . . . . . . . . . . 134
9.6.2 Processing and UI Services . . . . . . . . . . . . . . . . . . . 135
9.6.3 Actor-Side Libraries . . . . . . . . . . . . . . . . . . . . . . 135
9.6.4 Application Use of Provenance Architecture . . . . . . . . . . 136
9.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
8
Page 9
PROVENANCE
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
10 Related Work 138
10.1 Fine Granularity Provenance Systems . . . . . . . . . . . . . . . . . 138
10.2 Domain Specific Provenance Systems . . . . . . . . . . . . . . . . . 139
10.2.1 Current Practises of Document Management Systems . . . . . 140
10.3 Provenance in Database Systems . . . . . . . . . . . . . . . . . . . . 141
10.4 Middleware Provenance Systems . . . . . . . . . . . . . . . . . . . . 142
10.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
11 Conclusion 144
11.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
A Notes 146
B Abbreviations 148
C XML Schema Diagrams 149
Index 151
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
9
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
10 Related Work 138
10.1 Fine Granularity Provenance Systems . . . . . . . . . . . . . . . . . 138
10.2 Domain Specific Provenance Systems . . . . . . . . . . . . . . . . . 139
10.2.1 Current Practises of Document Management Systems . . . . . 140
10.3 Provenance in Database Systems . . . . . . . . . . . . . . . . . . . . 141
10.4 Middleware Provenance Systems . . . . . . . . . . . . . . . . . . . . 142
10.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
11 Conclusion 144
11.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
A Notes 146
B Abbreviations 148
C XML Schema Diagrams 149
Index 151
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
9
Page 11
PROVENANCE
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
This document covers the logical and process architectures of provenance systems.
Specifically, the logical architecture identifies key roles and their interactions, whereas
the process architecture discusses distribution, scalability and security. A fundamen-
tal aspect of our presentation is its technology-independent nature, which makes it
reusable: the principles that are exposed in this document may be applied to different
technologies. Despite this technology-independent view, where appropriate we high-
light how the architecture can be considered within Service Oriented Architecture and
workflow enactment engine scenarios to address the emphasis on these areas expressed
in the Technical Annex of the original project proposal.
The development and physical architectures are presented in separate documents,
explaining how the architectural design is mapped onto the Web Services stack of
standards, and how each individual architecture component is implemented [Ran05,
HI05].
1.2 Structure of Document
This document is structured as follows.
Chapter 2: Provenance Definition Based on the common sense definition of prove-
nance, we propose a new definition of provenance that is suited to the compu-
tational model underpinning service oriented architectures. Since our aim is to
conceive a computer-based representation of provenance that allows us to per-
form useful reasoning about the origin of results, we examine the nature of such
representation, which is articulated around the documentation of execution.
Chapter 3: Logical Architecture We then examine the architecture of a provenance
system, centred around the notion of a provenance store. We also examine mod-
els of execution documentation.
Chapter 4: Security Architecture Although security is a non-functional requirement,
software engineering methodology strongly recommends that security consider-
ations be integrated into the development life-cycle as early as possible. Many
of the application domains in which a provenance architecture could potentially
be deployed have stringent requirements on access to data manipulated within
the system. A security architecture that helps address these issues is discussed
in this chapter.
Chapter 5: Scalability Architecture This chapter discusses scalability in the prove-
nance architecture. Architectural scalability addresses how architectural com-
ponents can be organised and used by implementations to cater for increasingly
large loads in terms of such measures as computation, bandwidth and storage.
The chapter first presents a set of recording patterns that identify communica-
tions between key architecture roles. Second, it explains how the data organ-
isation adopted by the provenance store allows for data that is geographically
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
11
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
This document covers the logical and process architectures of provenance systems.
Specifically, the logical architecture identifies key roles and their interactions, whereas
the process architecture discusses distribution, scalability and security. A fundamen-
tal aspect of our presentation is its technology-independent nature, which makes it
reusable: the principles that are exposed in this document may be applied to different
technologies. Despite this technology-independent view, where appropriate we high-
light how the architecture can be considered within Service Oriented Architecture and
workflow enactment engine scenarios to address the emphasis on these areas expressed
in the Technical Annex of the original project proposal.
The development and physical architectures are presented in separate documents,
explaining how the architectural design is mapped onto the Web Services stack of
standards, and how each individual architecture component is implemented [Ran05,
HI05].
1.2 Structure of Document
This document is structured as follows.
Chapter 2: Provenance Definition Based on the common sense definition of prove-
nance, we propose a new definition of provenance that is suited to the compu-
tational model underpinning service oriented architectures. Since our aim is to
conceive a computer-based representation of provenance that allows us to per-
form useful reasoning about the origin of results, we examine the nature of such
representation, which is articulated around the documentation of execution.
Chapter 3: Logical Architecture We then examine the architecture of a provenance
system, centred around the notion of a provenance store. We also examine mod-
els of execution documentation.
Chapter 4: Security Architecture Although security is a non-functional requirement,
software engineering methodology strongly recommends that security consider-
ations be integrated into the development life-cycle as early as possible. Many
of the application domains in which a provenance architecture could potentially
be deployed have stringent requirements on access to data manipulated within
the system. A security architecture that helps address these issues is discussed
in this chapter.
Chapter 5: Scalability Architecture This chapter discusses scalability in the prove-
nance architecture. Architectural scalability addresses how architectural com-
ponents can be organised and used by implementations to cater for increasingly
large loads in terms of such measures as computation, bandwidth and storage.
The chapter first presents a set of recording patterns that identify communica-
tions between key architecture roles. Second, it explains how the data organ-
isation adopted by the provenance store allows for data that is geographically
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
11
Page 16
PROVENANCE
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
scale systems are typically designed using a service-oriented approach [SH05], usually
referred to as service-oriented architectural style [Bur00]. As far as services are con-
cerned, we do not intend to restrict ourselves to a specific technology; instead, we take
services to be components that take inputs and produce outputs. Such services are (1)
brought together to solve a given problem typically via a workflow that specifies their
composition. In this abstract view, invocations of services take place using messages (2)
that are constructed in accordance with service interface specifications. In a service- (3)
oriented architecture (SOA), clients typically invoke services, which may themselves
act as clients for other services; hence, we use the term actor to denote either a client
or a service in a SOA. An actor that sends a message is referred to as a sender, whereas
an actor that receives a message is known as a receiver. One message exchanged be-
tween a sender and a receiver is termed an interaction. Hence, a given interaction
comprises two views: the sending of the message and its receiving. The running of an
application programmed in a SOA style requires the execution of the workflow, which
characterises composition of the services that belong ‘to the application. Hence, the
execution of a workflow is referred to as a process. (We note that this use of the term (4)
‘process’ differs from the one in ‘process architecture’.)
Actors may have internal states that change during the course of execution. An
actor’s state is not directly observable by other actors; to be seen by another actor, the
state (or part of it) has to be communicated within a message sent by the actor owning
the state. (5)
Our broad, technology-independent approach to SOAs has formal foundations in
the pi-calculus [Mil99] and asynchronous distributed systems [Lyn95, Tel94]. Accord-
ing to such a view of the world, messages are the only mechanism used to transfer
information between actors. The pi-calculus is of interest in this context because of
its approach to defining events that are internal to actors as hidden communications;
an asynchronous view of distributed systems is, however, a better match to service-
oriented architectures.
2.3 Definition of Provenance
In this section, we focus on data produced by computer systems, and we define the
provenance of a piece of data (or data item). Specifically, we consider service-oriented
architectures, as discussed in Section 2.2, since they constitute the architectural style
generally adopted to build large scale open systems. (In Section 2.6, we examine
how our definition of provenance can be extended to cater for objects or events of the
physical world.)
The two common sense definitions consider provenance to be the derivation from
a particular source to a specific state of an item. We have identified a process in a
SOA as the execution of a workflow, which we broadly see as a specification of a given
service composition. Hence, by having a description of the process that resulted in
a data item, we can explain how such a data item has been obtained. Inspired by
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
16
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
scale systems are typically designed using a service-oriented approach [SH05], usually
referred to as service-oriented architectural style [Bur00]. As far as services are con-
cerned, we do not intend to restrict ourselves to a specific technology; instead, we take
services to be components that take inputs and produce outputs. Such services are (1)
brought together to solve a given problem typically via a workflow that specifies their
composition. In this abstract view, invocations of services take place using messages (2)
that are constructed in accordance with service interface specifications. In a service- (3)
oriented architecture (SOA), clients typically invoke services, which may themselves
act as clients for other services; hence, we use the term actor to denote either a client
or a service in a SOA. An actor that sends a message is referred to as a sender, whereas
an actor that receives a message is known as a receiver. One message exchanged be-
tween a sender and a receiver is termed an interaction. Hence, a given interaction
comprises two views: the sending of the message and its receiving. The running of an
application programmed in a SOA style requires the execution of the workflow, which
characterises composition of the services that belong ‘to the application. Hence, the
execution of a workflow is referred to as a process. (We note that this use of the term (4)
‘process’ differs from the one in ‘process architecture’.)
Actors may have internal states that change during the course of execution. An
actor’s state is not directly observable by other actors; to be seen by another actor, the
state (or part of it) has to be communicated within a message sent by the actor owning
the state. (5)
Our broad, technology-independent approach to SOAs has formal foundations in
the pi-calculus [Mil99] and asynchronous distributed systems [Lyn95, Tel94]. Accord-
ing to such a view of the world, messages are the only mechanism used to transfer
information between actors. The pi-calculus is of interest in this context because of
its approach to defining events that are internal to actors as hidden communications;
an asynchronous view of distributed systems is, however, a better match to service-
oriented architectures.
2.3 Definition of Provenance
In this section, we focus on data produced by computer systems, and we define the
provenance of a piece of data (or data item). Specifically, we consider service-oriented
architectures, as discussed in Section 2.2, since they constitute the architectural style
generally adopted to build large scale open systems. (In Section 2.6, we examine
how our definition of provenance can be extended to cater for objects or events of the
physical world.)
The two common sense definitions consider provenance to be the derivation from
a particular source to a specific state of an item. We have identified a process in a
SOA as the execution of a workflow, which we broadly see as a specification of a given
service composition. Hence, by having a description of the process that resulted in
a data item, we can explain how such a data item has been obtained. Inspired by
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
16
Page 17
PROVENANCE
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
previous work [GLM04a, GLM04c, GLM04b, TGX05, MGBM06, SM03b], the EU
Provenance project pre-prototype [XBC+05], its requirements documents [And05a,
And05b], and an architecture strawman [MCG+05], we propose the following defini-
tion of provenance, which makes explicit the notion of process.
Definition 2.3 (Provenance of a piece of data) The provenance of a piece of data is
the process that led to that piece of data. 2
In relation to the two common sense definitions of provenance, we note that Definition
2.3 is concerned with provenance as a concept. Ultimately, our aim is to conceive a
computer-based representation of provenance that allows us to perform useful analysis
and reasoning to support our use cases. Consequently, the provenance of a piece of
data is to be represented in a computer system by some suitable documentation of the
process that led to the data.
While specific applications determine the actual form that such documentation
should take, we can identify several of its general properties. Documentation can be
complete or partial (for instance, when the computation has not yet terminated); it can
be accurate or inaccurate; it can present conflicting or consensual views of the actors
involved; it can be descriptive or conceptual; and it can abstract more or less from
reality.
2.4 Representation of Provenance
In this section, we introduce the key elements that form the representation of prove-
nance in a SOA; further refinement will ultimately lead to data types for provenance
representation (cf. Chapter 6).
In the previous section, we stated that provenance of a data item is to be represented
in a computer system by some suitable documentation of the process that led to it. To
this end, we distinguish a specific piece of information documenting some step of a
process from the whole documentation of the process. The former shall be referred to
as a p-assertion, which we define as follows.
Definition 2.4 (p-assertion) A p-assertion is an assertion that is made by an actor
and pertains to a process. 2
From this definition, we derive the notion of process documentation.
Definition 2.5 (Process Documentation) The documentation of a process consists of
a set of p-assertions made by the actors involved in the process. 2 (6)
We note that a given p-assertion may belong to the provenance representation of
multiple pieces of data. When a p-assertion is created (and later recorded), it docu-
ments a step of a process in progress, which ultimately will lead to a piece of data.
At the time of the p-assertion creation, we may not know the piece of data that will
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
17
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
previous work [GLM04a, GLM04c, GLM04b, TGX05, MGBM06, SM03b], the EU
Provenance project pre-prototype [XBC+05], its requirements documents [And05a,
And05b], and an architecture strawman [MCG+05], we propose the following defini-
tion of provenance, which makes explicit the notion of process.
Definition 2.3 (Provenance of a piece of data) The provenance of a piece of data is
the process that led to that piece of data. 2
In relation to the two common sense definitions of provenance, we note that Definition
2.3 is concerned with provenance as a concept. Ultimately, our aim is to conceive a
computer-based representation of provenance that allows us to perform useful analysis
and reasoning to support our use cases. Consequently, the provenance of a piece of
data is to be represented in a computer system by some suitable documentation of the
process that led to the data.
While specific applications determine the actual form that such documentation
should take, we can identify several of its general properties. Documentation can be
complete or partial (for instance, when the computation has not yet terminated); it can
be accurate or inaccurate; it can present conflicting or consensual views of the actors
involved; it can be descriptive or conceptual; and it can abstract more or less from
reality.
2.4 Representation of Provenance
In this section, we introduce the key elements that form the representation of prove-
nance in a SOA; further refinement will ultimately lead to data types for provenance
representation (cf. Chapter 6).
In the previous section, we stated that provenance of a data item is to be represented
in a computer system by some suitable documentation of the process that led to it. To
this end, we distinguish a specific piece of information documenting some step of a
process from the whole documentation of the process. The former shall be referred to
as a p-assertion, which we define as follows.
Definition 2.4 (p-assertion) A p-assertion is an assertion that is made by an actor
and pertains to a process. 2
From this definition, we derive the notion of process documentation.
Definition 2.5 (Process Documentation) The documentation of a process consists of
a set of p-assertions made by the actors involved in the process. 2 (6)
We note that a given p-assertion may belong to the provenance representation of
multiple pieces of data. When a p-assertion is created (and later recorded), it docu-
ments a step of a process in progress, which ultimately will lead to a piece of data.
At the time of the p-assertion creation, we may not know the piece of data that will
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
17
Page 19
PROVENANCE
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
“volunteer” some information that is only available to it. An actor may provide re-
lationship p-assertions that identify the relationship between its outputs (whether as
returned result or invocation message to other actors) and its inputs (or intermediary
results received from invoked actors).
Definition 2.7 (Relationship p-assertion) A relationship p-assertion is an assertion
by an actor that the sending of a message would not be occurring or a data item it is
sending would not be as it is (the effect), if it had not received other messages or data
items had not been as they are (the causes), and that this relationship is due to its own
action, expressible as the function applied to the causes to produce the effect. 2
While matching interaction p-assertions denote a flow of data between actors, relation-
ships explain how data flows inside actors. Relationship p-assertions are directional
since they explain how some data was computed from other data.
Figure 2.1 illustrates two actors. The first is a primitive actor, i.e., one that receives
a message and produces a result, but does not invoke subsequent actors, or alterna-
tively, an actor that does not make assertions of the invocations it makes of subsequent
actors (say, for privacy reasons). In order to contribute some information about its
internal flow of information, it can indicate that its output data (in the output message)
is a function of the input data (contained in the input message). The second actor of
Figure 2.1 is not primitive, and makes assertions of the contents of the messages it
sends to and receives from another actor that it invokes. Like the first actor, it may
indicate that its output is a function of its input; alternatively, it may explain how the
data contained in the secondary invocation message and its result relate to the input
and output.
f
M1
M2
f
M1
M2
M3
M4
f2
f1
d1
d2
d3
d4
d1
d2
interaction key p-assertion type p-assertion content
1 interaction M1
2 interaction M2
2 relationship d2=f(d1)
interaction key p-assertion type p-assertion content
1 interaction M1
2 interaction M2
3 interaction M3
4 interaction M4
2 relationship d2=f(d1)
3 relationship d3=f1(d1)
2 relationship d2=f2(d4,d1)
Figure 2.1: Data flow assertions by opaque and transparent actors
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
19
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
“volunteer” some information that is only available to it. An actor may provide re-
lationship p-assertions that identify the relationship between its outputs (whether as
returned result or invocation message to other actors) and its inputs (or intermediary
results received from invoked actors).
Definition 2.7 (Relationship p-assertion) A relationship p-assertion is an assertion
by an actor that the sending of a message would not be occurring or a data item it is
sending would not be as it is (the effect), if it had not received other messages or data
items had not been as they are (the causes), and that this relationship is due to its own
action, expressible as the function applied to the causes to produce the effect. 2
While matching interaction p-assertions denote a flow of data between actors, relation-
ships explain how data flows inside actors. Relationship p-assertions are directional
since they explain how some data was computed from other data.
Figure 2.1 illustrates two actors. The first is a primitive actor, i.e., one that receives
a message and produces a result, but does not invoke subsequent actors, or alterna-
tively, an actor that does not make assertions of the invocations it makes of subsequent
actors (say, for privacy reasons). In order to contribute some information about its
internal flow of information, it can indicate that its output data (in the output message)
is a function of the input data (contained in the input message). The second actor of
Figure 2.1 is not primitive, and makes assertions of the contents of the messages it
sends to and receives from another actor that it invokes. Like the first actor, it may
indicate that its output is a function of its input; alternatively, it may explain how the
data contained in the secondary invocation message and its result relate to the input
and output.
f
M1
M2
f
M1
M2
M3
M4
f2
f1
d1
d2
d3
d4
d1
d2
interaction key p-assertion type p-assertion content
1 interaction M1
2 interaction M2
2 relationship d2=f(d1)
interaction key p-assertion type p-assertion content
1 interaction M1
2 interaction M2
3 interaction M3
4 interaction M4
2 relationship d2=f(d1)
3 relationship d3=f1(d1)
2 relationship d2=f2(d4,d1)
Figure 2.1: Data flow assertions by opaque and transparent actors
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
19
Page 20
PROVENANCE
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
Figure 2.1 displays the ideal case of purely functional actors, which do not maintain
a persistent state across invocations. The same approach generalises to stateful actors:
the data in an output message can be a function of the data received during a previous
interaction and kept in a persistent store. On the right-hand side of Figure 2.1, we see (8)
a symbolic representation of the p-assertions generated by the actors. Each p-assertion
has a type and a content, and is asserted in the context of an interaction identified by a
key.
Hence, interaction p-assertions denote data flows between actors, whereas relation-
ship p-assertions denote data flows within actors. Such data flows are core elements to
reconstitute functional data dependencies in execution. In the most general case, such
data flows constitute a directed acyclic graph (DAG). From a specific data item, the
data flow DAG indicates where and how the data item is used; vice versa, following
relationships in reverse helps us identify how a data item was produced. The data flow
DAG is thus a core element of provenance representation, but it is not the only one;
other p-assertions can provide further information about internal states of actors during
execution, as we now explain.
Interaction and relationship p-assertions capture the flow of data in a process. In
some circumstances, however, actors’ internal states may also be necessary to under-
stand the functionality, performance or accuracy of actors, and therefore the nature of
the result they compute. Hence, we introduce the notion of an actor state p-assertion
(†) as the documentation provided by an actor about its internal state in the context of [SR-1-6, p. 116]
a specific interaction.
Definition 2.8 (Actor State p-assertion) An actor state p-assertion is an assertion,
by an actor, of data received from an (unspecified) internal component of the actor just
before, during or just after a message is sent or received. It can, therefore, be viewed
as documenting part of the state of the actor at an instant, and may be the cause, but
not effect, of other events in a process. 2
Actor state p-assertions can be extremely varied: they may include the function the
actor performs, the workflow that is being executed, the amount of disk and CPU a
service used in a computation, the floating point precision of the results it produced,
or application-specific state descriptions.
In summary, p-assertions can be of three(†) disjoint kinds: interaction p-assertions, [SR-1-12, p. 118]
relationship p-assertions and actor state p-assertions. We note that p-assertions are
independent of the actual service technology used to implement applications.
2.5 Provenance Lifecycle and Three Provenance Views
In the previous section, we characterised the syntactic nature of p-assertions, in the
form of a broad classification in three different categories, according to whether they
document interactions, relationships or actor states. We now focus on a dynamic char-
acterisation of p-assertions and, in particular, when they are created, recorded, queried
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
20
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
Figure 2.1 displays the ideal case of purely functional actors, which do not maintain
a persistent state across invocations. The same approach generalises to stateful actors:
the data in an output message can be a function of the data received during a previous
interaction and kept in a persistent store. On the right-hand side of Figure 2.1, we see (8)
a symbolic representation of the p-assertions generated by the actors. Each p-assertion
has a type and a content, and is asserted in the context of an interaction identified by a
key.
Hence, interaction p-assertions denote data flows between actors, whereas relation-
ship p-assertions denote data flows within actors. Such data flows are core elements to
reconstitute functional data dependencies in execution. In the most general case, such
data flows constitute a directed acyclic graph (DAG). From a specific data item, the
data flow DAG indicates where and how the data item is used; vice versa, following
relationships in reverse helps us identify how a data item was produced. The data flow
DAG is thus a core element of provenance representation, but it is not the only one;
other p-assertions can provide further information about internal states of actors during
execution, as we now explain.
Interaction and relationship p-assertions capture the flow of data in a process. In
some circumstances, however, actors’ internal states may also be necessary to under-
stand the functionality, performance or accuracy of actors, and therefore the nature of
the result they compute. Hence, we introduce the notion of an actor state p-assertion
(†) as the documentation provided by an actor about its internal state in the context of [SR-1-6, p. 116]
a specific interaction.
Definition 2.8 (Actor State p-assertion) An actor state p-assertion is an assertion,
by an actor, of data received from an (unspecified) internal component of the actor just
before, during or just after a message is sent or received. It can, therefore, be viewed
as documenting part of the state of the actor at an instant, and may be the cause, but
not effect, of other events in a process. 2
Actor state p-assertions can be extremely varied: they may include the function the
actor performs, the workflow that is being executed, the amount of disk and CPU a
service used in a computation, the floating point precision of the results it produced,
or application-specific state descriptions.
In summary, p-assertions can be of three(†) disjoint kinds: interaction p-assertions, [SR-1-12, p. 118]
relationship p-assertions and actor state p-assertions. We note that p-assertions are
independent of the actual service technology used to implement applications.
2.5 Provenance Lifecycle and Three Provenance Views
In the previous section, we characterised the syntactic nature of p-assertions, in the
form of a broad classification in three different categories, according to whether they
document interactions, relationships or actor states. We now focus on a dynamic char-
acterisation of p-assertions and, in particular, when they are created, recorded, queried
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
20
Page 21
PROVENANCE
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
and managed, with respect to process execution. These different phases identify a
provenance lifecycle, which we now describe. (We note that such a lifecycle is to be
understood in the context of application execution and should be distinguished from
a methodology that identifies design steps in order to conceive an application that is
provenance aware.)
Before discussing the provenance lifecycle, it is necessary to introduce an archi-
tectural element, which we expand upon in Chapter 3. Since we aim to provide a
long-term facility for storing the provenance representation of data items, we delegate
to a specific element, which we refer to as a provenance store, the role of making per-
sistent, managing and providing controlled access to such provenance representation.
The choice of an explicit architectural element to embody this role in no way implies
any form of physical deployment; instead, it helps us identify the kind of functionality
that is necessary in order to offer support for provenance.
The provenance lifecycle is composed of four different phases. As execution pro-
ceeds, actors create p-assertions that are aimed at representing their involvement in a
computation. After their creation, p-assertions are stored in a provenance store, with
the intent they can be used to reconstitute the provenance of some data. The prove-
nance store therefore acts as storage of p-assertions. After a data item has been com-
puted, users (or applications) may need to obtain the provenance of this data item: they
can do so by querying the provenance store. At the most basic level, the result of the
query is the set of p-assertions pertaining to the process that produced the data. More
advanced query facilities may return a representation derived from p-assertions that is
of interest to the user. We will come back to this aspect in Section 2.7. Finally, as time
progresses, the provenance store and its contents may need to be managed to handle
distribution, change management, curation etc. In summary, the provenance lifecycle
is composed of four different phases: (i) creating, (ii) recording, (iii) querying and
(iv) managing. A provenance system should provide support for all these phases.
We previously discussed the two understandings of provenance that Definitions 2.1
and 2.2 imply: conceptual and representational (in a computer system). In light of the
provenance lifecycle, we can refine this view and distinguish three understandings of
provenance. (i) As before, provenance can be seen as a concept from which we
can explain how a result has been achieved. (ii) The recording phase of the prove-
nance lifecycle results in a set of p-assertions accumulated in the provenance store.
These p-assertions constitute a documentation of execution, which includes informa-
tion from which a representation of the provenance of the data we are interested in
can be derived. (iii) Alternatively, the lifecycle querying phase suggests that prove-
nance queries filter out p-assertions and make them available in some representation
(whether as a set of p-assertions or in some other form), which constitutes a query-time
representation of provenance.
When designing a generic provenance system, we cannot anticipate all forms of
queries that users may wish to issue. Hence, to be able to support complex querying
functionality, it is important to provide a complete and detailed set of p-assertions
about the aspect of execution we are permitted to document. This inevitably may
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
21
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
and managed, with respect to process execution. These different phases identify a
provenance lifecycle, which we now describe. (We note that such a lifecycle is to be
understood in the context of application execution and should be distinguished from
a methodology that identifies design steps in order to conceive an application that is
provenance aware.)
Before discussing the provenance lifecycle, it is necessary to introduce an archi-
tectural element, which we expand upon in Chapter 3. Since we aim to provide a
long-term facility for storing the provenance representation of data items, we delegate
to a specific element, which we refer to as a provenance store, the role of making per-
sistent, managing and providing controlled access to such provenance representation.
The choice of an explicit architectural element to embody this role in no way implies
any form of physical deployment; instead, it helps us identify the kind of functionality
that is necessary in order to offer support for provenance.
The provenance lifecycle is composed of four different phases. As execution pro-
ceeds, actors create p-assertions that are aimed at representing their involvement in a
computation. After their creation, p-assertions are stored in a provenance store, with
the intent they can be used to reconstitute the provenance of some data. The prove-
nance store therefore acts as storage of p-assertions. After a data item has been com-
puted, users (or applications) may need to obtain the provenance of this data item: they
can do so by querying the provenance store. At the most basic level, the result of the
query is the set of p-assertions pertaining to the process that produced the data. More
advanced query facilities may return a representation derived from p-assertions that is
of interest to the user. We will come back to this aspect in Section 2.7. Finally, as time
progresses, the provenance store and its contents may need to be managed to handle
distribution, change management, curation etc. In summary, the provenance lifecycle
is composed of four different phases: (i) creating, (ii) recording, (iii) querying and
(iv) managing. A provenance system should provide support for all these phases.
We previously discussed the two understandings of provenance that Definitions 2.1
and 2.2 imply: conceptual and representational (in a computer system). In light of the
provenance lifecycle, we can refine this view and distinguish three understandings of
provenance. (i) As before, provenance can be seen as a concept from which we
can explain how a result has been achieved. (ii) The recording phase of the prove-
nance lifecycle results in a set of p-assertions accumulated in the provenance store.
These p-assertions constitute a documentation of execution, which includes informa-
tion from which a representation of the provenance of the data we are interested in
can be derived. (iii) Alternatively, the lifecycle querying phase suggests that prove-
nance queries filter out p-assertions and make them available in some representation
(whether as a set of p-assertions or in some other form), which constitutes a query-time
representation of provenance.
When designing a generic provenance system, we cannot anticipate all forms of
queries that users may wish to issue. Hence, to be able to support complex querying
functionality, it is important to provide a complete and detailed set of p-assertions
about the aspect of execution we are permitted to document. This inevitably may
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
21
Page 22
PROVENANCE
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
raise scalability concerns that have to be addressed by the architectural design for
the lifecycle recording phase. Symmetrically, the challenge for a query facility is to
identify a subset of useful p-assertions, by selecting, scoping and filtering p-assertions.
(These aspects are discussed further in Section 2.7.)
2.6 Beyond Computer Data
We specifically restricted Definition 2.3 to the provenance of electronic data contained
in a computer system. Our rationale was that our primary focus is on service oriented
architectures, used in building open, large scale systems. However, objects in the real
world also have a provenance. The purpose of this section is to examine how the
approach we propose to track provenance of data can be extended to track provenance
of physical world entities.
Initially, we consider a restrictive deployment, as illustrated in Figure 2.2. On the
left hand side, we see a computer application, in a SOA style, composed of a set of
actors and producing some result. With the approach presented in this chapter, p-
assertions describing execution are stored in a provenance store. The actors however
are not traditional processing actors that take inputs and produce outputs as result of
their internal behaviour. Instead, such actors are directly wired to “actuator/sensor”
pairs that operate on objects in the physical world and sense their environment, all
represented on the right hand of the picture. (The actual “wiring” is represented by
dashed lines.) Such actuators can be robots, taking objects as input and assembling
them, painting them, wrapping them, or even shipping them. Sensors perceive events
in the physical world, such as movement sensors, cameras, radar. Information can
transit from an actor to an actuator: it can be seen as control order for the actuator;
vice versa, sensors can feed back information to the computer system. We assume
here that the mapping is one to one, i.e., for one actor there exists one and only one
actuator/sensor, that an actuator is directly driven by an actor, and that an actor reacts
to information provided by a sensor. The outcome of the chain of actuators/sensors is
a physical artifact. We note that either the actuator or the sensor functionality in an
actuator/sensor pair may be void.
Given this mapping assumption, the computer system’s workflow mirrors a physi-
cal process in the physical world. The ultimate electronic data produced by the com-
puter system is thus an electronic proxy for the physical world artifact. By querying the
provenance of the electronic data, we can therefore obtain an accurate representation
of the provenance of the physical artifact, due to the one to one mapping assumption.
This requires some explicit actor state p-assertions to be recorded by actors in the
computer application, which describe the activated actuators and the sensed data they
return.
In practise, the one to one assumption may not necessarily hold, which means that
the physical process may not directly be mirrored in the computer system. Specifically,
we consider the case in which there may be actuators or sensors that are not directly
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
22
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
raise scalability concerns that have to be addressed by the architectural design for
the lifecycle recording phase. Symmetrically, the challenge for a query facility is to
identify a subset of useful p-assertions, by selecting, scoping and filtering p-assertions.
(These aspects are discussed further in Section 2.7.)
2.6 Beyond Computer Data
We specifically restricted Definition 2.3 to the provenance of electronic data contained
in a computer system. Our rationale was that our primary focus is on service oriented
architectures, used in building open, large scale systems. However, objects in the real
world also have a provenance. The purpose of this section is to examine how the
approach we propose to track provenance of data can be extended to track provenance
of physical world entities.
Initially, we consider a restrictive deployment, as illustrated in Figure 2.2. On the
left hand side, we see a computer application, in a SOA style, composed of a set of
actors and producing some result. With the approach presented in this chapter, p-
assertions describing execution are stored in a provenance store. The actors however
are not traditional processing actors that take inputs and produce outputs as result of
their internal behaviour. Instead, such actors are directly wired to “actuator/sensor”
pairs that operate on objects in the physical world and sense their environment, all
represented on the right hand of the picture. (The actual “wiring” is represented by
dashed lines.) Such actuators can be robots, taking objects as input and assembling
them, painting them, wrapping them, or even shipping them. Sensors perceive events
in the physical world, such as movement sensors, cameras, radar. Information can
transit from an actor to an actuator: it can be seen as control order for the actuator;
vice versa, sensors can feed back information to the computer system. We assume
here that the mapping is one to one, i.e., for one actor there exists one and only one
actuator/sensor, that an actuator is directly driven by an actor, and that an actor reacts
to information provided by a sensor. The outcome of the chain of actuators/sensors is
a physical artifact. We note that either the actuator or the sensor functionality in an
actuator/sensor pair may be void.
Given this mapping assumption, the computer system’s workflow mirrors a physi-
cal process in the physical world. The ultimate electronic data produced by the com-
puter system is thus an electronic proxy for the physical world artifact. By querying the
provenance of the electronic data, we can therefore obtain an accurate representation
of the provenance of the physical artifact, due to the one to one mapping assumption.
This requires some explicit actor state p-assertions to be recorded by actors in the
computer application, which describe the activated actuators and the sensed data they
return.
In practise, the one to one assumption may not necessarily hold, which means that
the physical process may not directly be mirrored in the computer system. Specifically,
we consider the case in which there may be actuators or sensors that are not directly
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
22
Page 23
PROVENANCE
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
Data
Physical
artefact
Application
Actors
Actuators/sensors
Figure 2.2: Mapping to the Physical World
under the control of the computer application, e.g. in a system where machines are
controlled by humans. In such circumstances, the provenance of the electronic data
only helps us to derive a partial representation of the provenance of the physical arti-
fact. Such a limitation may be alleviated if an actor is capable of recording p-assertions
about the part of the physical process that is not directly mirrored in the computer sys-
tem, as if a one-to-one mapping existed. (We note that this also applies to any process
where actors are not able to record documentation of process themselves.)
The discussion in this section has focused on physical artifacts. However, the
principles just exposed remain applicable to other “things” in the real world, such
as choices made by users, outcomes of a decision making process, or events observed
by sensors or users. What the provenance system requires is either a user interface or
sensor to act as an actor, recording p-assertions about the actions that occurred in the
physical world, or another actor to relate such actions on behalf of the physical process
that is not observed by sensors or users.
Consequently, we can now extend our definition of provenance to encompass the
physical world.
Definition 2.9 (Provenance of an entity) The provenance of an entity (whether com-
puter based or in the physical world) at a given point in execution is the process that
led to that entity at that point. 2
In the rest of the document, we continue to refer to the provenance of “data items”
unless we specifically wish to refer to the provenance of physical world entities.
Additionally, we note that earlier we used the term actor to denote either a client
or a service in a SOA. As the physical world is not so clearly describable in terms of
clients and services, we broaden the definition of actor to mean any entity that acts.
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
23
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
Data
Physical
artefact
Application
Actors
Actuators/sensors
Figure 2.2: Mapping to the Physical World
under the control of the computer application, e.g. in a system where machines are
controlled by humans. In such circumstances, the provenance of the electronic data
only helps us to derive a partial representation of the provenance of the physical arti-
fact. Such a limitation may be alleviated if an actor is capable of recording p-assertions
about the part of the physical process that is not directly mirrored in the computer sys-
tem, as if a one-to-one mapping existed. (We note that this also applies to any process
where actors are not able to record documentation of process themselves.)
The discussion in this section has focused on physical artifacts. However, the
principles just exposed remain applicable to other “things” in the real world, such
as choices made by users, outcomes of a decision making process, or events observed
by sensors or users. What the provenance system requires is either a user interface or
sensor to act as an actor, recording p-assertions about the actions that occurred in the
physical world, or another actor to relate such actions on behalf of the physical process
that is not observed by sensors or users.
Consequently, we can now extend our definition of provenance to encompass the
physical world.
Definition 2.9 (Provenance of an entity) The provenance of an entity (whether com-
puter based or in the physical world) at a given point in execution is the process that
led to that entity at that point. 2
In the rest of the document, we continue to refer to the provenance of “data items”
unless we specifically wish to refer to the provenance of physical world entities.
Additionally, we note that earlier we used the term actor to denote either a client
or a service in a SOA. As the physical world is not so clearly describable in terms of
clients and services, we broaden the definition of actor to mean any entity that acts.
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
23
Page 24
PROVENANCE
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
2.7 The Nature of Queries
The purpose of a provenance query about a given data item is to identify a set of
p-assertions that were submitted to the provenance store during some execution that
resulted in the data item. The intent of such a query is that the selected p-assertions,
which we refer to as the query result, provide a description of the process that led to the
data, i.e., the provenance of the data, expressed at a level of abstraction that is suitable
for the requester.
Hence, given a query, the purpose of a query engine is to decide which p-assertions
belong to a query result. Several factors can be taken into account in order to decide
if a p-assertion belongs to a query result. It is the purpose of the query to specify such
factors. In the rest of the section, we discuss some of the factors that a provenance
system needs to support.
Open systems may introduce an understanding of a process’s scope that differs
from one in closed systems. Indeed, in a traditional batch system, the beginning of a
process is marked by its submission to the batch system (or by its scheduling) and its
end is defined by the termination of execution and deallocation of resources. While
such a clearly defined beginning and end of process can still be achieved in a well-
structured and controlled closed computation performed in an open environment, it
no longer applies when previous results are opportunistically and serendipitously dis-
covered and reused to produce some data. As an illustration, consider a process p1
producing a result r1, which is itself later discovered and used by a distinct process p2
producing r2. In this example, the end of process p1 is marked by the production of re-
sult r1, while process p2 begins after the production and discovery of r1 and terminates
with result r2. Another design could have conceived a process p3 producing a similar
final result r′2, where p3 is the composition of p1 and p2. If we are not interested in tem-
poral details, and the fact that intermediary result r1 was stored and discovered, both
results r2 and r′2 have similar provenance, but were produced by apparently different
processes, p2 and p3, respectively. The reason for this difference is that p3 is conceived
as a closed experiment, producing r′2, whereas p2 opportunistically reused an existing
result. There is no right or wrong interpretation in this example: whether p2 or p3 is
the process of interest is to be decided at query time, by the querier.
Let us now assume that the provenance representation we discuss here is made
available for all data or objects. Given that the state of our universe, including all elec-
tronic data, is derived from the “Big Bang”, we do not expect provenance queries to
return all p-assertions back to such a point. Hence, we need mechanisms to specify
how far back in the execution we include p-assertions in the query result. Such mech-
anisms can be varied: we introduce them briefly here and discuss them later in Section
7.2. (i) A limit can be set on the length of the relationship chains. (ii) Relationship
chains can be traversed until the data being transferred satisfies some property, such
as being of a given type. (iii) Given that actors can describe themselves by the func-
tionality they perform on their inputs, functionalities of interest identify p-assertions
that belong to the query result or that are to be rejected. (iv) Actors may record
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
24
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
2.7 The Nature of Queries
The purpose of a provenance query about a given data item is to identify a set of
p-assertions that were submitted to the provenance store during some execution that
resulted in the data item. The intent of such a query is that the selected p-assertions,
which we refer to as the query result, provide a description of the process that led to the
data, i.e., the provenance of the data, expressed at a level of abstraction that is suitable
for the requester.
Hence, given a query, the purpose of a query engine is to decide which p-assertions
belong to a query result. Several factors can be taken into account in order to decide
if a p-assertion belongs to a query result. It is the purpose of the query to specify such
factors. In the rest of the section, we discuss some of the factors that a provenance
system needs to support.
Open systems may introduce an understanding of a process’s scope that differs
from one in closed systems. Indeed, in a traditional batch system, the beginning of a
process is marked by its submission to the batch system (or by its scheduling) and its
end is defined by the termination of execution and deallocation of resources. While
such a clearly defined beginning and end of process can still be achieved in a well-
structured and controlled closed computation performed in an open environment, it
no longer applies when previous results are opportunistically and serendipitously dis-
covered and reused to produce some data. As an illustration, consider a process p1
producing a result r1, which is itself later discovered and used by a distinct process p2
producing r2. In this example, the end of process p1 is marked by the production of re-
sult r1, while process p2 begins after the production and discovery of r1 and terminates
with result r2. Another design could have conceived a process p3 producing a similar
final result r′2, where p3 is the composition of p1 and p2. If we are not interested in tem-
poral details, and the fact that intermediary result r1 was stored and discovered, both
results r2 and r′2 have similar provenance, but were produced by apparently different
processes, p2 and p3, respectively. The reason for this difference is that p3 is conceived
as a closed experiment, producing r′2, whereas p2 opportunistically reused an existing
result. There is no right or wrong interpretation in this example: whether p2 or p3 is
the process of interest is to be decided at query time, by the querier.
Let us now assume that the provenance representation we discuss here is made
available for all data or objects. Given that the state of our universe, including all elec-
tronic data, is derived from the “Big Bang”, we do not expect provenance queries to
return all p-assertions back to such a point. Hence, we need mechanisms to specify
how far back in the execution we include p-assertions in the query result. Such mech-
anisms can be varied: we introduce them briefly here and discuss them later in Section
7.2. (i) A limit can be set on the length of the relationship chains. (ii) Relationship
chains can be traversed until the data being transferred satisfies some property, such
as being of a given type. (iii) Given that actors can describe themselves by the func-
tionality they perform on their inputs, functionalities of interest identify p-assertions
that belong to the query result or that are to be rejected. (iv) Actors may record
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
24
Page 28
PROVENANCE
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
Application
Services
Workflow
Enactment
Engine
Domain-
Specific
Services
Actor-Side
Recording
Library
User
Auditor
Service
Quality
Analyser
Trace
Comparator
Trace
to Workflow
Converter
Re-enactor
Semantic
Validity
Analyser
Publication
Generator
Trace
Visualiser /
Browser
Trace
Difference
Visualiser
Trace
Validity
Visualiser
Service
Quality
Visualiser
Workflow
Constructer
Presentation
UIs
Provenance
Stores
Query Interface
Recording
Interface
Ma
na
ge
me
nt
Int
erf
ac
e
Actor-Side
Query
Library
Actor-Side
Management
Library
Policy-Based
Matchmaking
Discovery
Negotiation
Service
Requirement &
Capability
Policy
Provenance
Store Policy
User
Requirement
Policy
Processing
Services
Management
UIs
Application
UI
Examples of
presentation UIs
Examples of
processing services
Examples of
application services
Figure 3.1: Architecture of a Provenance-Aware Application
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
28
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
Application
Services
Workflow
Enactment
Engine
Domain-
Specific
Services
Actor-Side
Recording
Library
User
Auditor
Service
Quality
Analyser
Trace
Comparator
Trace
to Workflow
Converter
Re-enactor
Semantic
Validity
Analyser
Publication
Generator
Trace
Visualiser /
Browser
Trace
Difference
Visualiser
Trace
Validity
Visualiser
Service
Quality
Visualiser
Workflow
Constructer
Presentation
UIs
Provenance
Stores
Query Interface
Recording
Interface
Ma
na
ge
me
nt
Int
erf
ac
e
Actor-Side
Query
Library
Actor-Side
Management
Library
Policy-Based
Matchmaking
Discovery
Negotiation
Service
Requirement &
Capability
Policy
Provenance
Store Policy
User
Requirement
Policy
Processing
Services
Management
UIs
Application
UI
Examples of
presentation UIs
Examples of
processing services
Examples of
application services
Figure 3.1: Architecture of a Provenance-Aware Application
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
28
Page 30
PROVENANCE
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
documenting (such as their need for high throughput or highly persistent provenance
stores); (iii) policies define configurations of provenance stores, from a deployment
and security viewpoint (e.g., resources they use, their access control list, or registry
where they should be advertised). Policies are further specified in Section 7.5. By
making explicit all these policies, it becomes possible to discover services that match
user or other service needs. When requested policies conflict with discovered policies,
negotiation can be initiated to find a compromise between the offer and demand.
Figure 3.1 displays how applications can integrate with the provenance system. It
is however important to clarify the scope of the architecture that we are addressing
in this document. This is precisely the purpose of Figure 3.2, which introduces a
circle around the architectural elements that are discussed in this document. Other
components are excluded from further discussion because their behaviours are entirely
application-dependent, apart from that specified in provenance-specific policies.
3.4 The P-Header
In Section 3.2, we introduced the roles of actors involved in the provenance life-cycle
and their general responsibilities. Roles place more specific obligations on actors with
respect to supporting actors in other roles. Largely, this is a matter of providing ad-
equate information in the correct format: for example, an asserting actor must create
p-assertions in a format that a provenance store can make persistent and a provenance
store must provide p-assertions in a format that querying actors can interpret. We spec-
ify how p-assertions and other data should be modelled to provide such consistency in
Chapter 6.
In order for p-assertions to be created, asserting actors need to identify which pro-
cess they are making an assertion about, which requires some shared context between
asserting actors. As it is application actors that make assertions, we place a further obli-
gation on them to pass context information between each other regarding the process
being executed. As this would often be achieved by putting the context information
in the header of an application message (it could be exchanged by other, application-
specific means), we call this information the p-header , defined as follows.
Definition 3.1 (p-header) The p-header of an interaction is provenance-related con-
textual information, sent along with the interaction’s message. 2
In practise, the p-header can contain an identifier for the interaction to which the
context information applies and the locations of provenance stores where p-assertions
documenting the same process are stored. Additionally, the p-header can contain a set
of tracers, which are used to demarcate where one process starts and ends. A tracer
is a token added to a p-header by an application actor, where the same tracer is added
to the p-headers of all interactions in the same process by the same application actor.
Additionally, where a tracer is included in the p-header of a message received by an
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
30
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
documenting (such as their need for high throughput or highly persistent provenance
stores); (iii) policies define configurations of provenance stores, from a deployment
and security viewpoint (e.g., resources they use, their access control list, or registry
where they should be advertised). Policies are further specified in Section 7.5. By
making explicit all these policies, it becomes possible to discover services that match
user or other service needs. When requested policies conflict with discovered policies,
negotiation can be initiated to find a compromise between the offer and demand.
Figure 3.1 displays how applications can integrate with the provenance system. It
is however important to clarify the scope of the architecture that we are addressing
in this document. This is precisely the purpose of Figure 3.2, which introduces a
circle around the architectural elements that are discussed in this document. Other
components are excluded from further discussion because their behaviours are entirely
application-dependent, apart from that specified in provenance-specific policies.
3.4 The P-Header
In Section 3.2, we introduced the roles of actors involved in the provenance life-cycle
and their general responsibilities. Roles place more specific obligations on actors with
respect to supporting actors in other roles. Largely, this is a matter of providing ad-
equate information in the correct format: for example, an asserting actor must create
p-assertions in a format that a provenance store can make persistent and a provenance
store must provide p-assertions in a format that querying actors can interpret. We spec-
ify how p-assertions and other data should be modelled to provide such consistency in
Chapter 6.
In order for p-assertions to be created, asserting actors need to identify which pro-
cess they are making an assertion about, which requires some shared context between
asserting actors. As it is application actors that make assertions, we place a further obli-
gation on them to pass context information between each other regarding the process
being executed. As this would often be achieved by putting the context information
in the header of an application message (it could be exchanged by other, application-
specific means), we call this information the p-header , defined as follows.
Definition 3.1 (p-header) The p-header of an interaction is provenance-related con-
textual information, sent along with the interaction’s message. 2
In practise, the p-header can contain an identifier for the interaction to which the
context information applies and the locations of provenance stores where p-assertions
documenting the same process are stored. Additionally, the p-header can contain a set
of tracers, which are used to demarcate where one process starts and ends. A tracer
is a token added to a p-header by an application actor, where the same tracer is added
to the p-headers of all interactions in the same process by the same application actor.
Additionally, where a tracer is included in the p-header of a message received by an
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
30
Page 31
PROVENANCE
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
Application
Services
Client-Side
Recording
Library
User
Presentation
UIs
Query Interface
Recording
Interface
Ma
nag
em
ent
Inte
rfa
ce
Actor-Side
Query
Library
Client-Side
Management
Library
Processing
Services
Management
UIs
Scope of a
standardised
Provenance
system
Provenance
Store
Figure 3.2: Provenance Logical Architecture and its Scope
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
31
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
Application
Services
Client-Side
Recording
Library
User
Presentation
UIs
Query Interface
Recording
Interface
Ma
nag
em
ent
Inte
rfa
ce
Actor-Side
Query
Library
Client-Side
Management
Library
Processing
Services
Management
UIs
Scope of a
standardised
Provenance
system
Provenance
Store
Figure 3.2: Provenance Logical Architecture and its Scope
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
31
Page 32
PROVENANCE
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
application actor, that actor is obliged to copy the tracer into the p-header of all inter-
actions within the same process. Using tracers, a querying actor can determine which
interactions were part of a single process, because their p-headers will all contain the
same tracer, and whether one process is contained within another, because the tracers
of the former’s interactions will be a subset of the tracers of the latter’s interactions.
The structure of p-headers and tracers is discussed in more detail in Chapter 6.
3.5 Conclusion
In this chapter, we have presented the logical architecture that underlies our prove-
nance system and the roles of the actors that interact within that architecture. During
the provenance lifecycle, the actors perform several roles: application actors execute
processes; asserting actors create p-assertions about these processes; and recording
actors record p-assertions in provenance stores, which allow querying actors to re-
trieve p-assertions and managing actors to maintain them. The recording, query and
management functions of the provenance stores are made available through fixed, pre-
specified interfaces, making it possible to program an application to take advantage of
the architecture. Policies control the run-time behaviour of architectural components
deployed in different contexts, and each role places obligations on the actors playing
them.
The remaining chapters of this document examine the issues that affect the funda-
mental parts of the architecture or that cut across a provenance system as a whole.
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
32
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
application actor, that actor is obliged to copy the tracer into the p-header of all inter-
actions within the same process. Using tracers, a querying actor can determine which
interactions were part of a single process, because their p-headers will all contain the
same tracer, and whether one process is contained within another, because the tracers
of the former’s interactions will be a subset of the tracers of the latter’s interactions.
The structure of p-headers and tracers is discussed in more detail in Chapter 6.
3.5 Conclusion
In this chapter, we have presented the logical architecture that underlies our prove-
nance system and the roles of the actors that interact within that architecture. During
the provenance lifecycle, the actors perform several roles: application actors execute
processes; asserting actors create p-assertions about these processes; and recording
actors record p-assertions in provenance stores, which allow querying actors to re-
trieve p-assertions and managing actors to maintain them. The recording, query and
management functions of the provenance stores are made available through fixed, pre-
specified interfaces, making it possible to program an application to take advantage of
the architecture. Policies control the run-time behaviour of architectural components
deployed in different contexts, and each role places obligations on the actors playing
them.
The remaining chapters of this document examine the issues that affect the funda-
mental parts of the architecture or that cut across a provenance system as a whole.
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
32
Page 33
PROVENANCE
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
Chapter 4
Security Architecture
One of the key features for a provenance architecture within the context of this project
is security. Many of the application domains in which a provenance architecture could
potentially be deployed in have stringent requirements on access to data manipulated
within the system. Correspondingly, p-assertions that incorporate or are derived from
these data are likely to have similar security restrictions on them as well. Although
security is a non-functional requirement, software engineering methodology strongly
recommends that security considerations be integrated into the development life-cycle
as early as possible. With this as a motivating factor, we proceed in this chapter to
outline a security architecture for the logical architecture that we described in Chapter
3. In addition, the remaining chapters of this document will contain a security section
(if relevant) that may make reference to the material presented in this chapter.
In Section 4.1, we briefly define some of the common security concepts that we
use in this document. In Section 4.2, we survey the security issues relevant to the
conception of provenance. Following that, we present the security architecture for the
provenance store and describe the functionality and interaction between its constituent
components in 4.3. In Section 4.4, we discuss the security issues pertaining to other
components in the logical architecture. We then outline the security issues that remain
unaddressed in Section 4.5, and conclude in Section 4.6.
4.1 Background
This section provides a brief narrative that encompasses some of the more common
terminologies encountered in the field of electronic security. It is not intended to be
a comprehensive treatise of the area, and merely seeks to provide a conceptual back-
ground for the security discussion in the remaining sections of this document.
We consider a system that offers some functionality through a set of resources that
can be accessed and manipulated. It is usually the case that these resources can only
be accessible or manipulated in specific ways in order to ensure that the functionality
offered by the entire system is unaffected. The integrity of a resource is a property of
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
33
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
Chapter 4
Security Architecture
One of the key features for a provenance architecture within the context of this project
is security. Many of the application domains in which a provenance architecture could
potentially be deployed in have stringent requirements on access to data manipulated
within the system. Correspondingly, p-assertions that incorporate or are derived from
these data are likely to have similar security restrictions on them as well. Although
security is a non-functional requirement, software engineering methodology strongly
recommends that security considerations be integrated into the development life-cycle
as early as possible. With this as a motivating factor, we proceed in this chapter to
outline a security architecture for the logical architecture that we described in Chapter
3. In addition, the remaining chapters of this document will contain a security section
(if relevant) that may make reference to the material presented in this chapter.
In Section 4.1, we briefly define some of the common security concepts that we
use in this document. In Section 4.2, we survey the security issues relevant to the
conception of provenance. Following that, we present the security architecture for the
provenance store and describe the functionality and interaction between its constituent
components in 4.3. In Section 4.4, we discuss the security issues pertaining to other
components in the logical architecture. We then outline the security issues that remain
unaddressed in Section 4.5, and conclude in Section 4.6.
4.1 Background
This section provides a brief narrative that encompasses some of the more common
terminologies encountered in the field of electronic security. It is not intended to be
a comprehensive treatise of the area, and merely seeks to provide a conceptual back-
ground for the security discussion in the remaining sections of this document.
We consider a system that offers some functionality through a set of resources that
can be accessed and manipulated. It is usually the case that these resources can only
be accessible or manipulated in specific ways in order to ensure that the functionality
offered by the entire system is unaffected. The integrity of a resource is a property of
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
33
Page 36
PROVENANCE
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
to invoke another service in order to fulfil the requested functionality. If these services
exist in different security domains, then the individual responsible for initiating the
workflow would need to authenticate twice: once to each of them. Once again, a sin-
gle sign-on capability can be provided if a mechanism is implemented in the security
infrastructure that empowers the first service to invoke the second service based on the
access control rights transferred to it from the individual concerned. Note here that
while the conception of single sign-on is the same as is the case in identity federation,
the motivating situations are slightly different. Delegation of access control generally
also carries the implication that the delegated access rights are only qualified within a
certain context: for example, during the duration of a workflow or to access specific
resources only. There must be a way to ensure that a service that has been delegated
some rights from an individual does not maintain the ability to use these rights indef-
initely outside of the given context, nor to delegate it further onwards to other entities
unless permitted to do so.
It needs to be borne in mind that delegation of access control and federation of iden-
tity are not novel security methodologies nor do they enhance the security capabilities
of a system. They merely provide a way to maintain the existing level of security in in-
dividual security domains while attempting to simplify the security requirements that
arise when complex interactions between these different domains occur.
4.2 Provenance Related Security Issues
In this section, we outline the security issues that we believe are relevant pertaining
to our notion of provenance. We note however that not all of these issues are relevant
in the context of the software requirements (see Chapter 9), and the eventual security
architecture will only address those that are.
1. Access control to the provenance store. This is the primary security issue as the
provenance store is considered to be central to the logical architecture. While
the access control mechanisms utilised are situated in the context of the specific
requirements of the project, this notion of security here is conceptually identical
to the general case of securing a database with multiple users.
2. Integrity and non-repudiation of p-assertions. Recording actors store p-assertions
created by asserting actors in the provenance store. In the event that the asserting
actor is not the recording actor, there is a need to ensure that information within
the p-assertion is not altered unintentionally or maliciously by either the record-
ing actor or provenance store. This can be achieved by having the asserting actor
sign the p-assertion it creates. The signature also serves the additional purpose of
ensuring that the asserting actor cannot deny responsibility for the creation of the
p-assertion in question. This can be necessary when legal or other requirements
mandate establishment of liability for the consequences arising from utilising
the information in a p-assertion. This issue is discussed further in Section 6.9.
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
36
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
to invoke another service in order to fulfil the requested functionality. If these services
exist in different security domains, then the individual responsible for initiating the
workflow would need to authenticate twice: once to each of them. Once again, a sin-
gle sign-on capability can be provided if a mechanism is implemented in the security
infrastructure that empowers the first service to invoke the second service based on the
access control rights transferred to it from the individual concerned. Note here that
while the conception of single sign-on is the same as is the case in identity federation,
the motivating situations are slightly different. Delegation of access control generally
also carries the implication that the delegated access rights are only qualified within a
certain context: for example, during the duration of a workflow or to access specific
resources only. There must be a way to ensure that a service that has been delegated
some rights from an individual does not maintain the ability to use these rights indef-
initely outside of the given context, nor to delegate it further onwards to other entities
unless permitted to do so.
It needs to be borne in mind that delegation of access control and federation of iden-
tity are not novel security methodologies nor do they enhance the security capabilities
of a system. They merely provide a way to maintain the existing level of security in in-
dividual security domains while attempting to simplify the security requirements that
arise when complex interactions between these different domains occur.
4.2 Provenance Related Security Issues
In this section, we outline the security issues that we believe are relevant pertaining
to our notion of provenance. We note however that not all of these issues are relevant
in the context of the software requirements (see Chapter 9), and the eventual security
architecture will only address those that are.
1. Access control to the provenance store. This is the primary security issue as the
provenance store is considered to be central to the logical architecture. While
the access control mechanisms utilised are situated in the context of the specific
requirements of the project, this notion of security here is conceptually identical
to the general case of securing a database with multiple users.
2. Integrity and non-repudiation of p-assertions. Recording actors store p-assertions
created by asserting actors in the provenance store. In the event that the asserting
actor is not the recording actor, there is a need to ensure that information within
the p-assertion is not altered unintentionally or maliciously by either the record-
ing actor or provenance store. This can be achieved by having the asserting actor
sign the p-assertion it creates. The signature also serves the additional purpose of
ensuring that the asserting actor cannot deny responsibility for the creation of the
p-assertion in question. This can be necessary when legal or other requirements
mandate establishment of liability for the consequences arising from utilising
the information in a p-assertion. This issue is discussed further in Section 6.9.
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
36
Page 37
PROVENANCE
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
3. Ascertaining asserter identity in a p-assertion. The structure for holding p-
assertions created by an asserting actor will also hold the identity of this actor
(Figure 6.10). By implication of the previous point we discussed, the asserter
identity should correlate with the identity associated with the signature on the
p-assertion, since only the asserting actor should sign the p-assertion. A check
can be done to ascertain whether this is true, and can be undertaken by either
the provenance store to which the p-assertion is recorded to, or by the querying
actor retrieving the p-assertion in question.
4. Derivation of authorisation information relating to p-assertions. It is likely that
p-assertions will contain or be derived in some fashion from an existing piece of
data in the system. For example, an application actor with access to a database
may send a message containing an item from that database to another actor. This
item is likely to have certain access control restrictions enforced upon it within
the security domain of the database in question. When a p-assertion is created
for the transmitted message and recorded to the provenance store, appropriate
access control restrictions (or authorisations) must now be established for this
new entry to ensure that any future access to it is in accordance with the security
policies of the provenance store.
In some situations, it may be useful to relate the authorisation for the newly
recorded p-assertion in some way to the access control restrictions on the orig-
inal database item that the p-assertion is based upon. This effectively allows
for a more flexible specification of authorisations on p-assertions by taking into
account information other than that found in statically predefined security poli-
cies on the provenance store. A possible approach towards this end is for the
recording actor to submit additional information along with the p-assertion to be
stored. This additional information would be created by the asserting actor and
can then be utilised in an automated manner by the provenance store to generate
appropriate authorisations for the new p-assertion.
5. Context-based authorisation specifications. As we have seen in Chapter 3, pro-
cessing services provide added-value to the query interfaces by further search-
ing, analysing and reasoning over recorded p-assertions. Some of the operations
that can be performed by a processing service have a well defined functionality;
for example, comparing processes used to produce several data items. In order
to perform this operation, a certain set of p-assertions identified by certain crite-
ria will need to be retrieved from the provenance store. Another operation, for
example, verifying that a given execution was semantically valid, will require
the retrieval of another set of different p-assertions. Situations may arise where
it is useful to ensure that certain actors are authorised to access only the relevant
p-assertion subset necessary for a specific operation (or more generally, any type
of context in which provenance representations can be used in). This would re-
quire an ability to express authorisations at this level, as well as some way to
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
37
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
3. Ascertaining asserter identity in a p-assertion. The structure for holding p-
assertions created by an asserting actor will also hold the identity of this actor
(Figure 6.10). By implication of the previous point we discussed, the asserter
identity should correlate with the identity associated with the signature on the
p-assertion, since only the asserting actor should sign the p-assertion. A check
can be done to ascertain whether this is true, and can be undertaken by either
the provenance store to which the p-assertion is recorded to, or by the querying
actor retrieving the p-assertion in question.
4. Derivation of authorisation information relating to p-assertions. It is likely that
p-assertions will contain or be derived in some fashion from an existing piece of
data in the system. For example, an application actor with access to a database
may send a message containing an item from that database to another actor. This
item is likely to have certain access control restrictions enforced upon it within
the security domain of the database in question. When a p-assertion is created
for the transmitted message and recorded to the provenance store, appropriate
access control restrictions (or authorisations) must now be established for this
new entry to ensure that any future access to it is in accordance with the security
policies of the provenance store.
In some situations, it may be useful to relate the authorisation for the newly
recorded p-assertion in some way to the access control restrictions on the orig-
inal database item that the p-assertion is based upon. This effectively allows
for a more flexible specification of authorisations on p-assertions by taking into
account information other than that found in statically predefined security poli-
cies on the provenance store. A possible approach towards this end is for the
recording actor to submit additional information along with the p-assertion to be
stored. This additional information would be created by the asserting actor and
can then be utilised in an automated manner by the provenance store to generate
appropriate authorisations for the new p-assertion.
5. Context-based authorisation specifications. As we have seen in Chapter 3, pro-
cessing services provide added-value to the query interfaces by further search-
ing, analysing and reasoning over recorded p-assertions. Some of the operations
that can be performed by a processing service have a well defined functionality;
for example, comparing processes used to produce several data items. In order
to perform this operation, a certain set of p-assertions identified by certain crite-
ria will need to be retrieved from the provenance store. Another operation, for
example, verifying that a given execution was semantically valid, will require
the retrieval of another set of different p-assertions. Situations may arise where
it is useful to ensure that certain actors are authorised to access only the relevant
p-assertion subset necessary for a specific operation (or more generally, any type
of context in which provenance representations can be used in). This would re-
quire an ability to express authorisations at this level, as well as some way to
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
37
Page 39
PROVENANCE
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
Database
backend
Identity
validator
Authorisation
engine
Credential server
Trust mediator
Provenance store
interfaces
b.
Actor
Authorisation /
access control
component of
host system
Derivation engine
Indicates a
different security
domain
Internal
representation
list
a.
c.
e.
d.
f.
g.
h.
i.
j.k.
Access
control policy
Authorisation
policy
l.
o.
p.
q.
r.
s.
t.
m.
Remote interactor
n.
Interaction with another
security domain
u.
v.
Figure 4.1: Provenance store security architecture
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
39
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
Database
backend
Identity
validator
Authorisation
engine
Credential server
Trust mediator
Provenance store
interfaces
b.
Actor
Authorisation /
access control
component of
host system
Derivation engine
Indicates a
different security
domain
Internal
representation
list
a.
c.
e.
d.
f.
g.
h.
i.
j.k.
Access
control policy
Authorisation
policy
l.
o.
p.
q.
r.
s.
t.
m.
Remote interactor
n.
Interaction with another
security domain
u.
v.
Figure 4.1: Provenance store security architecture
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
39
Page 40
PROVENANCE
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
is optional, and correlates with the third security issue in Section 4.2. If the
asserter identity is to be utilised in the access control decision, then it needs to
be mapped to a corresponding IR as well.
4. Formats the request into an appropriate representation for access control pur-
poses.
The first two functions are performed with help from an internal representation list
that specifies the appropriate mapping relationships, including roles.
The credential server(†) fulfils the role of being a trusted third party holding identity- [GR-OTM.5, p. 129]
related information for all potential users of the provenance system within a given se-
curity domain, as well as providing them with suitable credentials and other related
security tokens for authentication purposes . The authorisation engine (†) essentially [SR-6-1, p. 122]
performs the access control functionality in two main ways based on the authorisa-
tions specified in the authorisation policy (†) and the IR produced from the identity [SR-6-1, p. 122]
validator:
• The request is granted or denied solely on the basis of the information from the
authorisation policy and the IR related to the identity of the requesting actor.
It is also possible that the IR of the asserting actor is taken into account in the
access control decision as well; we assume that this possibility exists, but use
the term IR to refer to the IR of the recording actor for the sake of brevity in the
remaining discussion. If granted, the requested operation is performed and the
appropriate acknowledgement or data item is returned directly to the requestor
without further intervention from the authorisation engine.
• The granting of the request may additionally be dependent on information con-
tained within the data item that the request is related to (such a condition would
be specified accordingly in the authorisation policy). For example, a read oper-
ation associated with an IR on a given p-assertion might be permitted only if the
p-assertion contained relevant information pertaining to that IR. In this case, the
p-assertion in question would have to be retrieved first and assessed accordingly
by the authorisation engine before a final decision can be made on granting or
denying the request.
Depending on the nature of the authorisation engine, it may be necessary that the
assignment of a role to an IR for the case of a RBAC should be achieved by the au-
thorisation engine instead of the identity validator. In addition, it is possible to employ
either one or both of these two approaches to specifying authorisation:
• an identity / role is assumed to have no authorisations in the initial case, and
explicit authorisations have to be granted;
• an identity / role is assumed to have complete authorisation in the initial case,
and explicit restrictions have to be placed.
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
40
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
is optional, and correlates with the third security issue in Section 4.2. If the
asserter identity is to be utilised in the access control decision, then it needs to
be mapped to a corresponding IR as well.
4. Formats the request into an appropriate representation for access control pur-
poses.
The first two functions are performed with help from an internal representation list
that specifies the appropriate mapping relationships, including roles.
The credential server(†) fulfils the role of being a trusted third party holding identity- [GR-OTM.5, p. 129]
related information for all potential users of the provenance system within a given se-
curity domain, as well as providing them with suitable credentials and other related
security tokens for authentication purposes . The authorisation engine (†) essentially [SR-6-1, p. 122]
performs the access control functionality in two main ways based on the authorisa-
tions specified in the authorisation policy (†) and the IR produced from the identity [SR-6-1, p. 122]
validator:
• The request is granted or denied solely on the basis of the information from the
authorisation policy and the IR related to the identity of the requesting actor.
It is also possible that the IR of the asserting actor is taken into account in the
access control decision as well; we assume that this possibility exists, but use
the term IR to refer to the IR of the recording actor for the sake of brevity in the
remaining discussion. If granted, the requested operation is performed and the
appropriate acknowledgement or data item is returned directly to the requestor
without further intervention from the authorisation engine.
• The granting of the request may additionally be dependent on information con-
tained within the data item that the request is related to (such a condition would
be specified accordingly in the authorisation policy). For example, a read oper-
ation associated with an IR on a given p-assertion might be permitted only if the
p-assertion contained relevant information pertaining to that IR. In this case, the
p-assertion in question would have to be retrieved first and assessed accordingly
by the authorisation engine before a final decision can be made on granting or
denying the request.
Depending on the nature of the authorisation engine, it may be necessary that the
assignment of a role to an IR for the case of a RBAC should be achieved by the au-
thorisation engine instead of the identity validator. In addition, it is possible to employ
either one or both of these two approaches to specifying authorisation:
• an identity / role is assumed to have no authorisations in the initial case, and
explicit authorisations have to be granted;
• an identity / role is assumed to have complete authorisation in the initial case,
and explicit restrictions have to be placed.
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
40
Page 41
PROVENANCE
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
The semantics and the granularity of the authorisation assertions within the autho-
risation policy will determine how fine grained and flexible access control can be in
the infrastructure. Ideally, authorisations should be specifiable at the level of individual
p-assertions, and could be refined to individual elements within a p-assertion if such
need arises. Current authorisation systems in use provide differing levels of seman-
tic expressibility and granularity for the authorisation policies they employ. In terms
of semantic expressibility, a possible example could be authorisation systems that al-
low restriction on access based on additional characteristics such as user attributes,
time of day or processing time of job, in contract to simpler systems that only permit
user identity in the authorisation expression. If an existing system is to be used as a
building block for the provenance store security infrastructure, it should therefore in-
corporate sufficient expressibility and granularity to satisfy the security requirements
of the project (Section 9.1.6).
Additionally, we note that the content-based authorisation described in Section 4.2
would require a higher level policy language to describe, that would then subsequently
require translation into the lower level assertions of the authorisation policy (as access
control can only be meaningfully performed at this level). This feature is not spec-
ified in the software requirements (Section 9.1.6), and merely represents an optional
enhancement to the actual security architecture to be implemented.
The access control policy (†) is a higher level security policy that specifies the ways [SR-6-1, p. 122]
in which the authorisation policy and/or internal representation list can be modified
by the components which access them. It also describes in a high level manner the
configuration of the provenance store from a security viewpoint and the protocols that
external entities (such as actors or other provenance stores) need to adhere to in order
to communicate with it. The access control policy, authorisation policy and internal
representation list would constitute the security policy of the provenance store as a
whole (Section 7.6). The database backend provides actual physical storage for the
p-assertions.
The trust mediator is used to support federation of authentication and/or authori-
sation for the case where distributed provenance stores (Section 5.2) exist in different
security domains. Its role is to obtain security assertions or credentials from other rele-
vant entities in order to support the specific federation methodology employed. Further
details on the need for federation, particularly with regards to distributed provenance
stores can be found in (Section 5.7).
These entities could be a trusted third party (such as the credential server) in the
local or remote security domain, or the trust mediator of another provenance store.
The gathered assertions or credentials can then be used locally (for example, to verify
other credentials received by the identity validator) or can be passed on to the remote
interactor. Like the trust mediator, the remote interactor is intended to interact with
external entities, but within a more general context rather than a security specific one.
The remote interactor uses the credentials provided by the trust mediator for secure
communication towards this end. The operational parameters for both the trust medi-
ator and remote interactor can be configured via the access control policy.
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
41
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
The semantics and the granularity of the authorisation assertions within the autho-
risation policy will determine how fine grained and flexible access control can be in
the infrastructure. Ideally, authorisations should be specifiable at the level of individual
p-assertions, and could be refined to individual elements within a p-assertion if such
need arises. Current authorisation systems in use provide differing levels of seman-
tic expressibility and granularity for the authorisation policies they employ. In terms
of semantic expressibility, a possible example could be authorisation systems that al-
low restriction on access based on additional characteristics such as user attributes,
time of day or processing time of job, in contract to simpler systems that only permit
user identity in the authorisation expression. If an existing system is to be used as a
building block for the provenance store security infrastructure, it should therefore in-
corporate sufficient expressibility and granularity to satisfy the security requirements
of the project (Section 9.1.6).
Additionally, we note that the content-based authorisation described in Section 4.2
would require a higher level policy language to describe, that would then subsequently
require translation into the lower level assertions of the authorisation policy (as access
control can only be meaningfully performed at this level). This feature is not spec-
ified in the software requirements (Section 9.1.6), and merely represents an optional
enhancement to the actual security architecture to be implemented.
The access control policy (†) is a higher level security policy that specifies the ways [SR-6-1, p. 122]
in which the authorisation policy and/or internal representation list can be modified
by the components which access them. It also describes in a high level manner the
configuration of the provenance store from a security viewpoint and the protocols that
external entities (such as actors or other provenance stores) need to adhere to in order
to communicate with it. The access control policy, authorisation policy and internal
representation list would constitute the security policy of the provenance store as a
whole (Section 7.6). The database backend provides actual physical storage for the
p-assertions.
The trust mediator is used to support federation of authentication and/or authori-
sation for the case where distributed provenance stores (Section 5.2) exist in different
security domains. Its role is to obtain security assertions or credentials from other rele-
vant entities in order to support the specific federation methodology employed. Further
details on the need for federation, particularly with regards to distributed provenance
stores can be found in (Section 5.7).
These entities could be a trusted third party (such as the credential server) in the
local or remote security domain, or the trust mediator of another provenance store.
The gathered assertions or credentials can then be used locally (for example, to verify
other credentials received by the identity validator) or can be passed on to the remote
interactor. Like the trust mediator, the remote interactor is intended to interact with
external entities, but within a more general context rather than a security specific one.
The remote interactor uses the credentials provided by the trust mediator for secure
communication towards this end. The operational parameters for both the trust medi-
ator and remote interactor can be configured via the access control policy.
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
41
Page 45
PROVENANCE
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
quently, a query to retrieve a group of related p-assertions may potentially require a
series of queries to the various provenance stores holding the desired p-assertions. We
discuss the security implications of this requirement in Section 5.7.
4.4 Security in Other Architecture Components
In the previous subsection, we presented and described the functioning of a security
architecture to protect the provenance store, a key component of the logical archi-
tecture. Here, we study the security considerations underlying interactions involving
other components of the logical architecture.
4.4.1 Between other components and the provenance store
The other components in the logical architecture that interact directly with the prove-
nance store will now require corresponding security functionality as well in order to
ensure their interactions are secured properly. We describe the nature of the required
functionality below for application services, management UIs and processing services.
1. A facility is required for accessing credentials that are to be submitted to the
identity validator in the provenance store. This can be provided as additional
libraries in the corresponding actor side libraries (Section 8.7) or as interfaces
that permit interoperation with external third party applications that provide cre-
dential generating functionality. A straightforward example would be a keystore
manager application that generates, archives keys and certificates and obtains
approval for these certificates from a CA.
2. If a keystore or some other facility for storing cryptographically generated ma-
terial is to be used by the actor side libraries, it has to be secured appropriately
(e.g. located in a secure account, encrypted and contents accessible only by the
provision of a username/password combination).
3. A facility is required for accessing specific security mechanisms such as signing
or time stamping. This is necessary, for example, when the asserting actor needs
to sign the p-assertion it created (see related security issue 2).
4. For the case where authorisation information is desired to be submitted along-
side p-assertions, an interface must be provided as part of the domain specific
services that allows the retrieval of this information from the appropriate loca-
tions (such as a local database). This interface should be congruent with the
specific format in which the authorisation information can be expressed in.
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
45
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
quently, a query to retrieve a group of related p-assertions may potentially require a
series of queries to the various provenance stores holding the desired p-assertions. We
discuss the security implications of this requirement in Section 5.7.
4.4 Security in Other Architecture Components
In the previous subsection, we presented and described the functioning of a security
architecture to protect the provenance store, a key component of the logical archi-
tecture. Here, we study the security considerations underlying interactions involving
other components of the logical architecture.
4.4.1 Between other components and the provenance store
The other components in the logical architecture that interact directly with the prove-
nance store will now require corresponding security functionality as well in order to
ensure their interactions are secured properly. We describe the nature of the required
functionality below for application services, management UIs and processing services.
1. A facility is required for accessing credentials that are to be submitted to the
identity validator in the provenance store. This can be provided as additional
libraries in the corresponding actor side libraries (Section 8.7) or as interfaces
that permit interoperation with external third party applications that provide cre-
dential generating functionality. A straightforward example would be a keystore
manager application that generates, archives keys and certificates and obtains
approval for these certificates from a CA.
2. If a keystore or some other facility for storing cryptographically generated ma-
terial is to be used by the actor side libraries, it has to be secured appropriately
(e.g. located in a secure account, encrypted and contents accessible only by the
provision of a username/password combination).
3. A facility is required for accessing specific security mechanisms such as signing
or time stamping. This is necessary, for example, when the asserting actor needs
to sign the p-assertion it created (see related security issue 2).
4. For the case where authorisation information is desired to be submitted along-
side p-assertions, an interface must be provided as part of the domain specific
services that allows the retrieval of this information from the appropriate loca-
tions (such as a local database). This interface should be congruent with the
specific format in which the authorisation information can be expressed in.
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
45
Page 46
PROVENANCE
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
4.4.2 Intermediate components
By intermediate components, we refer to components that are not directly accessible
by the user. Such components may themselves be invoked or accessed by other com-
ponents rather than by the user, and may interact directly with the provenance store.
For example, a user may use a presentation UI to access a presentation service which
in turn accesses the provenance store. In the application domain, a user may access an
application UI that in turn invokes a chain of other application services before a final
invocation is made to the provenance store. In such cases, the intermediate component
may require authentication of incoming requests to it. It is possible to reuse the secu-
rity architecture developed for the provenance store for this particular component as
well. The primary differences would be, with reference to Fig. 4.1, are:
1. As the incoming request is to the intermediate component, it is unlikely to be a p-
assertion, rather a generic data item (which may contain a p-assertion) submitted
in accordance with the schema of the interface to this intermediate component.
2. The derivation engine will not be used to create new authorisation information
as the submitted data item is not intended to be stored. However it may be used
in performing some security-related functionality on the data item, for example
encrypting or filtering out a certain portion of it. This will be accomplished
in conjunction with security policy dictating the operation of this intermediate
component.
3. Once the request is approved by the authorisation engine, it is sent off (l) to
some internal function of the intermediate component for further processing,
rather than to a database backend (as is the case for the provenance store). Once
this processing is complete, a result is returned to the invoking actor (m) and /
or a further invocation is made to another component.
4.4.3 Delegation of identity or access control
The need to delegate access control may arise if the intermediate component described
previously exists in a separate security domain from both the user and the provenance
store. Consider again the logical architecture in Fig. 4.1 and assume that a user is
performing a query on the provenance store through the presentation UI and a pro-
cessing service. Assume now three separate security domains: one containing the user
and the presentation UI, another the processing service, and the third encapsulating the
provenance store.
When the presentation UI under the users control sends a request to the presenta-
tion service, an appropriate credential is submitted by the user for purposes of authen-
tication. If the request is authorised, the presentation service will then decide the type
and number of provenance store queries that need to be made in order to satisfy the
request. When making these queries, the presentation service needs to present suitable
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
46
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
4.4.2 Intermediate components
By intermediate components, we refer to components that are not directly accessible
by the user. Such components may themselves be invoked or accessed by other com-
ponents rather than by the user, and may interact directly with the provenance store.
For example, a user may use a presentation UI to access a presentation service which
in turn accesses the provenance store. In the application domain, a user may access an
application UI that in turn invokes a chain of other application services before a final
invocation is made to the provenance store. In such cases, the intermediate component
may require authentication of incoming requests to it. It is possible to reuse the secu-
rity architecture developed for the provenance store for this particular component as
well. The primary differences would be, with reference to Fig. 4.1, are:
1. As the incoming request is to the intermediate component, it is unlikely to be a p-
assertion, rather a generic data item (which may contain a p-assertion) submitted
in accordance with the schema of the interface to this intermediate component.
2. The derivation engine will not be used to create new authorisation information
as the submitted data item is not intended to be stored. However it may be used
in performing some security-related functionality on the data item, for example
encrypting or filtering out a certain portion of it. This will be accomplished
in conjunction with security policy dictating the operation of this intermediate
component.
3. Once the request is approved by the authorisation engine, it is sent off (l) to
some internal function of the intermediate component for further processing,
rather than to a database backend (as is the case for the provenance store). Once
this processing is complete, a result is returned to the invoking actor (m) and /
or a further invocation is made to another component.
4.4.3 Delegation of identity or access control
The need to delegate access control may arise if the intermediate component described
previously exists in a separate security domain from both the user and the provenance
store. Consider again the logical architecture in Fig. 4.1 and assume that a user is
performing a query on the provenance store through the presentation UI and a pro-
cessing service. Assume now three separate security domains: one containing the user
and the presentation UI, another the processing service, and the third encapsulating the
provenance store.
When the presentation UI under the users control sends a request to the presenta-
tion service, an appropriate credential is submitted by the user for purposes of authen-
tication. If the request is authorised, the presentation service will then decide the type
and number of provenance store queries that need to be made in order to satisfy the
request. When making these queries, the presentation service needs to present suitable
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
46
Page 47
PROVENANCE
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
authentication credentials to the provenance store. There are essentially two ways to
proceed here:
• Authenticate to the provenance store using the credentials of the presentation ser-
vice, whereupon subsequent authorisation decisions will be based on the identity
or associated role of the presentation service. This approach requires the presen-
tation service to be trusted and known to the provenance store security admin-
istrators, and that it has the appropriate authorisation to access a wide enough
pool of p-assertions to satisfy requests from all potential users (or at least users
that are known within the security domain of the presentation service).
• Authenticate to the provenance store on behalf of the original user. This ap-
proach requires that a form of delegated identity or access control credential be
created by the presentation service, possibly in negotiation with the presentation
UI. The identity validator of the provenance store must then be able to recognise
and process this delegated credential accordingly, and infer the identity or asso-
ciated role of the original user. Subsequent authorisation decisions are then on
the basis of the users identity, and may also need to take into account additional
constraints specified in the delegated credential itself.
The first approach is suitable if all potential users making queries can ever only do
so through the medium of a presentation service. Here, the responsibility of checking
authorisations for the actual users is effectively offloaded from the provenance store to
the various presentation services in the system. If the number of presentation services
known within the provenance store security domain is significantly smaller than the
potential number of users, then the overhead of authorisation is equivalently reduced
as there is now only a need to check on these presentation services.
There are some drawbacks however with this approach however. Firstly, authori-
sation policies are likely to be duplicated between many presentation services, as it
is unlikely that authorisation for a specific user will differ between different services.
Accordingly, changes or additions to these authorisation policies must then also be
propagated between the different copies on all services. Lastly, application services
storing p-assertions through the recording interface must now provide authorisation
information pertaining to presentation services rather than specific users. This may
necessitate additional overhead in communication between application services and
presentation services.
The second approach therefore appears to be a more feasible one. There will how-
ever be an overhead associated with communication between the presentation UI and
the presentation service in order to create an appropriate delegation credential. De-
pending on the delegation act itself, there may be a need also for further communica-
tion between the security architecture of the provenance store and the user / presenta-
tion UI during the authentication or authorisation process in the security architecture of
the provenance store. This might happen, for example, when delegating access control
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
47
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
authentication credentials to the provenance store. There are essentially two ways to
proceed here:
• Authenticate to the provenance store using the credentials of the presentation ser-
vice, whereupon subsequent authorisation decisions will be based on the identity
or associated role of the presentation service. This approach requires the presen-
tation service to be trusted and known to the provenance store security admin-
istrators, and that it has the appropriate authorisation to access a wide enough
pool of p-assertions to satisfy requests from all potential users (or at least users
that are known within the security domain of the presentation service).
• Authenticate to the provenance store on behalf of the original user. This ap-
proach requires that a form of delegated identity or access control credential be
created by the presentation service, possibly in negotiation with the presentation
UI. The identity validator of the provenance store must then be able to recognise
and process this delegated credential accordingly, and infer the identity or asso-
ciated role of the original user. Subsequent authorisation decisions are then on
the basis of the users identity, and may also need to take into account additional
constraints specified in the delegated credential itself.
The first approach is suitable if all potential users making queries can ever only do
so through the medium of a presentation service. Here, the responsibility of checking
authorisations for the actual users is effectively offloaded from the provenance store to
the various presentation services in the system. If the number of presentation services
known within the provenance store security domain is significantly smaller than the
potential number of users, then the overhead of authorisation is equivalently reduced
as there is now only a need to check on these presentation services.
There are some drawbacks however with this approach however. Firstly, authori-
sation policies are likely to be duplicated between many presentation services, as it
is unlikely that authorisation for a specific user will differ between different services.
Accordingly, changes or additions to these authorisation policies must then also be
propagated between the different copies on all services. Lastly, application services
storing p-assertions through the recording interface must now provide authorisation
information pertaining to presentation services rather than specific users. This may
necessitate additional overhead in communication between application services and
presentation services.
The second approach therefore appears to be a more feasible one. There will how-
ever be an overhead associated with communication between the presentation UI and
the presentation service in order to create an appropriate delegation credential. De-
pending on the delegation act itself, there may be a need also for further communica-
tion between the security architecture of the provenance store and the user / presenta-
tion UI during the authentication or authorisation process in the security architecture of
the provenance store. This might happen, for example, when delegating access control
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
47
Page 49
PROVENANCE
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
1. Mutual authentication and secure transport of p-assertions between two appli-
cation actors. Both activities have to be handled or negotiated between the two
actors involved in the production of interaction p-assertions. This issue is en-
tirely within the responsibility of the application domain.
2. Anonymisation of data. Some applications (for example Medical applications
using sensitive patient data) require the exchange of patient-related information
during the interaction of services. Legal restrictions mandate that data of this
nature is anonymised (patient identity is removed) and depersonalised (i.e. the
identity of the patient cannot be traced based on other information in the record).
This requirement is also outside the context of the security architecture.
3. Support for multiple authentication schemes. To enhance security in some ap-
plication scenarios, authentication requires a combination of security credentials
in order to be successful. The identity validator of the component in question
must then be able to support the use of multiple security credentials. There is no
explicit requirement for multiple authentication to be handled by the provenance
store, although this could be employed in the application domain itself for which
it assumes full responsibility for.
4. Specifying policies in an RBAC fashion. It may be useful for authorisation to
be performed in a RBAC fashion in the provenance store; the authorisation en-
gine, policy and internal representation list would thereby need to incorporate
the necessary semantics to express RBAC-type assertions.
5. Long term storage of process documentation. If a third party database provider
is used, then process documentation may need to be encrypted or signed by
the remote interactor prior to sending it off for storage. In the event that this
documentation is intended to be stored for a relatively long period (e.g. 100
years), a situation likely to arise is one where the original cryptographic keys
and / or algorithms become outdated or expire. Such issues must be catered
for in some way, for example, by having a key archival facility and re-signing
/ re-encrypting provenance information periodically over the intended storage
duration.
6. Expiry of certificates. For workflows that run over a relatively long period, it is
possible that certificates could expire in the middle of a workflow run. If an actor
uses a certificate as part of the authentication process to the provenance store,
then expiry of this certificate would mean that submission invocations that were
once accepted within the context of this workflow have now become invalid. To
avoid situations like this, proper management of certificates and keys at the actor
end is called for (i.e. workflow duration is estimated against certificate life time
prior to commencing a workflow). Alternatively, the provenance store security
policy could be articulated appropriately to avoid this situation. For example,
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
49
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
1. Mutual authentication and secure transport of p-assertions between two appli-
cation actors. Both activities have to be handled or negotiated between the two
actors involved in the production of interaction p-assertions. This issue is en-
tirely within the responsibility of the application domain.
2. Anonymisation of data. Some applications (for example Medical applications
using sensitive patient data) require the exchange of patient-related information
during the interaction of services. Legal restrictions mandate that data of this
nature is anonymised (patient identity is removed) and depersonalised (i.e. the
identity of the patient cannot be traced based on other information in the record).
This requirement is also outside the context of the security architecture.
3. Support for multiple authentication schemes. To enhance security in some ap-
plication scenarios, authentication requires a combination of security credentials
in order to be successful. The identity validator of the component in question
must then be able to support the use of multiple security credentials. There is no
explicit requirement for multiple authentication to be handled by the provenance
store, although this could be employed in the application domain itself for which
it assumes full responsibility for.
4. Specifying policies in an RBAC fashion. It may be useful for authorisation to
be performed in a RBAC fashion in the provenance store; the authorisation en-
gine, policy and internal representation list would thereby need to incorporate
the necessary semantics to express RBAC-type assertions.
5. Long term storage of process documentation. If a third party database provider
is used, then process documentation may need to be encrypted or signed by
the remote interactor prior to sending it off for storage. In the event that this
documentation is intended to be stored for a relatively long period (e.g. 100
years), a situation likely to arise is one where the original cryptographic keys
and / or algorithms become outdated or expire. Such issues must be catered
for in some way, for example, by having a key archival facility and re-signing
/ re-encrypting provenance information periodically over the intended storage
duration.
6. Expiry of certificates. For workflows that run over a relatively long period, it is
possible that certificates could expire in the middle of a workflow run. If an actor
uses a certificate as part of the authentication process to the provenance store,
then expiry of this certificate would mean that submission invocations that were
once accepted within the context of this workflow have now become invalid. To
avoid situations like this, proper management of certificates and keys at the actor
end is called for (i.e. workflow duration is estimated against certificate life time
prior to commencing a workflow). Alternatively, the provenance store security
policy could be articulated appropriately to avoid this situation. For example,
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
49
Page 53
PROVENANCE
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
Acto
r
Prov
enan
ce
Stor
e Reco
rd
P-As
serti
ons
Figure 5.1: SeparateStore Pattern Diagram
Solution A separately deployed store is introduced to retain information about an
application actor’s interactions and states, which we referred to as provenance store
in Chapter 2. An actor records p-assertions in a provenance store so that it does not
have to retain this information itself. A provenance store should have the following
properties:
1. It should be available in a long-term manner in comparison to the application
actors that submit p-assertions to it. This property allows p-assertions recorded
by an application actor to be accessed after the application actor has become
unavailable.
2. It should provide a well-defined interface for the recording of p-assertions by an
application actor.
3. It should provide a query capability to retrieve p-assertions, which makes the
p-assertions available to querying actors.
4. It should provide a management mechanism to manage the stored p-assertions.
5.1.2 ContextPassing Pattern
Diagram See Figure 5.2.
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
53
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
Acto
r
Prov
enan
ce
Stor
e Reco
rd
P-As
serti
ons
Figure 5.1: SeparateStore Pattern Diagram
Solution A separately deployed store is introduced to retain information about an
application actor’s interactions and states, which we referred to as provenance store
in Chapter 2. An actor records p-assertions in a provenance store so that it does not
have to retain this information itself. A provenance store should have the following
properties:
1. It should be available in a long-term manner in comparison to the application
actors that submit p-assertions to it. This property allows p-assertions recorded
by an application actor to be accessed after the application actor has become
unavailable.
2. It should provide a well-defined interface for the recording of p-assertions by an
application actor.
3. It should provide a query capability to retrieve p-assertions, which makes the
p-assertions available to querying actors.
4. It should provide a management mechanism to manage the stored p-assertions.
5.1.2 ContextPassing Pattern
Diagram See Figure 5.2.
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
53
Page 61
PROVENANCE
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
policies. Furthermore, policies are necessary to determine whether provenance stores
should push or pull staged data. Policies for data staging are defined in Definition 7.9,
page 102.
Data staging implies the need to copy or move data between various provenance
stores. A provenance store can copy data to a target provenance store by first query-
ing itself (using the query interface) for the particular p-assertions to be copied and
then recording those p-assertions in the target provenance store (using the recording
interface). The provenance store is acting as a recording actor in this case and not an
asserting actor. The provenance store may use implementation specific functionality
for deleting p-assertions if movement of data is necessary. Likewise, implementation
specific functionality may be implemented to update linking information (see Sec-
tion 7.4.2).
The copying and deletion of p-assertions between provenance stores have an ef-
fect on querying those p-assertions: they are available for search and retrieval from a
different set of stores. For a querying actor expecting to find process documentation
in a store from which it has been deleted, there should be a mechanism by which it
can find the new location of that documentation. This is a data management issue out
of scope of this architecture (we assume that, in general, p-assertions are not deleted
from stores), but suggested mechanisms are given in Section 7.4, such as the sending
of notifications when p-assertions are moved to another store.
The querying over staged data is challenging. For example, data may be in transit
when a query is issued to a provenance store. If the data necessary to answer a query
is in transit to another provenance store, is not readily available and the query cannot
be answered. One solution to the problem is to inform querying actors when data is
finished being transferred and where that data now resides through the use of notifica-
tions (Section 7.4). The remote interactor of the security architecture (Section 4.3.1)
can be used in the sending of these notifications, as well as supporting any security
requirements implicit with such notifications.
5.4 References
In many applications, the size of the messages being transferred between actors may be
so large or bandwidth costs so high that the actor cannot record a copy of the message
in the provenance store. In other instances, actors may wish to record multiple p-
assertions that contain the same data but only record that data one time. Both these
problems can be solved through the use of references. References differ from links.
Links point to provenance stores whereas references point to data.
5.4.1 By-Value versus By-Reference recording
To address the situation where an actor does not wish to record data by-value in a p-
assertion, we introduce the notion of recording a reference to a message outside the
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
61
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
policies. Furthermore, policies are necessary to determine whether provenance stores
should push or pull staged data. Policies for data staging are defined in Definition 7.9,
page 102.
Data staging implies the need to copy or move data between various provenance
stores. A provenance store can copy data to a target provenance store by first query-
ing itself (using the query interface) for the particular p-assertions to be copied and
then recording those p-assertions in the target provenance store (using the recording
interface). The provenance store is acting as a recording actor in this case and not an
asserting actor. The provenance store may use implementation specific functionality
for deleting p-assertions if movement of data is necessary. Likewise, implementation
specific functionality may be implemented to update linking information (see Sec-
tion 7.4.2).
The copying and deletion of p-assertions between provenance stores have an ef-
fect on querying those p-assertions: they are available for search and retrieval from a
different set of stores. For a querying actor expecting to find process documentation
in a store from which it has been deleted, there should be a mechanism by which it
can find the new location of that documentation. This is a data management issue out
of scope of this architecture (we assume that, in general, p-assertions are not deleted
from stores), but suggested mechanisms are given in Section 7.4, such as the sending
of notifications when p-assertions are moved to another store.
The querying over staged data is challenging. For example, data may be in transit
when a query is issued to a provenance store. If the data necessary to answer a query
is in transit to another provenance store, is not readily available and the query cannot
be answered. One solution to the problem is to inform querying actors when data is
finished being transferred and where that data now resides through the use of notifica-
tions (Section 7.4). The remote interactor of the security architecture (Section 4.3.1)
can be used in the sending of these notifications, as well as supporting any security
requirements implicit with such notifications.
5.4 References
In many applications, the size of the messages being transferred between actors may be
so large or bandwidth costs so high that the actor cannot record a copy of the message
in the provenance store. In other instances, actors may wish to record multiple p-
assertions that contain the same data but only record that data one time. Both these
problems can be solved through the use of references. References differ from links.
Links point to provenance stores whereas references point to data.
5.4.1 By-Value versus By-Reference recording
To address the situation where an actor does not wish to record data by-value in a p-
assertion, we introduce the notion of recording a reference to a message outside the
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
61
Page 65
PROVENANCE
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
The security policy in such an instance should therefore dictate that the client actor
sign the IDs it generates (or the context containing such IDs) to the service actor (see
Rule 8.11). This signature activity could be encompassed within the mandate of the
protocol governing a secure interaction between both the client and service actors. On
a similar note, we observe that since the view link is crucial in locating a related p-
assertion for a given interaction, its non-repudiation can also be achieved by ensuring
that the recorded link is signed as well. If the recorded link is part of the contents of a
recorded p-assertion, then a signature on the entire content will suffice. Signatures on
p-assertions are discussed in Section 6.9.
Furthermore, if p-assertions are copied or moved between stores that are located
in different security domains (i.e. for the staging of data), the access control restric-
tions on them in their new destinations needs to be defined. In the simplest case, the
newly moved or copied p-assertions retain the same access control restrictions that
were associated with them in their original domain. These restrictions can be provided
as authorisation information along with the p-assertions as they are recorded to their
new destinations, where they can be processed by the derivation engine in the manner
described in Section 4.3.1.
If the authorisation information involves identities from the originating domain that
are currently unknown in the destination domain, then this identity information needs
to be communicated between the trust mediators of both domains. The communication
can be performed when the p-assertions are initially recorded, or at a later time when
a request is made to the provenance store from an entity that is not recognisable in the
new provenance store domain. The process of moving p-assertions between different
stores also needs to ensure that the transfer medium is secure (if such a requirement
is present), and that both stores are properly authenticated to each other prior to the
movement.
Security considerations can also arise in the use of references, where a p-assertion
contains a unique identifier for data stored elsewhere rather than the actual data itself.
In this case, both the asserting and querying actor may require some sort of assur-
ance that any data eventually retrieved by resolving the identifier in the p-assertion
is actually the same piece of data that was referred to by the identifier at the time of
p-assertion creation by the asserting actor. This can be accomplished by having the
querying actor include a digest of the data being referred to along with the identifier
of that data in the p-assertion. The signature on the p-assertion assures that the digest
and the identifier will not be changed. Subsequently, if the referenced data is retrieved
later, its digest can be computed and compared against the digest within the p-assertion
as a check of its integrity. Digests can be provided through the reference-digest docu-
mentation style (Definition 6.14 page 74, Section 6.5).
When references are being used, it may also be possible for the provenance store
to resolve the references itself (Section 5.4). Retrieval of the remotely stored data will
be achieved by the remote interactor of the security architecture, with the required
credentials being obtained by the trust mediator (Section 4.3.1).
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
65
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
The security policy in such an instance should therefore dictate that the client actor
sign the IDs it generates (or the context containing such IDs) to the service actor (see
Rule 8.11). This signature activity could be encompassed within the mandate of the
protocol governing a secure interaction between both the client and service actors. On
a similar note, we observe that since the view link is crucial in locating a related p-
assertion for a given interaction, its non-repudiation can also be achieved by ensuring
that the recorded link is signed as well. If the recorded link is part of the contents of a
recorded p-assertion, then a signature on the entire content will suffice. Signatures on
p-assertions are discussed in Section 6.9.
Furthermore, if p-assertions are copied or moved between stores that are located
in different security domains (i.e. for the staging of data), the access control restric-
tions on them in their new destinations needs to be defined. In the simplest case, the
newly moved or copied p-assertions retain the same access control restrictions that
were associated with them in their original domain. These restrictions can be provided
as authorisation information along with the p-assertions as they are recorded to their
new destinations, where they can be processed by the derivation engine in the manner
described in Section 4.3.1.
If the authorisation information involves identities from the originating domain that
are currently unknown in the destination domain, then this identity information needs
to be communicated between the trust mediators of both domains. The communication
can be performed when the p-assertions are initially recorded, or at a later time when
a request is made to the provenance store from an entity that is not recognisable in the
new provenance store domain. The process of moving p-assertions between different
stores also needs to ensure that the transfer medium is secure (if such a requirement
is present), and that both stores are properly authenticated to each other prior to the
movement.
Security considerations can also arise in the use of references, where a p-assertion
contains a unique identifier for data stored elsewhere rather than the actual data itself.
In this case, both the asserting and querying actor may require some sort of assur-
ance that any data eventually retrieved by resolving the identifier in the p-assertion
is actually the same piece of data that was referred to by the identifier at the time of
p-assertion creation by the asserting actor. This can be accomplished by having the
querying actor include a digest of the data being referred to along with the identifier
of that data in the p-assertion. The signature on the p-assertion assures that the digest
and the identifier will not be changed. Subsequently, if the referenced data is retrieved
later, its digest can be computed and compared against the digest within the p-assertion
as a check of its integrity. Digests can be provided through the reference-digest docu-
mentation style (Definition 6.14 page 74, Section 6.5).
When references are being used, it may also be possible for the provenance store
to resolve the references itself (Section 5.4). Retrieval of the remotely stored data will
be achieved by the remote interactor of the security architecture, with the required
credentials being obtained by the trust mediator (Section 4.3.1).
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
65
Page 66
PROVENANCE
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
5.8 Conclusion
This chapter presented the architecture facets that addresses problems of scalability in
provenance systems. The chapter presented three deployment patterns, which identify
communication between key architecture roles and that can be applied by developers
to deploy their provenance system in a distributed manner to promote scalability. It
then discussed how provenance retrieval could be enabled across a set of distributed
provenance stores via linking, which supports the deployment of multiple distributed
provenance stores. Data staging was presented as a mechanism for minimising the use
of network bandwidth and for allowing p-assertions to be recorded in a timely manner
by recording actors. The chapter then presented two notions of recording by reference
either by referring to p-assertions already recorded or by referring to data stored at
some other location from within a p-assertion. References support scalability by not
duplicating data and allowing data to be kept in a single location. P-Assertion Tem-
plates were then described as a mechanism to offload the generation of p-assertions
to provenance stores reducing the computational load on asserting actors. We then
discussed how large query results could be handled. Finally, we addressed relevant
security issues that pertain to these scalability solutions.
This chapter has presented architectural solutions to scalability for provenance sys-
tems. However, implementations of the architecture can also address the above is-
sues in technology specific ways. For example, provenance stores can use clustering
[IGM05] and database management technologies to handle large loads. Likewise, in
Web Service implementations, where XML is the common format, binary/compressed
data representation should also be supported. These implementation specific scalabil-
ity solutions are supported by the flexibility and design decisions prescribed in this
architecture.
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
66
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
5.8 Conclusion
This chapter presented the architecture facets that addresses problems of scalability in
provenance systems. The chapter presented three deployment patterns, which identify
communication between key architecture roles and that can be applied by developers
to deploy their provenance system in a distributed manner to promote scalability. It
then discussed how provenance retrieval could be enabled across a set of distributed
provenance stores via linking, which supports the deployment of multiple distributed
provenance stores. Data staging was presented as a mechanism for minimising the use
of network bandwidth and for allowing p-assertions to be recorded in a timely manner
by recording actors. The chapter then presented two notions of recording by reference
either by referring to p-assertions already recorded or by referring to data stored at
some other location from within a p-assertion. References support scalability by not
duplicating data and allowing data to be kept in a single location. P-Assertion Tem-
plates were then described as a mechanism to offload the generation of p-assertions
to provenance stores reducing the computational load on asserting actors. We then
discussed how large query results could be handled. Finally, we addressed relevant
security issues that pertain to these scalability solutions.
This chapter has presented architectural solutions to scalability for provenance sys-
tems. However, implementations of the architecture can also address the above is-
sues in technology specific ways. For example, provenance stores can use clustering
[IGM05] and database management technologies to handle large loads. Likewise, in
Web Service implementations, where XML is the common format, binary/compressed
data representation should also be supported. These implementation specific scalabil-
ity solutions are supported by the flexibility and design decisions prescribed in this
architecture.
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
66
Page 74
PROVENANCE
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
Such a style is the simplest documentation style as no transformation actually hap-
pens and the data is copied to the p-assertion in a “verbatim” way.
Section 5.4 introduces the notion of reference in order to provide scalable handling
of large p-assertions. Such references are handled by the reference documentation
style, which we now define.
Definition 6.13 (Reference Documentation Style) The reference documentation style
denotes a transformation of a message by which a part of (or the whole of) its contents
has been replaced by a reference to the location where the actual contents can be
found.2
Section 5.7 discussed security implications of Reference Documentation Style; in
order to ensure that the querier of a p-assertion who dereferences a reference accesses
the same data as intended by the recorder of the p-assertion, a digest may be associated
with the reference.
Definition 6.14 (Reference-Digest Documentation Style) The reference-digest doc-
umentation style denotes a transformation of a message by which a part of (or the
whole of) its contents has been replaced by a reference to the actual location where it
can be found and a digest of the substituted data.2
Section 5.4.2 introduced the notion of allowing actors to record references to p-
assertions that have already been recorded in a provenance store. This allows actors to
avoid recording multiple duplicate p-assertions. To support these references that point
internally to a provenance store, we introduce the following documentation style.
Definition 6.15 (Internal Reference Documentation Style) The internal reference doc-
umentation style denotes a transformation of a message by which a part of (or the
whole of) its contents has been replaced by a global p-assertion key, which refers to
another p-assertion that contains the actual data.2
This documentation style allows actors to avoid duplicating data unnecessarily.
The style is a special case of the reference documentation style.
Section 9.4 introduces the requirement for anonymisation of patient identifiers
from the OTM/EHCR application (GR-OTM.1, GR-EHCR.4). These application re-
quirements are handled by anonymous documentation style, which we now define.
Definition 6.16 (Anonymous Documentation Style) The anonymous documentation
style denotes a transformation of a message by which a part of (or the whole of) its
contents has been replaced by an “anonymous” identifier. This identifier hides the
actual data without losing the link to them. 2
Consider the case of medical data for example, where information for each patient
is kept. It may be necessary to hide the patient name from the provenance store due
to medical data privacy policies. In order to support this requirement, the anonymous
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
74
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
Such a style is the simplest documentation style as no transformation actually hap-
pens and the data is copied to the p-assertion in a “verbatim” way.
Section 5.4 introduces the notion of reference in order to provide scalable handling
of large p-assertions. Such references are handled by the reference documentation
style, which we now define.
Definition 6.13 (Reference Documentation Style) The reference documentation style
denotes a transformation of a message by which a part of (or the whole of) its contents
has been replaced by a reference to the location where the actual contents can be
found.2
Section 5.7 discussed security implications of Reference Documentation Style; in
order to ensure that the querier of a p-assertion who dereferences a reference accesses
the same data as intended by the recorder of the p-assertion, a digest may be associated
with the reference.
Definition 6.14 (Reference-Digest Documentation Style) The reference-digest doc-
umentation style denotes a transformation of a message by which a part of (or the
whole of) its contents has been replaced by a reference to the actual location where it
can be found and a digest of the substituted data.2
Section 5.4.2 introduced the notion of allowing actors to record references to p-
assertions that have already been recorded in a provenance store. This allows actors to
avoid recording multiple duplicate p-assertions. To support these references that point
internally to a provenance store, we introduce the following documentation style.
Definition 6.15 (Internal Reference Documentation Style) The internal reference doc-
umentation style denotes a transformation of a message by which a part of (or the
whole of) its contents has been replaced by a global p-assertion key, which refers to
another p-assertion that contains the actual data.2
This documentation style allows actors to avoid duplicating data unnecessarily.
The style is a special case of the reference documentation style.
Section 9.4 introduces the requirement for anonymisation of patient identifiers
from the OTM/EHCR application (GR-OTM.1, GR-EHCR.4). These application re-
quirements are handled by anonymous documentation style, which we now define.
Definition 6.16 (Anonymous Documentation Style) The anonymous documentation
style denotes a transformation of a message by which a part of (or the whole of) its
contents has been replaced by an “anonymous” identifier. This identifier hides the
actual data without losing the link to them. 2
Consider the case of medical data for example, where information for each patient
is kept. It may be necessary to hide the patient name from the provenance store due
to medical data privacy policies. In order to support this requirement, the anonymous
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
74
Page 76
PROVENANCE
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
that this encryption documentation style is unrelated to the encryption discussed in
4.3.2. For that case, encryption is applied for the purposes of securing the communi-
cation channel between the recording actor and the provenance store.
All previous documentation styles denote an atomic transformation (which is not
described in terms of internal steps). We now define the means by which several
transformations can be applied to a single message.
Definition 6.19 (Composite Documentation Style) A composite documentation style
denotes that more than one atomic documentation style has been applied to a message.2
The atomic documentation styles described above can be used in any combination
for the same message. This means that several transformations may happen either
at different parts of the message or at the same part of it. The resulting message
consists of the composition of all the transformations. A composite documentation
style denotes that a set of atomic transformations has been applied to a message.
Documentation styles allow actors to express transformations they have performed
on messages when asserting interaction p-assertions for those messages. Documenta-
tion styles are mandatory for interaction p-assertions because querying actors need to
be able to determine whether two interaction p-assertions are matching (see 2.6). Us-
ing the documentation styles of the two interaction p-assertions, a querier can generate
the original message from both p-assertions and determine whether they are matching,
irrespective of how the message was transformed when being asserted. In the case of
actor state p-assertions, there is no application-independent notion of “matching” actor
state p-assertions. However, we allow the documentation style that has been applied in
asserting an actor state p-assertion to be declared along with the state (see below). Re-
lationship p-assertions do not have documentation styles: all the fields in a relationship
p-assertion are necessary so that the provenance query functionality can be performed
in an application independent manner.
6.6 Actor State P-Assertion Modelling
Actor state p-assertions, as defined in Definition 2.8, are assertions made by an actor
about its internal state in the context of a specific interaction. Each actor in an in-
teraction sends or receives a message, so an actor state p-assertion asserts something
about the state of the actor just before or just after it sent or received the message. For
example, a service with an incoming message buffer may assert the state of its buffer
just before and after receiving a message. Often, after an actor receives a message, it
performs an execution that the message has triggered and, similarly, before sending a
message, it performs an execution that resulted in that message. Therefore, a common
subset of actor state p-assertions give details of the execution that took place just af-
ter receiving or just before sending a message. For example, a service may assert the
computational resources allocated to an execution. For example, the actor state may
name the workflow that the interaction occurred as part of.
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
76
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
that this encryption documentation style is unrelated to the encryption discussed in
4.3.2. For that case, encryption is applied for the purposes of securing the communi-
cation channel between the recording actor and the provenance store.
All previous documentation styles denote an atomic transformation (which is not
described in terms of internal steps). We now define the means by which several
transformations can be applied to a single message.
Definition 6.19 (Composite Documentation Style) A composite documentation style
denotes that more than one atomic documentation style has been applied to a message.2
The atomic documentation styles described above can be used in any combination
for the same message. This means that several transformations may happen either
at different parts of the message or at the same part of it. The resulting message
consists of the composition of all the transformations. A composite documentation
style denotes that a set of atomic transformations has been applied to a message.
Documentation styles allow actors to express transformations they have performed
on messages when asserting interaction p-assertions for those messages. Documenta-
tion styles are mandatory for interaction p-assertions because querying actors need to
be able to determine whether two interaction p-assertions are matching (see 2.6). Us-
ing the documentation styles of the two interaction p-assertions, a querier can generate
the original message from both p-assertions and determine whether they are matching,
irrespective of how the message was transformed when being asserted. In the case of
actor state p-assertions, there is no application-independent notion of “matching” actor
state p-assertions. However, we allow the documentation style that has been applied in
asserting an actor state p-assertion to be declared along with the state (see below). Re-
lationship p-assertions do not have documentation styles: all the fields in a relationship
p-assertion are necessary so that the provenance query functionality can be performed
in an application independent manner.
6.6 Actor State P-Assertion Modelling
Actor state p-assertions, as defined in Definition 2.8, are assertions made by an actor
about its internal state in the context of a specific interaction. Each actor in an in-
teraction sends or receives a message, so an actor state p-assertion asserts something
about the state of the actor just before or just after it sent or received the message. For
example, a service with an incoming message buffer may assert the state of its buffer
just before and after receiving a message. Often, after an actor receives a message, it
performs an execution that the message has triggered and, similarly, before sending a
message, it performs an execution that resulted in that message. Therefore, a common
subset of actor state p-assertions give details of the execution that took place just af-
ter receiving or just before sending a message. For example, a service may assert the
computational resources allocated to an execution. For example, the actor state may
name the workflow that the interaction occurred as part of.
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
76
Page 80
PROVENANCE
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
Figure 6.9: The P-Structure
The p-structure is organised as a hierarchy. At the top level of the hierarchy are
InteractionRecords. Each record encapsulates all the p-assertions and identifiers re-
lated to one interaction. The choice of interaction records as the chief items in the
p-structure comes from the idea that interactions are the core actions of a process.
Each InteractionRecord is identified by an interaction key, as shown in Figure 6.1. The
interaction key distinguishes one InteractionRecord from all others and is provided
by the asserting actor and not the provenance store. Therefore, no contact with the
provenance store is required in order to create p-assertions.
In the p-structure hierarchy, we find two Views under the InteractionRecord. One
View contains the p-assertions from the sender in the interaction, while the other View
contains those from the receiver. A View has the following structure as shown in Fig-
ure 6.10: every view has an asserter (†), which is the identity of the actor asserting [GR-OTM.9, p. 130]
a set of p-assertions, it can contain several interaction p-assertions (where there are
more than one, we would expect different document styles to be used for the same
message), several actor state p-assertions, several relationship p-assertions, and a set
of ExposedInteractionMetada. These components make explicit at a high-level the in-
formation contained within the InteractionMetaData contained within the p-assertions
stored within the InteractionRecord, and can be used to facilitate the location of p-
assertions within the p-structure. A View also contains a set of AnyElements that
can be instantiated for specific purposes, such as with ViewLinks for linking differ-
ent views of an interaction when they are stored in different provenance stores (as
described in Chapter 5).
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
80
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
Figure 6.9: The P-Structure
The p-structure is organised as a hierarchy. At the top level of the hierarchy are
InteractionRecords. Each record encapsulates all the p-assertions and identifiers re-
lated to one interaction. The choice of interaction records as the chief items in the
p-structure comes from the idea that interactions are the core actions of a process.
Each InteractionRecord is identified by an interaction key, as shown in Figure 6.1. The
interaction key distinguishes one InteractionRecord from all others and is provided
by the asserting actor and not the provenance store. Therefore, no contact with the
provenance store is required in order to create p-assertions.
In the p-structure hierarchy, we find two Views under the InteractionRecord. One
View contains the p-assertions from the sender in the interaction, while the other View
contains those from the receiver. A View has the following structure as shown in Fig-
ure 6.10: every view has an asserter (†), which is the identity of the actor asserting [GR-OTM.9, p. 130]
a set of p-assertions, it can contain several interaction p-assertions (where there are
more than one, we would expect different document styles to be used for the same
message), several actor state p-assertions, several relationship p-assertions, and a set
of ExposedInteractionMetada. These components make explicit at a high-level the in-
formation contained within the InteractionMetaData contained within the p-assertions
stored within the InteractionRecord, and can be used to facilitate the location of p-
assertions within the p-structure. A View also contains a set of AnyElements that
can be instantiated for specific purposes, such as with ViewLinks for linking differ-
ent views of an interaction when they are stored in different provenance stores (as
described in Chapter 5).
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
80
Page 83
PROVENANCE
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
Figure 6.11: Model for a secured interaction p-assertion
Figure 6.12: Model for a secured actor state p-assertion
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
83
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
Figure 6.11: Model for a secured interaction p-assertion
Figure 6.12: Model for a secured actor state p-assertion
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
83
Page 84
PROVENANCE
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
Figure 6.13: Model for a secured relationship p-assertion
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
84
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
Figure 6.13: Model for a secured relationship p-assertion
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
84
Page 89
PROVENANCE
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
Figure 7.2: Record acknowledgement message
tuple in the acknowledgement message because the message layer can identify what
request a response relates to. Finally, both sych ack and ack also contain ERROR to
allow application specific error messages to be passed
The recording functionality of the provenance store lets actors record p-assertions
about their interactions, state and relationships between them. These p-assertions are
uniquely identifiable.
7.2 Provenance Query Interface
The provenance store supports two query interfaces: a provenance query interface
which allows querying actors to retrieve the provenance of application entities, and a
process documentation query interface, by which the content of identified p-assertions
can be retrieved. In this section, we specify the functional requirements of the prove-
nance query interface, while the process documentation query interface is specified in
the next section.
The provenance query interface (†) accepts a provenance query request and re- [SR-1-2, p. 115]
sponds with provenance query results. A provenance query request defines a search
for the provenance of an entity at a given instant, and provenance query results rep-
resent the provenance of that entity at that instant. The reason that the provenance of
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
89
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
Figure 7.2: Record acknowledgement message
tuple in the acknowledgement message because the message layer can identify what
request a response relates to. Finally, both sych ack and ack also contain ERROR to
allow application specific error messages to be passed
The recording functionality of the provenance store lets actors record p-assertions
about their interactions, state and relationships between them. These p-assertions are
uniquely identifiable.
7.2 Provenance Query Interface
The provenance store supports two query interfaces: a provenance query interface
which allows querying actors to retrieve the provenance of application entities, and a
process documentation query interface, by which the content of identified p-assertions
can be retrieved. In this section, we specify the functional requirements of the prove-
nance query interface, while the process documentation query interface is specified in
the next section.
The provenance query interface (†) accepts a provenance query request and re- [SR-1-2, p. 115]
sponds with provenance query results. A provenance query request defines a search
for the provenance of an entity at a given instant, and provenance query results rep-
resent the provenance of that entity at that instant. The reason that the provenance of
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
89
Page 90
PROVENANCE
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
an entity must begin at a specified instant in time is that the entity may be different at
different instants in its lifetime, and so the processes by which it came to be in those
states (its provenance) are also different. Because p-assertions may be recorded any
time after the interaction that the assertion is about, there is no realistic way to define
the instant “now”: a provenance store or stores can never know whether it has docu-
mentation on the latest version of an entity, nor does it necessarily know what the latest
version of an entity is.
A provenance query request is made up of a query data handle and a relationship
target filter, defined below and depicted in Figure 7.3.
Definition 7.1 (Query Data Handle) A query data handle is a search over the con-
tents of a provenance store in order to find the record of an entity at a given instant
that the querying actor wishes to find the provenance of. 2
Definition 7.2 (Relationship Target Filter) A relationship target filter is a set of cri-
teria by which the querying actor specifies whether any given entity in the process
documentation should be included in the query results. By this mechanism, the query-
ing actor can scope the provenance query results. 2
Figure 7.3: Provenance Query Request model
7.2.1 Query Data Handles
In order for a querying actor to ask provenance stores the question “What is the prove-
nance of entity E at instant T?”, the actor must identify the entity in a way that the
provenance stores can interpret. The identification used is called a query data handle.
From the application perspective, a query data handle identifies an application entity at
a given instant. From the provenance store perspective, a query data handle identifies
a search for p-assertion data items within the process documentation.
A query data handle is made up of the following two parts:
• A search over the p-structure for instants at which the entity may occur.
• A search over the contents of interaction or actor state p-assertions to retrieve
the data items which are documentation of the entity.
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
90
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
an entity must begin at a specified instant in time is that the entity may be different at
different instants in its lifetime, and so the processes by which it came to be in those
states (its provenance) are also different. Because p-assertions may be recorded any
time after the interaction that the assertion is about, there is no realistic way to define
the instant “now”: a provenance store or stores can never know whether it has docu-
mentation on the latest version of an entity, nor does it necessarily know what the latest
version of an entity is.
A provenance query request is made up of a query data handle and a relationship
target filter, defined below and depicted in Figure 7.3.
Definition 7.1 (Query Data Handle) A query data handle is a search over the con-
tents of a provenance store in order to find the record of an entity at a given instant
that the querying actor wishes to find the provenance of. 2
Definition 7.2 (Relationship Target Filter) A relationship target filter is a set of cri-
teria by which the querying actor specifies whether any given entity in the process
documentation should be included in the query results. By this mechanism, the query-
ing actor can scope the provenance query results. 2
Figure 7.3: Provenance Query Request model
7.2.1 Query Data Handles
In order for a querying actor to ask provenance stores the question “What is the prove-
nance of entity E at instant T?”, the actor must identify the entity in a way that the
provenance stores can interpret. The identification used is called a query data handle.
From the application perspective, a query data handle identifies an application entity at
a given instant. From the provenance store perspective, a query data handle identifies
a search for p-assertion data items within the process documentation.
A query data handle is made up of the following two parts:
• A search over the p-structure for instants at which the entity may occur.
• A search over the contents of interaction or actor state p-assertions to retrieve
the data items which are documentation of the entity.
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
90
Page 91
PROVENANCE
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
Because interaction and actor state p-assertions are also part of the p-structure,
these parts can be combined into a single search over the contents of a provenance
store, represented by a p-structure. We will look at how the querying actor specifies
each part of the search below.
Identifying Instants
Following the formulation of a system as being a collection of distributed interacting
actors, described in Chapter 2, there are three ways in which to identify a documented
instant in the past:
• The instant at which an actor sent a message.
• The instant at which an actor received a message.
• The instant at which an actor accessed an asserted part of its state, i.e. the instant
to which an actor state refers.
These are apparent in the p-structure as the following.
• An interaction p-assertion in the sender’s view of an interaction.
• An interaction p-assertion in the receiver’s view of an interaction.
• An actor state p-assertion whose content specifies to which instant it applies, e.g.
just before or just after sending or receiving a message.
The identities of the actors involved in an interaction are apparent in the message
source, message sink and asserter elements of the p-structure.
Identifying Data Items
The entity of which the querying actor wishes to find the provenance must be apparent
in the interaction or actor state p-assertions in order for the provenance store to find
it. The entity may not always be present as an exact copy of data: it may appear by a
reference in a p-assertion or otherwise implied by application-specific structures. For
instance, application messages may refer to a data item by the name of the file in which
it is contained, but the querying actor wishes to find the provenance of the data, rather
than the file.
The query data handle includes a search over the contents of p-assertions to re-
trieve data items which are documentation of the entity. This search is expressed in a
particular search language, and the range of search languages supported by any one
provenance store may vary.
A search language must assume some format and structure of the documents over
which it searches, which we call the document language, e.g. the XPath search lan-
guage assumes the XML document language. However, a provenance store is agnostic
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
91
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
Because interaction and actor state p-assertions are also part of the p-structure,
these parts can be combined into a single search over the contents of a provenance
store, represented by a p-structure. We will look at how the querying actor specifies
each part of the search below.
Identifying Instants
Following the formulation of a system as being a collection of distributed interacting
actors, described in Chapter 2, there are three ways in which to identify a documented
instant in the past:
• The instant at which an actor sent a message.
• The instant at which an actor received a message.
• The instant at which an actor accessed an asserted part of its state, i.e. the instant
to which an actor state refers.
These are apparent in the p-structure as the following.
• An interaction p-assertion in the sender’s view of an interaction.
• An interaction p-assertion in the receiver’s view of an interaction.
• An actor state p-assertion whose content specifies to which instant it applies, e.g.
just before or just after sending or receiving a message.
The identities of the actors involved in an interaction are apparent in the message
source, message sink and asserter elements of the p-structure.
Identifying Data Items
The entity of which the querying actor wishes to find the provenance must be apparent
in the interaction or actor state p-assertions in order for the provenance store to find
it. The entity may not always be present as an exact copy of data: it may appear by a
reference in a p-assertion or otherwise implied by application-specific structures. For
instance, application messages may refer to a data item by the name of the file in which
it is contained, but the querying actor wishes to find the provenance of the data, rather
than the file.
The query data handle includes a search over the contents of p-assertions to re-
trieve data items which are documentation of the entity. This search is expressed in a
particular search language, and the range of search languages supported by any one
provenance store may vary.
A search language must assume some format and structure of the documents over
which it searches, which we call the document language, e.g. the XPath search lan-
guage assumes the XML document language. However, a provenance store is agnostic
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
91
Page 92
PROVENANCE
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
to and unaware of the structure of application messages and assertions of state it con-
tains. In fact, application messages may have used different formats, so p-assertions
within one provenance store may use different document languages. Therefore, a query
data handle may also specify document language mappings [ZDF+05] between the
document language used for a p-assertion and the document language required for the
search.
Definition 7.3 (Document Language Mapping) A document language mapping is a
definition of how to transform documents formatted in one document language into
another document language. 2
Composition of Provenance Queries
A query data handle is a search for an entity at an instant within a p-structure. Primar-
ily, this p-structure will be the full contents of a provenance store. However, in some
cases a query data handle may be better expressed as a composition of provenance
queries, i.e. one provenance query is performed and the results are searched over by
another provenance query. For example, to find the provenance of the second item
added to a set, we can first determine the provenance of the set, and then identify from
that the item added second, so the provenance of the item can then be found. In this
case, the search that the query data handle specifies is over a p-structure formed from
the results of another provenance query.
In general terms, we divide the query data handle into the search to be performed
and the p-structure over which it will search. That p-structure, referred to by a p-
structure reference, can be one of three possibilities (also depicted in Figure 7.4).
• The contents of the provenance store.
• The results of another provenance query, in the form of a p-structure.
• A p-structure given by another p-structure reference but, in addition, includ-
ing the transitive closure of relationships with a given relation name in that p-
structure, i.e. wherever there is a relationship of the specified type from A to
B and one from B to C, the transitive closure will also contain a relationship
p-assertion from A to C.
Definition 7.4 (P-Structure Reference) A p-structure reference is a declaration of
the p-structure over which a provenance query’s entity search will be executed. 2
Model
A model of the query data handle is shown in Figure 7.5. The search element specifies
a search, in the chosen search language, over the p-structure for data items within p-
assertions asserted about sending or receiving messages at given instants. The pStruc-
tureReference refers to the set of p-assertions that the search will be conducted over,
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
92
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
to and unaware of the structure of application messages and assertions of state it con-
tains. In fact, application messages may have used different formats, so p-assertions
within one provenance store may use different document languages. Therefore, a query
data handle may also specify document language mappings [ZDF+05] between the
document language used for a p-assertion and the document language required for the
search.
Definition 7.3 (Document Language Mapping) A document language mapping is a
definition of how to transform documents formatted in one document language into
another document language. 2
Composition of Provenance Queries
A query data handle is a search for an entity at an instant within a p-structure. Primar-
ily, this p-structure will be the full contents of a provenance store. However, in some
cases a query data handle may be better expressed as a composition of provenance
queries, i.e. one provenance query is performed and the results are searched over by
another provenance query. For example, to find the provenance of the second item
added to a set, we can first determine the provenance of the set, and then identify from
that the item added second, so the provenance of the item can then be found. In this
case, the search that the query data handle specifies is over a p-structure formed from
the results of another provenance query.
In general terms, we divide the query data handle into the search to be performed
and the p-structure over which it will search. That p-structure, referred to by a p-
structure reference, can be one of three possibilities (also depicted in Figure 7.4).
• The contents of the provenance store.
• The results of another provenance query, in the form of a p-structure.
• A p-structure given by another p-structure reference but, in addition, includ-
ing the transitive closure of relationships with a given relation name in that p-
structure, i.e. wherever there is a relationship of the specified type from A to
B and one from B to C, the transitive closure will also contain a relationship
p-assertion from A to C.
Definition 7.4 (P-Structure Reference) A p-structure reference is a declaration of
the p-structure over which a provenance query’s entity search will be executed. 2
Model
A model of the query data handle is shown in Figure 7.5. The search element specifies
a search, in the chosen search language, over the p-structure for data items within p-
assertions asserted about sending or receiving messages at given instants. The pStruc-
tureReference refers to the set of p-assertions that the search will be conducted over,
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
92
Page 93
PROVENANCE
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
Figure 7.4: P-Structure Reference model
and is one of the options discussed in the section above. The documentLanguageMap-
pings specify how p-assertion contents are mapped to the document language required
by the search.
Figure 7.5: Query Data Handle model
7.2.2 Relationship Target Filters
The set of process documentation about entities that ultimately have some causal in-
fluence on the entity identified by a query data handle could be vast, and most of it
irrelevant to a querying actor for any one purpose. Therefore, we need to allow the
querying actor to specify the scope of the provenance query, i.e. a definition of what
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
93
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
Figure 7.4: P-Structure Reference model
and is one of the options discussed in the section above. The documentLanguageMap-
pings specify how p-assertion contents are mapped to the document language required
by the search.
Figure 7.5: Query Data Handle model
7.2.2 Relationship Target Filters
The set of process documentation about entities that ultimately have some causal in-
fluence on the entity identified by a query data handle could be vast, and most of it
irrelevant to a querying actor for any one purpose. Therefore, we need to allow the
querying actor to specify the scope of the provenance query, i.e. a definition of what
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
93
Page 95
PROVENANCE
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
Figure 7.6: Relationship Target model
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
95
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
Figure 7.6: Relationship Target model
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
95
Page 97
PROVENANCE
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
7.2.3 Provenance Query Results
The response to a provenance query is a representation of the provenance of an entity
at a given instant, i.e. the one specified by the query data handle. Provenance query
results are comprised of start p-assertion data keys and a set of full relationships,
defined below and depicted in Figure 7.8:
Definition 7.6 (Provenance Query Result Start) A provenance query result start is
the p-assertion data key(s) to the process documentation of the entity for which the
provenance was found, i.e. the key(s) for the p-assertion data item(s) found by resolv-
ing the query data handle. 2
Definition 7.7 (Provenance Query Result Full Relationships) A provenance query
result full relationship is a relationship between two p-assertion data items in the
provenance of entity found by the query. 2
Figure 7.8: Provenance Query Result model
Full relationships are adapted versions of relationship p-assertions in the process
documentation, differing in two regards. First, because a relationship p-assertion can
have multiple objects, and not every object may be within scope of the provenance
query results, a full relationship is between strictly one subject and one object, to
indicate that that exact relationship is within scope. If multiple objects of a relationship
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
97
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
7.2.3 Provenance Query Results
The response to a provenance query is a representation of the provenance of an entity
at a given instant, i.e. the one specified by the query data handle. Provenance query
results are comprised of start p-assertion data keys and a set of full relationships,
defined below and depicted in Figure 7.8:
Definition 7.6 (Provenance Query Result Start) A provenance query result start is
the p-assertion data key(s) to the process documentation of the entity for which the
provenance was found, i.e. the key(s) for the p-assertion data item(s) found by resolv-
ing the query data handle. 2
Definition 7.7 (Provenance Query Result Full Relationships) A provenance query
result full relationship is a relationship between two p-assertion data items in the
provenance of entity found by the query. 2
Figure 7.8: Provenance Query Result model
Full relationships are adapted versions of relationship p-assertions in the process
documentation, differing in two regards. First, because a relationship p-assertion can
have multiple objects, and not every object may be within scope of the provenance
query results, a full relationship is between strictly one subject and one object, to
indicate that that exact relationship is within scope. If multiple objects of a relationship
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
97
Page 99
PROVENANCE
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
At minimum, the process documentation query interface must allow the querying
actor to perform the following operations (†). [SR-1-2, p. 115]
• Retrieve the contents of a p-assertion with a given global p-assertion key.
• Retrieve all p-assertions asserted about one interaction, identified by an interac-
tion key, by one actor, identified as the sender or receiver or by its identity.
• Retrieve all p-assertions asserted about one interaction, identified by an interac-
tion key.
The actual results of the query depend on the contents of the provenance store
to which the query is sent, because the query will only return data contained in that
provenance store, and the access control restrictions placed on the querying actor by
the store.
As the amount of data returned may be large in volume, the process documentation
query interface should allow for the iterative retrieval of query results. By this mech-
anism, a querying actor should be able to process the results in manageable chunks.
(†) [TSR-1-2, p. 125]
Ideally, the process documentation query interface should allow more than the
above minimum operations, so that queries can be used to search for and retrieve more
p-assertion data meeting different criteria, e.g. to retrieve all interaction p-assertions
of a particular type, and possibly to perform transformations on the results before re-
turning, so that the querying actor receives the results in the form they can most easily
process.
The process documentation query interface is not more fully specified here because
there are a range of query languages already available that can be used to query a set of
stored data, and the ideal one will depend on the structure of a particular application’s
p-assertion content.
7.4 Management Interface
Within a provenance architecture, a management interface is necessary in order to
facilitate the administration, reuse and maintenance of provenance stores. Such an
interface may provide generic data storage administration capabilities and will not,
in itself, be provenance specific. This being so, this section merely suggests some
useful functionality that a management interface might have, and we do not make any
commitments to a formal specification of the suggested functionality described below.
7.4.1 Notification of Provenance Store Use
Managing actors might like to be informed when operations are performed on a prove-
nance store. For example, they might like to know when a p-assertion has been
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
99
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
At minimum, the process documentation query interface must allow the querying
actor to perform the following operations (†). [SR-1-2, p. 115]
• Retrieve the contents of a p-assertion with a given global p-assertion key.
• Retrieve all p-assertions asserted about one interaction, identified by an interac-
tion key, by one actor, identified as the sender or receiver or by its identity.
• Retrieve all p-assertions asserted about one interaction, identified by an interac-
tion key.
The actual results of the query depend on the contents of the provenance store
to which the query is sent, because the query will only return data contained in that
provenance store, and the access control restrictions placed on the querying actor by
the store.
As the amount of data returned may be large in volume, the process documentation
query interface should allow for the iterative retrieval of query results. By this mech-
anism, a querying actor should be able to process the results in manageable chunks.
(†) [TSR-1-2, p. 125]
Ideally, the process documentation query interface should allow more than the
above minimum operations, so that queries can be used to search for and retrieve more
p-assertion data meeting different criteria, e.g. to retrieve all interaction p-assertions
of a particular type, and possibly to perform transformations on the results before re-
turning, so that the querying actor receives the results in the form they can most easily
process.
The process documentation query interface is not more fully specified here because
there are a range of query languages already available that can be used to query a set of
stored data, and the ideal one will depend on the structure of a particular application’s
p-assertion content.
7.4 Management Interface
Within a provenance architecture, a management interface is necessary in order to
facilitate the administration, reuse and maintenance of provenance stores. Such an
interface may provide generic data storage administration capabilities and will not,
in itself, be provenance specific. This being so, this section merely suggests some
useful functionality that a management interface might have, and we do not make any
commitments to a formal specification of the suggested functionality described below.
7.4.1 Notification of Provenance Store Use
Managing actors might like to be informed when operations are performed on a prove-
nance store. For example, they might like to know when a p-assertion has been
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
99
Page 100
PROVENANCE
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
recorded. A management interface should provide the following functionality regard-
ing notification.
• Notification
A management interface should be able to notify subscribed managing actors of
record and query operations. (†) [SR-4-2, p. 122]
• Subscription management
A management interface should allow actors to manage their subscription infor-
mation e.g. where notifications are sent to.
7.4.2 Provenance Store Utility
• Link Modification
A management interface could provide functionality to update links. This is
useful when process documentation has been moved from one provenance store
to another and contains links.
• Deletion
Ideally, p-assertions are never deleted, but in some circumstances, such as data
staging, it may be necessary for an application to delete particular p-assertions
that have been moved to a new provenance store. A management interface may
provide this deletion capability.
• Setup and management of indexes
Provenance stores hold a large amount of p-assertions. No matter how these p-
assertions are organised, some storage structures may be suitable for some query
operations and not suitable for others. A management interface should provide a
mechanism to setup and manage indexes in terms of time, tracer or other criteria,
so that p-assertions can be organised into multiple views and structures, thus
facilitating querying.
7.5 Policies
Policies describe the capabilities, requirements and general characteristics of compo-
nents in service oriented systems. In the Provenance architecture, they are important
for providing interoperability, enabling users to identify services and provenance stores
that provide the required functionality. In Chapter 3, policies identified for the Prove-
nance architecture were clarified into the three distinct areas shown below.
• Service requirement and capability policies.
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
100
Enabling and Supporting Provenance in Grids for Complex Problems
Contract Number: 511085
recorded. A management interface should provide the following functionality regard-
ing notification.
• Notification
A management interface should be able to notify subscribed managing actors of
record and query operations. (†) [SR-4-2, p. 122]
• Subscription management
A management interface should allow actors to manage their subscription infor-
mation e.g. where notifications are sent to.
7.4.2 Provenance Store Utility
• Link Modification
A management interface could provide functionality to update links. This is
useful when process documentation has been moved from one provenance store
to another and contains links.
• Deletion
Ideally, p-assertions are never deleted, but in some circumstances, such as data
staging, it may be necessary for an application to delete particular p-assertions
that have been moved to a new provenance store. A management interface may
provide this deletion capability.
• Setup and management of indexes
Provenance stores hold a large amount of p-assertions. No matter how these p-
assertions are organised, some storage structures may be suitable for some query
operations and not suitable for others. A management interface should provide a
mechanism to setup and manage indexes in terms of time, tracer or other criteria,
so that p-assertions can be organised into multiple views and structures, thus
facilitating querying.
7.5 Policies
Policies describe the capabilities, requirements and general characteristics of compo-
nents in service oriented systems. In the Provenance architecture, they are important
for providing interoperability, enabling users to identify services and provenance stores
that provide the required functionality. In Chapter 3, policies identified for the Prove-
nance architecture were clarified into the three distinct areas shown below.
• Service requirement and capability policies.
Copyright @ 2005, 2006 by the PROVENANCE consortium
The PROVENANCE project receives research funding from the European Commission’s Sixth Framework Programme
100
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime
Start using Mendeley in seconds!
Readership Statistics
22 Readers on Mendeley
by Discipline
by Academic Status
36% Ph.D. Student
14% Post Doc
9% Student (Master)
by Country
27% United States
27% United Kingdom
9% Germany



