Recording Process Documentation for Provenance
IEEE Transactions on Parallel and Distributed Systems (2009)
- ISSN: 10459219
- DOI: 10.1109/TPDS.2008.215
Available from
Paul Groth's profile on Mendeley.
or
Abstract
Scientific and business communities are adopting large-scale distributed systems as a means to solve a wide range of resource-intensive tasks. These communities also have requirements in terms of provenance. We define the provenance of a result produced by a distributed system as the process that led to that result. This paper describes a protocol for recording documentation of a distributed system's execution. The distributed protocol guarantees that documentation with characteristics suitable for accurately determining the provenance of results is recorded. These characteristics are confirmed through a number of proofs based on an abstract state machine formalization.
Page 1
Recording Process Documentation for Provenance
Recording Process Documentation
for Provenance
Paul Groth, Member, IEEE, and Luc Moreau
Abstract—Scientific and business communities are adopting large-scale distributed systems as a means to solve a wide range of
resource-intensive tasks. These communities also have requirements in terms of provenance. We define the provenance of a result
produced by a distributed system as the process that led to that result. This paper describes a protocol for recording documentation of
a distributed system’s execution. The distributed protocol guarantees that documentation with characteristics suitable for accurately
determining the provenance of results is recorded. These characteristics are confirmed through a number of proofs based on an
abstract state machine formalization.
Index Terms—Provenance, lineage, grids, distributed systems, data protocols.
Ç
1 INTRODUCTION
SCIENTIFIC and business communities are adopting large-scale distributed systems (Grids, Web Services) as a
means to solve a wide range of resource-intensive tasks. For
example, bioinformaticians are using such systems to aid in
drug discovery by modeling the folding of proteins [20].
Likewise, in aerospace engineering, practitioners are running
simulations of aircraft using networked supercomputers to
improve design and safety, while reducing cost [19]. Lastly,
financial service firms are using the idle cycles of desktop
computers to perform financial analytics with faster turn-
around times [22]. Beyond their use of large amounts of
computational resources in distributed networks, these
example applications share another common concern. In
each application, the process by which the result was
generated is as important as the result itself. For instance, in
the aerospace example, if a plane has a malfunction, it is
necessary to find where in the design and building process
the failure could have arisen. If during the design process the
simulation was at fault, it could be improved to prevent
future malfunctions in planes that use similar simulation
techniques. The process that led to the plane (including its
design, construction, and operation) is termed the provenance
of the plane. Thus, conceptually, we term the process that led
to a result, the provenance of that result.
Thenecessity forprovenance is apparent inawide rangeof
fields and mandated by a number of regulatory authorities.
For example, the American Food and Drug Administration
requires that the provenance of a drug’s discovery be kept as
long as the drug is in use (up to 50 years in some cases).
Likewise, the Federal Aviation Administration requires that
simulation records, as well as other provenance data, be kept
up to 99 years after the design of an aircraft. In financial
auditing, the American Sarbanes-Oxley Act requires public
accounting firms to maintain the provenance of an audit
report for at least seven years after the issue of that report (US
Public Law No. 107-204). Beyond regulatory requirements,
provenance is particularly important when there is no
physical record as in the case of purely in silico distributed
scientific processes since provenance provides the only
means to validate a result.
Therefore, the ability to determine the provenance of
results produced by a computational distributed systems is
necessary. However, in most instances, the existence of a
computer-derived result itself is not sufficient to determine
its provenance. We need to know details of the actual
execution (e.g., process) responsible for the result’s genera-
tion. During the execution of a distributed system, it is
possible to automatically create a description of such an
execution, which we call process documentation and record it
in a repository called a provenance store. A provenance store
can then be queried to retrieve a concrete representation of the
provenance of the result of interest. To reiterate, process
documentation contains concrete representations of the
provenance of results (e.g., data) produced by a distributed
system. Creating, recording, and querying process docu-
mentation are core steps of the provenance lifecycle [18].
The focus of this paper is on the creating and recording
phase of such a lifecycle; we identify a protocol by which a
distributed system can record process documentation in a
provenance store. Furthermore, we define the expected
behavior of the distributed system, in the creation and
recording phases of the provenance lifecycle, so that the
recorded process documentation can help accurately
determine the provenance of a result. Consequently, the
contributions of this paper are threefold:
1. The definition of five characteristics that make
process documentation high quality.
2. A description and a formal definition of a distrib-
uted protocol that records process documentation.
3. Proofs that establish that the protocol records high-
quality process documentation.
The rest of this paper is organized as follows: Section 2
presents a set of characteristics to ensure that process
documentation is high quality. In Section 3, we describe a
1246 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 20, NO. 9, SEPTEMBER 2009
. P. Groth is with the Information Sciences Institute, University of Southern
California, 4676 Admiralty Way #1001, Marina del Rey, CA 90292.
E-mail: pgroth@isi.edu.
. L. Moreau is with the School of Electronics and Computer Science,
University of Southampton, SO17 1BJ Southampton, UK.
E-mail: l.moreau@ecs.soton.ac.uk.
Manuscript received 25 Apr. 2008; revised 8 Sept. 2008; accepted 11 Sept.
2008; published online 24 Sept. 2008.
Recommended for acceptance by M. Raynal.
For information on obtaining reprints of this article, please send e-mail to:
tpds@computer.org, and reference IEEECS Log Number TPDS-2008-04-0150.
Digital Object Identifier no. 10.1109/TPDS.2008.215.
1045-9219/09/$25.00 2009 IEEE Published by the IEEE Computer Society
for Provenance
Paul Groth, Member, IEEE, and Luc Moreau
Abstract—Scientific and business communities are adopting large-scale distributed systems as a means to solve a wide range of
resource-intensive tasks. These communities also have requirements in terms of provenance. We define the provenance of a result
produced by a distributed system as the process that led to that result. This paper describes a protocol for recording documentation of
a distributed system’s execution. The distributed protocol guarantees that documentation with characteristics suitable for accurately
determining the provenance of results is recorded. These characteristics are confirmed through a number of proofs based on an
abstract state machine formalization.
Index Terms—Provenance, lineage, grids, distributed systems, data protocols.
Ç
1 INTRODUCTION
SCIENTIFIC and business communities are adopting large-scale distributed systems (Grids, Web Services) as a
means to solve a wide range of resource-intensive tasks. For
example, bioinformaticians are using such systems to aid in
drug discovery by modeling the folding of proteins [20].
Likewise, in aerospace engineering, practitioners are running
simulations of aircraft using networked supercomputers to
improve design and safety, while reducing cost [19]. Lastly,
financial service firms are using the idle cycles of desktop
computers to perform financial analytics with faster turn-
around times [22]. Beyond their use of large amounts of
computational resources in distributed networks, these
example applications share another common concern. In
each application, the process by which the result was
generated is as important as the result itself. For instance, in
the aerospace example, if a plane has a malfunction, it is
necessary to find where in the design and building process
the failure could have arisen. If during the design process the
simulation was at fault, it could be improved to prevent
future malfunctions in planes that use similar simulation
techniques. The process that led to the plane (including its
design, construction, and operation) is termed the provenance
of the plane. Thus, conceptually, we term the process that led
to a result, the provenance of that result.
Thenecessity forprovenance is apparent inawide rangeof
fields and mandated by a number of regulatory authorities.
For example, the American Food and Drug Administration
requires that the provenance of a drug’s discovery be kept as
long as the drug is in use (up to 50 years in some cases).
Likewise, the Federal Aviation Administration requires that
simulation records, as well as other provenance data, be kept
up to 99 years after the design of an aircraft. In financial
auditing, the American Sarbanes-Oxley Act requires public
accounting firms to maintain the provenance of an audit
report for at least seven years after the issue of that report (US
Public Law No. 107-204). Beyond regulatory requirements,
provenance is particularly important when there is no
physical record as in the case of purely in silico distributed
scientific processes since provenance provides the only
means to validate a result.
Therefore, the ability to determine the provenance of
results produced by a computational distributed systems is
necessary. However, in most instances, the existence of a
computer-derived result itself is not sufficient to determine
its provenance. We need to know details of the actual
execution (e.g., process) responsible for the result’s genera-
tion. During the execution of a distributed system, it is
possible to automatically create a description of such an
execution, which we call process documentation and record it
in a repository called a provenance store. A provenance store
can then be queried to retrieve a concrete representation of the
provenance of the result of interest. To reiterate, process
documentation contains concrete representations of the
provenance of results (e.g., data) produced by a distributed
system. Creating, recording, and querying process docu-
mentation are core steps of the provenance lifecycle [18].
The focus of this paper is on the creating and recording
phase of such a lifecycle; we identify a protocol by which a
distributed system can record process documentation in a
provenance store. Furthermore, we define the expected
behavior of the distributed system, in the creation and
recording phases of the provenance lifecycle, so that the
recorded process documentation can help accurately
determine the provenance of a result. Consequently, the
contributions of this paper are threefold:
1. The definition of five characteristics that make
process documentation high quality.
2. A description and a formal definition of a distrib-
uted protocol that records process documentation.
3. Proofs that establish that the protocol records high-
quality process documentation.
The rest of this paper is organized as follows: Section 2
presents a set of characteristics to ensure that process
documentation is high quality. In Section 3, we describe a
1246 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 20, NO. 9, SEPTEMBER 2009
. P. Groth is with the Information Sciences Institute, University of Southern
California, 4676 Admiralty Way #1001, Marina del Rey, CA 90292.
E-mail: pgroth@isi.edu.
. L. Moreau is with the School of Electronics and Computer Science,
University of Southampton, SO17 1BJ Southampton, UK.
E-mail: l.moreau@ecs.soton.ac.uk.
Manuscript received 25 Apr. 2008; revised 8 Sept. 2008; accepted 11 Sept.
2008; published online 24 Sept. 2008.
Recommended for acceptance by M. Raynal.
For information on obtaining reprints of this article, please send e-mail to:
tpds@computer.org, and reference IEEECS Log Number TPDS-2008-04-0150.
Digital Object Identifier no. 10.1109/TPDS.2008.215.
1045-9219/09/$25.00 2009 IEEE Published by the IEEE Computer Society
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime
Start using Mendeley in seconds!
Readership Statistics
14 Readers on Mendeley
by Discipline
7% Engineering
7% Law
by Academic Status
29% Ph.D. Student
21% Researcher (at an Academic Institution)
14% Other Professional
by Country
29% United Kingdom
21% United States
14% Netherlands



