Sign up & Download
Sign in

Linking multiple workflow provenance traces for interoperable collaborative science

by Paolo Missier, Bertram Ludascher, Shawn Bowers, Saumen Dey, Anandarup Sarkar, Biva Shrestha, Ilkay Altintas, Manish Kumar Anand, Carole Goble show all authors
5th Workshop on Workflows in Support of LargeScale Science WORKS (2010)

Abstract

Scientific collaboration increasingly involves data sharing between separate groups. We consider a scenario where data products of scientific workflows are published and then used by other researchers as inputs to their workflows. For proper interpretation, shared data must be complemented by descriptive metadata. We focus on provenance traces, a prime example of such metadata which describes the genesis and processing history of data products in terms of the computational workflow steps. Through the reuse of published data, virtual, implicitly collaborative experiments emerge, making it desirable to compose the independently generated traces into global ones that describe the combined executions as single, seamless experiments. We present a model for provenance sharing that realizes this holistic view by overcoming the various interoperability problems that emerge from the heterogeneity of workflow systems, data formats, and provenance models. At the heart lie (i) an abstract workflow and provenance model in which (ii) data sharing becomes itself part of the combined workflow. We then describe an implementation of our model that we developed in the context of the Data Observation Network for Earth (DataONE) project and that can stitch together traces from different Kepler and Taverna workflow runs. It provides a prototypical framework for seamless cross-system, collaborative provenance management and can be easily extended to include other systems. Our approach also opens the door to new ways of workflow interoperability not only through often elusive workflow standards but through shared provenance information from public repositories.

Cite this document (BETA)

Available from www.isi.edu
Page 1
hidden

Linking multiple workflow provenance traces for interoperable collaborative science

Linking Multiple Workflow Provenance Traces for
Interoperable Collaborative Science
Paolo Missier, Carole Goble
School of Computer Science
University of Manchester, Manchester, UK
fpmissier,caroleg@cs.man.ac.uk
Saumen Dey, Anandarup Sarkar
Dept. of Computer Science
University of California, Davis
fscdey,asarkarg@ucdavis.edu
Biva Shrestha
Dept. of Computer Science
Appalachian State University, Boone, NC
ivashrestha@gmail.com
Bertram Luda¨scher
Dept. of Computer Science & Genome Center
University of California, Davis
ludaesch@ucdavis.edu
Shawn Bowers
Dept. of Computer Science
Gonzaga University
bowers@gonzaga.edu
Ilkay Altintas, Manish Kumar Anand
San Diego Supercomputer Center
University of California, San Diego
faltintas, mkanandg@sdsc.edu
Abstract—Scientific collaboration increasingly involves data
sharing between separate groups. We consider a scenario where
data products of scientific workflows are published and then
used by other researchers as inputs to their workflows. For
proper interpretation, shared data must be complemented by
descriptive metadata. We focus on provenance traces, a prime
example of such metadata which describes the genesis and
processing history of data products in terms of the computational
workflow steps. Through the reuse of published data, virtual,
implicitly collaborative experiments emerge, making it desirable to
compose the independently generated traces into global ones that
describe the combined executions as single, seamless experiments.
We present a model for provenance sharing that realizes this
holistic view by overcoming the various interoperability problems
that emerge from the heterogeneity of workflow systems, data
formats, and provenance models. At the heart lie (i) an abstract
workflow and provenance model in which (ii) data sharing
becomes itself part of the combined workflow. We then describe
an implementation of our model that we developed in the context
of the Data Observation Network for Earth (DataONE) project
and that can “stitch together” traces from different Kepler and
Taverna workflow runs. It provides a prototypical framework for
seamless cross-system, collaborative provenance management and
can be easily extended to include other systems. Our approach
also opens the door to new ways of workflow interoperability
not only through often elusive workflow standards but through
shared provenance information from public repositories.
I. INTRODUCTION
One of the tenets of the emerging paradigm of “open”
experimental, data-intensive science [14] in which information
is the main product, is that scientists should have both the
incentive and the ability to share some of their findings with
other members of their community, as well as to reuse their
peers’ data products. Indeed, the scientists’ natural resistance
to sharing their data and methods is increasingly being re-
placed by the realization that the benefits of data sharing
may outgrow the risks of losing exclusive ownership of data.
This phenomenon is amplified by new requirements to make
data available prior to publication, along with the definition
of standard formats for data exchange in many domains of
science [1].
We will concentrate on the particularly common setting
where the structure and the steps of the transformation process
is formally encoded as a scientific workflow [26], [18], and
where provenance traces results from the observation of the
workflow execution. In this setting, implicit collaboration be-
tween two or more parties involves the execution of workflows
which uses some of the results of another workflow’s execution
as part of its inputs. The following scenario, used as a running
example through the paper, clarifies this setting (see Fig. 1).
A. Workflow Collaboration Scenario: Alice, Bob, and Charlie
Alice and Bob are two experts in image analysis for medical
applications who occasionally collaborate on joint projects. Al-
ice has developed a workflow WA using her favorite workflow
system. WA consists of a data pipeline that performs various
transformations of input image(s) X to produce a set Z of new
images.1 Alice decides to publish the results of some of her
workflow runs via a shared data space so that her collaborators
(or any other users) can use them. Bob retrieves a copy of one
of those images, z 2 Z, as he would like to use it as input
to his own workflow WB , which he developed (incidentally
using a different workflow system than Alice). He first applies
some format transformation u = f(z), then runs WB with u
(and possibly some additional local input data), obtaining a
new result set v of data products. He then in turn publishes v,
together with a trace TB of how v was derived in a shared data
space. Along comes Charlie, our third expert, who is interested
in the results v and wants to understand how they have
been derived. The commonly accepted approach to answering
Charlie’s question is to collect a provenance trace TB during
the execution of Bob’s workflow, which describes in detail
the data transformation and generation process, as well as the
dependencies amongst data products involved in the process.
The trace TB is a directed graph whose nodes represent either
data or computations, and where arcs represent dataflow or
1E.g., we use a workflow for image analysis of brain scans from the First
Provenance Challenge [24], [23] to demonstrate our system [12].

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

14 Readers on Mendeley
by Discipline
 
 
 
by Academic Status
 
43% Researcher (at an Academic Institution)
 
21% Ph.D. Student
 
14% Professor
by Country
 
57% United States
 
14% United Kingdom
 
14% Germany