Sign up & Download
Sign in

Expressive Reusable Workflow Templates

by Yolanda Gil, Paul Groth, Varun Ratnakar, Christian Fritz
Proceedings of the Fifth IEEE International Conference on eScience (2009)

Abstract

Workflow systems can manage complex scientific applications with distributed data processing. Although some workflow systems can represent collections of data with very compact abstractions and manage their execution efficiently, there are no approaches to date to manage collections of application components required to express some scientific applications. We present an approach to handle collections of components and data alike in expressive workflow templates whose basic structure is reusable. We also present an algorithm that can elaborate abstract compact workflow templates into execution-ready workflows that enumerate all computations to be carried out. We implemented the proposed approach in the Wings workflow system. Our work is motivated by real-world complex scientific applications that require handling of nested collections of both components and data.

Cite this document (BETA)

Available from www.isi.edu
Page 1
hidden

Expressive Reusable Workflow Templates

Expressive Reusable Workflow Templates Yolanda Gil, Paul Groth, Varun Ratnakar, Christian Fritz USC/Information Sciences Institute 4676 Admiralty Way Marina del Rey, CA, 90292, USA {gil, pgroth, varunr, fritz}@isi.edu Abstract Workflow systems can manage complex scientific applications with distributed data processing. Although some workflow systems can represent collections of data with very compact abstractions and manage their execution efficiently, there are no approaches to date to manage collections of application components required to express some scientific applications. We present an approach to handle collections of components and data alike in expressive workflow templates whose basic structure is reusable. We also present an algorithm that can elaborate abstract compact workflow templates into execution-ready workflows that enumerate all computations to be carried out. We implemented the proposed approach in the Wings workflow system. Our work is motivated by real-world complex scientific applications that require handling of nested collections of both components and data. 1. Introduction Scientists often deal with collections of data, whether it is multiple overlaying images produced by fMRI scanners or thousands of gene sequences generated using high throughput sequencing. To deal with such large and complex collections of data, scientist have turned towards workflow technology. Computational experiments can be modeled as workflows, which are declarative representations of the dataflow between software components. Thus, sophisticated software packages can be weaved together in order to express a computational experiment. Once an experiment is represented as a workflow, workflow systems can be used to execute computational experiments on a large scale [3], optimize performance [15] and track the provenance of experimental outputs [7]. One important outcome of representing experiments as workflows is the ability for scientists to easily share and reuse experiments [11].
However, the software components within a workflow are in many cases not designed to process more than one data set at a time. Consider, for example, a bioinformatician, testing whether a particular gene expression predicts a given phenotype in an organism using a k-nearest neighbor classifier. This classifier typically only classifies one data set at a time, but imagine that the bioinformatician wants to test a newly trained classifier on many test data sets in order to have evidence of the classifier’s efficacy. Or, to find the classifier that produces the best results, the bioinfomatician may want to train and test a collection of alternative algorithms simultaneously. To support this sort of application using available software components, a workflow system needs to be able to represent, reason about and process not only collections of data but also collections of components. In order to further enable the sharing and reuse of computational experiments, workflows need to be able to be easily adapted both to new or similar data sets and the availability of new analysis components. For example, if new test data sets or classifiers are available for use by the aforementioned bioinformatician, the workflow should be easily (and perhaps automatically) adapted to them. Thus, the workflow system should make it easy to reuse the basic dataflow structure of an experiment at an abstract level, and dynamically incorporate new data sets and components. In this paper, we present a new approach to workflow representation and generation that addresses collections of data and components. While some workflow systems have treated collections [5;14;9], our approach to collections differs in that it 1) handles collections of components in addition to collections of data, and 2) automatically adapts the initial workflow template to new collections of data sets and components. Our approach is implemented as an extension to the Wings workflow system [4;7]. The paper starts with motivating examples that lead to requirements to handle collections. We then describe the representations of workflow templates that we have developed to support those requirements. We also present the algorithm that uses those
Proceedings of the Fifth IEEE International Conference on e-Science (e-Science 2009), Oxford, UK, December 9-11, 2009.

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

21 Readers on Mendeley
by Discipline
 
 
 
5% Law
by Academic Status
 
29% Researcher (at an Academic Institution)
 
19% Other Professional
 
19% Researcher (at a non-Academic Institution)
by Country
 
43% United States
 
19% United Kingdom
 
10% Germany