Workflows for Information Integration in the Life Sciences Search Computing
- ISBN: 9783642196676
- DOI: 10.1007/978-3-642-19668-3_20
Abstract
The increasingly computationally- and data-intensive nature of experimental science motivates recent interest in workflows, as a way to specify complex data processing and integration pipelines in a fairly intuitive way. Such workflows orchestrate the invocation of data retrieval services in a way that resembles, to some extent, Search Computing query plans. While the former are manually specified, however, the latter are the result of an automated translation process. Using lessons learnt from experience in workflow design, in this chapter we discuss some of the requirements on service curation that make automated, on-demand data integration processes possible and realistic.
Workflows for Information Integration in the Life Sciences Search Computing
Life Sciences
Paolo Missier, Norman Paton, and Peter Li
School of Computer Science, University of Manchester
Oxford rd., Manchester, UK
{firstname.lastname}@cs.manchester.ac.uk
Abstract. The increasingly computationally- and data-intensive nature
of experimental science motivates recent interest in workflows, as a way
to specify complex data processing and integration pipelines in a fairly
intuitive way. Such workflows orchestrate the invocation of data retrieval
services in a way that resembles, to some extent, Search Computing query
plans. While the former are manually specified, however, the latter are
the result of an automated translation process. Using lessons learnt from
experience in workflow design, in this chapter we discuss some of the
requirements on service curation that make automated, on-demand data
integration processes possible and realistic.
1 Workflows for Computational Science and Information
Integration
In many disciplines of natural science, research advances increasingly rely upon
the automated acquisition, transformation and analysis of large-scale data. In this
chapter, we exemplify and discuss the use of workflow technology as a way to ad-
dress the needs of data analysis automation in science [1]. Our examples refer to
two emerging areas in the life sciences which has been the focus of data-intensive
research, namely next generation DNA sequencing (NGS) and systems biology.
NGS is having a profound impact on the expectations and the methods of
genomics research. First introduced around 2005, NGS makes it possible to se-
quence entire genomes in weeks, spurring ambitious new efforts like the 1000
Genomes Project [2]. While these projects underpin the study of the genetic
causes of human diseases, they come with new challenges at multiple levels.
Firstly, they push the limits of current data repositories. For example, the Short
Read Archive, the European repository that accepts data submissions from NGS
machines at the EMBL1, received 30TB of data in the first six months of op-
eration, making data submission rate the new bottleneck for advances in ge-
nomics2 [3]. At the same time, a secondary effect of these new whole-genome
sequencing studies is the exponential growth in the number of submissions to
1 European Molecular Biology Lab: http://www.ebi.ac.uk/ena/
2 The EMBL-Bank grows in size at the rate of 200% per annum.
S. Ceri and M. Brambilla (Eds.): Search Computing II, LNCS 6585, pp. 215–225, 2011.
c
© Springer-Verlag Berlin Heidelberg 2011
SNP databases3[4]. In turn, advances in data production drive the need for the
development of highly automated pipelines for the analysis of NGS data, both
primary (sequence) and “downstream” (SNP analysis, for example).
While such experimental processes predictably involve a combination of data-
centric (data retrieval, format mappings) as well as compute-intensive tasks, their
exact nature and composition into a complete process tend to change rapidly, fol-
lowing data availability and other technological advances. In practice, the exper-
imental nature of the projects extends from data generation technology, to the
development of novel techniques for data analysis. In this setting, workflow tech-
nology addresses the scientists’ needs for rapid prototyping of innovative appli-
cations. Workflows embody high level programming models that let users specify
the coordinated execution, known as orchestration, of various types of executable
software components, or tasks, often implemented as Web services. Workflow lan-
guages tend to be higher level than traditional scripting languages, such as Perl,
resulting in more manageable specifications of complex data processing pipelines.
At the same time, their computational models are more understandable by do-
main experts with a limited knowledge of general-purpose programming. For these
experts, workflows are a way to maintain control over all phases of their computa-
tional experiment, from design, to execution, to analysis of the results. Workflow
systems offer additional advantages over general scripting environments, including
managing the scheduling of tasks and their deployment on HPC infrastructures,
such as clouds.
A variety of workflow systems for science have emerged over the past few years,
in response to these scientists’ needs. Their commonalities and differences have
been described at length in the literature [5,6]. In most cases, however, the fo-
cus is on workflows that accomplish compute-intensive tasks, such as large-scale
simulations [7]. Less emphasis is placed on a class of workflows whose main pur-
pose is to retrieve and integrate data from multiple sources, usually in order to
enable some more complex processing downstream. The importance and impact
of these resource-oriented workflows on the e-science data infrastructure is grow-
ing with the number and size of the available databases, as mentioned earlier. A
recent EBI statistic [3], for example, compares interactive Web page accesses to
programmatic (i.e., Web service-based) access to its 63 databases, reporting about
1 million automated data retrieval jobs / month in 2009 from services.
Resource-oriented workflows resemble less a scientific experiment, and more a
distributed query plan in which the nodes are service invocations, a characterisa-
tion that makes them particularly interesting in the context of Search Comput-
ing. In the rest of the chapter we use an example from the bioinformatics area
of systems biology, implemented using the Taverna workflow system [8], to dis-
cuss opportunities and limitations of using workflows as a form of on-the-fly data
integration.
3 Single Nucleotide Polymorphisms, or SNPs, are single-base mutations on a chromo-
some. About .5k new SNPs are detected for each genome that is sequenced, leading
to over 100 million submissions, by early 2010, to dbSNP, the SNP database at the
NCBI: http://www.ncbi.nlm.nih.gov/snp/.
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime


