Taverna/my Grid: Aligning a Workflow System with the Life Sciences Community
Workflows for eScience (2007)
- ISBN: 9781846285196
Available from www.springerlink.com
or
Abstract
Bioinformatics is a discipline that uses computational and mathematical techniques to store, manage, and analyze biological data in order to answer biological questions. Bioinformatics has over 850 databases 154 and numerous tools that work over those databases and local data to produce even more data themselves. In order to perform an analysis, a bioinformatician uses one or more of these resources to gather, filter, and transform data to answer a question. Thus, bioinformatics is an in silico science.
Page 1
Taverna/my Grid: Aligning a Workflow System with the Life Sciences Community
20
Taverna / myGrid: aligning a workflow system
with the life sciences community
Tom Oinn1, Peter Li2, Douglas B. Kell2, Carole Goble3, Antoon Goderis3,
Mark Greenwood3, Duncan Hull3, Robert Stevens3, Daniele Turi3 and Jun
Zhao3
1 EMBL European Bioinformatics Institute, Hinxton, Cambridge, CB10 1SD, UK
tmo@ebi.ac.uk
2 Bioanalytical Sciences, School of Chemistry, University of Manchester, M13 9PL,
UK Peter.Li@manchester.ac.uk
3 School of Computer Science, University of Manchester, M13 9PL, UK
carole@cs.man.ac.uk
20.1 Introduction
Bioinformatics is a discipline that uses computational and mathematical tech-
niques to store, manage and analyse biological data in order to answer bio-
logical questions. Bioinformatics has over 850 databases [181] and numerous
tools that work over those databases and local data to themselves produce
even more data. In order to perform an analysis, a bioinformatician uses one
or more of these resources to gather, filter and transform data to answer a
question. Thus, bioinformatics is an in silico science.
The traditional bioinformatics technique of cutting and pasting between
Web pages can be effective, but it is neither scalable nor does it support
scientific best practice, such as record keeping. In addition, as such methods
are scaled up, slips and omissions are more likely to occur. A final human
factor is the tedium of such repetitive tasks [371].
Doing these tasks programmatically is an obvious solution, especially for
the repetitive nature of the tasks. Some bioinformaticians have the program-
ming skills to wrap these distributed resources. Such solutions are, however,
not easy to disseminate, adapt and verify. Moreover, one of the consequences
of the autonomy of bioinformatics service providers is massive heterogene-
ity within those resources. The advent of Web Services has brought about a
major change in the availability of bioinformatics resources from Web pages
and command line programmes to Web services [369], though much of the
structural, value based and syntactic heterogeneity remains. The consequent
lack of a common type system means that services are difficult to join to-
gether programmatically and any technical solution to in silico experiments
in biology has to address this issue.
Taverna / myGrid: aligning a workflow system
with the life sciences community
Tom Oinn1, Peter Li2, Douglas B. Kell2, Carole Goble3, Antoon Goderis3,
Mark Greenwood3, Duncan Hull3, Robert Stevens3, Daniele Turi3 and Jun
Zhao3
1 EMBL European Bioinformatics Institute, Hinxton, Cambridge, CB10 1SD, UK
tmo@ebi.ac.uk
2 Bioanalytical Sciences, School of Chemistry, University of Manchester, M13 9PL,
UK Peter.Li@manchester.ac.uk
3 School of Computer Science, University of Manchester, M13 9PL, UK
carole@cs.man.ac.uk
20.1 Introduction
Bioinformatics is a discipline that uses computational and mathematical tech-
niques to store, manage and analyse biological data in order to answer bio-
logical questions. Bioinformatics has over 850 databases [181] and numerous
tools that work over those databases and local data to themselves produce
even more data. In order to perform an analysis, a bioinformatician uses one
or more of these resources to gather, filter and transform data to answer a
question. Thus, bioinformatics is an in silico science.
The traditional bioinformatics technique of cutting and pasting between
Web pages can be effective, but it is neither scalable nor does it support
scientific best practice, such as record keeping. In addition, as such methods
are scaled up, slips and omissions are more likely to occur. A final human
factor is the tedium of such repetitive tasks [371].
Doing these tasks programmatically is an obvious solution, especially for
the repetitive nature of the tasks. Some bioinformaticians have the program-
ming skills to wrap these distributed resources. Such solutions are, however,
not easy to disseminate, adapt and verify. Moreover, one of the consequences
of the autonomy of bioinformatics service providers is massive heterogene-
ity within those resources. The advent of Web Services has brought about a
major change in the availability of bioinformatics resources from Web pages
and command line programmes to Web services [369], though much of the
structural, value based and syntactic heterogeneity remains. The consequent
lack of a common type system means that services are difficult to join to-
gether programmatically and any technical solution to in silico experiments
in biology has to address this issue.
Page 2
300 Tom Oinn, Peter Li et al.
Many scientific computing projects within the academic community have
turned to workflows as a means of orchestrating complex tasks (in silico
experiments) over a distributed set of resources. Examples include Discov-
eryNet [352] for molecular biology and environmental data analysis, SEEK for
ecology [67, 68], GriPhyn for particle physics [144], and SCEC/IT for earth-
quake analysis and prediction [242].
Workflows offer a high-level alternative for encoding bioinformatics in sil-
ico experiments. The high-level nature of the encoding means a broader com-
munity can create templates for in silico experiments. They are also easier to
adapt or re-purpose by substitution or extension. Finally workflows are less of
a black-box than a script or traditional programme; the experimental protocol
captured in the workflow is displayed in such a way that a user can see the
components, their order and input & outputs. Such a workflow can be seen
in Figure 20.1.
myGrid is a project to build middleware to support workflow-based in silico
experiments in biology. Funded by the UK’s e-Science Programme from 2001,
it has developed a set of open source components that can be used indepen-
dently and together. These include a service directory [268], ontology-driven
search tools over semantic descriptions of external resources and data [268];
data repositories and semantically-driven metadata stores for recording the
provenance of a workflow and the experimental lifecycle [457], as well as other
components such as distributed query processing [63], event notification1.
myGrid’s workflow execution and development environment, Taverna, links
together and executes external remote or local, private or public, third party
or home-grown, heterogeneous open services, (applications, databases, etc).
The Freefluo workflow enactment engine2 enacts the workflows. The Tav-
erna workbench is a GUI-based application for bioinformaticians to assemble,
adapt and run workflows, and manage the generated data and metadata.
myGrid components are Taverna plug-ins (for results collection and brows-
ing, provenance capture, service publication & discovery) and services (such
as specialist text mining). Thus the workbench is the user facing application
for the myGrid middleware services. At the time of writing Taverna 1.3 has
been downloaded over 14 0003 times and has an estimated user base of around
1 500 installations. Taverna has been used in many different areas of research
throughout Europe and the U.S.A. for functional genomics, systems biology,
protein structure analysis, image processing, chemoinformatics and simula-
tion co-ordination. From 2006, myGrid has been incorporated into the UK’s
Open Middleware Infrastructure Institute to be “hardened” and developed to
continue to support Life Scientists.
1 http://www.mygrid.org.uk
2 http://freefluo.sourceforge.net
3 see http://taverna.sourceforge.net/index.php?doc=stats.php
Many scientific computing projects within the academic community have
turned to workflows as a means of orchestrating complex tasks (in silico
experiments) over a distributed set of resources. Examples include Discov-
eryNet [352] for molecular biology and environmental data analysis, SEEK for
ecology [67, 68], GriPhyn for particle physics [144], and SCEC/IT for earth-
quake analysis and prediction [242].
Workflows offer a high-level alternative for encoding bioinformatics in sil-
ico experiments. The high-level nature of the encoding means a broader com-
munity can create templates for in silico experiments. They are also easier to
adapt or re-purpose by substitution or extension. Finally workflows are less of
a black-box than a script or traditional programme; the experimental protocol
captured in the workflow is displayed in such a way that a user can see the
components, their order and input & outputs. Such a workflow can be seen
in Figure 20.1.
myGrid is a project to build middleware to support workflow-based in silico
experiments in biology. Funded by the UK’s e-Science Programme from 2001,
it has developed a set of open source components that can be used indepen-
dently and together. These include a service directory [268], ontology-driven
search tools over semantic descriptions of external resources and data [268];
data repositories and semantically-driven metadata stores for recording the
provenance of a workflow and the experimental lifecycle [457], as well as other
components such as distributed query processing [63], event notification1.
myGrid’s workflow execution and development environment, Taverna, links
together and executes external remote or local, private or public, third party
or home-grown, heterogeneous open services, (applications, databases, etc).
The Freefluo workflow enactment engine2 enacts the workflows. The Tav-
erna workbench is a GUI-based application for bioinformaticians to assemble,
adapt and run workflows, and manage the generated data and metadata.
myGrid components are Taverna plug-ins (for results collection and brows-
ing, provenance capture, service publication & discovery) and services (such
as specialist text mining). Thus the workbench is the user facing application
for the myGrid middleware services. At the time of writing Taverna 1.3 has
been downloaded over 14 0003 times and has an estimated user base of around
1 500 installations. Taverna has been used in many different areas of research
throughout Europe and the U.S.A. for functional genomics, systems biology,
protein structure analysis, image processing, chemoinformatics and simula-
tion co-ordination. From 2006, myGrid has been incorporated into the UK’s
Open Middleware Infrastructure Institute to be “hardened” and developed to
continue to support Life Scientists.
1 http://www.mygrid.org.uk
2 http://freefluo.sourceforge.net
3 see http://taverna.sourceforge.net/index.php?doc=stats.php
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime
Start using Mendeley in seconds!
Readership Statistics
6 Readers on Mendeley
by Discipline
17% Chemistry
by Academic Status
50% Researcher (at an Academic Institution)
33% Student (Master)
17% Professor
by Country
33% United Kingdom
17% Greece
17% United States


