Data curation + process curation=data integration + science.
- PubMed: 19060304
Abstract
In bioinformatics, we are familiar with the idea of curated data as a prerequisite for data integration. We neglect, often to our cost, the curation and cataloguing of the processes that we use to integrate and analyse our data. Programmatic access to services, for data and processes, means that compositions of services can be made that represent the in silico experiments or processes that bioinformaticians perform. Data integration through workflows depends on being able to know what services exist and where to find those services. The large number of services and the operations they perform, their arbitrary naming and lack of documentation, however, mean that they can be difficult to use. The workflows themselves are composite processes that could be pooled and reused but only if they too can be found and understood. Thus appropriate curation, including semantic mark-up, would enable processes to be found, maintained and consequently used more easily. This broader view on semantic annotation is vital for full data integration that is necessary for the modern scientific analyses in biology. This article will brief the community on the current state of the art and the current challenges for process curation, both within and without the Life Sciences.
Author-supplied keywords
Data curation + process curation=data integration + science.
integration 1 science
Carole Goble, Robert Stevens, Duncan Hull, Katy Wolstencroft and Rodrigo Lopez
Submitted: 16th May 2008; Received (in revised form) : 25th July 2008
Abstract
In bioinformatics, we are familiar with the idea of curated data as a prerequisite for data integration. We neglect,
often to our cost, the curation and cataloguing of the processes that we use to integrate and analyse our data.
Programmatic access to services, for data and processes, means that compositions of services can be made that
represent the in silico experiments or processes that bioinformaticians perform. Data integration through workflows
depends on being able to know what services exist and where to find those services. The large number of services
and the operations they perform, their arbitrary naming and lack of documentation, however, mean that they
can be difficult to use. The workflows themselves are composite processes that could be pooled and reused but
only if they too can be found and understood.Thus appropriate curation, including semantic mark-up, would enable
processes to be found, maintained and consequently used more easily. This broader view on semantic annotation
is vital for full data integration that is necessary for the modern scientific analyses in biology. This article will brief
the community on the current state of the art and the current challenges for process curation, both within and
without the Life Sciences.
Keywords: curation; semantic annotation; processes; services; workflow; ontology; metadata
INTRODUCTION:WHY
This briefing presents the need for the curation,
including the semantic annotation, of the processes
that filter or transform data as part of a bioinformatics
analysis and the vital part this will play in data inte-
gration. Integration is a central activity in bioinfor-
matics; it is a perennial problem that has had many
proposed solutions [1]. The bioinformatics landscape
is one of distributed and heterogeneous data and
tools—a landscape a bioinformatician needs to navi-
gate in order to perform the analyses so necessary to
modern biology. Today there are a bewildering array
of resources available to the modern bioinformatician
or molecular biologist—what Stein calls a ‘Bioinfor-
matics Nation’ [2]. For example, NucleicAcidsResearch
describes 1037 databases [3] and 166 web servers [4];
numbers beyond ad hoc reliance on human memory
for management and use.
Bioinformatics analyses are a mixture of data and
processes. These combinations are often complex.
Whilst such analyses are data orientated, it is the
services representing the tools that provide, filter or
transform these data [5] and form the data pipelines
that are common throughout bioinformatics.
The days of a scientist having to cut and paste
between different web interfaces are gone—this is
simply not a scalable or reproducible process in an
era of high-throughput analysis and large-scale data
generation [6]. Manual queries through web forms
are increasingly being replaced by automated queries
through web services [7].
Web services provide a well-defined program-
ming interface to integrate tools into appli-
cations over the internet or other network
connections. Software applications written in various
programming languages and running on various
Corresponding author. Carole Goble, School of Computer Science, University of Manchester, Oxford Road, Manchester, M13 9PL,
UK. Tel: þ44 161 275 6195; Fax: þ44 161 275 6236; E-mail: robert.stevens@manchester.ac.uk
Carole Goble is a Professor of Computer Science at the University of Manchester.
Robert Stevens is a senior lecturer in Computer Science at the University of Manchester.
Duncan Hull is a Postdoctoral Research Associate in the School of Chemistry at the University of Manchester.
KatyWolstencroft is a Postdoctoral Research Associate in the School of Computer Science at the University of Manchester.
Rodrigo Lopez is head of the external services group at the European Bioinformatics Institute (EBI), Hinxton, Cambridgeshire.
BRIEFINGS IN BIOINFORMATICS. VOL 9. NO 6. 506^517 doi:10.1093/bib/bbn034
Advance Access publication December 6, 2008
The Author 2008. Published by Oxford University Press. For Permissions, please email: journals.permissions@oxfordjournals.org
the Internet.
Using web services to build complex networked
tool chains is now a widely accepted solution in
bioinformatics for the everyday work of the biologist;
for tools and applications [7, 8]. Systems such as Life
Science Grid [9], ONDEX [10], GMOD [11],
UTOPIA [12] and Vl-e [13] use web services
behind the scenes to plug-in services—data sets and
tools—into their integration systems, as do ware-
houses like ATLAS [14] and integration frameworks
like GAGGLE [15], and DAS [16]. Commercial
systems such as Medicel Integrator[17] do the same.
Alternatively, tools can expose web service interfaces
to enable scientists to build pipelines (or workflows)
of data sources and analyses. For example, scientific
workflow management systems automatically orch-
estrate the execution of services, coordinating pro-
cesses (process flow) and managing the flow of data
between them (dataflow). Workflow management
tools such as Taverna, Triana, Kepler, Wildfire, Inf-
orSense, Pipeline Pilot and Pegasys [8], provide a
mechanism to orchestrate third party and in-house
Life Science services. The workflows themselves
(Figure 1) are explicit and precise descriptions of
a scientific process and, in turn, these workflows
can become services within other workflows and
applications.
In an effort to manage, analyse and integrate the
data deluge we have now created a service deluge.
For example, the Taverna Workflow Workbench has
access to over 3500 different tools and data resources,
over hundreds of third party services. The data that
these workflows and services process are often
curated, but the processes themselves are poorly
curated if they are curated at all. By process curation
we mean the cataloguing and annotation of services
and workflows and not the content they deliver.
Web services, as a prime example, tend to be
poorly described, often with documentation that
is insufficient or inappropriate. Their interfaces are
commonly (but not always) accompanied by a file
that gives the names of the operations performed by
the web service, as well as their inputs and outputs;
for most this is described in the Web Service
Description Language (WSDL) [18]. Unfortunately,
Figure 1: A workflow from theTaverna workflow management system [35], highlighting a KEGG [66] web service
operation.
Curated data as a prerequisite for data integration 507
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime


