Data curation + process curation=...
Data curation1process curation^data integration1science Carole Goble, Robert Stevens, Duncan Hull, Katy Wolstencroft and Rodrigo Lopez Submitted: 16th May 2008 Received (in revised form): 25th July 2008 Abstract In bioinformatics, we are familiar with the idea of curated data as a prerequisite for data integration. We neglect, often to our cost, the curation and cataloguing of the processes that we use to integrate and analyse our data. Programmatic access to services, for data and processes, means that compositions of services can be made that represent the in silico experiments or processes that bioinformaticians perform. Data integration through workflows depends on being able to know what services exist and where to find those services. The large number of services and the operations they perform, their arbitrary naming and lack of documentation, however, mean that they can be difficult to use. The workflows themselves are composite processes that could be pooled and reused but only if they too can be found and understood.Thus appropriate curation, including semantic mark-up, would enable processes to be found, maintained and consequently used more easily. This broader view on semantic annotation is vital for full data integration that is necessary for the modern scientific analyses in biology. This article will brief the community on the current state of the art and the current challenges for process curation, both within and without the Life Sciences. Keywords: curation semantic annotation processes services workflow ontology metadata INTRODUCTION:WHY This briefing presents the need for the curation, including the semantic annotation, of the processes that filter or transform data as part of a bioinformatics analysis and the vital part this will play in data inte- gration. Integration is a central activity in bioinfor- matics it is a perennial problem that has had many proposed solutions [1]. The bioinformatics landscape is one of distributed and heterogeneous data and tools���a landscape a bioinformatician needs to navi- gate in order to perform the analyses so necessary to modern biology. Today there are a bewildering array of resources available to the modern bioinformatician or molecular biologist���what Stein calls a ���Bioinfor- matics Nation��� [2]. For example, NucleicAcidsResearch describes 1037 databases [3] and 166 web servers [4] numbers beyond ad hoc reliance on human memory for management and use. Bioinformatics analyses are a mixture of data and processes. These combinations are often complex. Whilst such analyses are data orientated, it is the services representing the tools that provide, filter or transform these data [5] and form the data pipelines that are common throughout bioinformatics. The days of a scientist having to cut and paste between different web interfaces are gone���this is simply not a scalable or reproducible process in an era of high-throughput analysis and large-scale data generation [6]. Manual queries through web forms are increasingly being replaced by automated queries through web services [7]. Web services provide a well-defined program- ming interface to integrate tools into appli- cations over the internet or other network connections. Software applications written in various programming languages and running on various Corresponding author. Carole Goble, School of Computer Science, University of Manchester, Oxford Road, Manchester, M13 9PL, UK. Tel: ��44 161 275 6195 Fax: ��44 161 275 6236 E-mail: robert.stevens@manchester.ac.uk Carole Goble is a Professor of Computer Science at the University of Manchester. Robert Stevens is a senior lecturer in Computer Science at the University of Manchester. Duncan Hull is a Postdoctoral Research Associate in the School of Chemistry at the University of Manchester. KatyWolstencroft is a Postdoctoral Research Associate in the School of Computer Science at the University of Manchester. Rodrigo Lopez is head of the external services group at the European Bioinformatics Institute (EBI), Hinxton, Cambridgeshire. BRIEFINGS IN BIOINFORMATICS. VOL 9. NO 6. 506^517 doi:10.1093/bib/bbn034 Advance Access publication December 6, 2008 �� The Author 2008. Published by Oxford University Press. For Permissions, please email: journals.permissions@oxfordjournals.org
platforms can use web services to exchange data over the Internet. Using web services to build complex networked tool chains is now a widely accepted solution in bioinformatics for the everyday work of the biologist for tools and applications [7, 8]. Systems such as Life Science Grid [9], ONDEX [10], GMOD [11], UTOPIA [12] and Vl-e [13] use web services behind the scenes to plug-in services���data sets and tools���into their integration systems, as do ware- houses like ATLAS [14] and integration frameworks like GAGGLE [15], and DAS [16]. Commercial systems such as Medicel Integrator[17] do the same. Alternatively, tools can expose web service interfaces to enable scientists to build pipelines (or workflows) of data sources and analyses. For example, scientific workflow management systems automatically orch- estrate the execution of services, coordinating pro- cesses (process flow) and managing the flow of data between them (dataflow). Workflow management tools such as Taverna, Triana, Kepler, Wildfire, Inf- orSense, Pipeline Pilot and Pegasys [8], provide a mechanism to orchestrate third party and in-house Life Science services. The workflows themselves (Figure 1) are explicit and precise descriptions of a scientific process and, in turn, these workflows can become services within other workflows and applications. In an effort to manage, analyse and integrate the data deluge we have now created a service deluge. For example, the Taverna Workflow Workbench has access to over 3500 different tools and data resources, over hundreds of third party services. The data that these workflows and services process are often curated, but the processes themselves are poorly curated if they are curated at all. By process curation we mean the cataloguing and annotation of services and workflows and not the content they deliver. Web services, as a prime example, tend to be poorly described, often with documentation that is insufficient or inappropriate. Their interfaces are commonly (but not always) accompanied by a file that gives the names of the operations performed by the web service, as well as their inputs and outputs for most this is described in the Web Service Description Language (WSDL) [18]. Unfortunately, Figure 1: A workflow from theTaverna workflow management system [35], highlighting a KEGG [66] web service operation. Curated data as a prerequisite for data integration 507