Towards Open Science: The myExper...
1 CCPE09v8 CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2000 00:1���6 Towards Open Science: The myExperiment approach David De Roure1,*, Carole Goble2, Sergejs Aleksejevs2, Sean Bechhofer2, Jiten Bhagat2, Don Cruickshank1, Paul Fisher2, Duncan Hull3, Danius Michaelides1, David Newman1, Rob Procter4, Yuwei Lin4, Meik Poschen4 1 School of Electronics and Computer Science, University of Southampton, Southampton SO17 1BJ, U.K, 2 School of Computer Science, The University of Manchester, Manchester M13 9PL, U.K. 3 The Manchester Interdisciplinary Biocentre, The University of Manchester, Manchester M13 9PL, U.K. 4 National Centre for e-Social Science, The University of Manchester, Manchester M13 9PL, U.K. SUMMARY By making research content more reusable, and providing a social infrastructure which facilitates sharing, the human aspects of the scholarly knowledge cycle may be accelerated and ���time-to-discovery��� reduced. We propose that the key to this is the sharing of methods and processes. We present myExperiment, a social web site for discovering, sharing and curating Scientific Workflows and experiment plans, and describe how myExperiment facilitates the management and sharing of research workflows, supports a social model for content curation tailored to the researcher and community, and supports Open Science by exposing content and functionality to the users��� tools and applications. Based on this we introduce the notion of the Research Object ��� the work objects that are built, transformed and published in the course of scientific experiments ��� and suggest that by encapsulating methods with results we can achieve research that is more reusable and repeatable and hence rapid and robust. KEY WORDS: Scientific Workflow, Web 2.0, Data Curation, Research Object, Semantic Web, e-Laboratory 1. INTRODUCTION 1.1 Motivation To accelerate the time to discovery of new research results we must look at the human component of the discovery cycle. Scientific advance relies on social processes in which scientists share hypotheses, insights and results, and the data and methods that support these. Traditionally, scholarly discourse and dissemination have focused on peer reviewed journal articles, mediated by the scholarly publishing process and gatherings such as conferences where researchers exchange knowledge in more informal ways. The Web is now widely used as a distributed platform for the dissemination of an increasingly diverse range of digital research materials: we are witnessing evolving practice in scholarly publishing [1] and communities supported by research portals and repositories. Significantly, there are also now tens of thousands of publicly available web services across business and science [2]. In this evolving landscape we observe an expansion in the kinds of scientific commodities being published, for example: ��� Primary and secondary data sets, along with standard metadata sufficient to support their interpretation and re-use, although tying together published results with the ���supplementary data��� upon which they are based has unsolved issues to do with persistence [3]. ��� Algorithms, software tools, scripts and procedures, through community services like OpenWetWare [4], which provides an exchange for techniques in biological sciences, and the nanoHUB gateway [5] which hosts user-contributed resources in the nanotechnology domain. * Corresponding author. Email dder@ecs.soton.ac.uk
2 CCPE09v8 This latter point is the focus of our work. Researchers need to share (and find) not just the digital materials of research but also the methods and processes: the protocols, plans, and standard operating procedures of bench science and the scripts, workflows and provenance records of e-Science. Methods are scientific commodities in their own right, with associated intellectual property, metadata, life cycles and hence curation needs [6] as with data and articles, they are subject to their own forms of authorship, credit and reuse criteria. We propose that: ��� By pooling and sharing methods we have the potential to accelerate science through exchanging know-how and best practice, avoiding reinvention and hence reducing time-to-experiment. Moreover, participating researchers are not always organised into predetermined Virtual Organisations but form fluid, opportunistic groupings amongst decoupled strangers. ��� By combining methods with results we can accelerate discovery by enabling transparent, comparable and reproducible research [7] and maintain the robustness of the accelerated process. By packaging and aggregating methods with data, results, publications, tutorials, simulations, logs, tags and people (experts, members, groups) and sharing these across applications as publication units we can work towards an open e-Laboratory that is outside any specific application. 1.2 Workflows A case in point is the Scientific Workflow. The Web provides a platform for delivering not just documents and data but also services which support the research process: Scientific workflows are the means to compose these, providing descriptions of processes that specify the co-ordinated execution of multiple tasks so that, for example, data analysis and simulations can be repeated and accurately reported. Alongside experiment plans, Standard Operating Procedures and laboratory protocols, automated workflows are one of the most recent forms of digital research methods, and one that has gained popularity and adoption in a short time [8]. Figure 1: Workflows and associated items used in the production of a research article Workflows can require specialist expertise that is hard-won and may be outside the skill-set of the author, and they are often complex and challenging to build [9]. Figure 1 illustrates a piece of research which involves two workflows developed for a particular bioinformatics investigation (investigating the Trypanosomiasis resistance phenotype in the mouse model) which led to publication of an article in Nucleic Acids Research [10]. The suite of scientific workflows in this work took a bioinformatics expert six months and over 40 versions to develop however, once developed they were immediately reusable by other, perhaps less experienced, researchers ��� in turn accelerating their research. In addition to the workflows and the pdf we see all the supplementary information relating to the published paper, including all workflow outputs, Word documents on result interpretation, spreadsheets detailing the re- sequencing of one candidate gene and a table from the paper itself, a PowerPoint presentation outlining the project���s background, and descriptions of the work carried out so that the provenance of the results can be established. In combination these items enable the research to be repeated, the research outcomes to be properly interpreted and trusted, and the components to be better repurposed.