Sign up & Download
Sign in

Towards a Generic Infrastructure for Sustainable Management of Quality Controlled Primary Data

by Thomas Buchmann, Stefan Jablonski, Bernhard Volz, Bernhard Westfechtel
Europe ()

Abstract

Collecting primary data in scientific research is currently being performed in numerous repositories. Frequently, these repositories have not been designed to support long-term evolution of data, processes, and tools. Furthermore, in many cases repositories have been set up for the specific needs of some research project, and are not maintained any longer when the project is terminated. Finally, quality control and data provenance issues are not addressed to a sufficient extent. Based on the experiences gained in a joint project with biologists in the domain of biodiversity informatics, we propose a generic infrastructure for sustainable management of quality controlled primary data. The infrastructure encompasses both project and institutional repositories and provides a process for migrating project data into institutional repositories. Evolution and adaptability are supported through a generic approach with respect to underlying data schemas, processes, and tools. Specific emphasis is placed on quality assurance and data provenance.

Cite this document (BETA)

Available from Europe
Page 1
hidden

Towards a Generic Infrastructure ...

R. Meersman et al. (Eds.): OTM 2010 Workshops, LNCS 6428, pp. 130���138, 2010. �� Springer-Verlag Berlin Heidelberg 2010 Towards a Generic Infrastructure for Sustainable Management of Quality Controlled Primary Data Thomas Buchmann, Stefan Jablonski, Bernhard Volz, and Bernhard Westfechtel Institute for Computer Science, University of Bayreuth Bayreuth, Germany {thomas.buchmann,stefan.jablonski,bernhard.volz, bernhard.westfechtel}@uni-bayreuth.de Abstract. Collecting primary data in scientific research is currently being performed in numerous repositories. Frequently, these repositories have not been designed to support long-term evolution of data, processes, and tools. Fur- thermore, in many cases repositories have been set up for the specific needs of some research project, and are not maintained any longer when the project is terminated. Finally, quality control and data provenance issues are not ad- dressed to a sufficient extent. Based on the experiences gained in a joint project with biologists in the domain of biodiversity informatics, we propose a generic infrastructure for sus- tainable management of quality controlled primary data. The infrastructure en- compasses both project and institutional repositories and provides a process for migrating project data into institutional repositories. Evolution and adaptability are supported through a generic approach with respect to underlying data sche- mas, processes, and tools. Specific emphasis is placed on quality assurance and data provenance. 1 Introduction The importance of the management of primary data in scientific research is increas- ingly being recognized. Primary data are collected in an abundant set of research projects. However, severe problems are still faced concerning the long-term manage- ment of primary data and provision of these data to research communities. For example, let us consider biodiversity data. Biodiversity is the variation of life forms, and it is often used as a measure for the health of biological systems. By col- lecting biodiversity data over long time spans, the evolution of biological systems may be traced. Therefore, it is crucial that biodiversity data are managed in a sustainable way. Biodiversity informatics [1] is considered with the development of methods, infra- structures, and tools for managing biodiversity data. Biodiversity data are managed in numerous repositories on different scale levels, including personal, project, institu- tional, and global repositories. At a global level, portals such as GBIF (Global Biodi- versity Information Facility [2]) and BioCASE (Biological Collection Access Service
Page 2
hidden
Towards a Generic Infrastructure for Sustainable Management 131 for Europe [3]) provide access to biodiversity data which are exported from a huge number of repositories. To facilitate data exchange, various standards such as ABCD (Access to Biological Collection Data [4]) or Darwin Core [5] have been developed under the umbrella of TDWG (Biodiversity Information Standards, previously called Taxonomic Database Working Group [6]). Furthermore, several domain-specific frameworks for developing biodiversity data management systems are available commercially or in the public domain, e.g., BRAHMS (Botanical Research and Her- barium Management Systems [7]) and BioOffice [8]. Finally, institutions are develop- ing frameworks for in-house use, e.g., the DiversityWorkbench [9] hosted by the SNSB (Staatliche Naturwissenschaftliche Sammlungen Bayerns). Thus, nowadays biodiversity informatics is a very active field, in which numerous activities are being performed on different levels of scale. However, this does by no means imply that the challenges of biodiversity informatics have already been solved. In the contrary, researchers working in this field are increasingly recognizing that they are facing difficult problems of data management. In particular, current solutions suffer from the following drawbacks: ��� Specific solutions: Usually, the systems for managing biodiversity data have been written to solve a specific problem of data management in a defined context. These systems cannot be easily adapted to other problem domains. ��� No sustainable management: Often, project repositories are not maintained after the funding of the project has terminated. Thus, valuable data are lost. ��� Data losses: Since biodiversity data are spread over numerous repositories, global portals such as GBIF were founded which provide world-wide global access. How- ever, data are exported into such portals with massive loss of data, and the data are not harmonized. ��� Lack of quality control and data provenance: While large amounts of data may be accessed via global portals, the portals cannot guarantee a defined level of data quality, nor do they support data provenance (i.e., it cannot be traced where the data came from, in which ways they were produced, etc.). ��� No migration path from project to institutional repositories: Institutional reposito- ries have been set up to bridge the gap between project repositories and global por- tals by managing biodiversity data which are consolidated and subject to quality control. However, migration of project data into institutional repositories is a labo- rious process which is not supported by adequate tools. ��� No evolution support: Biodiversity data are the results of biological research, and as such they are subject to constant change ��� not only on the level of instances, but also on the level of data schemas. Furthermore, the scientific work processes are changing, as well. Finally, maintaining repositories over a long period has to deal with technological evolution. Current systems are hardly designed for evolution with respect to any of these dimensions. In this paper, we propose an infrastructure for sustainable management of primary data. The proposal is based on experiences which we have gained in a joint research project carried out with biologists in the domain of biodiversity informatics [10], as well as on the analysis of other projects and systems in related domains. In contrast to existing approaches which by the majority stem from the application domains and

Readership Statistics

6 Readers on Mendeley
by Discipline
 
 
by Academic Status
 
33% Ph.D. Student
 
17% Post Doc
 
17% Researcher (at an Academic Institution)
by Country
 
33% United States
 
33% Denmark
 
17% Switzerland

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in