The data paper: a mechanism to in...
RESEARCH Open Access The data paper: a mechanism to incentivize data publishing in biodiversity science Vishwas Chavan1,2*���, Lyubomir Penev1,2��� Background: Free and open access to primary biodiversity data is essential for informed decision-making to achieve conservation of biodiversity and sustainable development. However, primary biodiversity data are neither easily accessible nor discoverable. Among several impediments, one is a lack of incentives to data publishers for publishing of their data resources. One such mechanism currently lacking is recognition through conventional scholarly publication of enriched metadata, which should ensure rapid discovery of ���fit-for-use��� biodiversity data resources. Discussion: We review the state of the art of data discovery options and the mechanisms in place for incentivizing data publishers efforts towards easy, efficient and enhanced publishing, dissemination, sharing and re-use of biodiversity data. We propose the establishment of the ���biodiversity data paper��� as one possible mechanism to offer scholarly recognition for efforts and investment by data publishers in authoring rich metadata and publishing them as citable academic papers. While detailing the benefits to data publishers, we describe the objectives, work flow and outcomes of the pilot project commissioned by the Global Biodiversity Information Facility in collaboration with scholarly publishers and pioneered by Pensoft Publishers through its journals Zookeys, PhytoKeys, MycoKeys, BioRisk, NeoBiota, Nature Conservation and the forthcoming Biodiversity Data Journal. We then debate further enhancements of the data paper beyond the pilot project and attempt to forecast the future uptake of data papers as an incentivization mechanism by the stakeholder communities. Conclusions: We believe that in addition to recognition for those involved in the data publishing enterprise, data papers will also expedite publishing of fit-for-use biodiversity data resources. However, uptake and establishment of the data paper as a potential mechanism of scholarly recognition requires a high degree of commitment and investment by the cross-sectional stakeholder communities. Background It is known that one of the effective strategies for addressing the growing biodiversity crisis is access to a range of biodiversity- and ecosystems-related data and information in a useful form. Furthermore, discovery of existing and prospective unpublished data needs to be encouraged, if our goal is to fill the extensive biodiver- sity knowledge gap that exists today. This emphasis on free and open access to biodiversity data is in tune with the call for open access to primary scientific data, which has been growing since 1991, beginning with Bromley Principles [1]. Since then, many statements, policies, and guidelines for open access to scientific data have appeared [2-23]. The Berlin Declaration of 2003 has been signed by 302 scientific bodies worldwide [18]. In 2004, the Organiza- tion for Economic Co-operation and Development (OECD) also recognized the importance of open access to primary scientific data [23]. Recently established initiatives such as Conservation Commons [24], the Global Earth Observation System of Systems (GEOSS) 10 year implementation plan [25], and the Intergovern- mental Science-Policy Platform on Biodiversity and Eco- system Services (IPBES) [26] recognized the importance of open access to primary scientific knowledge. Many scholarly publishers have joined in implementing the * Correspondence: vchavan@gbif.org ��� Contributed equally 1Global Biodiversity Information Facility Secretariat, Universitetsparken 15, DK 2100, Copenhagen, Denmark Full list of author information is available at the end of the article Chavan and Penev BMC Bioinformatics 2011, 12(Suppl 15):S2 http://www.biomedcentral.com/1471-2105/12/S15/S2 �� 2011 Chavan and Penev licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
common principle that scientists must make their data available for independent use, without restrictions, once the data have been used in publications [27-34]. Recently, several of them emphasized the need for simultaneous publication of primary biodiversity data with scholarly publications and described some approaches to incorporate this practice in the routine publication process [35-38]. Editors of scientific journals can have an important role in promoting public deposition of scientific data [39]. However, these efforts are yet to yield any signifi- cant results because existing data remain unpublished, undiscovered and thus underused [40]. The majority of initiatives to make data accessible have focused on ���big science��� rather than ���small science��� [41]. We do not have a model for publication and discovery of data from small scale data authors, who collectively produce huge quantities of primary data, forming the so-called ���long tail��� of science data [41,42]. Biodiversity research, as well biodiversity conservation and sustainable use, cannot be achieved if data are not preserved, discovered and made accessible [43]. Thus, discovery is a first step towards increased access to pri- mary biodiversity data. However, our current progress in discovering biodiversity data resources emphasize the need for innovative mechanisms to speed up progress. We propose the establishment of the ���biodiversity data paper��� as one possible mechanism to offer scholarly recognition through registration of priority, citability and dissemination of the efforts and investment by data publishers in authoring rich metadata. In context of this article, the term ���data publisher��� is used in its widest sense. Data publishers include all data creators, data curators, data managers and data publishing networks/ systems who form an integral part of data life cycle. Thus, data publishers are individuals, institutions or net- works that facilitates discovery and access to primary biodiversity data through national, regional, thematic or global networks such as the Global Biodiversity Informa- tion Facility (GBIF). These are often also referred to as ���data providers��� [44]. Publishing and discovery of biodiversity data: the state of the art Primary biodiversity data are the digital text or multime- dia data records that detail the instance of an organism - the what, where, when, how and by whom of the organism���s occurrence and recording [44,45]. Many the biodiversity data are neither accessible nor discoverable [46]. Currently the GBIF facilitates discovery of over 10,000 data resources, providing access to over 267 mil- lion primary biodiversity data records. However, this progress can be compared to scratching the surface of a huge iceberg. For instance, 6,500 natural history collections across the world are believed to be holding approximately 3 billion data records spanning the past 250 years of biodiversity research [47,48]. Ari��o (2010) very conservatively estimated it to be 1.2 to 2.1 billion, of which only 3% is discoverable at the moment [49]. Although data from ���data-rich��� nations are being discov- ered at a snail���s pace, there are no definite efforts being made to ensure discovery of data resources from mega- biodiverse, developing and under-developed regions of the world. Most of the existing data discovery efforts are geared towards big projects or initiatives that constitute less that 20% of the estimated universe of biodiversity data: the remaining 80% of the data, not easily found by potential user, is called ���dark data��� [50]. These include investigator-focused ���small data���, locally generated ���invi- sible data��� and ���incidental data���, which are less well planned, poorly curated and unlikely to be visible to others. These dark data are in danger of being lost for want of an appropriate discovery mechanism [51]. According to Heidorn (2008), these dark data may be more important, because of their huge volume, than the data that can be easily discovered and used [50]. In summary, there is a lack of up-to-date, easy, fast, reliable and affordable discovery and access to a wide spectrum of primary biodiversity data. This leads to an unnecessary duplication of effort. Furthermore, verifica- tion of results become difficult and investment in research, data creation and collection remain under-rea- lized as these data are currently trapped invisibly in institutional and individual cupboards, computers and disks. This is an obstacle to interdisciplinary and inter- national research [46], as huge investment in data col- lection does not in any way ensure that the data are accessible now or that they will be accessible in future. Thus discovery of both digital and non-digital data resources is essential for ensuring access and enhanced use of biodiversity data. Publishing and discovery of biodiversity data: the constraints and challenges The major reasons for this grim state of affairs are: (a) the lack of sustainable practices for data publishing (b) the lack of easy-to-use tools and related guidelines for authoring metadata documents (c) the difficulty of deal- ing with heterogeneity and diversity of standards, tools and numerous metadata extensions (d) the cost of crea- tion and maintenance of infrastructure by small- and medium-scale data publishers and (e) the lack of profes- sional reward structures or incentives. The first four of these causes are being addressed by various initiatives. The GBIF and its participants and standards bodies such as Biodiversity Informatics (also known as the Taxonomic Database Working Group, TDWG) are at various stages of development. However, the last cause, Chavan and Penev BMC Bioinformatics 2011, 12(Suppl 15):S2 http://www.biomedcentral.com/1471-2105/12/S15/S2 Page 2 of 12