Sign up & Download
Sign in

Sharing Detailed Research Data Is Associated with Increased Citation Rate

by Heather A Piwowar, Roger S Day, Douglas B Fridsma
PLoS ONE ()

Abstract

Background: Sharing research data provides benefit to the general scientific community, but the benefit is less obvious for the investigator who makes his or her data available. Principal Findings: We examined the citation history of 85 cancer microarray clinical trial publications with respect to the availability of their data. The 48% of trials with publicly available microarray data received 85% of the aggregate citations. Publicly available data was significantly (p=0.006) associated with a 69% increase in citations, independently of journal impact factor, date of publication, and author country of origin using linear regression. Significance: This correlation between publicly available data and increased literature impact may further motivate investigators to share their detailed research data.

Cite this document (BETA)

Available from www.pubmedcentral.nih.gov
Page 1
hidden

Sharing Detailed Research Data Is...

Sharing Detailed Research Data Is Associated with Increased Citation Rate Heather A. Piwowar*, Roger S. Day, Douglas B. Fridsma Department of Biomedical Informatics, University of Pittsburgh School of Medicine, Pittsburgh, Pennsylvania, United States of America Background. Sharing research data provides benefit to the general scientific community, but the benefit is less obvious for the investigator who makes his or her data available. Principal Findings. We examined the citation history of 85 cancer microarray clinical trial publications with respect to the availability of their data. The 48% of trials with publicly available microarray data received 85% of the aggregate citations. Publicly available data was significantly (p = 0.006) associated with a 69% increase in citations, independently of journal impact factor, date of publication, and author country of origin using linear regression. Significance. This correlation between publicly available data and increased literature impact may further motivate investigators to share their detailed research data. Citation: Piwowar HA, Day RS, Fridsma DB (2007) Sharing Detailed Research Data Is Associated with Increased Citation Rate. PLoS ONE 2(3): e308. doi:10.1371/journal.pone.0000308 INTRODUCTION Sharing information facilitates science. Publicly sharing detailed research data���sample attributes, clinical factors, patient outcomes, DNA sequences, raw mRNA microarray measurements���with other researchers allows these valuable resources to contribute far beyond their original analysis[1]. In addition to being used to confirm original results, raw data can be used to explore related or new hypotheses, particularly when combined with other publicly available data sets. Real data is indispensable when investigating and developing study methods, analysis techniques, and software implementations. The larger scientific community also benefits: sharing data encourages multiple perspectives, helps to identify errors, discourages fraud, is useful for training new researchers, and increases efficient use of funding and patient population resources by avoiding duplicate data collection. Believing that that these benefits outweigh the costs of sharing research data, many initiatives actively encourage investigators to make their data available. Some journals, including the PLoS family, require the submission of detailed biomedical data to publicly available databases as a condition of publication[2���4]. Since 2003, the NIH has required a data sharing plan for all large funding grants. The growing open-access publishing movement will perhaps increase peer pressure to share data. However, while the general research community benefits from shared data, much of the burden for sharing the data falls to the study investigator. Are there benefits for the investigators themselves? A currency of value to many investigators is the number of times their publications are cited. Although limited as a proxy for the scientific contribution of a paper[5], citation counts are often used in research funding and promotion decisions and have even been assigned a salary-increase dollar value[6]. Boosting citation rate is thus is a potentially important motivator for publication authors. In this study, we explored the relationship between the citation rate of a publication and whether its data was made publicly available. Using cancer microarray clinical trials, we addressed the following questions: Do trials which share their microarray data receive more citations? Is this true even within lower profile trials? What other data-sharing variables are associated with an increased citation rate? While this study is not able to investigate causation, quantifying associations is a valuable first step in understanding these relationships. Clinical microarray data provides a useful environment for the investigation: despite being valuable for reuse and extremely costly to collect, is not yet universally shared. RESULTS We studied the citations of 85 cancer microarray clinical trials published between January 1999 and April 2003, as identified in a systematic review by Ntzani and Ioannidis[7] and listed in Supplementary Text S1. We found 41 of the 85 clinical trials (48%) made their microarray data publicly available on the internet. Most data sets were located on lab websites (28), with a few found on publisher websites (4), or within public databases (6 in the Stanford Microarray Database (SMD)[8], 6 in Gene Expression Omnibus (GEO)[9], 2 in ArrayExpress[10], 2 in the NCI GeneExpression Data Portal (GEDP)(gedp.nci.nih.gov) some datasets in more than one location). The internet locations of the datasets are listed in Supplementary Text S2. The majority of datasets were made available concurrently with the trial publication, as illustrated within the WayBackMachine internet archives (www.archive.org/web/web.php) for 25 of the datasets and mention of supplementary data within the trial publication itself for 10 of the remaining 16 datasets. As seen in Table 1, trials published in high impact journals, prior to 2001, or with US authors were more likely to share their data. The cohort of 85 trials was cited an aggregate of 6239 times in 2004���2005 by 3133 distinct articles (median of 1.0 cohort citation per article, range 1���23). The 48% of trials which shared their data received a total of 5334 citations (85% of aggregate), distributed as shown in Figure 1. Academic Editor: John Ioannidis, University of Ioannina School of Medicine, Greece Received December 13, 2006 Accepted February 26, 2007 Published March 21, 2007 Copyright: �� 2007 Piwowar et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Funding: HAP was supported by NLM Training Grant Number 5T15-LM007059-19. The NIH had no role in study design, data collection or analysis, writing the paper, or the decision to submit it for publication. The publication contents are solely the responsibility of the authors and do not necessarily represent the official views of the NIH. Competing Interests: The authors have declared that no competing interests exist. * To whom correspondence should be addressed. E-mail: hpiwowar@cbmi.pitt. edu PLoS ONE | www.plosone.org 1 March 2007 | Issue 3 | e308
Page 2
hidden
Whether a trial���s dataset was made publicly available was significantly associated with the log of its 2004���2005 citation rate (69% increase in citation count 95% confidence interval: 18 to 143%, p = 0.006), independent of journal impact factor, date of publication, and US authorship. Detailed results of this multivar- iate linear regression are given in Table 2. A similar result was found when we regressed on the number of citations each trial received during the 24 months after its publication (45% increase in citation count 95% confidence interval: 1 to 109%, p = 0.050). To confirm that these findings were not dependent on a few extremely high-profile papers, we repeated our analysis on a subset of the cohort. We define papers published after the year 2000 in journals with an impact factor less than 25 as lower-profile publications. Of the 70 trials in this subset, only 27 (39%) made their data available, although they received 1875 of 2761 (68%) aggregate citations. The distribution of the citations by data availability in this subset is shown in Figure 2. The association between data sharing and citation rate remained significant in this lower-profile subset, independent of other covariates within a multivariate linear regression (71% increase in citation count 95% confidence interval: 19 to 146%, p = 0.005). Lastly, we performed exploratory analysis on citation rate within the subset of trials which shared their microarray data results are given in Table 3 and raw covariate data in Supplementary Data S1. The number of patients in a trial and a clinical endpoint correlated with increased citation rate. Assuming shared data is actually re- analyzed, one might expect an increase in citations for those trials which generated data on a standard platform (Affymetrix), or released it in a central location or format (SMD, GEO, GEDP)[11]. However, the choice of platform was insignificant and only those trials located in SMD showed a weak trend of increased citations. In fact, the 6 trials with data in GEO (in addition to other locations for 4 of the 6) actually showed an inverse relationship to citation rate, though we hesitate to read much into this due to the small number of trials in this set. The few trials in this cohort which, in addition to gene expression fold-change or other preprocessed information, shared their raw probe data or actual microarray images did not receive additional citations. Finally, although finding diverse microarray datasets online is non-trivial, an additional increase in citations was not noted for trials which mentioned their Supple- mentary Material within their paper, nor for those trials with datasets identified by a centralized, established data mining website. In summary, only trial design features such as size and clinical endpoint showed a significant association with citation rate covariates relating to the data collection and how the data was made available only showed very weak trends. Perhaps with a larger and more balanced sample of trials with shared data these trends would be more clear. Table 1. Characteristics of Eligible Trials by Data Sharing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Number of Articles Odds Ratio (95% confidence interval) Total Data Shared Data Not Shared TOTAL 85 41 (48%) 44 (52%) High Impact (. = 25) 12 12 (100%) 0 (0%) ��� (3.8 to ���) Low Impact Journal 73 29 (40%) 44 (60%) Published 1999���2000 6 5 (83%) 1 (17%) 6.0 (0.6 to 288.5) Published 2001���2003 79 36 (46%) 43 (54%) Include a US Author 56 35 (63%) 21 (38%) 6.4 (2.0 to 21.9) No US Authors 29 6 (21%) 23 (79%) doi:10.1371/journal.pone.0000308.t001 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 1. Distribution of 2004���2005 citation counts of 85 trials by data availability. The 41 clinical trial publications which publicly shared their microarray data received more citations, in general, than the 44 publications which did not share their microarray data. In this plot of the distribution of citation counts received by each publication, the extent of the box encompasses the interquartile range of the citation counts, whiskers extend to 1.5 times the interquartile range, and lines within the boxes represent medians. doi:10.1371/journal.pone.0000308.g001 Table 2. Multivariate regression on citation count for 85 publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Percent increase in citation count (95% confidence interval) p-value Publish in a journal with twice the impact factor 84% (59 to 109%) ,0.001 Increase the publication date by a month 23% (25 to 22%) ,0.001 Include a US author 38% (1 to 89%) 0.049 Make data publicly available 69% (18 to 143%) 0.006 We calculated a multivariate linear regression over the citation counts, including covariates for journal impact factor, date of publication, US authorship, and data availability. The coefficients and p-values for each of the covariates are shown here, representing the contribution of each covariate to the citation count, independent of other covariates. doi:10.1371/journal.pone.0000308.t002 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sharing Data Citation Rate PLoS ONE | www.plosone.org 2 March 2007 | Issue 3 | e308

Readership Statistics

199 Readers on Mendeley
by Discipline
 
 
 
by Academic Status
 
19% Ph.D. Student
 
15% Researcher (at an Academic Institution)
 
13% Librarian
by Country
 
41% United States
 
15% United Kingdom
 
9% Germany

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in