Sign up & Download
Sign in

PhD Thesis: Foundational studies for measuring the impact, prevalence, and patterns of publicly sharing biomedical research data

by Heather Alyce Piwowar
Database (2010)

Abstract

Many initiatives encourage research investigators to share their raw research datasets in hopes of increasing research efficiency and quality. Despite these investments of time and money, we do not have a firm grasp on the prevalence or patterns of data sharing and reuse. Previous survey methods for understanding data sharing patterns provide insight into investigator attitudes, but do not facilitate direct measurement of data sharing behaviour or its correlates. In this study, we evaluate and use bibliometric methods to understand the impact, prevalence, and patterns with which investigators publicly share their raw gene expression microarray datasets after study publication. To begin, we analyzed the citation history of 85 clinical trials published between 1999 and 2003. Almost half of the trials had shared their microarray data publicly on the internet. Publicly available data was significantly (p=0.006) associated with a 69% increase in citations, independently of journal impact factor, date of publication, and author country of origin. Digging deeper into data sharing patterns required methods for automatically identifying data creation and data sharing. We derived a full-text query to identify studies that generated gene expression microarray data. Issuing the query in PubMed Central, Highwire Press, and Google Scholar found 56% of the data-creation studies in our gold standard, with 90% precision. Next, we established that searching ArrayExpress and the Gene Expression Omnibus databases for PubMed article identifiers retrieved 77% of associated publicly-accessible datasets. We used these methods to identify 11603 publications that created gene expression microarray data. Authors of at least 25% of these publications deposited their data in the predominant public databases. We collected a wide set of variables about these studies and derived 15 factors that describe their authorship, funding, institution, publication, and domain environments. In second-order analysis, authors with a history of sharing and reusing shared gene expression microarray data were most likely to share their data, and those studying human subjects and cancer were least likely to share. We hope these methods and results will contribute to a deeper understanding of data sharing behavior and eventually more effective data sharing initiatives

Cite this document (BETA)

Available from Heather Piwowar's profile on Mendeley.
Page 1
hidden

PhD Thesis: Foundational studies for measuring the impact, prevalence, and patterns of publicly sharing biomedical research data

FOUNDATIONAL STUDIES FOR MEASURING
THE IMPACT, PREVALENCE, AND PATTERNS
OF PUBLICLY SHARING BIOMEDICAL RESEARCH DATA








by
Heather Alyce Piwowar
Bachelor of Science in Electrical Engineering and Computer Science, MIT, 1995
Master of Engineering in Electrical Engineering and Computer Science, MIT, 1996
Master of Science in Biomedical Informatics, University of Pittsburgh, 2006









Submitted to the Graduate Faculty of
the School of Medicine in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy









University of Pittsburgh
2010

Page 2
hidden

UNIVERSITY OF PITTSBURGH
SCHOOL OF MEDICINE




This dissertation was presented

by


Heather Alyce Piwowar



It was defended on
March 24, 2010
and approved by
Brian B. Butler, PhD, Associate Professor,
Katz Graduate School of Business, University of Pittsburgh

Ellen G. Detlefsen, PhD, Associate Professor,
School of Information Sciences, University of Pittsburgh

Gunther Eysenbach, MD, MPH, Associate Professor,
Department of Health Policy, Management and Evaluation, University of Toronto

Madhavi Ganapathiraju, PhD, Assistant Professor,
Department of Biomedical Informatics, University of Pittsburgh

Dissertation Advisor: Wendy W. Chapman, PhD, Assistant Professor,
Department of Biomedical Informatics, University of Pittsburgh


ii
Page 3
hidden

iii

FOUNDATIONAL STUDIES FOR MEASURING
THE IMPACT, PREVALENCE, AND PATTERNS
OF PUBLICLY SHARING BIOMEDICAL RESEARCH DATA
Heather A. Piwowar, PhD
University of Pittsburgh, 2010

Many initiatives encourage research investigators to share their raw research datasets
in hopes of increasing research efficiency and quality. Despite these investments of
time and money, we do not have a firm grasp on the prevalence or patterns of data
sharing and reuse. Previous survey methods for understanding data sharing patterns
provide insight into investigator attitudes, but do not facilitate direct measurement of
data sharing behaviour or its correlates. In this study, we evaluate and use bibliometric
methods to understand the impact, prevalence, and patterns with which investigators
publicly share their raw gene expression microarray datasets after study publication.
To begin, we analyzed the citation history of 85 clinical trials published between
1999 and 2003. Almost half of the trials had shared their microarray data publicly on
the internet. Publicly available data was significantly (p=0.006) associated with a 69%
increase in citations, independently of journal impact factor, date of publication, and
author country of origin.
Digging deeper into data sharing patterns required methods for automatically
identifying data creation and data sharing. We derived a full-text query to identify
studies that generated gene expression microarray data. Issuing the query in PubMed
Central®, Highwire Press, and Google Scholar found 56% of the data-creation studies
in our gold standard, with 90% precision. Next, we established that searching
ArrayExpress and the Gene Expression Omnibus databases for PubMed® article
identifiers retrieved 77% of associated publicly-accessible datasets.
We used these methods to identify 11603 publications that created gene
expression microarray data. Authors of at least 25% of these publications deposited
their data in the predominant public databases. We collected a wide set of variables
about these studies and derived 15 factors that describe their authorship, funding,
Page 4
hidden

iv
institution, publication, and domain environments. In second-order analysis, authors
with a history of sharing and reusing shared gene expression microarray data were
most likely to share their data, and those studying human subjects and cancer were
least likely to share.
We hope these methods and results will contribute to a deeper understanding of
data sharing behavior and eventually more effective data sharing initiatives.
Page 5
hidden

v
TABLE OF CONTENTS
PREFACE ....................................................................................................................... X 
1.0  INTRODUCTION .................................................................................................... 1 
1.1  BACKGROUND .............................................................................................. 2 
1.1.1 The potential benefits of data sharing ................................................... 4   
1.1.2 Current data sharing practice: forces in support .................................. 4   
1.1.3  Current data sharing practice: forces in opposition .............................. 6 
1.2  PREVIOUS RESEARCH ON DATA SHARING BEHAVIOR .......................... 7 
1.2.1 Measuring and modeling data sharing behavior ................................... 8   
1.2.2 Measuring and modeling data sharing attitudes and intentions ............ 8   
1.2.3 Identifying instances of data sharing ..................................................... 9   
1.2.4 Evaluating the impact of data sharing policies .................................... 10   
1.2.5 Estimating the costs and benefits of data sharing ............................... 10   
1.2.6  Related research fields ....................................................................... 11 
1.3  RESEARCH DESIGN AND METHODS ........................................................ 11 
1.3.1 Aim 1: Does sharing have benefit for those who share? ..................... 11   
1.3.2 Aim 2: Can sharing and withholding be systematically measured? .... 12  
1.3.3  Aim 3: How often is data shared? What predicts sharing?
How can we model sharing behavior? ................................................ 12 
1.4  RELATED RESEARCH APPLICATIONS OF METHODS ............................ 12 
1.4.1 Citation analysis for adoption and impact of open science ................. 12   
1.4.2 Natural language processing of the biomedical literature ................... 13  
1.4.3  Regression and factor analysis for deriving and evaluating models
of sharing behavior ............................................................................. 14 
1.5  OUTLINE OF THE DISSERTATION ............................................................. 14 
Page 6
hidden

vi
2.0  AIM 1: SHARING DETAILED RESEARCH DATA IS ASSOCIATED
WITH INCREASED CITATION RATE .................................................................. 15 
2.1 INTRODUCTION ........................................................................................... 15   
2.2  MATERIALS AND METHODS ...................................................................... 17 
2.2.1 Identification and Eligibility of Relevant Studies .................................. 17   
2.2.2 Data Extraction ................................................................................... 17   
2.2.3  Analysis .............................................................................................. 18 
2.3 RESULTS ..................................................................................................... 19   
2.4  DISCUSSION ................................................................................................ 23 
3.0  AIM 2A: USING OPEN ACCESS LITERATURE TO GUIDE FULL-TEXT
QUERY FORMULATION ..................................................................................... 27 
3.1 BACKGROUND ............................................................................................ 28   
3.2  METHOD ....................................................................................................... 30 
3.2.1 Query development corpus ................................................................. 30   
3.2.2 Query development features .............................................................. 31   
3.2.3 Query development algorithm ............................................................. 31   
3.2.4 Query syntax ....................................................................................... 32   
3.2.5 Query evaluation corpus ..................................................................... 32   
3.2.6 Query execution .................................................................................. 33   
3.2.7  Query evaluation statistics .................................................................. 33 
3.3  RESULTS ..................................................................................................... 34 
3.3.1 Queries ............................................................................................... 34   
3.3.2 Evaluation portal coverage ................................................................. 34   
3.3.3  Query performance ............................................................................. 35 
3.4  DISCUSSION ................................................................................................ 37 
4.0  AIM 2B: RECALL AND BIAS OF RETRIEVING GENE EXPRESSION
MICROARRAY DATASETS THROUGH PUBMED IDENTIFIERS ...................... 41 
4.1 BACKGROUND ............................................................................................ 42   
4.2  METHODS .................................................................................................... 44 
4.2.1 Reference standard ............................................................................ 44   
4.2.2  Database search for PubMed identifiers ............................................. 44 
Page 7
hidden

vii
4.2.3 Data extraction .................................................................................... 45   
4.2.4  Statistical analysis .............................................................................. 46 
4.3 RESULTS ..................................................................................................... 46   
4.4 DISCUSSION ................................................................................................ 52   
4.5  CONCLUSIONS ............................................................................................ 53 
5.0  AIM 3: WHO SHARES? WHO DOESN’T? FACTORS ASSOCIATED
WITH SHARING GENE EXPRESSION MICROARRAY DATA ............................ 54 
5.1 INTRODUCTION ........................................................................................... 55   
5.2  METHODS .................................................................................................... 57 
5.2.1 Studies for analysis ............................................................................. 57   
5.2.2 Study attributes ................................................................................... 58   
5.2.3  Statistical methods .............................................................................. 60 
5.3  RESULTS ..................................................................................................... 61 
5.3.1 First-order factors ............................................................................... 66   
5.3.2  Second-order factors .......................................................................... 70 
5.4  DISCUSSION ................................................................................................ 75 
6.0  CONCLUSIONS ................................................................................................... 80 
6.1 SUMMARY .................................................................................................... 80   
6.2  CONTRIBUTIONS, IMPLICATIONS, AND FUTURE WORK ....................... 81 
6.2.1 Contributions ....................................................................................... 81   
6.2.2  Findings .............................................................................................. 82 
6.2.2.1 Data sharing is associated with an increased citation rate ....... 82  
6.2.2.2  Data creation studies can be identified through
full-text queries .............................................................................. 84 
6.2.2.3 Datasets can be identified by their PubMed identifiers ............. 85   
6.2.2.4  Many attributes are correlated with data sharing behaviour ..... 85 
6.2.3  The next frontier .................................................................................. 85 
6.3 CODE AND DATA AVAILABILITY .............................................................. 86   
6.4  HOPE ............................................................................................................ 86 
APPENDIX .................................................................................................................... 87 
BIBLIOGRAPHY ........................................................................................................... 93 
Page 8
hidden

viii
LIST OF TABLES

Table 1: Characteristics of eligible publications ............................................................ 20 
Table 2: Multivariate regression on citation count of 85 publications ........................... 21 
Table 3: Exploratory regression on citation count for 41 publications with shared data 23 
Table 4: Derived microarray data creation queries for full-text portals ......................... 34 
Table 5: Full-text portal coverage of reference journals, in order of preference ........... 35 
Table 6: Query accuracy by portal source .................................................................... 36 
Table 7: Query accuracy compared to baseline MeSH queries ................................... 37 
Table 8: Comparison of dataset retrieval by two retrieval strategies ............................. 48 
Table 9: First-order factor loadings............................................................................... 67 
Table 10: Second-order factor loadings, by first-order factors ...................................... 71 
Table 11: Second-order factor loadings, by original variables ...................................... 72 
Table 12: Data sharing prevalence by two second-order factors ................................. 75 

Page 9
hidden

ix
LIST OF FIGURES

Figure 1: Distribution of 2004-2005 citation counts of 85 publications.......................... 20 
Figure 2: Distribution of 2004-2005 citation counts of 70 lower-profile publications ..... 22 
Figure 3: Method for building boolean queries from text features ................................ 32 
Figure 4: Datasets found or missed by PubMed ID queries, by database .................... 49 
Figure 5: Datasets found or missed by PubMed ID queries, by impact and size .......... 50 
Figure 6: Datasets found or missed by PubMed ID queries, by journal ........................ 51 
Figure 7: Covariance matrix of independent variables. ................................................ 63 
Figure 8: Proportion of articles with shared datasets, by year ...................................... 64 
Figure 9: Proportion of articles with shared datasets, by journal .................................. 65 
Figure 10: Association between shared data and first-order factors ............................. 69 
Figure 11: Odds ratios of data sharing for first-order factor, multivariate model ............ 70 
Figure 12: Association between shared data and second-order factors ........................ 73 
Figure 13: Odds ratios of data sharing for second-order factor, multivariate model ...... 74 
Figure 14: Association between shared data and original independent variables ......... 88 
Page 11
hidden

xi
source code (including blog snippets!), and open access articles. I thank scientists who share
their data. It is hard to do. Thank you.
I am grateful to everyone who organized workshops, conferences, and symposia where I
presented preliminary and tangential work: the NLM training conference, AMIA, ISMB, ELPUB,
JCDL, PSB, ASIS&T. These opportunities gave me experience, exposure, confidence, and
valuable feedback. In particular I thank those who put extra effort into organizing doctoral
consortiums, student awards, and special tracks in Open Science.
Thanks to the open science community itself. You are inspirational, affirming, helpful,
and make me want to be my best self.
I send a shout-out to all of the caffeine and wifi-fueled “third spaces” and their friendly
faces that facilitated my flextime life in Pittsburgh and Vancouver, and to all of my friends and
relations who helped keep work in perspective and life fun.
Thanks to my Maple Ridge family. Mom, you are always interested, always make time,
and demonstrate a can-do and must-do attitude. Dad, I enjoy coming to you for insightful
advice, and relish your example of unabashed joy in single-minded focus. Robyn: your passion
is matched only by your intellect, and I admire both more than I can say. Callum and Kris, you
make a rich life look easy: I draw strength from your example.
My Scottdale family, you put a human face on the medical and teaching professions.
You go after your dreams, and offer unwavering support and love to those around you. Thank
you.
I save a place of honour for all of the caregivers in our lives. Grandparents! Also, the
staff at UCDC and Escuelita, and particularly Niki, B, Christa, Katie, Rosi, Lorenza, Lisa: thank
you so much for your warmth and care. Niki: even more thanks on top, because you helped
our family navigate the early days, and made me feel good about being a brand new mom and a
PhD student and both at the same time.
Finally, first, last, and always: John, for doing everything you did to make this happen.


Page 12
hidden

xii



I dedicate this work to two people:

Kira
without whom I’d never have started, and
John
without whom I’d never have finished.
Page 14
hidden

2
withholding will facilitate effective refinement of data sharing initiatives to better address
real-world needs.
1.1 BACKGROUND
Widespread adoption of the Internet now allows research results to be shared more
readily than ever before. This is true not only for published research reports, but also
for the raw research data points that underlie the reports. Investigators who collect and
analyze data can submit their datasets to online databases, post them on websites, and
include them as electronic supplemental information – thereby making the data easy to
examine and reuse by other researchers.
Reusing research data has many benefits for the scientific community. New
research hypotheses can be tested more quickly and inexpensively when duplicate data
collection is reduced. Data can be aggregated to study otherwise-intractable issues,
and a more diverse set of scientists can become involved when analysis is opened
beyond those who collected the original data. Ethically, it has long been considered a
tenet of scientific behavior to share results [1], thereby allowing close examination of
research conclusions and facilitating others to build directly on previous work. The
ethical position is even stronger when the research has been funded by public money
[2], or the data are donated by patients and so should be used to advance science by
the greatest extent permitted by the donors [3].
However, whereas the general research community benefits from shared data,
much of the burden for sharing the data falls to the study investigator. A major cost is
time: the data have to be formatted, documented, and released. Further, it is sometimes
complicated to decide where to best publish data, since supplementary information and
laboratory sites are transient [4-6]. Beyond a time investment, releasing data can induce
fear. There is a possibility that the original conclusions may be challenged by a re-
analysis, whether due to possible errors in the original study [7], a misunderstanding or
misinterpretation of the data [8], or simply more refined analysis methods. Future data
miners might discover additional relationships in the data, some of which could disrupt
Page 16
hidden

4
1.1.1 The potential benefits of data sharing
Sharing information facilitates science. Reusing previously-collected data in new studies
allows these valuable resources to contribute far beyond their original analysis [14]. In
addition to being used to confirm original results, raw data can be used to explore
related or new hypotheses, particularly when combined with other publicly available
data sets. Real data is indispensable when investigating and developing study methods,
analysis techniques, and software implementations. The larger scientific community
also benefits: sharing data encourages multiple perspectives, helps to identify errors,
discourages fraud, is useful for training new researchers, and increases efficient use of
funding and patient population resources by avoiding duplicate data collection.
Believing that that these benefits outweigh the costs of sharing research data,
many initiatives actively encourage investigators to make their data available. Some
journals require the submission of detailed biomedical data to publicly available
databases as a condition of publication [15, 16]. Since 2003, the NIH has required a
data sharing plan for all large funding grants and has more recently introduced stronger
requirements for genome-wide association studies [17, 18]; other funders have similar
policies. Several government whitepapers [14, 19] and high-profile editorials [19-25]
call for responsible data sharing and reuse, large-scale collaborative science is
providing the opportunity to share datasets within and outside of the original research
projects [20, 21], and tools, standards, and databases are developed and maintained to
facilitate data sharing and reuse.
1.1.2 Current data sharing practice: forces in support
As highlighted above, sharing research data has many potential benefits to society.
Although sharing of data has always been an aspiration of the scientific enterprise, it
has only been common in a few subdisciplines. Forces are now converging to make it
an achievable and everyday practice.
Datasets are larger than they have ever been – and larger than any single team
of scientists can analyze exhaustively. The ubiquitous sharing and reuse of DNA
Page 18
hidden

6
Biomedical Ontology’s Bioportal framework [37], and caBIG [29, 38, 39] provide visions
for the future of research when data is more universally available and interoperable.
Data sharing and integration are being actively pursued outside of biomedical
research, in other scientific fields (physics, astronomy, environmental science) and also
by the general public [40]. Several websites encourage uploading and visualizing all
sorts of data: the “Tasty Data Goodies” at Swivel (http://www.swivel.com) and IBM’s
Many Eyes (http://www.many-eyes.com) are popular examples. Widespread adoption of
Web 2.0 technologies, including blogging, tagging, wikis, and mashups, suggest that
our next generation of scientists will expect and embrace a world of research remixes
[40].
Finally, I note the complementary forces of open access and pre-print
publications, open notebook science projects [41], open source code [42], Creative
Commons copyright licenses (http://creativecommons.org/) for many kinds of original
content (including data), and two recent public access policies. The NIH Public Access
Policy requires all NIH-funded investigators to submit their peer-reviewed manuscripts
to PubMed Central to ensure public access, as of April 2008 [43]. In February 2008, the
faculty of Harvard University voted to make all faculty scholarly publications freely
available in an online open-access repository [44], the first such resolution by a
university in the United States. While these policies do not apply to data beyond that
provided within the manuscripts, they clearly demonstrate a political will to support
sharing research results “to help advance science and improve human health”
(http://publicaccess.nih.gov) and “promote free and open access to significant, ongoing
research” [44].
1.1.3 Current data sharing practice: forces in opposition
While many forces are converging to enhance our ability to share data, there are
significant social, organizational, technical and legislative factors that may impede them.
Investigators may restrict access to data to maximize the professional and
economic benefit that they accrue from data they generate, even though they also gain
advantage by accessing data produced by others.
Page 20
hidden

8
1.2.1 Measuring and modeling data sharing behavior
Most measurements of data sharing prevalence have manually searched for shared
datasets across a subset of journals [10, 11, 49], or systematically contacted authors to
ask for shared datasets [51]. These studies have found that data sharing levels are
high (but less than 100%) in a few cases, but overall prevalence is low. For example,
Ochsner et al. [10] found that despite the maturity of gene expression microarray data
sharing infrastructure and multitude of funder and journal mandates, overall rates of
sharing gene expression microarray data online is about 50%.
These analyses have not correlated their prevalence findings with other variables
to detect patterns. Multivariate analyses have relied upon surveyed attitudes and
intentions (described below), rather than measured characteristics.
1.2.2 Measuring and modeling data sharing attitudes and intentions
The largest body of knowledge about motivations and predictors for biomedical data
sharing and withholding comes from Campbell and co-authors. They surveyed
researchers, asking whether they have ever requested data and been denied, or
themselves denied other researchers from access to data. Results indicated that
participation in relationships with industry, mentors’ discouragement of data sharing,
negative past experience with data sharing, and male gender were associated with data
withholding [13]. In another survey, among geneticists who said they intentionally
withheld data related to their published work, 80% said it was too much effort to share
the data, 64% said they withheld data to protect the ability of a junior team member to
publish, and 53% withheld data to protect their own publishing opportunities [52].
Occasionally, the administrators of centralized data servers publish feedback
surveys of their users. As an example, Ventura reports a survey of researchers who
submitted and reviewed microarray studies in the Physiological Genomics journal after
its mandatory data submission policy had been in place for two years. Almost all (92%)
authors said that they believed depositing microarray data was of value to the scientific
Page 21
hidden

9
community and about half (55%) were aware of other researchers reusing data from the
database [53].
In related research, the information science and management of information
systems communities have developed several models of knowledge sharing. These
models often use either case studies [54] or opinions and attitudes gathered through
validated survey instruments ([44, 55-57], and many more). Studied domains include
knowledge sharing within an organization, volunteering knowledge in open social
networks, physician knowledge sharing in hospitals, participation in open source
projects, academic contributions to institutional archives, and other related activities.
1.2.3 Identifying instances of data sharing
While surveys have provided insight into sharing and reuse behavior, other issues are
best examined by studying the demonstrated behavior of scientists. Unfortunately,
observed measurement of data behavior is difficult because of the complexity in
identifying all episodes of data sharing and reuse. Although indications of sharing and
reuse usually exist within a published research report, the descriptions are in
unstructured free text and thus complex to extract.
Most studies of data sharing to date have used a manual review to identify
shared datasets (e.g. [10, 11, 49]).
One automated approach for identifying data sharing behavior is to follow the
“primary citation” field of database submission entries. Unfortunately, this is imperfect,
since these references often missing when data is submitted prior to study publication.
Populating the submission citation fields retrospectively requires intensive manual effort,
as demonstrated by the recent Protein Data Bank remediation project [57, 58], and thus
is not usually performed. No effective way exists to automatically retrieve and index
data housed on personal or lab websites or journal supplementary information.
Related research has examined the degree to which data remains available after
it has been shared. Multiple studies underscore the transience of supplementary
information [5], website URLs [6], and corresponding author email addresses [44].
Page 22
hidden

10
1.2.4 Evaluating the impact of data sharing policies
Despite many funder and journal policies requesting and requiring data sharing, the
impact of these policies have only been measured in small and disparate studies.
McCain manually categorized the journal “Instruction to Author” statements in 1995 [15].
A more recent manual review of gene sequence papers found that, despite
requirements, up to 15% of articles did not submit their datasets to Genbank [11].
Analyses of reproducibility in the political science literature suggests that only actively
enforced journal policies are effective [49].
Studying the impact of data sharing policies is difficult because policies are often
confounded with other variables. If, for example, impact factor is positively correlated
with a strong journal data sharing policy as well as a large research impact, it is difficult
to distinguish the direction of causation. Evaluating data sharing policies would ideally
involve a randomized controlled trial, but unfortunately this is impractical.
In related work, evaluations have been done to estimate the impact of reporting
guidelines [59].
1.2.5 Estimating the costs and benefits of data sharing
Estimating the costs and benefits of data sharing would be challenging even with a
comprehensive dataset of occurrences. A complete evaluation would require comparing
projects that shared with other similar projects that did not, across a wide variety of
variables including person-hours-till-completion, total project cost, received citations and
their impact, the number and impact of future publications, promotion, success in future
grant proposals, and general recognition and respect in the field.
Pienta [60] is currently investigating these questions with respect to social
science research data and publications. Zimmerman [61] has studied the ways in which
ecologists find and validate datasets to overcome the personal costs and risks of data
reuse.
Page 23
hidden

11
Examining variables for their benefits on research impact is a common theme
within the field of bibliometrics. Research impact is usually approximated by citation
metrics, despite their recognized limitations.
1.2.6 Related research fields
Evaluation of data sharing and reuse behavior is related to a number of other active
research fields: code reusability in software engineering, motivation in open source
projects, online knowledge sharing communities, and corporate knowledge sharing,
tools for collaboration, evaluating research output, the sociological study of altruism,
information retrieval, usage metrics, data standards, the semantic web, open access,
and open notebook science.

1.3 RESEARCH DESIGN AND METHODS
The long-term goal of this research is to accelerate research progress by increasing
effective data reuse through informed improvement of data sharing and reuse tools and
policies. The objective of this research project is to examine the feasibility of evaluating
data sharing behavior based on examination of the biomedical literature.
This research addressed the following specific aims:
1.3.1 Aim 1: Does sharing have benefit for those who share?
I investigated the association between sharing raw microarray data and subsequent
citation rate of published studies. I used datasets generated by a small, relatively
homogeneous set of cancer gene expression microarray clinical trials. Multivariate
analysis was used to statistically controlling for potential confounders. The results of
Aim 1 provided motivation for Aim 2 and preliminary work for Aim 3.
Page 24
hidden

12
1.3.2 Aim 2: Can sharing and withholding be systematically measured?
Because the manual methods used to conduct Aim 1 did not scale to larger analyses, I
investigated automatic methods for measuring data sharing and withholding behavior.
First, articles that generated gene expression microarray data were identified using NLP
on full-text research. Second, to assess whether the authors of these data-generating
studies shared or withheld their data, I investigated using database submission citation
links as evidence of data sharing. The results of Aim 2 were used to generate a dataset
for Aim 3.
1.3.3 Aim 3: How often is data shared? What predicts sharing? How can we
model sharing behavior?
First, I applied the classification systems described in Aim 2 to a wide spectrum of the
biomedical literature to identify articles that generated gene expression microarray data
and, subsequently, which of the articles that generated data also shared it. Then, for
each of the articles, I collected and analyzed features related to the authors, their
institutional and funding environment, the study itself, and the publishing mechanism. I
used univariate and multivariate statistics to investigate which of these features are
associated with dataset sharing. Finally, I used exploratory factor analysis to derive a
model that could be used to explain data sharing decisions based on my measured
variables.
1.4 RELATED RESEARCH APPLICATIONS OF METHODS
1.4.1 Citation analysis for adoption and impact of open science
Citation analysis has been used to assess several aspects of the adoption and impact
of open science, particularly literature open access. Eysenbach [62] found that authors
Page 26
hidden

14
Techniques vary depending on the task, but stemming, synonyms, and n-grams
are a mainstay [86]. Query expansion to include all query aspects have also been
shown to help [87]. The availability of full text articles in PMC, Google Scholar, and
other portals is spurring new approaches [88].
Finally, NLP techniques applied to clinical text might be of informative. For
example, Melton et al. [89] also faces the issue of identifying records based on snippets
of full text, though in their case it is adverse reactions in clinical discharge summaries.
1.4.3 Regression and factor analysis for deriving and evaluating models of
sharing behavior
Most models of sharing behavior are based on established surveys, and thus evaluate
their models using confirmatory analysis [101-105]. However, a few research projects
instead use linear regression, such as [13, 56, 90-92]. Siemsen et al. [93] compare a
regression model to that derived from constraining factor analysis. Finally, several
studies involve exploratory factor analysis [71, 94, 95].
1.5 OUTLINE OF THE DISSERTATION
This chapter has provided an introduction to the dissertation and its topic. Each aim is
described separately as a self-contained research report including an introduction,
methods, results, and discussion. Aim 1 is covered in Chapter 2, Aim 2 in Chapters 3
and 4, and Aim 3 in Chapter 5. An overall discussion of contributions, implications, and
future work is provided in the final chapter.

Page 27
hidden

15
2.0 AIM 1: SHARING DETAILED RESEARCH DATA IS ASSOCIATED WITH
INCREASED CITATION RATE
Background
Sharing research data provides benefit to the general scientific community, but the
benefit is less obvious for the investigator who makes his or her data available.
Principal Findings
We examined the citation history of 85 cancer microarray clinical trial publications with
respect to the availability of their data. The 48% of trials with publicly available
microarray data received 85% of the aggregate citations. Publicly available data was
significantly (p = 0.006) associated with a 69% increase in citations, independently of
journal impact factor, date of publication, and author country of origin using linear
regression.
Significance
This correlation between publicly available data and increased literature impact may
further motivate investigators to share their detailed research data.
2.1 INTRODUCTION
Sharing information facilitates science. Publicly sharing detailed research data–sample
attributes, clinical factors, patient outcomes, DNA sequences, raw mRNA microarray
measurements–with other researchers allows these valuable resources to contribute far
beyond their original analysis [14]. In addition to being used to confirm original results,
raw data can be used to explore related or new hypotheses, particularly when combined
with other publicly available data sets. Real data is indispensable when investigating
Page 28
hidden

16
and developing study methods, analysis techniques, and software implementations. The
larger scientific community also benefits: sharing data encourages multiple
perspectives, helps to identify errors, discourages fraud, is useful for training new
researchers, and increases efficient use of funding and patient population resources by
avoiding duplicate data collection.
Believing that that these benefits outweigh the costs of sharing research data,
many initiatives actively encourage investigators to make their data available. Some
journals, including the PLoS family, require the submission of detailed biomedical data
to publicly available databases as a condition of publication [15, 96, 97]. Since 2003, the
NIH has required a data sharing plan for all large funding grants. The growing open-
access publishing movement will perhaps increase peer pressure to share data.
However, while the general research community benefits from shared data, much
of the burden for sharing the data falls to the study investigator. Are there benefits for
the investigators themselves?
A currency of value to many investigators is the number of times their
publications are cited. Although limited as a proxy for the scientific contribution of a
paper [98], citation counts are often used in research funding and promotion decisions
and have even been assigned a salary-increase dollar value [99]. Boosting citation rate
is thus is a potentially important motivator for publication authors.
In this study, we explored the relationship between the citation rate of a
publication and whether its data was made publicly available. Using cancer microarray
clinical trials, we addressed the following questions: Do trials which share their
microarray data receive more citations? Is this true even within lower profile trials? What
other data-sharing variables are associated with an increased citation rate? While this
study is not able to investigate causation, quantifying associations is a valuable first
step in understanding these relationships. Clinical microarray data provides a useful
environment for the investigation: despite being valuable for reuse and extremely costly
to collect, is not yet universally shared.
Page 31
hidden

19
Finally, as exploratory analysis within the subset of all trials with publicly
available microarray data, we looked at the linear regression relationships between
additional covariates and citation count. Covariates included trial size, clinical endpoint,
microarray platform, inclusion in various public databases, release of raw data, mention
of supplementary information, and reference within the Oncomine [105] repository.
Statistical analysis was performed using the stats package in R version 2.1 [108].
P-values are two-tailed.
2.3 RESULTS
We studied the citations of 85 cancer microarray clinical trials published between
January 1999 and April 2003, as identified in a systematic review by Ntzani and
Ioannidis [100] and listed in Supplementary Text S1. We found 41 of the 85 clinical trials
(48%) made their microarray data publicly available on the internet. Most data sets were
located on lab websites (28), with a few found on publisher websites (4), or within public
databases (6 in the Stanford Microarray Database (SMD) [101], 6 in Gene Expression
Omnibus (GEO) [102], 2 in ArrayExpress [103], 2 in the NCI GeneExpression Data
Portal (GEDP) (gedp.nci.nih.gov); some datasets in more than one location). The
internet locations of the datasets are listed in Supplementary Text S2. The majority of
datasets were made available concurrently with the trial publication, as illustrated within
the WayBackMachine internet archives (www.archive.org/web/web.php) for 25 of the
datasets and mention of supplementary data within the trial publication itself for 10 of
the remaining 16 datasets. As seen in Table 1, trials published in high impact journals,
prior to 2001, or with US authors were more likely to share their data.


Page 32
hidden
Table 1: Characteristics of eligible publications



The cohort of 85 trials was cited an aggregate of 6239 times in 2004–2005 by
3133 distinct articles (median of 1.0 cohort citation per article, range 1–23). The 48% of
trials which shared their data received a total of 5334 citations (85% of aggregate),
distributed as shown in Figure 1.


Figure 1: Distribution of 2004-2005 citation counts of 85 publications




20
Page 33
hidden
Whether a trial's dataset was made publicly available was significantly associated
with the log of its 2004–2005 citation rate (69% increase in citation count; 95%
confidence interval: 18 to 143%, p=0.006), independent of journal impact factor, date of
publication, and US authorship. Detailed results of this multivariate linear regression are
given in Table 2. A similar result was found when we regressed on the number of
citations each trial received during the 24 months after its publication (45% increase in
citation count; 95% confidence interval: 1 to 109%, p = 0.050).


Table 2: Multivariate regression on citation count of 85 publications



To confirm that these findings were not dependent on a few extremely high-
profile papers, we repeated our analysis on a subset of the cohort. We define papers
published after the year 2000 in journals with an impact factor less than 25 as lower-
profile publications. Of the 70 trials in this subset, only 27 (39%) made their data
available, although they received 1875 of 2761 (68%) aggregate citations. The


21
Page 38
hidden

26
microarray databases (SMD [119], GEO [102], ArrayExpress [103], CIBEX [104],
GEDP(gedp.nci.nih.gov)) offer an obvious, centralized, free, and permanent data
storage solution. Standards have been developed to specify minimal required data
elements (MIAME [120] for microarray data, REMARK [121] for prognostic study
details), consistent data encoding (MAGE-ML [122] for microarray data), and semantic
models (BRIDG (www.bridgproject.org) for study protocol details). Software exists to
help de-identify some types of patient records (De-ID [123]). The NIH and other
agencies allow funds for data archiving and sharing. Finally, large initiatives (NCI's
caBIG [39]) are underway to build tools and communities to enable and advance
sharing data.
Research consumes considerable resources from the public trust. As data
sharing gets easier and benefits are demonstrated for the individual investigator,
hopefully authors will become more apt to share their study data and thus maximize its
usefulness to society.



Page 39
hidden

27
3.0 AIM 2A: USING OPEN ACCESS LITERATURE TO GUIDE FULL-TEXT
QUERY FORMULATION
Background
Much scientific knowledge is contained in the details of the full-text biomedical
literature. Most research in automated retrieval presupposes that the target literature
can be downloaded and preprocessed prior to query. Unfortunately, this is not a
practical or maintainable option for most users due to licensing restrictions, website
terms of use, and sheer volume. Scientific article full-text is increasingly queryable
through portals such as PubMed Central, Highwire Press, Scirus, and Google Scholar.
However, because these portals only support very basic Boolean queries and full text is
so expressive, formulating an effective query is a difficult task for users. We propose
improving the formulation of full-text queries by using the open access literature as a
proxy for the literature to be searched. We evaluated the feasibility of this approach by
building a high-precision query for identifying studies that perform gene expression
microarray experiments.
Methodology and Results
We built decision rules from unigram and bigram features of the open access literature.
Minor syntax modifications were needed to translate the decision rules into the query
languages of PubMed Central, Highwire Press, and Google Scholar. We mapped all
retrieval results to PubMed identifiers and considered our query results as the union of
retrieved articles across all portals. Compared to our reference standard, the derived
full-text query found 56% (95% confidence interval, 52% to 61%) of intended studies,
and 90% (86% to 93%) of studies identified by the full-text search met the reference
standard criteria. Due to this relatively high precision, the derived query was better
suited to the intended application than alternative baseline MeSH® queries.
Page 40
hidden

28
Significance
Using open access literature to develop queries for full-text portals is an open, flexible,
and effective method for retrieval of biomedical literature articles based on article full-
text. We hope our approach will raise awareness of the constraints and opportunities in
mainstream full-text information retrieval and provide a useful tool for today’s
researchers.
3.1 BACKGROUND
Much scientific information is available only in the full body of a scientific article. Full-text
biomedical articles contain unique and valuable information not encapsulated in titles,
abstracts, or indexing terms. Literature-based hypothesis generation, systematic
reviews, and day-to-day literature surveys often require retrieving documents based on
information in full-text only.
Progress has been made in accurately retrieving documents and passages
based on their full-text content. Research efforts, relying on advanced machine-learning
techniques and features such as parts of speech, stemmed words, n-grams, semantic
tags, and weighted tokens, have focused on situations in which complete full-text
corpora are available for preprocessing. Unfortunately, most users do not have an
extensive, local, full-text library. Establishing and maintaining a machine-readable
archive involves complex issues of permissions, licenses, storage, and formats.
Consequently, applying cutting-edge full-text information retrieval and extraction
research is not feasible for mainstream scientists.
Several portals offer a simple alternative: PubMed Central, Highwire Press,
Scirus, and Google Scholar provide full-text query interfaces to an increasingly large
subset of the biomedical literature. Users can search for full-text keywords and phrases
without maintaining a local archive; in fact, they need not have subscription nor access
privileges for the articles they are querying. Portals return a list of articles that match
the query (often with a matching snippet). Users can manually review this list and
download articles subject to individual licensing agreements.
Page 46
hidden

34
3.3 RESULTS
3.3.1 Queries
We applied our query-formulation approach to the task of identifying studies that
performed gene expression microarray experiments. Using the open access literature
as a development corpus and links to a gene expression microarray database as a
proxy endpoint, we derived the full-text queries shown in Table 4.


Table 4: Derived microarray data creation queries for full-text portals
Portal Query
PubMed Central ("gene expression" [text] AND "microarray" [text] AND "cell" [text] AND "rna" [text])
AND ("rneasy" [text] OR "trizol" [text] OR "real-time pcr" [text])
NOT ("tissue microarray*" [text] OR "cpg island*" [text])
HighWire Press Anywhere in Text, ANY: ("gene expression" AND microarray AND cell AND rna)
AND (rneasy OR trizol OR "real-time pcr") NOT (“tissue microarray*” OR “cpg
island*”)
Google Scholar +"gene expression” +microarray +cell +rna +(rneasy OR trizol OR "real time pcr")
-"cpg island*" -"tissue microarray*"
Scirus Anywhere in Text, ALL: ("gene expression" AND microarray AND cell AND rna)
(rneasy OR trizol OR "real-time pcr") ANDNOT ("cpg island*" OR "tissue
microarray*")

3.3.2 Evaluation portal coverage
Our evaluation corpus spanned 20 journals. We preferred to execute queries in
PubMed Central when possible, since it allows automated query and results processing:
As seen in Table 5, three of the 20 journals have deposited all of their content in
PubMed Central. HighWire Press is also easy to use, though it does require manual
Page 49
hidden

37
Finally, we compare the results of the derived query to two naïve queries based
on Medical Subject Heading (MeSH) terms. As seen in Table 7, the derived query had
better precision than either of the MeSH queries at an acceptable recall for our intended
task.


Table 7: Query accuracy compared to baseline MeSH queries
N precision recall f-measure
“gene expression profiling” [mesh] OR
“Oligonucleotide Array Sequence Analysis” [mesh]
768 81% 66% 73%
“gene expression profiling” [mesh] AND
“Oligonucleotide Array Sequence Analysis” [mesh]
768 88% 24% 38%
Derived query 768 90% 56% 69%

3.4 DISCUSSION
We described a mechanism for formulating effective queries for use in publicly
available, established full-text search portals, using the open access literature as
training material. As a proof of concept, we applied this approach to a task that requires
searching the full text of research articles: identifying studies that ran gene expression
microarray experiments. The query we derived achieved 90% precision and 56% recall,
making it a better fit for our intended application than lower-precision baseline MeSH
queries. Although the evaluation demonstrates the usefulness of this approach in only
one situation, we believe the method for deriving full-text queries could have
widespread potential.
Page 50
hidden

38
Effectively querying full-text is difficult: Synonyms, variant spellings, acronyms,
and inexperience make it difficult to form effective queries [128]. Although difficult,
searching full-text is often the only way to identify methods [85], detect harm [129],
extract detailed data, or identify all of the biomedical concepts or genes explored in the
study [130, 131]. There is also evidence that searching full-text is more effective than
searching meta-data or abstracts for identifying articles of overall relevance [132, 133].
Domain-specific biomedical NLP and data integration systems, such as
Textpresso [134], Pharmspresso [135], BioText [81], and BioLit [136], illustrate the
potential value of accessing, exploring, and analyzing full-text, though none of these
tools is designed to facilitate searching across domain-independent open-access and
closed-access biomedical literature. Other systems have been built to take a
preassembled corpus of positive and negative examples to build a filter query for
execution in PubMed [137, 138], but to our knowledge, none suggest an easily
accessed open-source training set nor result in a full-text query for use in domain-
independent, publicly accessible online portals.
Existing full-text search portals, such as Google Scholar, Scirus, Highwire Press,
and PubMed Central, differ in their features and performance [154, 155], though we
believe their full-text searching capabilities have not yet been compared. We found
differences in retrieval performance, but because our dataset was relatively small, it was
not clear if any differences between portals were due to the portal or the subset of
journals we searched.
While portals provide a source of articles, many prohibit systematic downloads
[139]. Furthermore, it is unclear whether standard licensing agreements and fair use
allow text mining, “a question on which informed people continue to disagree [157, 158].
Luckily, open access articles are available for download and all kinds of reuse.
Evidence suggests that these articles have similar textual characteristics to traditional
journal articles [140], and so we used them as a proxy for all articles.
Our method offers several advantages over alternatives: It is easy to maintain, it
is free and open to query both open- and subscription-based content, and the user can
be in direct control of recall/precision balance by setting recall and precision thresholds.
It does have several limitations, however. This technique can only identify articles with
Page 52
hidden

40
While our system will undoubtedly underperform compared with those at the
cutting edge of research, we believe it will raise awareness of the constraints in
mainstream full-text information retrieval and provide a useful tool for today’s
researchers.

Page 54
hidden

42
Conclusions
Searching database entries using PubMed identifiers can identify the majority of publicly
available datasets. We urge authors of all datasets to complete the citation fields for
their dataset submissions once publication details are known, thereby ensuring their
work has maximum visibility and can contribute to subsequent studies.
4.1 BACKGROUND
The number of publicly available biomedical research datasets, such as those based on
gene expression microarray experiments, continues to increase. The ability to access
and process these large datasets enables other scientists to perform their own data
driven studies, reduces duplicate data collection, allows the study of issues that require
combining multiple datasets, and facilitates the training of future scientists through the
analysis of real experimental data.
To realize these potential benefits, it is necessary that datasets can easily be
found when needed. Biomedical databases typically include structured data fields
indicating number of data samples, experimental platform and organism and tissue-type
or disease of study. The experimental design, controls, and interventions involved are
usually described in free-text fields. Unfortunately, the content of these descriptions is
often sparse and diverse [145]. As a result, although basic queries of the structured
fields can be effective, more complex queries may require pre-processing steps [146]
and lack the accuracy required for some applications [147, 148].
Many publicly available datasets are associated with rich annotation outside the
database: the published article describing the primary generation and analysis of the
data. Centralized biomedical databases often include a “primary citation” field to link to
the original published article or articles. This unambiguous link permits a user to query
the article metadata, indexing terms, abstracts, or even the full text of the article, and
then receive links to datasets relevant to the query.
The usefulness of Medical Subject Heading (MeSH) indexing terms for
annotating gene expression datasets has been described by Butte and colleagues [147,
Page 55
hidden

43
149, 150]. For example, they found that 53% of gene expression microarray datasets in
the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus
(GEO) database were linked to articles with disease related MeSH terms [147], that
control/intervention gene expression data are publicly available for diseases contributing
to 30% of all disease-related mortality in the United States }[149], and that
approximately 10% of microarray experiments in GEO have MeSH terms related to
pharmacological substances [150]. We expect that the use of MEDLINE annotations for
dataset retrieval will increase, particularly as combining text and data analysis becomes
more common [80, 136, 148, 151-154].
To identify the links between articles and their accompanying datasets, ideally a
scientist could simply query PubMed, PubMed Central, or a specialized value added
interface (e.g. MedMiner [155], BioText [81], or others [156]) and receive links to related
datasets. This is possible within the Entrez network of databases. By appending “AND
pubmed_gds [filter]” to any PubMed query, the set of returned articles is limited to those
identified as a primary citation in a Gene Expression Omnibus GEO DataSet record.
While viewing PubMed results, selecting “GEO Datasets” in the Database dropdown list
under “Find related data” in the right-hand column will retrieve the associated datasets.
The data can then be explored or downloaded. In many cases, this primary citation
query process can be automated. The Entrez databases can be queried through a web
service eUtilities interface
(http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html). Other databases offer
similar web services or application programming interfaces.
As with any information retrieval strategy, retrieving datasets through their
citation field identifiers has limitations. Not all publicly available datasets are submitted
to centralized databases, and many are hosted on publisher or laboratory websites.
Dataset citation fields are often empty because datasets are frequently submitted to
databases before the research article has been published and assigned a PubMed ID.
If we use a retrieval strategy based on article metadata, how many datasets are we
missing? Are the datasets that are found a representative sample? If not, what are the
biases?
Page 61
hidden

Figure 4: Datasets found or missed by PubMed ID queries, by database
(bars indicate 95% confidence intervals of proportions)


Next, we looked at univariate patterns to determine whether the datasets
retrieved through our search differed from those found only by the Ochsner search. The
odds that a dataset was about cancer, performed on an Affymetrix platform, involved
humans, or involved cultured cells were not significantly different whether the dataset
was retrievable through our search method or not (p>0.3). The recall for datasets from
disciplinary journals was similar to the recall from multidisciplinary journals (p>0.1). In
ANOVA analysis, the distribution of species was not significantly different between the
two search strategies (p>0.9).
Datasets found through PubMed identifiers were more likely to be associated
with articles in higher impact journals than datasets overlooked by this retrieval method
(p=0.01). Our PubMed identifier search found 92% of datasets from articles published
in journals with impact factors greater than 20, 88% of those with impact factors
between 10 and 20, and 73% of those with impact factors between three and 10.
Journal data sharing policy and journal scope were strongly associated with journal
impact factor (p<0.001), but stratifying our dataset by these features only slightly
reduced the association between impact factor and recall (minimum p-value for stratified
analysis was 0.06).


49
There was no association between the number of citations received by a study or
the study sample size and whether or not the dataset was found by our PubMed
Page 64
hidden

52
publicly available datasets in journals without such a policy (65%), but the difference
was not statistically significant (p=0.19).
4.4 DISCUSSION
In this study we found that scripted queries of centralized microarray databases using
PubMed identifiers retrieved 76.6% of all publicly available datasets associated with the
publications. The spectrum of datasets was similar to that found by a reference search
[10] in terms of array platform, cell source, subject of study, sample size, and study
impact.
Dataset retrieval through PubMed identifiers achieved the highest recall when
applied to studies from the highest-impact journals. Additional research is needed to
understand the reasons behind this finding since it is not fully explained by journal policy
or scope, and may have to do with the implementation details of journal policy
requirements. The importance of the retrieval bias depends on the intended use of the
query results. For example, while there is likely no problem using the query to retrieve
datasets for a combination analysis, caution is required when using the results for policy
evaluation because query results are not fully representative of all online datasets,
Our evaluation has several limitations. The evaluation dataset was not chosen
randomly and does not contain a representative distribution of journals: in particular,
our evaluation subset lacked any journal with an impact factor below 2.5. Also, our
reference standard classifications may contain errors, if there exist studies with publicly
available data that were identified by neither the Ochsner search nor our PubMed
identifier query.
We found that the number of gene expression microarray dataset entries with
citation links could be increased by about 25% if all datasets now published on the
internet were uploaded to centralized databases, and all primary article citation fields
were fully completed. This is consistent with the findings of manual update efforts on
the PDB database [57, 161]. We believe encouraging authors and enabling curators to
document all link between datasets and research articles is effort well spent. In addition
Page 66
hidden

54
5.0 AIM 3: WHO SHARES? WHO DOESN’T? FACTORS ASSOCIATED WITH
SHARING GENE EXPRESSION MICROARRAY DATA
Many initiatives encourage research investigators to share their raw research datasets
in hopes of increasing research efficiency and quality. Despite these investments of
time and money, we do not have a firm grasp on the prevalence or patterns of data
sharing and reuse; the effectiveness of initiatives; or the costs, benefits, and impact of
repurposing biomedical research data. Previous survey methods for understanding
data sharing patterns provide insight into investigator attitudes, but do not facilitate
direct measurement of data sharing behaviour or its correlates. In this study, we use
bibliometric methods to understand the prevalence and patterns with which
investigators publicly share their raw gene expression microarray datasets after study
publication.
We used automated methods to identify 11,603 publications that created gene
expression microarray data and estimated that the authors of at least 25% of these
publications deposited their data in the predominant public databases. We collected a
wide set of variables about these studies and derived 15 factors that describe
authorship, funding, institution, publication, and domain environments. Most factors
were found to be statistically associated with the prevalence of data sharing. In
particular, publishing in a journal with a relatively strong data sharing policy, having
funding from many NIH grants, publishing in an open access journal, and having prior
experience sharing data were associated with the highest data sharing rates. In
contrast, increased first author age and experience, having no experience reusing data,
and studying cancer and human subjects were associated with the lowest data sharing
rates.
Page 71
hidden

59
For every study article, we collected 124 attributes that were used as
independent variables, as listed in the Appendix. The independent variables were
collected automatically from a wide variety of sources. Basic bibliometric metadata was
extracted from the MEDLINE record, including journal, year of publication, number of
authors, Medical Subject Heading (MeSH) terms, number of citations from PubMed
Central, inclusion in PubMed subsets for cancer, whether the journal is published with
an open-access model and if it had data-submission links from Genbank, PDB, and
SwissProt. The corresponding address was parsed for institution and country, following
the methods of Yu et al.[184].
Institutions were cross-referenced to the SCImago Institutions Rankings 2009
World Report(http://www.scimagoir.com/) to estimate the relative degree of research
output and impact of the institutions. The gender of the first and last authors were
estimated using the Baby Name Guesser website at
http://www.gpeters.com/names/baby-names.php. ISI Journal Impact Factors and
associated metrics were extracted from the 2008 ISI Journal Citation Reports.
NIH grant details were extracted by cross-referencing grant numbers in the
MEDLINE record with the NIH award information at
http://report.nih.gov/award/state/state.cfm. From this information, we tabulated the
amount of total funding received for each of the fiscal years from 2003 to 2008. We also
estimated the date of renewal by identifying the most recent year in which a grant
number was prefixed by a “1” or “2” —indication that the grant is “new” or “renewed,”
respectively.
We quantified the content of journal data-sharing policies based on the
“Instruction for Authors” for the most commonly occurring journals. We attempted to
estimate if the paper itself reused publicly available gene expression microarray data by
looking for its inclusion in the list that GEO keeps of reuse at
http://www.ncbi.nlm.nih.gov/projects/geo/info/ucitations.html.
A list of prior publications in MEDLINE was extracted from Author-ity clusters,
2009 edition [185], for the first and last author of each article in our study. To limit the
impact of extremely large “lumped” clusters that erroneously contain the publications of
more than one actual author, we excluded prior publication lists for first or last authors in
Page 72
hidden

60
the largest 2% of clusters and instead considered this data to be missing. For all
papers in an author’s publication history with PubMed identifiers numerically less than
the PubMed identifier of the paper in question, we queried for whether any of these prior
publications had been published in an open source journal, were included in our “gene
expression microarray creation” subset themselves, or had reused gene expression
data. We recorded the date of the earliest publication by the author and the number of
citations to date that their earlier papers received in PubMed Central.
Data collection scripts were coded in Python version 2.5.2 (many libraries,
including EUtils, BeautifulSoup, pyparsing and nltk [186]) and SQLite version 3.4.
5.2.3 Statistical methods
Statistical analysis was performed in R version 2.10.1 [108]. P-values were two-tailed.
Data was visually explored using Mondrian version 1.1 [187] and the Hmisc package
[188]. We applied a square-root transformation to variables representing count data to
improve their normality prior to calculating correlations.
To calculate variable correlations, we used the hector function in the polycor
library. This computes polyserial correlations between pairs of numeric and ordinal
variables and polychoric correlations between two ordinal variables. We modified it to
calculate Pearson correlations between numeric variables using the rcorr function in the
Hmisc library. We used a pairwise-complete approach to missing data and used the
nearcor function in the sfsmisc library to make the correlation matrix positive definite. A
correlation heatmap was produced using the gplots library.
We used the nFactors library to calculate and display the scree plot for our
correlations.
Since our correlation matrix was not well-behaved enough for maximum-
likelihood factor analysis, first-order exploratory factor analysis was performed with the
fa function in the psych library, using the minimum residual (minres) solution and a
promaxoblique rotation. Second-order factor analysis also used the minres solution but
a varimax rotation, since we wanted these factors to be orthogonal. We computed the
Page 73
hidden

61
loadings on the original variables for the second-order factors using the method
described by Gorsuch[189].
To compute the factor scores for the original dataset, we first had to impute the
missing values. We did this using Gibbs sampling with two iterations through the mice
library.
Using this complete dataset, we computed scores for each of our datapoints onto
all of the first and second-order factors using Bartlett’s algorithm as extracted from the
factanal function. We submitted these factor scores to a logistic regression using the
lrm function in the rms package. Continuous variables were modeled as cubic splines
with 4 knots using the rcs function from the rms package, and all two-way interactions
were explored.
Finally, we performed hierarchical supervised clustering on the datapoints to
learn which factors were most predictive and then estimated the data sharing
prevalence in a contingency table of these two clusters split at their medians.
5.3 RESULTS
Our queries for identifying microarray data-producing articles returned PubMed
identifiers for 11,603 studies.
MEDLINE fields were still “in process” for 512 records, resulting in missing data
for our MeSH-derived variables (Human, Mice, effectiveness, etc.). Impact factors were
found for all but 1,001 articles. Journal policy variables were missing for 4,107 articles.
The institution ranking attributes were missing for 6,185. We cross-referenced NIH
grant details for 3,064 studies (some grant numbers could not be parsed, because they
were incomplete or strangely formatted). We were able to determine the gender of the
first and last authors, based on the forenames in the MEDLINE record, for all but 2,841
first authors and 2,790 last authors. All but 1,765 first authors and 797 last authors
were found to have a publication history in the 2009 Author-ity clusters. A summary of
the variables can be found in the Appendix and their correlations in Figure 7.
Page 78
hidden

66
Many of our other attributes were also associated with the prevalence of data
sharing in univariate analysis. Illustrations of these relationships are given in the
Appendix.
5.3.1 First-order factors
We tried to use a scree plot to determine the number of factors for our first-order
analysis. Since the scree plot did not have a clear drop-off, we experimented with a
range of factor counts near the optimal coordinates index (as calculated by nScree in
the nFactors R-project library) and finalized on 15 factors. Our correlation matrix was
not sufficiently well-behaved for maximum-likelihood factor analysis, so we used a
minimum residual (minres) solution. We chose to rotate our factors with the promax
oblique algorithm, because we expected our first-order factors to have significant
correlations with one another. The rotated first-order factors are given in Table 9 with
loadings larger than 0.4 or less than -0.4. We named the factors based on the variables
they load most heavily, using abbreviations for publishing in an Open Access journal
(OA) and previously depositing data in the Gene Expression Omnibus (GEO) or
ArrayExpress (AE) databases.








Page 79
hidden

67
Table 9: First-order factor loadings
Large NIH grant
0.97 num.post2005.morethan1000k.tr
0.96 num.post2005.morethan750k.tr
0.92 num.post2004.morethan750k.tr
0.91 num.post2004.morethan1000k.tr
0.91 num.post2005.morethan500k.tr
0.89 num.post2006.morethan1000k.tr
0.89 num.post2006.morethan750k.tr
0.86 num.post2004.morethan500k.tr
0.85 num.post2006.morethan500k.tr
0.84 num.post2003.morethan750k.tr
0.84 num.post2003.morethan1000k.tr
0.80 num.post2003.morethan500k.tr
0.74 has.U.funding
0.71 has.P.funding
0.58 nih.sum.avg.dollars.tr
0.56 nih.sum.sum.dollars.tr
0.44 nih.max.max.dollars.tr

Has journal policy
1.00 journal.policy.contains..geo.omnibus
0.95 journal.policy.at.least.requests.sharing.array
0.95 journal.policy.mentions.any.sharing
0.93 journal.policy.contains.word.microarray
0.91 journal.policy.requests.sharing.other.data
0.85 journal.policy.says.must.deposit
0.83 journal.policy.contains.word.arrayexpress
0.72 journal.policy.requires.microarray.accession
0.71 journal.policy.requests.accession
0.58 journal.policy.contains.word.miame.mged
0.48 journal.microarray.creating.count.tr
0.45 journal.policy.mentions.consequences
0.42 journal.policy.general.statement

NOT institution NCI or intramural
0.59 pubmed.is.funded.non.us.govt
0.55 institution.is.higher.ed
-0.89 institution.nci
-0.86 pubmed.is.funded.nih.intramural
-0.42 country.usa

Count of R01 & other NIH grants
1.15 has.R01.funding
1.14 has.R.funding
0.89 num.grants.via.nih.tr
0.86 nih.cumulative.years.tr
0.82 num.grant.numbers.tr
0.80 max.grant.duration.tr
0.66 pubmed.is.funded.nih
0.50 nih.max.max.dollars.tr
0.45 num.nih.is.nigms.tr
0.44 country.usa
0.42 has.T.funding
0.41 num.nih.is.niaid.tr

Journal impact
0.88 journal.5yr.impact.factor.log
0.88 journal.impact.factor.log
0.85 journal.immediacy.index.log
0.70 journal.policy.mentions.exceptions
0.54 journal.num.articles.2008.tr
0.51 journal.policy.contains.word.miame.mged
-0.61 journal.policy.contains.word.arrayexpress
-0.48 pubmed.is.open.access

Last author num prev pubs & first year pub
0.84 last.author.num.prev.pubs.tr
0.74 last.author.year.first.pub.ago.tr
0.73 last.author.num.prev.pmc.cites.tr
0.68 last.author.num.prev.other.sharing.tr
0.48 country.japan
0.44 last.author.num.prev.microarray.creations.tr

Journal policy consequences & long half-life
0.78 journal.policy.mentions.consequences
0.73 journal.cited.halflife
0.60 pubmed.is.bacteria
0.42 journal.policy.requires.microarray.accession
-0.54 pubmed.is.open.access
-0.45 journal.policy.general.statement

Institution high citations & collaboration
0.76 institution.mean.norm.citation.score
0.72 institution.international.collaboration
0.64 institution.mean.norm.impact.factor
0.41 country.germany
-0.67 country.china
-0.61 country.korea
-0.56 last.author.gender.not.found
-0.43 country.japan

continued…
Page 80
hidden

68
Table 9 (continued)

NO geo reuse & YES high institution output
0.66 institution.research.output.tr
0.58 institution.harvard
0.46 has.K.funding
0.42 institution.stanford
-0.79 pubmed.is.geo.reuse
-0.62 country.australia
-0.46 institution.rank

NOT animals or mice
0.51 pubmed.is.humans
0.43 pubmed.is.diagnosis
0.40 pubmed.is.effectiveness
-0.93 pubmed.is.animals
-0.86 pubmed.is.mice

Humans & cancer
0.84 pubmed.is.humans
0.75 pubmed.is.cancer
0.67 pubmed.is.cultured.cells
0.52 institution.is.medical
0.47 pubmed.is.core.clinical.journal
-0.68 pubmed.is.plants
-0.49 pubmed.is.fungi

Institution is government & NOT higher ed
0.92 institution.is.govnt
0.70 country.germany
0.65 country.france
0.46 institution.international.collaboration
-0.78 institution.is.higher.ed
-0.56 country.canada
-0.51 institution.stanford
-0.42 institution.is.medical

NO K funding or P funding
0.56 has.R01.funding
0.49 has.R.funding
0.41 num.post2006.morethan500k.tr
0.41 num.post2006.morethan750k.tr
0.40 num.post2006.morethan1000k.tr
-0.65 has.K.funding
-0.63 has.P.funding

Authors prev GEOAE sharing & OA & arry creation
0.83 last.author.num.prev.geoae.sharing.tr
0.74 last.author.num.prev.microarray.creations.tr
0.73 last.author.num.prev.oa.tr
0.60 first.author.num.prev.geoae.sharing.tr
0.47 first.author.num.prev.oa.tr
0.46 first.author.num.prev.microarray.creations.tr
0.40 institution.stanford
-0.44 years.ago.tr

First author num prev pubs & first year pub
0.83 first.author.num.prev.pubs.tr
0.77 first.author.year.first.pub.ago.tr
0.73 first.author.num.prev.pmc.cites.tr
0.52 first.author.num.prev.other.sharing.tr


After imputing missing values, we calculated scores for each of the 15 factors for
each of our 11,603 datapoints. In univariate analysis, several of the factors
demonstrated a correlation with frequency of data sharing, as seen in Figure 10.
Several factors seemed to have a linear relationship with data sharing across their
whole range. For example, whereas the data sharing rate was relatively low for studies
that had the lowest score on the factor “Authors prev GEOAE sharing & OA &
microarray creation” (in Figure 10, the first line under the heading “Authors prev GEOA
sharing…”), the data sharing rate was higher for studies that had scores within the 25th
to 50th percentile of all the studies in our sample, higher still for studies with “Authors
prev GEO sharing…” factor scores in the third quartile, and studies that had a very high
Page 81
hidden
score on the factor, above the 75th percentile, had a relatively high rate of data sharing.
A trend in the opposite direction can be seen for the factor “Humans & cancer”: the
higher a study scored on that factor, the less likely they were to have shared their data.


Figure 10: Association between shared data and first-order factors
Percentage of studies with shared data is shown for each quartile for each factor.
Univariate analysis.



69
Page 82
hidden
Most of these factors were significantly associated with data-sharing behavior in
a multivariate logistic regression: p=0.18 for "Large NIH grant", p<0.05 for "No GEO
reuse & YES high institution output" and "No K funding or P funding", and p<0.005 for
the other first-order factors. The increase in odds of data sharing is illustrated in Figure
11, as each factor in the model is moved from its 25th percentile value to its 75th
percentile value.


Figure 11: Odds ratios of data sharing for first-order factor, multivariate model
Odd ratios are calculated as factor scores are each varied from
their 25th percentile value to their 75th percentile value.
Horizontal lines show the 95% confidence intervals of the odds ratios.

5.3.2 Second-order factors
The heavy correlations between these factors suggest that second-order factors may be
illuminating. Scree plot analysis of the correlations between the first-order factors
suggested that we explore a solution containing five second-order factors. We


70
Page 88
hidden
data sharing. This sort of data is surely some of the most valuable for reuse, to the
extent that it can help confirm, refute, advance, and train scientists in bench-to-bedside
translational research. Further research will be required to understand the interplay of
an investigator’s motivation, opportunity, and ability that result in a low rate of data
sharing in these studies [50, 190]. We can make some guesses: As is appropriate,
concerns about privacy of human subjects’ data undoubtedly affect a researcher’s
willingness and ability (perceived or actual) to share raw study data. We do not
presume to recommend a proper balance between privacy and the societal benefit of
data sharing, but we do feel strongly that researchers should seriously consider the re-
identification risk of their data on a study-by-study basis [191], evaluate the risks and
benefits across the wide range of stakeholder interests [45], and consider an ethical
framework to make these difficult decisions [192]. Data-sharing rates could also be low
for reasons other than privacy. Cancer researchers may perceive their field as
particularly competitive, or cancer studies may have relatively strong links to industry–
two attributes previously associated with data withholding [193, 194].
NIH funding levels are associated with increased prevalence of data sharing,
though the overall probability of sharing remains low. Data sharing is infrequent even in
studies funded by grants clearly covered by the NIH Data Sharing Policy, such as those
that receive more than one million dollars per year and awarded or renewed since 2006.
This result is consistent with reports that the NIH Data Sharing Policy is often not taken
seriously because compliance is not enforced. [50]
We are intrigued that publishing in an open access journal, previously sharing
gene expression data, and previously reusing gene expression data were associated
with data sharing outcomes. The results are consistent with the results of our pilot
study, in which we found a strong association between “author experience” and data
sharing rates [195]. More research is required to understand the drivers behind the
association. Does the factor represent an attitude towards “openness” by the decision-
making authors? Does the act of sharing data lower the perceived effort of sharing data
again? Does it dispel fears induced by possible negative outcomes from sharing data?
To what extent does recognizing the value of shared data through data reuse motivate
an author to share his or her own datasets?
76
Page 89
hidden
People often wonder whether the attitude towards data sharing varies with age.
Although we were not able to capture author age, we did estimate the number of years
since first and last authors had published their first paper. Our analysis suggests that
first authors with many years in the field are less likely to share data than those with
fewer years of experience, but no such association for last authors. More work is
needed to confirm this finding given the confounding factor of previous data-sharing
experience.
Gene expression publications associated with Stanford University have a very
high level of data sharing. The true level is actually much higher than that reflected in
our study: Stanford University hosts a public microarray repository, and many articles
that did not have a dataset link from GEO or ArrayExpress do mention submission to
the Stanford Microarray Database. If one were looking for a community on which to
model best practices for data sharing adoption, Stanford would be a great place to start.
Analyzing data sharing through bibliometric and data-mining attributes has
several advantages: We can look at a very large set of studies and attributes, our
results are not biased by survey response self-selection or reporting bias, and the
analysis can be repeated over time with little additional effort.
However, this approach does suffer its own limitations. Our filters for identifying
microarray creation studies do not have perfect precision, so we may have included
some non-data-creation studies in our analysis. Because studies that do not create
data will not have data deposits, their inclusion alters the composition of what we
consider to be studies that create but do not share data. Furthermore, our method for
detecting data deposits overlooks data deposits that are missing PubMed identifiers in
GEO and ArrayExpress, so our dataset misclassifies some studies that did in fact share
their data as non-data-sharing.
We made decisions to facilitate analysis, such as assuming that PubMed
identifiers were monotonically increasing with publication date and using the current
journal data-sharing policy as a surrogate for the data-sharing policy in place when
papers were published. These decisions may have introduced errors.
Missing data may have obscured important information. For example, articles
published in journals with policies that we did not examine had a lower rate of data
77
Page 94
hidden
o Peter Suber, in Open Access News: “Many studies have shown a correlation
between OA articles and citation impact. I believe this is the first study to
document a similar correlation between OA data and citation impact.”
o viewed over 13000 times at PLoS ONE
o 45 citations from items in Google Scholar, including citations from research
articles, books, and editorials
• an award-winning proposal (Thomson-Reuters Dissertation Proposal Scholarship for
2009), openly available online
o used as a case-study in a PhD-level course at the School of Information
Studies, McGill University
• a generalizable approach for developing practical full-text queries for use in
established academic literature portals, to be submitted for publication
o in use by a colleague at the National Core for Neuroethics at the University of
British Columbia
• an evaluation of the precision, recall, and bias of using PubMed identifiers to find
publicly available gene expression microarray datasets, accepted for publication
• an estimate of the prevalence and patterns of gene expression microarray dataset
sharing and preliminary models of data sharing behavior, to be submitted for
publication
• a publicly available dataset associating microarray study publications with data
sharing status
• open source Python data collection code and R-project statistical analyses
6.2.2 Findings
6.2.2.1 Data sharing is associated with an increased citation rate

Based on 85 cancer clinical trials, we found that publications that made their datasets
publicly available received 69% more citations than similar publications that did not
share their data. Several editorials have cited this evidence when debuting stricter data
82
Page 96
hidden
To present a complete picture, this finding should be integrated with other individual
benefits, individual costs, societal benefits, and societal costs.
6.2.2.2 Data creation studies can be identified through full-text queries
We described and evaluated a method to identify articles that create gene expression
datasets using open access literature full text as training data and full-text portals as an
execution environment.
How useful will this method be, outside of this study? Identifying data creation
studies could be useful for investigators looking for data to reuse, for those monitoring
the adoption of various research methods, and for extracting evidence types for
biocurators.
The most important implication of this work, however, is in the general process
we used. Most research in automated retrieval presupposes that the target literature
can be downloaded and preprocessed prior to query. Unfortunately, this is not a
practical or maintainable option for most users due to licensing restrictions, website
terms of use, and sheer volume. Scientific article full text is increasingly queryable
through online portals such as PubMed Central, Highwire Press, Scirus, and Google
Scholar. Recognizing that these full-text portals can be used for broad systematic
retrieval of the biomedical literature based on words and phrases in article full text,
particularly when queries are developed, refined, and evaluated by applying machine
learning techniques to open access articles, potentially opens up large areas of
research and application.
Further research could increase the impact of this approach. A review is needed
to describe the scope and breadth of full-text proxy engines. The methods presented
here could easily be offered to the general public as an openly-available web service.
Derived queries could be improved through application of more advanced text mining
techniques. Finally, the methods will have to be refined for domains without well-
organized portals like PubMed Central and Highwire Press.
84
Page 100
hidden

Figure 14: Association between shared data and original independent variables
The frequency of data sharing is shown for each quartile for continuous variables.
Horizontal lines illustrate 95% confidence intervals of the data sharing proportions.
88

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

20 Readers on Mendeley
by Discipline
 
 
 
by Academic Status
 
25% Other Professional
 
20% Ph.D. Student
 
15% Researcher (at an Academic Institution)
by Country
 
50% United States
 
10% Germany
 
5% United Kingdom