Linking database submissions to primary citations with PubMed Central
Abstract
Background: Dataset submissions are growingexponentially. Links between datasetsubmissions and primary literature that describethe data collection are useful for many reasons:rich documentation, proper attribution, improvedinformation retrieval, and enhanced text/dataintegration for analysis. Unfortunately, manydatabase submissions do not include primarycitation links, as database submissions are oftenmade prior to publication. We suggest thatautomated tools can be developed to helpidentify links between dataset submissions andthe primary literature. These tools require fulltext to differentiate cases of data sharing fromdata reuse and other contexts. In this study, weexplore the possibility that deep analysis of fulltext may not be necessary, thereby enabling thequerying of all reports in PubMed Central.Methods: We trained machine learning treeand rule-based classifiers on full-text openaccessarticle unigram vectors, with theexistence of a primary citation link from NCBI'sGene Expression Omnibus (GEO) databasesubmission records as the binary output class.We manually combined and simplified theclassifier trees and rules to create a querycompatible with the interface for PubMedCentral.Results: The query identified 40% of non-OAarticles with dataset submission links from GEO(recall), and 65% of the returned articles withoutdataset submission links were manually judgedto include statements of dataset deposit despitehaving no link from the database We hope this work inspires futureenhancements, and highlights the opportunitiesfor simple full-text queries in PubMed Centralgiven the mandated influx of NIH-fundedresearch reports.
Linking database submissions to primary citations with PubMed Central
deeper and wider scope of study, but assembling a corpus has been hindered by the complex, disparate, and decentralized access processes and licenses of publisher websites. While PMC does not permit automated downloading of non-Open Access full text (as per publisher licenses), full text can be queried from the PMC interface. Integrating the ability to query the full text of all future NIH-funded research reports in combination with MeSH terms and other NCBI Entrez database links offers exciting possibilities. In this report, we explore the potential of one such application: linking articles that describe data collection to their database submission entries. Databases that store research datasets often include citation links to the articles that describe the initial generation and use of the datasets. As we discuss below, these links are valuable, often missing, and time-consuming to manually derive. We previously developed several NLP systems to identify declarations of database submission within research articles2, however these systems required access to complete full text for feature extraction. To take advantage of the PMC resource, here we develop a system restricted to rules that can be expressed within the PMC query interface. We apply our system to gene expression microarray studies deposited in NCBI's Gene Expression Omnibus (GEO) database3. Gene expression data are expensive to collect, often but not always shared, and valuable for reuse. The GEO database is the largest repository for gene expression datasets, is well integrated with PMC query results, and contains links from submitted datasets to primary citation reports.
Methods Our goal was to develop a PMC query for retrieving articles that mention depositing a dataset into GEO. We developed the query using a selection of Open Access (OA) articles, and evaluated it on non-OA articles. We used a gold standard based on our previous work.2 Positive cases came from two sources: all OA articles that were linked from the GEO DataSet primary submission field, plus articles without a primary citation link from the GEO
Results The training set was composed of open-access articles, including 550 positive examples (articles that had links from the GEO primary citation fields or were manually determined to have shared data in GEO) and 165 negatives (articles without links from GEO). We combined the rules and tree branches that occurred most frequently across the trained machine learning classifiers to compose the following PubMed Central query:
(geo OR omnibus) AND microarray AND "gene expression" AND accession NOT (databases OR user OR users OR (public AND accessed) OR (downloaded AND published)) This query retrieved 772 articles, of which 455 were not open access. The results included 385 of the 966 PubMed Central non-OA articles with links to the GEO Datasets (“pmc gds”[filter] NOT "open access"[filter]), for a recall of 40%. Next, we limited the query to non-OA articles without a PMC link to the GEO Datasets. We manually determined that 44 of the 68 results included a statement of dataset submission to GEO within their full-text report. This indicates an overall query precision of 94% (385+44/455) for retrieving articles that have deposited datasets into GEO and an applicable precision of 65% (44/68) for retrieving articles that don’t have PMC links but should. Our error analysis of the 24 false positives found that 13 of the articles referenced GEO datasets in the context of dataset reuse rather than submission (including 2 reusing their own work), 4 referenced GEO in the context of platform descriptions rather than datasets, and 5 didn’t reference the GEO database at all but rather mentioned the word “geo” for another purpose, usually the beta-geo gene. The PubMed database contains 4291 articles with links from GEO DataSets (in PubMed: “pubmed gds”[filter]). Thus, the addition of an estimated 115 (177*65%) novel true positive links would increase the current number of dataset-submission-to-primary-citation links by about 2.6%. We also estimated how the query impact might increase once new NIH-funded articles are deposited in PMC. PMC contains 202 articles published in 2007, funded by the NIH, and linked from GEO DataSets. In comparison, the PubMed database contains 596 such articles—almost three times as many. Our query returned 39 hits for NIH-funded articles published in 2007 that were not linked from GEO DataSets. If all NIH-funded articles were in PMC, and if similar patterns exist for microarray papers that share their data but do not currently have links from
Discussion Database submissions often include a link to the research article that describes the original data collection conditions and interpretations. Our results suggest a simple query on full-text can automatically identify database submission primary citation links with a precision of 94% and recall of 40%. A trivial full-text query identified articles with 90% precision and 34% recall. Precision for the subset of articles without existing links from the GEO database was 65%. The methods we describe can be used to develop queries for identifying primary citations across a wide variety of datatypes and databases. The approach outlined in this study is much more practical than a complex regular-expression classifier running on article full text. Processing full-text articles requires not only access licenses and reuse permissions (or a limitation to open access content) but also the maintenance of a text repository and classification system. Querying full text through PubMed Central, in contrast, is publicly available, requires no infrastructure beyond an internet connection, and covers all OA and non-OA articles within PMC. We imagine this query could be used in two ways. It could be used by dataset-seekers, by appending it onto PubMed or PMC queries to find articles with shared datasets. Alternatively, it could be used by biocurators as a tool for identifying primary citations that may be missing from their database submission fields. This latter use has broad implications, which we discuss further below. Links between shared datasets and primary citations have many purposes. First, the citation
serves as rich documentation for the dataset, whether as free text or as meta-data mark-up as illustrated by the BioLit PDB Clone (http://biolit.ucsd.edu/pdb/). Second, the citation provides a crucial mechanism for attributing recognition to the originators of the dataset upon reuse.6 Third, the citation provides a link for enhanced information retrieval or text/data integration pathways.7,8 Unfortunately, links to primary citations are often missing from database submission entries because datasets are usually submitted before publication details are known.9 Evidence suggests that a significant number of links from database submissions to the primary literature may be missing. For example, the PDB data uniformity project of 2000 found that 33% of submission entries lacked a citation. Half of these were recovered automatically using the list of submitter names, 40% through manual searches of PubMed and the Thomson ISI databases, and 10% (3% of total) were presumed to represent work that was never published.10 More recently, another large-scale PDB remediation project looked at improving the quality of many fields, including primary citations. As of May 2005, 8508 (27%) of the 31663 database submissions required remediation due to inconsistent or missing PubMed IDs and citation information. A report near the end of the remediation process11 estimated that manual searches found PubMed IDs for 1226 entries, citation information without PubMed IDs for 387 entries, and about 700 (2% of 31663) were presumed unpublished. These examples suggest that a sizeable number of entries may be missing citation fields, and that most of them are recoverable. Unfortunately, these efforts are time-intensive and thus difficult to incorporate into the workflow of busy biocurators.12 NLP is already being used to aid database curation in a variety of tasks13, and we believe it can also help biocurators identify missing links to primary citations. Procedurally, our query results could be manually confirmed and then used to update database records. GEO asks for omitted citations (http://www.ncbi.nlm.nih.gov/geo/info/ucitations.html); we have sent them our findings and they have updated their database to include the missing links identified in this study. Other databases, however, consider the submission record the property of the submitter14 and are
Funding National Library of Medicine (5T15-LM007059-19 to HAP, 1R01-LM009427-01 to WWC)
References 1. NOT-OD-08-033 Revised Policy on Enhancing Public Access to Archived Publications Resulting from NIH-Funded Research. 2. Piwowar, H.A. & Chapman, W.W. Identifying Data Sharing in Biomedical Literature. Available from Nature Precedings<http://hdl.handle.net/10101/npre.2008.1721.1> (2008). 3. Barrett, T., et al. NCBI GEO: mining tens of millions of expression profiles--
database and tools update. Nucleic Acids Res 35(2007). 4. Rose, C.P., et al. Analyzing Collaborative Learning Processes Automatically: Exploiting the Advances of Computational Linguistics in Computer-Supported Collaborative Learning, International Journal of Computer Supported Collaborative Learning (In Press). 5. Witten, I.H. & Frank, E. Data Mining: Practical machine learning tools and techniques, 2nd Edition, Morgan Kaufmann, San Francisco (2005). 6. Compete, collaborate, compel. Nat Genet 39(2007). 7. Butte, A.J. & Chen, R. Finding disease-related genomic experiments within an international repository: first steps in translational bioinformatics. AMIA Annu Symp Proc, 106-110 (2006). 8. Muller, H.M., Kenny, E.E. & Sternberg, P.W. Textpresso: an ontology-based information retrieval and extraction system for biological literature. PLoS Biol 2(2004). 9. Piwowar, H.A. & Chapman, W.W. A review of journal policies for sharing research data. Available from Nature Precedings<http://hdl.handle.net/10101/npre.2008.1700.1>, (2008). 10. Bhat, T.N., et al. The PDB data uniformity project. Nucleic Acids Res 29, 214-218 (2001). 11. PDBj News Letter. in Volume 7, March 2006<http://www.pdbj.org/NewsLetter/newsletter_vol7_e.pdf> (2006). 12. Burkhardt, K., Schneider, B. & Ory, J. A biocurator perspective: annotation at the Research Collaboratory for Structural Bioinformatics Protein Data Bank. PLoS Computational Biology 2(2006). 13. Karamanis, N., et al. Natural Language Processing in aid of FlyBase curators. BMC Bioinformatics 9(2008). 14. Pennisi, E. DNA DATA: Proposal to 'Wikify' GenBank Meets Stiff Resistance. Science 319, 1598-1599 (2008). 15. Zhang, L., Ajiferuke, I. & Sampson, M. Optimizing search strategies to identify randomized controlled trials in MEDLINE. BMC Medical Research Methodology 6, 23 (2006).
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime



