Linking database submissions to primary citations with PubMed Central
Background: Dataset submissions are growingexponentially. Links between datasetsubmissions and primary literature that describethe data collection are useful for many reasons:rich documentation, proper attribution, improvedinformation retrieval, and enhanced text/dataintegration for analysis. Unfortunately, manydatabase submissions do not include primarycitation links, as database submissions are oftenmade prior to publication. We suggest thatautomated tools can be developed to helpidentify links between dataset submissions andthe primary literature. These tools require fulltext to differentiate cases of data sharing fromdata reuse and other contexts. In this study, weexplore the possibility that deep analysis of fulltext may not be necessary, thereby enabling thequerying of all reports in PubMed Central.Methods: We trained machine learning treeand rule-based classifiers on full-text openaccessarticle unigram vectors, with theexistence of a primary citation link from NCBI'sGene Expression Omnibus (GEO) databasesubmission records as the binary output class.We manually combined and simplified theclassifier trees and rules to create a querycompatible with the interface for PubMedCentral.Results: The query identified 40% of non-OAarticles with dataset submission links from GEO(recall), and 65% of the returned articles withoutdataset submission links were manually judgedto include statements of dataset deposit despitehaving no link from the database We hope this work inspires futureenhancements, and highlights the opportunitiesfor simple full-text queries in PubMed Centralgiven the mandated influx of NIH-fundedresearch reports.