When pharmaceutical companies publish large datasets: an abundance of riches or fool's gold?
- PubMed: 20732447
Abstract
The recent announcement that GlaxoSmithKline have released a huge tranche of whole-cell malaria screening data to the public domain, accompanied by a corresponding publication, raises some issues for consideration before this exemplar instance becomes a trend. We have examined the data from a high level, by studying the molecular properties, and consider the various alerts presently in use by major pharma companies. We not only acknowledge the potential value of such data but also raise the issue of the actual value of such datasets released into the public domain. We also suggest approaches that could enhance the value of such datasets to the community and theoretically offer an immediate benefit to the search for leads for other neglected diseases.
When pharmaceutical companies publish large datasets: an abundance of riches or fool's gold?
F
e
a
t
u
r
e
s
P
E
R
S
P
E
C
T
I
V
E
PERSPECTIVE Drug Discovery Today
Volume 15, Numbers 19/20
October 2010publish large datasets: an abundance
of riches or fool’s gold?
Sean Ekins
1,2,3,4,*
and Antony J. Williams
5
The recent announcement that GlaxoSmithKline have released a huge tranche of whole-cell malaria
screening data to the public domain, accompanied by a corresponding publication, raises some issues for
consideration before this exemplar instance becomes a trend. We have examined the data from a high
level, by studying the molecular properties, and consider the various alerts presently in use by major
pharma companies. We not only acknowledge the potential value of such data but also raise the issue of
the actual value of such datasets released into the public domain. We also suggest approaches that could
enhance the value of such datasets to the community and theoretically offer an immediate benefit to the
search for leads for other neglected diseases.
Introduction
We are currently witnessing considerable shifts
in the ways that pharmaceutical research can be
accelerated. These approaches include decen-
tralizing research and engagement with exter-
nal research communities through
crowdsourcing. There is a marked trend toward
collaboration of all kinds [1–5]. In parallel, there
is a renewed interest in neglected disease
research (on malaria, tuberculosis [TB], kineto-
plastids and so on [6]), owing to the notable
influence of the US National Institutes of Health
(NIH), foundations such as the Bill and Melinda
Gates Foundation, The European Commission,
and increasing investment from pharmaceutical
companies and others [6,7].Inthepast,drug
companies generally published only small
chunks of data in the form of molecular struc-
tures and biological (pharmacology) data when
it was convenient, as needed to influence
investors, the public and/or FDA approval, or
after a compound or project was terminated.
With the combinatorial chemistry and high-
throughput screening that have seen explosive
growth over the past decade, each large phar-
maceutical company has created massive pro-
prietary databases and invested enormous
resources in the purchase or development of
complex informatic platforms. These software
systems probably contain more data than can
be realistically mined because of the quest for
blockbuster-targeted therapies. In addition, the
quality of the data and the applicability of the
assays can add a lot of noise into the system.
Although it is not unusual for academics (and
occasionally pharmaceutical companies) to
publish and collate relatively large datasets
(from several hundred to thousands of com-
pounds), primarily for quantitative structure–
activity analysis (http://www.cheminformatics.
org/datasets/index.shtml; http://www.qsar-
world.com/qsar-datasets.php) or compile
datasets from NIH-funded screening programs
(http://www.pdsp.med.unc.edu/indexR.html)
[8–12], pharmaceutical companies have been
less willing to make larger screening datasets
available to the public, and even with such huge
efforts, there has been little productivity in
screening for antibiotics [13].Thishasnow
changed with GlaxoSmithKline (GSK)’s unpre-
cedented release of approximately 13,500 in
vitro screening hits against malaria using Plas-
modium falciparum alongwiththeirassociated
cytotoxicity (in HepG2 cells) data from an initial
screen of more than two million compounds
(see Ref. [14], which follows an earlier press
release from GSK: http://www.nature.com/
news/2010/100120/full/news.2010.20.html).
Hosted GSK malaria data
Three databases initially all hosted the data: the
European Bioinformatics Institute–European812 www.drugdiscoverytoday.comal companies1359-6446/06/$ - see front matter 2010 Elsevieatureer Ltd. All rights reserved. doi:10.1016/j.drudis.2010.08.010
ctober 2010
F
e
a
t
u
r
e
s
P
E
R
S
P
E
C
T
I
V
EMolecular Biology Laboratory (ChEMBL, http://
www.ebi.ac.uk/chembl/), PubChem (http://
www.pubchem.ncbi.nlm.nih.gov/) and Colla-
borative Drug Discovery (CDD, http://www.col-
laborativedrug.com). What happens now the
data are hosted and announced to the com-
munity is open to prediction: the malaria com-
munity might ignore it, which is highly unlikely,
or the malaria community might be excited by
the availability of the data, use the data in their
research and, potentially, find a new drug. There
is also, perhaps, a low probability of this, which
will realistically take many years unless we find
ways to accelerate the process. The question,
therefore, is whether such data depositions are
an abundance of riches or fool’s gold. They might
actually be neither.
This massive contribution of data to the com-
munity creates a precedent, and it is expected that
other companies will follow suit (http://www.na-
ture.com/news/2010/100120/full/
news.2010.20.html). It also raises many other
questions, which we can only begin to pose
responses for. Who should get to host such public
data? If the data are truly ‘open’ then anyone can
download the data and reuse, repurpose and host
the data. It will be interesting to see what ‘licen-
sing’ is applied to the different types of data when
they are exposed by various companies. The GSK
malaria screening hits data were added to the
ChEMBL dataset and the chemical structures are
available for download. In the CDD database, data
associated with public datasets are generally
made available for similarity, substructure and
Boolean searching (http://www.collaborative-
drug.com/register). The SDF file of structures and
data is also available for download. In the USA,
PubChem has established itself as the de facto
repository for screening data, so deposition here
represents a drop in the ocean of more than 27
million unique molecule structures, although
admittedly just a fraction are associated with
bioactivity data. The deposition and hosting of the
data in the three repositories (ChEMBL, PubChem
and CDD) does leverage what each database has
to offer in terms of integration to existing infor-
mation on the compounds. We also believe that
deposition into other databases with links out to
others, including the original hosting organiza-
tions, also offers great benefits. We believe that a
great model for this approach is also the Chem-
Spider database from the Royal Society of
Chemistry (http://www.chemspider.com), which
has already demonstrated the feasibility of such
an approach and now also hosts the GSK data. We
suggest there should be connectivity between all
Drug Discovery Today Volume 15, Numbers 19/20 Othe databases hosting these data and others
released in the future.Questions and opportunities upon
opening up data
We can predict that there might be increased
competition to woo pharmaceutical companies
to exclusively deposit data in various databases.
There might, however, be more utility if the
various databases collaborated to convince
companies that releasing data could be a
powerful force for change, with each group
contributing their technologies and own
expertise to the task. Instead of having standards
for deposition, ala´microarray MIAME [15] and so
on, there are no such standards for chemical
structures and biological data. Who will ensure
the quality of the data and act as a host for future
annotation and potential structure validation? It
is unlikely to be the databases clamoring to host
the data. Would any organization be interested
in funding future data uploads and data
cleansing? Will companies only deposit com-
pounds into the public domain that are of less
interest to them and have been demonstrated to
be inactive against other screens, while retaining
drug-like hits and negative data (which can be
valuable for SAR creation)? Although we applaud
the pharmaceutical companies for donating data
to the community, we question whether these
contributions might create more noise and less
signal in the public compound databases. Will
people really want to mine the data if they are
perceived as cast-off data or just represent
commercially available screening libraries with
little in the way of novel chemical entities? In
many cases, we judge that most researchers will
treat these data as of minor importance – in this
case, because it is malaria-related. This might be
a mistake, however, because GSK suggest that
many of these compounds are acting on malaria
kinases or proteases that could be important for
the treatment of other diseases and that could
lead to a drug for other blockbuster diseases. Will
there be any incentives to pursue such neglected
disease data? For the time being, pharmaceutical
companies are investing more in such neglected
diseases [6], although much less than in other
major diseases.
This brings to mind the United States Envir-
onmental Agency’s ToxCast project [16], which
created a database, at major public expense, and
then invited academics to build computational
models or mine the data. Ultimately, what is
derived from the data will be dependent on the
quality of the data populated into the database.
The only incentive is if something of interest is
found; then, there might be some funding to
investigate further. If pharmaceutical companiesare to put data into the public domain, it might
be of value for these or other organizations toconsider incentivizing researchers to mine it or
to offer challenges and awards.
GSK malaria screening data analysis
Driven purely by curiosity regarding the nature
of the malaria data recently deposited into the
public domain by GSK [14], we have undertaken
a preliminary evaluation using a simple
descriptor analysis, as was performed previously
for some very large tuberculosis datasets ori-
ginally funded by the NIH [17]. In addition, we
have used some readily available substructure
alerts or ‘filters’ to identify potentially reactive
molecules. Pharmaceutical companies use such
computational filters to clean up screening sets
and remove undesirable molecules from vendor
libraries [18]. Examples include filters from GSK
[19] and Abbott [20–22]. An academic group also
developed an extensive series of more than 400
substructural features for the removal of pan-
assay interference compounds from screening
libraries [23], which have yet to be integrated
into a public resource. Our simplistic analysis
compares the GSK dataset [14] to widely avail-
able drug-like molecules from the MicroSource
US drugs dataset (http://www.msdiscovery.com/
usdrugs.html).
As we would expect, the percentage of GSK
malaria screening hit molecules failing the pub-
lished ‘GSK filters’ [19] is close to zero, whereas the
percentage failing the Pfizer and Abbott filters
[20] is considerably higher (57% and 76%,
respectively) because these seem to be more
conservative (Table 1). This could be interpreted
as representing different business rule decisions
instituted according to their own criteria, which
we are not in a position to critically judge. The
percentage of failures for the set of US FDA drugs
is lower for both Pfizer and Abbott filters (Table 1),
suggesting that these compounds are by no
means perfect but perhaps setting a threshold. A
recent study filtered a set of >1000 marketed
drugs, and at least 26% failed filters for molecular
features undesirable for high-throughput
screening [24]. We have also recently used the
same rules to filter sets of compounds with
activity against TB [11,12], with 81–92% failing the
Abbott filters [25], which might be related to
mechanism of action. In the GSK paper, the
authors suggest they did not find any non-specific
inhibitors of lactate dehydrogenase, although
cytotoxicity was seen in 1982 compounds [14].A
detailed analysis of our calculated molecular
descriptors for the GSK malaria hits [14] shows
that most are normally distributed, apart from the
skewed Lipinski violations data and the bimodal
PERSPECTIVEmolecular weight. Table 2 shows the means and
standard deviations for each descriptor. Interest-
www.drugdiscoverytoday.com 813
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime


