Sign up & Download
Sign in

Precompetitive preclinical ADME/Tox data: set it free on the web to facilitate computational model building and assist drug development.

by Sean Ekins, Antony J Williams
Lab on a Chip (2010)

Abstract

Web-based technologies coupled with a drive for improved communication between scientists have resulted in the proliferation of scientific opinion, data and knowledge at an ever-increasing rate. The increasing array of chemistry-related computer-based resources now available provides chemists with a direct path to the discovery of information, once previously accessed via library services and limited to commercial and costly resources. We propose that preclinical absorption, distribution, metabolism, excretion and toxicity data as well as pharmacokinetic properties from studies published in the literature (which use animal or human tissues in vitro or from in vivo studies) are precompetitive in nature and should be freely available on the web. This could be made possible by curating the literature and patents, data donations from pharmaceutical companies and by expanding the currently freely available ChemSpider database of over 21 million molecules with physicochemical properties. This will require linkage to PubMed, PubChem and Wikipedia as well as other frequently used public databases that are currently used, mining the full text publications to extract the pertinent experimental data. These data will need to be extracted using automated and manual methods, cleaned and then published to the ChemSpider or other database such that it will be freely available to the biomedical research and clinical communities. The value of the data being accessible will improve development of drug molecules with good ADME/Tox properties, facilitate computational model building for these properties and enable researchers to not repeat the failures of past drug discovery studies.

Cite this document (BETA)

Available from Antony Williams's profile on Mendeley.
Page 1
hidden

Precompetitive preclinical ADME/Tox data: set it free on the web to facilitate computational model building and assist drug development.

Precompetitive preclinical ADME/Tox
facilitate computational model building
Sean Ekins*abc and Antony J. Williams*d
Received 27th August 2009, Accepted 1st October 2009
First published as an Advance Article on the web 10th November 2009
DOI: 10.1039/b917760b
un
edg
ow
sed
orp
fro
vivo
de
by
hy
her
t t
met
ilab
mp
ona
st d
biomedical researcher (who is likely to be chemistry na€ıve) can
hing for interesting molecules
dors. Today we find a major
logical information related to
the understanding of absorp-
cretion and toxicity (ADME/
ules evaluated as drug candi-
ommercial databases such as
(http://www.symyx.com/prod-
p), Prous Ensemble (http://
Aureus Auroscope databases
ages/Products/Aurscope.php)
chemistry there are tens if not
databases, many containing
et until recently there was no
here are databases of curated
atalogs, molecular properties,
analytical data, etc. The only
PERSPECTIVE www.rsc.org/loc | Lab on a Chip300 5321; Tel: +1 919 201 1516way to know whether a specific piece of information is available
for a chemical structure is to have simultaneous access to all of
these databases as well as journals and other commercial
resources for mining and integrating them. Since many of these
databases are commercial there is no way to easily determine the
availability of information either within these or in the open
access databases. The availability of molecule databases such as
PubChem (http://pubchem.ncbi.nlm.nih.gov/) has changed
scientists’ expectations of web-based databases in many ways but
only goes part way to inform us about our chemical universe, and
aCollaborations in Chemistry, 601 Runnymede Ave, Jenkintown, PA 19046,
USA. E-mail: ekinssean@yahoo.com; Fax: +1 215 481 0159; Tel: +1 269
930 0974
bDepartment of Pharmaceutical Sciences, University of Maryland,
Baltimore, MD 21202, USA
cDepartment of Pharmacology, University of Medicine and Dentistry of
New Jersey, Robert Wood Johnson Medical School, Piscataway, NJ
08854, USA
dRoyal Society of Chemistry, 904 Tamaras Circle, Wake Forest, NC
27587, USA. E-mail: antony.williams@chemspider.com; Fax: +1 919Introduction
Biomedical research is fast moving towards a collaborative
network of chemists and biologists and making knowledge
available to the masses, enabling rapid sharing of information.1–4
Yet, pharmaceutical scientists (biologists and chemists in
particular) commonly find themselves overwhelmed by the
availability of information on the web, in primary commercial
databases such as CAS Scifinder (http://www.cas.org/), journals
and, commonly, a plethora of internally developed systems inside
their companies. From another perspective, that of the academic
or those in the financially constrained developing countries,
biology and chemistry information has long been limited by the
tolls associated with accessing commercial databases. Even the
calculation of relatively simple molecular properties (such as
lipophilicity) has, up until very recently, required knowledge and
ownership of informatics software. Structure searching of such
chemically aware databases used to be restricted to specialists but
now with user friendly web-based tools, even the biologist or
find such tools of value for searc
from any of the commercial ven
limitation in the availability of bio
chemical structures. For example
tion, distribution, metabolism, ex
Tox) data5–7 for drugs and molec
dates is provided in individual c
Symyx’ Metabolite and Toxicity
ucts/databases/bioactivity/index.js
www.prous.com/products/) and
(http://www.aureus-pharma.com/P
as representative examples.8
As an example, in the world of
hundreds of chemical structure
molecules of biological interest, y
single way to search across them. T
literature data, chemical vendor c
environmental data, toxicity data,Web-based technologies coupled with a drive for improved comm
resulted in the proliferation of scientific opinion, data and knowl
increasing array of chemistry-related computer-based resources n
a direct path to the discovery of information, once previously acces
commercial and costly resources. We propose that preclinical abs
excretion and toxicity data as well as pharmacokinetic properties
literature (which use animal or human tissues in vitro or from in
nature and should be freely available on the web. This could be ma
and patents, data donations from pharmaceutical companies and
available ChemSpider database of over 21 million molecules with p
require linkage to PubMed, PubChem and Wikipedia as well as ot
that are currently used, mining the full text publications to extrac
These data will need to be extracted using automated and manual
to the ChemSpider or other database such that it will be freely ava
clinical communities. The value of the data being accessible will i
molecules with good ADME/Tox properties, facilitate computati
properties and enable researchers to not repeat the failures of paThis journal is ª The Royal Society of Chemistry 2010data: set it free on the web to
and assist drug development
ication between scientists have
e at an ever-increasing rate. The
available provides chemists with
via library services and limited to
tion, distribution, metabolism,
m studies published in the
studies) are precompetitive in
possible by curating the literature
expanding the currently freely
sicochemical properties. This will
frequently used public databases
he pertinent experimental data.
hods, cleaned and then published
le to the biomedical research and
rove development of drug
l model building for these
rug discovery studies.Lab Chip, 2010, 10, 13–22 | 13
Page 2
hidden
they are fast and are free. With chemically searchable patentsin particular those molecules we might be interested in for their
pharmaceutical properties. For example, while the web has
provided improved access to chemistry-related information there
has not been an online central resource allowing integrated
chemical structure combined with biology data-searching of
chemistry or biology databases, chemistry articles, patents and
web pages such as blogs and wikis. For example the commercial
company Collaborative Drug Discovery, Inc. (CDD, www.col-
laborativedrug.com)3 has integrated data from the NIMH
Psychoactive Drug Screening Program (PDSP, http://pdsp.
med.unc.edu/indexR.html).9–11 This dataset includes >20 000
compound structures associated with the biology data in one
place. Previously there had been links from the PDSP database to
PubChem only. This suggests to us an enormous opportunity to
link the diverse biological data residing in other databases,
patents and publications, with molecular structures.
There are many freely available chemical compound databases
on the web and they assume different forms.2,12 These files
generally contain the chemical identifiers in the form of chemical
names (systematic and trade) and registry numbers. Since the
files are assembled in a heterogeneous manner the resulting data
are plagued with inconsistencies and data quality issues. Such an
approach to gathering and merging data is a far cry from that
taken by commercial database vendors who manually gather and
curate data. While the commercial databases offer curated data
there is certainly a price barrier to accessing the information. A
number of the free online resources are also manually curated
and, as will be discussed later, can offer as high a quality as the
commercial offerings. These resources are, however, constructed
with a specific focus in mind and therefore commonly number in
the low thousands of structures rather than the millions available
in the larger online databases. Meanwhile, there are several large
online database resources offering access to valuable data and
knowledge.
The quality of chemical information in the public domain is
generally quite low. This does not mean that the data are not of
value but that care needs to be taken in the nature of the provider
as an authority. There is, of course, no central body responsible
for the quality of data in the public domain.Databases of chemical
structure information besides PubChem include ChemIDPLus
(http://chem.sis.nlm.nih.gov/chemidplus/) and ChembioFinder
(http://chembiofinder.cambridgesoft.com/chembiofinder/Sim-
pleSearch.aspx) which are commonly looked upon as authorities
in terms of reliable information. However, these sources are also
aggregators of information and are at risk of perpetuating errors
from the original public data and depositions. Errors in structure–
identifier pairs are common and inaccurate structure representa-
tions, specifically with regard to stereochemistry, proliferate
across many databases. A definitive description of the challenges
regarding quality in public domain databases and the rigorous
processes required to aggregate quality data were provided by
Richard et al.13 During their assembly of the EPA DSSTox
databases (http://www.epa.gov/NCCT/dsstox/) they assembled
the chemical structures, chemical names and CAS registry
numbers for over 8000 chemicals from numerous toxicity
databases. The data they extracted were carefully curated and
validated using multiple public information sources.
The creation, hosting and support of a curated compound
database containing structures of chemical and biological14 | Lab Chip, 2010, 10, 13–22also available online, at no charge, the landscape for scientists
searching for information is more open than ever. We believe if
there are data of interest to be located then Internet search
engines will enable it.
The premier curated database offerings of today have an
interesting if not challenging future ahead of them. Their value-
added enhancements of the distributed data must be significant
enough to warrant an investment in their services. As expressed
earlier, the quality of the data resulting from curation is signifi-
cant but the longevity of that distinguishing factor moving
forward is questionable. Roboticized recognition and conversion
of chemical names to chemical structures can dramatically shift
this domain and efforts have already been demonstrated in
applications with patents and publications. Should the quality of
these efforts reach a sufficient standard then today’s publishers’
business models will definitely be at risk, as free content will be
greatly expanded compared with today.
Examples of free key databases with molecule
information
The following represent examples of some free databases
containing molecule information of interest to chemists and
biologists in drug discovery:
PubChem
The highest profile online database is certainly PubChem which
was launched by the NIH in 2004 to support the ‘New Pathways
to Discovery’ component of their roadmap initiative.14 PubChem
archives and organizes information about the biological activi-
ties of chemical compounds into a comprehensive biomedical
database and is the informatics backbone for the initiative,
intended to empower the scientific community to use small
molecule chemical compounds in their research. PubChem
Compound contains over 25 million unique structures and
provides biological property information for each compound.
The majority of databases discussed in this article now use twoCertain areas of the scientific literature, while still of high
value, can become antiquated fairly quickly. With the capabil-
ities of Internet-based searching and direct access to abstracts for
the majority of publishers, even a rudimentary text search can
expose articles previously unavailable except through an
abstracting service. Search engines will increasingly be utilized
for first level searches specifically because they are simple to use,interest with integrated content is an expensive enterprise.
Historically these databases have been built as a result of
hundreds if not thousands of man years of rigorous and exacting
human effort and then, for some of the original founders in this
domain, migrated onto computer systems. In the development of
these systems some host organizations have created sizeable
revenues. The hosting of large databases, the text-based search-
ing of immense amounts of data and the ability to disseminate
complex forms of graphical information via standard protocols
provide an opportunity for future disruptive offerings in this
domain whereby online offerings can also become authorities
and, with the support and input of the community, can offer the
benefits of crowdsourcing for enhancing the data.This journal is ª The Royal Society of Chemistry 2010
Page 3
hidden
primary identifiers in their systems—the CAS registry number
(commercial) and a PubChem ID number (non-commercial).
This alone indicates a shift in equality of commercial versus
public compound repositories. For now, PubChem remains
focused on its initial intent to support the National Molecular
Libraries Initiative.
DSSTox
The EPA distributed structure-searchable toxicity (DSSTox)
database project15,16 provides a series of documented, standard-
ized and fully structure-annotated files of toxicity information.
The initial intention for the project was to deliver a public central
repository of toxicity information to allow for flexible analogue
searching, SAR model development and the building of chemical
relational databases. In order to ensure maximum uptake by the
public and allow users to integrate the data into their own
systems the DSSTox project adopted the use of the common
standard file format (SDF) to include chemical structure, text
and property information. The DSSTox datasets are among the
most highly curated public datasets available and likely the
reference standard in publicly available structure-based toxicity
data.
eMolecules (http://www.emolecules.com/)
This website offers a free online database of almost 8 million
unique chemical structures. The database is assembled from data
supplied by over 150 suppliers and provides a path to identifying
a vendor for a particular chemical compound. Their database
was recently enhanced by providing access to NMR, MS and IR
spectra from Wiley-VCH for over 500 000 compounds via
ChemGate, a fee-based service. eMolecules also provides links
to many sources of data for spectra, physical properties and
biological data.
DrugBank (http://www.drugbank.ca/)
This is a manually curated resource17 assembled from the
collection information of a series of other public domain data-
bases and enhanced with additional data generated within the
laboratories of the hosts. The database aggregates both bio-
informatics and cheminformatics data and combines detailed
drug data with comprehensive drug target (i.e. protein) infor-
mation. The database contains FDA approved small molecule
and biotech drugs as well as experimental drugs. Each record in
the database, known as a DrugCard, has >80 data fields. The
information is split into drug/chemical data and drug target or
protein data and many data fields are linked to other databases.
The database supports extensive text, sequence, chemical
structure and relational query searches.
PharmGKB (http://www.pharmgkb.org/)
This database brings together human genetic variation data that
impact drug response, including curated primary genotype and
phenotype data, variants and gene–drug–disease relationships
from the literature, along with key genes and drug pathways.18
The database also contains drugs with annotations related to
some pharmacokinetic (PK) properties but it does not appearThis journal is ª The Royal Society of Chemistry 2010that you can query by these properties, molecular structure, or
output the data in a format needed for computational modeling.
Wikipedia (http://en.wikipedia.org/wiki/Main_Page)
This certainly represents an important shift in the future access of
information associated with small molecules. At present there are
approximately 6000 articles with a chembox or drugbox. The
detailed information offered regarding a particular chemical or
drug can be excellent. The advantage of a wiki is that changes can
be made within a few keystrokes and the quality is immediately
enhanced. This community curation process makes Wikipedia
a very important online chemistry resource whose impact will
only expand with time.
ZINC (http://zinc.docking.org/index.shtml)
This is a free database of commercially available compounds for
virtual screening.19,20 The library contains over 10 million
molecules, each with a 3D structure and gathered from the
catalogs of compounds from vendors. All molecules in the
databases are assigned biologically relevant protonation states
and annotated with molecular properties. The database is
available for free download in several common file formats and
a web-based search page, including a molecular drawing inter-
face which allows the database to be searched.
SureChem (http://www.surechem.org/)
This site provides chemically intelligent searching of a patent
database containing over 8 million US, European and World
patents. Using extraction heuristics to identify chemical and
trade names and conversion of the extracted entities to chemical
structures using a series of name to structure conversion tools,
SureChem has delivered a database of over 10 million individual
chemical structures. The free access online portal allows scien-
tists to search the system based on structure, substructure or
similarity of structure, as well as the text-based searching
expected for patent inquiries.
ChemSpider (http://www.chemspider.com/)
ChemSpider2,12 was initially developed as a hobby project by
a small group of dedicated cheminformatics specialists. The
intention was to aggregate and index available sources of
chemical structures and their associated information into
a single searchable repository and make it available to every-
body, at no charge. ChemSpider was unveiled to the public in
March 2007 with the intention of ‘‘building a structure centric
community for chemists’’. ChemSpider has grown into
a resource containing over 21 million unique chemical struc-
tures. The data sources have been gathered from chemical
vendors as well as commercial database vendors and publishers
and members of the Open Notebook Science community.
ChemSpider has also integrated the SureChem patent database
collection of structures to facilitate links between the systems.
The database can be queried using structure/substructure
searching and alphanumeric text searching of both intrinsic and
predicted molecular properties. The ChemSpider developers
also added virtual screening results using the LASSO similarityLab Chip, 2010, 10, 13–22 | 15
Page 4
hidden
search tool to screen the ChemSpider database against all 40
target families from the Database of Useful Decoys (DUD)
dataset.
Fig. 1 Screenshot of a molecule record in ChemSpider. The record for Xanax
over 6400 available), a series of PubMed articles (the long list is truncated for t
here: http://www.chemspider.com/2034.
16 | Lab Chip, 2010, 10, 13–22ChemSpider has enabled unique capabilities relative to the
primary public chemistry databases. These include real-time
curation of the data, association of analytical data with chemical
shows the header of the relatedWikipedia article, links to 10 patents (with
his figure) and a series of predicted properties. The full record is available
This journal is ª The Royal Society of Chemistry 2010
Page 5
hidden
structures, real-time deposition of single or batch chemical
structures (including with activity data) and transaction-based
predictions of physicochemical data. The system developers have
also made available a series of web services to allow integration
to the system for the purpose of searching the system as well as
generation of InChI identifiers and conversion routines.
The system also integrates text-based searching of open
access (OA) articles. The index is expected to increase
dramatically as they extract chemical names from OA articles
and convert the names to chemical structures using name to
structure conversion algorithms. These chemical structures will
be deposited back to the ChemSpider database thereby facili-
tating structure and substructure searching in concert with text-
based searching.
ChemSpider has a focus on, and commitment to, community
curation and ease of use (Fig. 1). The social community aspects of
the system demonstrate the potential of this approach. The team
software algorithms provided by collaborators will be added into
the system. Web services such as the recently exposed InChI and
OpenBabel services will continue to be made available as
a service to the community.
ChemSpider is acknowledged by scientists as a valuable
resource for understanding chemistry.21 ChemSpider can be
linked with other software and databases from other groups
(academia or industry). For example Collaborative Drug
Discovery, Inc. (www.collaborativedrug.com) recently provided
links to ChemSpider for molecules in this database. This enables
the users to resource more information about their molecules
(Fig. 2).
There are a multitude of other examples of databases and
Wikis linking to ChemSpider. These include Wikipedia, Pub-
Chem and many others. Other databases such as WikiProteins
and GeneWiki are presently developing their integration links to
ChemSpider. There have also been several applications of
owhave committed to the release of a Wiki-like environment for
further annotation of the chemical structures in the database,
a project they termWiChempedia. They will utilize both available
Wikipedia content and deposited content from users to enable the
ongoing development of community curated chemistry.
ChemSpider was acquired inMay 2009 by the Royal Society of
Chemistry and will continue to grow in its reach into the chem-
istry, biology and biomedical research communities with
a number of specific missions:
(1) Improving the quality of available information. With
millions of indexed compounds ChemSpider has enabled
a community-based curating process to help in improving the
association between a chemical compound and a set of identifiers
(systematic names, trade names, synonyms and registry
numbers).
(2) Increased access to chemistry-related information. There
are many types of data and information that can be associated
with chemical compounds and made available to the benefit of
the chemistry community. As an example of this the association
of analytical data and the integration to patent searches have
been demonstrated and the integration to QSAR-based modeling
is presently in progress.
(3) Provide access to online tools and services. ChemSpider
already serves up the online prediction of certain chemical
properties for chemists to take advantage of, and a number of
Fig. 2 A screenshot from the Collaborative Drug Discovery database shThis journal is ª The Royal Society of Chemistry 2010ChemSpider for generating structure–activity relationships. An
example of this application22 used ChemSpider to provide
structures and molecule properties for a human drug metabo-
lizing enzyme. It should be noted that ChemSpider allows the
user to download structures of interest and molecular properties
so these could be used in other computational or analysis soft-
ware (Fig. 1). A second example used ChemSpider to derive
molecular properties for machine learning models to predict
biopharmaceutical characteristics of drugs.23 A third application
used ChemSpider to follow-up a molecule selected by pharma-
cophore searching of vendor databases as a potential pregnane
X receptor antagonist (a potential target for modulating anti-
cancer drug metabolism, transport, etc.). Substructure searching
in ChemSpider indicated additional molecules of interest for
testing which were validated in vitro and shown to have
activity.24 The above examples illustrate how the content in
ChemSpider is useful to the scientific community involved in
drug discovery and how free connectivity between tools via the
web may enable a much broader impact. It is likely that
collaborative software in this space will also require links to
ChemSpider as a minimum. ChemSpider has seen good growth
in the number of users considering there has been no investment
in publicity, now averages over 6000 unique visitors per day and
continues to be described frequently in publications and
presentations.1,2,12,25
ing the ChemSpider link below a molecule in the EPA ToxCast dataset.Lab Chip, 2010, 10, 13–22 | 17
Page 6
hidden
One current use for ChemSpider could be to find out as much
about a compound as possible as researchers can eliminate
undesirable leads early in the lead generation process by quickly
accessing information on the pharmacological effects, side effects
Fig. 3 The results of a text search on ‘‘gefitinib’’. The record shows the structu
InChI Key) and links to multiple data sources. The header from the Wikipedi
18 | Lab Chip, 2010, 10, 13–22and drug–drug interactions for similar compounds or compound
classes of interest, as well as their corresponding metabolites. For
example for the compound gefitinib (Iressa), what preclinical
information exists? A search was initiated through ChemSpider
re, intrinsic properties, systematic name and identifiers (InChI String and
a article is shown. The bolded names show manually validated identifiers.
This journal is ª The Royal Society of Chemistry 2010
Page 7
hidden
data are reproduced by different groups when comparing their
own proprietary compounds with a competitor compoundand produced one hit with the results shown in Fig. 3. The results
display for gefitinib includes the chemical structure, a series of
intrinsic and predicted properties, links to a number of original
data sources for associated information and a number of
alphanumeric identifiers, some of which are validated. A number
of names, database IDs and synonyms connect to Wikipedia via
the [Wiki] link, links to patents are immediately viewable and
any articles containing gefitinib or other synonyms in the title or
abstract are linked through to PubMed.
The list of data sources shown in the figure relates to various
forms of information. Each is marked with the type of infor-
mation associated with each data source to assist the user in
deciding what data to examine. Each source listed in the data
source column is hyperlinked to a description of the depositor.
Where possible the entries in the external ID column have been
hyperlinked to external information. By combining a search of
PubChem, PubMed, DrugBank, ChemSpider and SureChem, it
is possible to obtain (fairly quickly) access to a majority of the
published data on this compound. This includes data such as
drug safety information, toxicology, pharmacology, metabolic
pathways, metabolites, synthetic routes, patents and suppliers.
Interrogating the data across multiple systems is, however,
challenging and time-consuming and integration would be
valuable to compare/compete with commercial databases.
Commercial preclinical ADME/Tox databases and the
precompetitive space
The major commercial vendors of preclinical data relating to
molecules of interest include the Prous Ensemble database
which provides information on more than 127 000 bioactive
compounds in the drug research and development pipeline
relating to over 275 000 references to the biomedical and
congress literature and more than 33 000 patent families cited.
A second product, the Aureus AurSCOPEADME/DDI (drug–
drug interactions), is a fully annotated, structured database
containing biological and chemical information on metabolic
properties of drugs. The same company has a database for the
potassium channel human ether-a-go-go related gene (hERG).
This channel is particularly important pharmaceutically as many
drugs interact and cause hERG-related cardiotoxicity.
Numerous blockbuster drugs have recently been removed from
the market due to QT syndrome side effects, an abnormality
associated with the hERG and associated channels.26 In addition
they have the AurSCOPE Nuclear Receptor database and
a pharmacological activity profiler called AurPROFILER
which rapidly conducts thorough searches across all individual
AurSCOPE Target Knowledge Databases or AurSCOPEGlobal
Pharmacology Space to rapidly identify target, cell or drug/
compound profiles. Results are displayed as interactive ‘‘heat
maps’’ for easy visualization and navigation of the pharmaco-
logical space—the target or cell.
An additional commercial database is the PharmaPendium
from Elsevier (https://www.pharmapendium.com/) which
captures data from the FDA freedom of information documents
and EMEA ‘‘EPAR’’ approval documents. This database has
a large amount of preclinical and clinical data, uses the medical
dictionary for regulatory activities (MedRA) standardized
terminology and is structure/substructure searchable.This journal is ª The Royal Society of Chemistry 2010(which may not be widely known). This is entirely unnecessary.
Why not share these data? It would certainly enable the industry
to quickly understand ADME/Tox liabilities with different
classes of compounds targeting a specific indication and enable
the generation of computer models for these properties.
We propose that the scientific community should tackle the
lack of public databases that contain preclinical ADME/Tox or
pharmacokinetic data. This can be achieved by either creating
a new database or preferably expanding the preexisting freely
available ChemSpider database with all of the ADME/Tox as
well as pharmacokinetic properties available from studies pub-
lished in the literature (which use animal or human tissues or
from in vivo studies). At the same time the number of links (and
therefore connectivity) to other currently available databases
available on the web should be increased.
How to create a freely available ADME/Tox database
It is one thing to propose the construction of such an ADME/
Tox database and another to actually execute on this vision. For
those that may take up the challenge it is perhaps worth
considering at least one strategy for how the scientific community
could build such a resource:
(1) Identify all available publications containing ADME/Tox
and PK properties data relating to molecular structures tested in
animal or human tissues in vitro or in vivo. Mine the data from
these publications relating to ADME/Tox and PK properties.
(2) Clean and organize data from these publications e.g. relate
by species, tissue, cell types and capture experimental conditions
using manual curation, create an ontology.
(3) Provide a means for other scientists to update and include
new ADME/Tox and PK properties.
(4) Encourage pharmaceutical companies to publish their
previously ‘unpublished preclinical data’ in exchange for access
to a duplicate of the database of ADME/Tox and PK data for
their own in-house efforts for internal deployment.There is a movement towards collaborations between
biomedical organizations both industrial and academic that are
precompetitive in nature covering areas such as cheminformatics,
toxicology, preclinical toxicology and beyond. Examples
include those organized by the Health and Environmental
Sciences Institute (HESI, http://www.hesiglobal.org/i4a/pages/
index.cfm?pageid ¼ 3279), the Pistoia Alliance, (http://
pistoiaalliance.org), the Critical Path Institute (C-Path, http://
www.c-path.org/), the Drug Safety Executive Council (DSEC,
http://www.drugsafetycouncil.org/pages/42_dsec_mission.cfm),
Enlight Biosciences (http://www.enlightbio.com/content/about-
enlight/) and Innovative Medicines Initiative (IMI, http://
imi.europa.eu/index_en.html).27,28 We would argue that ADME/
Tox data are also precompetitive data and should be made freely
available on the web as a resource for all scientists. ADME/Tox
information (we use the term broadly to include everything from
in vivo and in vitro preclinical data) are data that are ultimately
provided for registration with regulatory bodies and become
available in package inserts for drugs or in widely distributed
publications. Generating these data is costly and in many casesLab Chip, 2010, 10, 13–22 | 19
Page 8
hidden
(5) As an example of the development and value of such
a database, ADME/Tox and PK computational models could be
built, validated and provided over the web for free (to the
academic community).
a single property like CYP3A4 inhibition) that can then be used
with external algorithms to generate predictive models. These
models could be automatically developed and updated30 as new
data are added to the database, with tools and descriptors that
Table 1 Targeted data types required for the ADME/Tox and PK database
ADME/Tox data Estimated amounts of data Data types
Cytochrome P450s and other
enzymes e.g. phase II
1000’s of data points Km, Ki, and IC50
Transporters (e.g. P-gp, BCRP) Several hundreds to 1000 IC50, some substrate data
Ion channels (hERG) <1000 for individual channels IC50—different cell types
Nuclear receptors (e.g. PXR) <1000 for individual receptors (48
receptors in human)
EC50 and fold activation
Pharmacokinetics 1000’s for drugs and failed
candidates in different species
AUC, Tmax, etc.Such data-mining could be achieved by using PubMed, Google
Scholar, Highwire, SureChem and other available software to
search journals, patents and the web. It will be necessary to
capture the individual publications and provide links to elec-
tronic sources (whether open or commercial). We would suggest
specifically annotating information like those in Table 1 for the
major important enzymes like cytochrome P450s which are
involved in clinically relevant drug–drug interactions, for
example capturing substrate and inhibitor data.22 For nuclear
receptors we could capture agonist and antagonist data which
would be of most interest.24,29 A simple hierarchy could be used
in data curation and also in the final data schema (Table 2).
Further consultation with experts in this particular domain of
schema development may be necessary for such a project. The
database should be able to be updated by an array of users
(the community) and also be integrated to other web-based
databases such as clinical trial databases. These databases could
link from a compound name to the structure in, for example,
ChemSpider to show preclinical properties (enzymes inhibited,
pharmacokinetic data, etc.). Providing the database behind
company firewalls may be necessary for companies to use the
tool for their in-house and private searches and mining rather
than sending queries over the Internet.
Uses for the database
Having captured the majority of published ADME/Tox and PK
data it will be possible to generate consistent datasets (e.g. for
Table 2 Example of data schema for the ADME/Tox and PK database
1. Species Human
2. Enzyme
3. CYP
CYP3
CYP3A4
20 | Lab Chip, 2010, 10, 13–22can then be implemented back into web-based software as
predictors. Such models could possibly be facilitated by
commercial pipelining tools such as Pipeline Pilot (Accelrys, San
Diego, CA) which has been widely used with Bayesian modeling
methods.31,32 In addition to generating such models it may be
possible to derive simple rules for some of the ADME/Tox
properties. It is likely the greatest value of such a database will be
as a historic reference source for scientists in drug discovery and
prevent repetitive experiments on the same compound.
Discussion
The technologies supporting chemistry, while immature, are fast
developing to support chemical structures and reactions,
analytical data support, and integration to related data sources
via supporting software technologies. Communication in chem-
istry is already witnessing a new revolution. The diversity of
information available online is expanding at a dramatic rate and
a shift to publicly available resources offers significant oppor-
tunities in terms of the benefit to science and society. Biomedical
researchers today have access to hundreds of thousands of
chemistry, biology and clinical articles via searches on platforms
including PubMed, Google Scholar and ChemSpider. While the
general nature of text-based searches provides a familiar envi-
ronment for chemists to search and review their results, a chem-
ist’s natural affinity for communicating via chemical structures
demands the need to perform searches in their ‘‘natural
language’’. Ask a chemist their preferred manner for searchingCYP3A4 Substrate
Km
Heteroactivation
Homoactivation
CYP3A4 Inhibitor
Ki
IC50
% Inhibition
Metabolic intermediate complex
formation
This journal is ª The Royal Society of Chemistry 2010
Page 9
hidden
chemistry databases and you will generally receive a response
pointing to structure-based searching. There are certainly
commercial solutions to provide chemical structure-based
searches of literature and patent data (CAS, Infochem and
Symyx to name but a few) as well as a myriad of solutions for
managing in-house organizational data collections. The chal-
lenge is finding chemistry—specifically chemical structures across
the web in databases described above and in the thousands of
books and journals.
A number of organizations generate sizeable revenues from the
creation of chemistry databases for the life sciences industry. The
Chemical Abstracts Service alone generates annual revenue in
excess of $250 million dollars. The total annual fees for accessing
this information when other companies are included into the
calculation will significantly exceed this figure. The primary
advantage of commercial databases is that they have been
manually examined by skilled curators, addressing the tedious
task of quality data-checking. Certainly, the aggregation of data
from multiple sources, both historical and modern, from
multiple countries and languages and from sources not available
electronically, is a significant enhancement over what is available
via an Internet search alone. The question remains for how long
will this remain an issue?
CAS and their CAS registry numbers (RNs) have played
a dominant role in managing a curated registry of chemical
entities and related chemical and biological literature. Their
proprietary registration system does not link to chemical struc-
tures in the public domain and their business model is likely at
risk. However, the scientific community as a whole is likely to
reap increasing benefits from the growing number of free access
services and content databases. While commercial vendors
generally have a highly moderated release cycle of new func-
tionality and capabilities, online services tend to move at a much
faster pace adding new capabilities, resolving issues and adding
fresh technologies on a rolling basis. This type of drive both
excites present users and draws new users to the expanding
offerings. Academics in particular are likely to have an increased
focus on the use of free access databases and tools as it is
demonstrative of the new found freedom of information. The
benefit for them of course is reduced expenditures for the
commercial offerings. This is further exaggerated in developing
countries where free access systems are the primary resources for
information since commercial offerings are simply out of reach
due to price barriers. One could also imagine the development of
such tools being funded by a micropayment system for those
willing to do this.
In terms of data quality issues, the Internet generation has
already demonstrated a willingness to curate, modify and
enhance the quality of content as modeled by Wikipedia. With
the appropriate enhancements in place, online curation and
markup of the data in real-time can quickly address errors in the
data as has already been demonstrated by the ChemSpider
system.
Increasing access to free and open access databases of both
chemistry and biological data is certainly impacting the manner
by which scientists access information. These databases are
additional tendrils in the web of Internet resources that continue
to expand in their proliferation of freely accessible data and
information, such as patents, open and free access peer-reviewedThis journal is ª The Royal Society of Chemistry 2010publications and software tools for the manipulation of chem-
istry-related data. As data-mining tools expand in their capa-
bilities and performance, the integration of chemistry and
biology databases is likely to offer even greater opportunities to
benefit the process of drug discovery. As these databases grow in
both their content and their quality, there may be challenging
times ahead with regard to the commercial business models of
publishers versus the drive towards more freely available data.
In summary, we have proposed that there should be an effort
to build a structure centric community for biomedical
researchers with key information relevant to drug discovery
which is precompetitive. We believe a free database of preclinical
properties will accelerate ADME/Tox and PK computational
model building, prevent different groups from repeating the same
experiments, reducing the number of animal experiments,
reducing the biological and chemical reagents used and generally
benefit the whole biomedical research community. While there
are databases that specialize in maintaining the absolute privacy
of the researchers’ data, there is a growing movement by some
scientists for the open dissemination of their data (e.g. Open
Notebook Science, http://usefulchem.wikispaces.com/). The
provision of the ChemSpider database currently fits with this
model in that data can be published to the community. It is
important to consider some opportunities and limitations of such
a free database. By providing actual experimental preclinical
data in the database it could also be used for validation of other
computational models e.g. those already integrated in Chem-
Spider or other databases/tools developed by third parties. These
molecules could represent test sets which would also be of value
to the biomedical research community in general. A major
limitation is to capture the public information as most data will
be in publications. Having access to many online journal
subscriptions relevant to ADME/Tox and PK data, e.g. ASPET
journals, Wiley, Springer, Elsevier and Nature Journals for
example, will be essential. These academic and commercial
publishers are likely responsible for the majority of the data that
appear in this domain. It may be possible to negotiate with such
publishers to get access to their historic ADME/Tox data for this
database in return for links to the original data source. One could
limit the data collected to the recent few decades with the
assumption that more recent publications will contain more
relevant data. Going forward publishers could require that
authors deposit their ADME/Tox data in the database as
a condition of publication. This is analogous to how protein
structures are deposited in the Protein DataBank, or microarray
data are deposited in various databases. Once data are uploaded
and available in the database such as ChemSpider, public
support, recognition and validation would be obtained via
publication, web page blogs, invited oral presentations to
conferences, etc. This is critically important for the information
to gain maximum visibility and to be evaluated by the experts. As
the target audience here is predominantly biomedical researchers
it will be important to present such a database at key conferences
which may raise awareness with the maximum number of
researchers. This would also be a critical way to capture new and
previously unpublished data. In conclusion, the ADME/Tox and
PK database curation project proposed here could cost effec-
tively and extensively leverage the existing ChemSpider database
and the cheminformatics expertise built to date.Lab Chip, 2010, 10, 13–22 | 21
Page 10
hidden
Conflicts of interest statement
SE consults for Collaborative Drug Discovery Inc. and is
a member of the ChemSpider Advisory Group. AJW is employed
by the Royal Society of Chemistry which owns ChemSpider and
associated technologies.
Acknowledgements
SE acknowledges Collaborative Drug Discovery Inc and Accel-
16 A. M. Richard, DSSTox web site launch: Improving public access to
databases for building structure-toxicity prediction models,
Preclinica, 2006, 2, 103–108.
17 D. S. Wishart, C. Knox, A. C. Guo, S. Shrivastava, M. Hassanali,
P. Stothard, Z. Chang and J. Woolsey, DrugBank: a comprehensive
resource for in silico drug discovery and exploration, Nucleic Acids
Res., 2006, 34, D668–672.
18 T. E. Klein, J. T. Chang, M. K. Cho, K. L. Easton, R. Fergerson,
M. Hewett, Z. Lin, Y. Liu, S. Liu, D. E. Oliver, D. L. Rubin,
F. Shafa, J. M. Stuart and R. B. Altman, Integrating genotype and
phenotype information: an overview of the PharmGKB project.
Pharmacogenetics Research Network and Knowledge Base,
Pharmacogenomics J., 2001, 1, 167–170.rys for providing access to their software.
References
1 A. J. Williams, Crowdsourcing, collaborations and text mining in
a world of open chemistry, 2008, http://www.slideshare.net/
AntonyWilliams/crowdsourcing-collaborations-and-text-mining-in-a-
world-of-open-chemistry-presentation.
2 A. J. Williams, Internet-based tools for communication and
collaboration in chemistry, Drug Discovery Today, 2008, 13, 502–506.
3 M. Hohman, K. Gregory, K. Chibale, P. J. Smith, S. Ekins and
B. Bunin, Novel web-based tools combining chemistry informatics,
biology and social networks for drug discovery, Drug Discovery
Today, 2009, 14, 261–270.
4 D. S. Bailey and E. D. Zanders, Drug discovery in the era of
Facebook—new tools for scientific networking, Drug Discovery
Today, 2008, 13, 863–868.
5 S. Ekins, B. J. Ring, J. Grace, D. J. McRobie-Belle and
S. A. Wrighton, Present and future in vitro approaches for drug
metabolism, J. Pharmacol. Toxicol. Methods, 2000, 44, 313–324.
6 S. Ekins, C. L. Waller, P. W. Swaan, G. Cruciani, S. A. Wrighton and
J. H. Wikel, Progress in predicting human ADME parameters
in silico, J. Pharmacol. Toxicol. Methods, 2000, 44, 251–272.
7 S. Ekins and P. W. Swaan, Computational models for enzymes,
transporters, channels and receptors relevant to ADME/TOX, Rev.
Comput. Chem., 2004, 20, 333–415.
8 L. J. Jolivette and S. Ekins, Methods for predicting human drug
metabolism, Adv. Clin. Chem., 2007, 43, 131–176.
9 R. T. Strachan, G. Ferrara and B. L. Roth, Screening the
receptorome: an efficient approach for drug discovery and target
validation, Drug Discovery Today, 2006, 11, 708–716.
10 K. A. O’Connor and B. L. Roth, Finding new tricks for old drugs: an
efficient route for public-sector drug discovery, Nat. Rev. Drug
Discovery, 2005, 4, 1005–1014.
11 B. L. Roth, E. Lopez, S. Beischel, R. B. Westkaemper and
J. M. Evans, Screening the receptorome to discover the molecular
targets for plant-derived psychoactive compounds: a novel
approach for CNS drug discovery, Pharmacol. Ther., 2004, 102,
99–110.
12 A. J. Williams, A perspective of publicly accessible/open-access
chemistry databases, Drug Discovery Today, 2008, 13, 495–501.
13 A. M. Richard, L. Swirsky Gold and M. C. Nicklaus, Chemical
structure indexing of toxicity data on the Internet: Moving toward
a flat world, Curr. Opin. Drug Discovery Dev., 2006, 9, 314–325.
14 N.I.o.H, Office of Portfolio Analysis and Strategic Initiatives. The
NIH Roadmap Initiative, 2008, http://nihroadmap.nih.gov/.
15 A. M. Richard and C. R. Williams, Distributed structure-searchable
toxicity (DSSTox) public database network: a proposal, Mutat.
Res., 2002, 499, 27–52.22 | Lab Chip, 2010, 10, 13–2219 J. J. Irwin and B. K. Shoichet, ZINC—a free database of
commercially available compounds for virtual screening, J. Chem.
Inf. Model., 2005, 45, 177–182.
20 J. J. Irwin, F. M. Raushel and B. K. Shoichet, Virtual screening
against metalloenzymes for inhibitors and substrates, Biochemistry,
2005, 44, 12316–12328.
21 G. Brumfiel, Chemists spin a web of data, Nature, 2008, 453, 139.
22 S. Ekins, M. Iyer, M. D. Krasowski and E. D. Kharasch, Molecular
characterization of CYP2B6 substrates, Curr. Drug Metab., 2008, 9,
363–373.
23 A. Khandelwal, P. Bahadduri, C. Chang, J. E. Polli, P. Swaan and
S. Ekins, Computational models to assign biopharmaceutics drug
disposition classification from molecular structure, Pharm. Res.,
2007, 24, 2249–2262.
24 S. Ekins, V. Kholodovych, N. Ai, M. Sinz, J. Gal, L. Gera,
W. J. Welsh, K. Bachmann and S. Mani, Computational discovery
of novel low micromolar human pregnane X receptor antagonists,
Mol. Pharmacol., 2008, 74, 662–672.
25 A. J. Williams, Qualifying online information resources for chemists.
Presentation at The Library of Congress, 2008, http://www.scivee.tv/
node/9267.
26 W. J. Crumb Jr, S. Ekins, D. Sarazan, J. H. Wikel, S. A. Wrighton,
C. Carlson and C. M. Beasley, Effects of antipsychotic drugs on Ito,
INa, Isus, IK1, and hERG: QT prolongation, structure activity
relationship, and network analysis, Pharm. Res., 2006, 23, 1133–
1143.
27 A. J. Hunter, The Innovative Medicines Initiative: a pre-competitive
initiative to enhance the biomedical science base of Europe to
expedite the development of new medicines for patients, Drug
Discovery Today, 2008, 13, 371–373.
28 N. Kamel, C. Compton, R. Middelveld, T. Higenbottam and
S. E. Dahlen, The Innovative Medicines Initiative (IMI): a new
opportunity for scientific collaboration between academia and
industry at the European level, Eur. Respir. J., 2008, 31, 924–
926.
29 S. Kortagere, D. Chekmarev, W. J. Welsh and S. Ekins,
Hybrid scoring and classification approaches to predict
human pregane X receptor activiators, Pharm. Res., 2009, 26,
1001–1011.
30 J. Cartmell, S. Enoch, D. Krstajic andD. E. Leahy, Automated QSPR
through Competitive Workflow, J. Comput. Aided Mol. Des., 2005,
19, 821–833.
31 D. Rogers, R. D. Brown and M. Hahn, Using extended-connectivity
fingerprints with Laplacian-modified Bayesian analysis in high-
throughput screening follow-up, J. Biomol. Screen., 2005, 10, 682–
686.
32 M. Hassan, R. D. Brown, S. Varma-O’Brien and D. Rogers,
Cheminformatics analysis and learning in a data pipelining
environment, Mol. Divers., 2006, 10, 283–299.This journal is ª The Royal Society of Chemistry 2010

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

24 Readers on Mendeley
by Discipline
 
 
 
by Academic Status
 
38% Researcher (at a non-Academic Institution)
 
17% Other Professional
 
17% Ph.D. Student
by Country
 
54% United States
 
8% United Kingdom
 
8% Switzerland