Sign up & Download
Sign in

Public chemical compound databases.

by Anthony J Williams
Current opinion in drug discovery development (2008)

Abstract

The internet has rapidly become the first port of call for all information searches. The increasing array of chemistry-related resources that are now available provides chemists with a direct path to the information that was previously accessed via library services and was limited by commercial and costly resources. The diversity of the information that can be accessed online is expanding at a dramatic rate, and the support for publicly available resources offers significant opportunities in terms of the benefits to science and society. While the data online do not generally meet the quality standards of manually curated sources, there are efforts underway to gather scientists together and 'crowdsource' an improvement in the quality of the available data. This review discusses the types of public compound databases that are available online and provides a series of examples. Focus is also given to the benefits and disruptions associated with the increased availability of such data and the integration of technologies to data mine this information.

Cite this document (BETA)

Available from www.ncbi.nlm.nih.gov
Page 1
hidden

Public chemical compound databases.

393
Current Opinion in Drug Discovery & Development 2008 11(3):393-404:
© The Thomson Corporation ISSN 1367-6733
Abbreviations
InChI International Chemical Identifier, PDB Protein
Data Bank
Introduction
It is likely that the majority of scientists use the internet on
a daily basis. There is little doubt that the World Wide Web
is the primary portal to query for information and data and,
when coupled with the intranet services of most companies,
is the tool of choice for most general searches. For many
years, the search for scientific information would start in the
library and usually engaged professionals who were skilled
in information searching. These people would have a deep
understanding of how to navigate the plethora of databases
and resources, using their own query languages, and would
perform searches using for-fee resources. While such
skills remain of value, most scientists now conduct the
majority of their own searches and certainly utilize their
access to a no-cost, intuitive and expansive internet of
information. There has been a tremendous growth in
scientific internet resources and there are enormous
opportunities provided by such facile access to chemistry
information and data.
Bioinformatics established the trend of providing online
access to data, and chemistry, in many ways, is far behind.
Open-access databases, such as GenBank [1] and the
Protein Data Bank (PDB) [2], have been assisting biologists
and chemists to extract biological relevance from gene and
protein sequences for over two decades. It is possible that
the differences in access between the two scientific fields
largely result from publishers in chemistry discouraging the
open flow of data and information. This is true not only for
scientific articles, but also for chemistry databases. With
the changing expectations of society in terms of freedom
of access to information, and the efforts of many public
access advocates, a shift toward both free and open access
(vide infra) chemistry-related information is well underway
and is likely to accelerate.
Murray-Rust envisages a world in which all scientific
information is instantly available [3•]. This emerging world
of e-science or cyberscholarship seeks "to develop the tools,
content and social attitudes to support multidisciplinary,
collaborative science. Its immediate aims are to find ways
of sharing information in a form that is appropriate to all
readers." This review discusses the research already
underway to support this noble and valid effort to provide
enhanced public access to chemistry data, and specifically
focuses on public chemical compound databases.
Public chemistry databases
There are many tens of indexes of chemistry databases
available online and the reader is encouraged to perform
one or more generic searches on 'chemistry databases'
to retrieve a list of related information. The authors
preferred database index is the Chemical Information
Sources Wiki, originally created by Gary Wiggins [4•]. While
the availability of freely accessible information is of clear
value to scientists, there are risks in terms of the quality of
the information. This issue of quality is one that the
mainstream publishers focus on during their peer review,
editorial and curation processes, and their efforts certainly
provide added value in terms of access to qualified scientific
information. No process is perfect, however, and inaccuracies
appear even in reviewed publications and databases.
That said, public compound databases are likely to have
a significant disruptive impact on the business models of
Public chemical compound databases
Antony J Williams
Address
ChemZoo Inc, 904 Tamaras Circle,
Wake Forest, NC 27587, USA
Email: antony.williams@chemspider.com
The internet has rapidly become the first port of call for all information searches. The increasing array of chemistry-related
resources that are now available provides chemists with a direct path to the information that was previously accessed via library
services and was limited by commercial and costly resources. The diversity of the information that can be accessed online is
expanding at a dramatic rate, and the support for publicly available resources offers significant opportunities in terms of the
benefits to science and society. While the data online do not generally meet the quality standards of manually curated sources,
there are efforts underway to gather scientists together and ‘crowdsource‘ an improvement in the quality of the available
data. This review discusses the types of public compound databases that are available online and provides a series of examples.
Focus is also given to the benefits and disruptions associated with the increased availability of such data and the integration of
technologies to data mine this information.
Keywords Blogs, chemical structure databases, cheminformatics, data mining, internet chemistry, open data, public
databases, wikis
Page 2
hidden
394 Current Opinion in Drug Discovery & Development 2008 Vol 11 No 3
publishers, especially because of the increased capabilities
and diversity of data they offer.
There are many freely available chemical compound
databases on the web and they are assembled in a variety of
different forms. The simplest form is a collection of chemical
structures aggregated into a single file, generally a structure
data file (SDF) [5], which is made available, gratis, for people
to download and import into a database for searching and
viewing. There are hundreds of such files online and they
are commonly available from chemical vendors in order
to advertise their catalog collections. These files generally
contain chemical identifiers in the form of chemical names
(systematic and trade) and registry numbers, and can also
include experimental or physical properties, file-specific
identifiers and pricing information. There are aggregators
who gather such files of chemical structures and related
information and assemble them into a single public database,
some examples of which are discussed below; however, as
the files are assembled in a heterogeneous manner, the
resulting data are plagued with inconsistencies and data
quality issues. Such a method of gathering and merging
data is notably different from the manual approach taken by
commercial database vendors, such as Chemical Abstracts
Services (CAS) [6], InfoChem GmbH [7] and Symyx
Technologies Inc [8].
While the commercial databases offer curated data, there
is a price barrier to accessing the information. A number of
the free online resources are also manually curated and, as
will be discussed later, can offer data of equally high quality
as their commercial counterparts; however, these free
resources are constructed with a specific focus in mind and
therefore commonly contain structures that number in the
low thousands rather than in the millions that are available
in the larger commercial online databases. While it is
impossible to be exhaustive within the confines of a review
article of this nature, an overview of a number of online
public compound databases, focusing specifically on free
access databases, is provided.
Data access – The difference between open
and free access
The confusion surrounding the differences between open
access [9] and free access continues to persist [10], but both
help the advancement of science by facilitating the sharing
of data, information and knowledge with no price or access
barriers. The first major international statement on open
access was the Budapest Open Access Initiative, in February
2002, which stated: "By 'open access' to this literature, we
mean its free availability on the public internet, permitting
any users to read, download, copy, distribute, print, search,
or link to the full texts of these articles, crawl them for
indexing, pass them as data to software, or use them for any
other lawful purpose, without financial, legal, or technical
barriers other than those inseparable from gaining access
to the internet itself. The only constraint on reproduction
and distribution, and the only role for copyright in this
domain, should be to give authors control over the integrity
of their work and the right to be properly acknowledged and
cited." [11]. Free access is not equivalent to open access,
but a simple definition has been suggested: "Free access
is access that removes price barriers, but not necessarily
any permission barriers." [12]. For the purpose of this
article, open data is also of interest; according to an online
resource, "Open data is a philosophy and practice requiring
that certain data are freely available to everyone, without
restrictions from copyright, patents or other mechanisms of
control." [13]. There is no commonly agreed upon definition
of open data, but as a result of public access advocates and
research groups in this area, attempts to address this issue
are underway [14•,15••,16-18].
The majority of scientists cannot, however, differentiate
between free and open access as both provide free
access to information of value to their research. In a
similar way, the majority of scientists do not care about
the distinctions between open and closed data; they
utilize free access public chemical compound databases
on an as-needed basis, derive value from the content and
move on. CAS [6] and their CAS Registry Numbers [19]
have played a dominant role in managing a curated registry
of chemical entities and related chemical and biological
literature. Their proprietary registration system does not
link to chemical structures in the public domain and, because
of this, their business model is considered to be at risk
[20••,21].
Data quality – The necessity for curation
Before reviewing examples of public compound databases,
the issues of data quality should be examined. All chemical
compound databases contain errors that arise for a number
of reasons, including errors in transcription, historical
errors (a compound was 'correct' when entered into the
database, but was later re-characterized and not updated)
and issues with graphical representation. The quality of
chemical information in the public domain is generally
quite low; this does not mean that the data are not of value,
but that consideration of the nature of the provider as an
authority is required. There is, of course, no central body
responsible for the quality of data in the public domain.
Databases of chemical structure information such as
PubChem [22••], ChemIDPlus [23] and ChemFinder [24] are
commonly considered to be authorities in terms of reliable
information; however, these sources are also aggregators
of information and are at risk of perpetuating errors from
the original public data and depositions. Errors in structure-
identifier pairs are common [25] and inaccurate structure
representations, specifically in regards to stereochemistry,
proliferate across many databases. A definitive description of
the challenges regarding quality in public domain databases,
and the rigorous processes required to aggregate quality
data, has been provided by Richards et al [26••]. An
example of integrating data is provided by the construction
of the US Environmental Protection Agency's DSSTox
databases, for which chemical structures, chemical names
and CAS Registry Numbers for over 8000 chemicals from
numerous toxicity databases were assembled. These data
were carefully curated and validated using multiple public
information sources [27].

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

18 Readers on Mendeley
by Discipline
 
 
 
by Academic Status
 
28% Researcher (at a non-Academic Institution)
 
22% Post Doc
 
11% Other Professional
by Country
 
33% United States
 
11% Japan
 
11% United Kingdom