Recent advances in chemoinformatics.
- PubMed: 17511441
Abstract
Chemoinformatics is a large scientific discipline that deals with the storage, organization, management, retrieval, analysis, dissemination, visualization, and use of chemical information. Chemoinformatics techniques are used extensively in drug discovery and development. Although many consider it a mature field, the advent of high-throughput experimental techniques and the need to analyze very large data sets have brought new life and challenges to it. Here, we review a selection of papers published in 2006 that caught our attention with regard to the novelty of the methodology that was presented. The field is seeing significant growth, which will be further catalyzed by the widespread availability of public databases to support the development and validation of new approaches.
Author-supplied keywords
Recent advances in chemoinformatics.
Recent Advances in Chemoinformatics
Dimitris K. Agrafiotis,*
,†
Deepak Bandyopadhyay,
†
Jo¨rg K. Wegner,
‡
and Herman van Vlijmen
‡
Johnson & Johnson Pharmaceutical Research & Development, L.L.C., 665 Stockton Drive, Exton,
Pennsylvania 19341, and Tibotec BVBA, Gen De Wittelaan L 11B 3, 2800 Mechelen, Belgium
Received February 12, 2007
Chemoinformatics is a large scientific discipline that deals with the storage, organization, management,
retrieval, analysis, dissemination, visualization, and use of chemical information. Chemoinformatics techniques
are used extensively in drug discovery and development. Although many consider it a mature field, the
advent of high-throughput experimental techniques and the need to analyze very large data sets have brought
new life and challenges to it. Here, we review a selection of papers published in 2006 that caught our
attention with regard to the novelty of the methodology that was presented. The field is seeing significant
growth, which will be further catalyzed by the widespread availability of public databases to support the
development and validation of new approaches.
INTRODUCTION
Chemoinformatics is a vast discipline, standing on the
interface between chemistry, biology, and computer science.
Despite being perceived by many as a mature field, it has
seen considerable growth in 2006. This growth is evidenced
by the fact that significant advances are no longer found in
the pages of a few specialty publications but across a wide
range of mainstream chemistry and general science journals
such as JACS and PNAS.
In this review, we highlight a few papers that were
published in 2006 and that we found intriguing for a variety
of reasons. The review is not intended to be exhaustive or
authoritative. It reflects strictly the views of the authors and
their long-standing interest in new computational methodol-
ogy. In order to manage the scope and length of the article,
important studies describing primarily the application, vali-
dation, and comparison of various chemoinformatics tech-
niques as well as incremental enhancements to established
methodologies have not been included.
The remaining sections are organized in seven general
areas: (1) advances in conformational analysis and phar-
macophore development; (2) de novo and fragment-based
design; (3) QSAR; (4) chemogenomics; (5) free energy and
solvation; (6) geometric algorithms and combinatorial op-
timization; and (7) molecule mining. This is by no means
an authoritative classification but rather an attempt to
organize our thoughts into coherent themes and focus the
readers’ attention on topics pertinent to their own interests.
CONFORMATIONAL ANALYSIS AND
PHARMACOPHORE DEVELOPMENT
Conformational sampling is a problem of central impor-
tance in computer-aided drug design. Several modeling
techniques depend critically on the diversity of conformations
sampled during the search, including protein docking,
pharmacophore modeling, 3D database searching, and 3D-
QSAR, to name a few. Recent analyses of crystal structures
of protein-ligand complexes have shown that bioactive
conformations tend to be more extended than random ones
1,2
and may lie several kcal/mol higher in energy than their
respective global minima.
3
There have been a number of
comparative studies of conformational analysis tools, focus-
ing primarily on the ability to identify the bioactive
conformation. While this is certainly a desired goal, our
knowledge of pharmacologically relevant conformational
space is very limited, and the ability to identify the bioactive
conformation can only be guaranteed if the search method
casts a wide net over the potential energy surface. Reproduc-
ing known ligand geometries is insufficient because these
represent an extremely limited and biased sampling of all
bound ligand conformations. Indeed, most ligands have never
been crystallized in their own targets, even fewer have been
crystallized in important countertargets, and many protein
classes have never been crystallized at all.
While diversity is sometimes a goal in its own right, as in
many approaches to library design, thoroughness of confor-
mational sampling is usually not an end in itself, nor is it
the sovereign virtue for a conformational search method.
For example, Omega (OMEGA 1.8.1, distributed by Openeye
Scientific Software (www.eyesopen.com)) is extremely fast,
with sampling suitable for many applications. However,
thorough sampling is an important means to many further
ends, and any practicing computational chemists would want
to know which methods sample the full ensemble of
accessible conformations.
One study that is particularly indicative of the state of
current conformational search techniques was recently
published by Carta et al.
4
It is well-known that many
stochastic 3D modeling techniques are very sensitive to
starting configurations and random number effects, and the
* Corresponding author phone: (610)458-6045; fax: (610)458-8249;
e-mail: dagrafio@prdus.jnj.com.
†
Johnson & Johnson Pharmaceutical Research & Development, L.L.C.
‡
Tibotec BVBA.
1279J. Chem. Inf. Model. 2007, 47, 1279-1293
10.1021/ci700059g CCC: $37.00 2007 American Chemical Society
Published on Web 05/19/2007
repeated under slightly different initial conditions. The paper
by Carta et al. demonstrates that this reproducibility problem
plagues systematic methods as well. More specifically, it
examined how different permutations of the connection table
affected the conformations generated by Corina, Omega,
Catalyst, and Rubicon. The authors used Daylight and in-
house utilities to generate different (noncanonical) variants
of SMILES
5,6
(Simplified Molecular Input Line Entry
System) and SD representations
7
for 17 bioactive ligands,
effectively changing the order of the atoms and bonds while
keeping the topology intact. Each variant was subjected to
conformational search, using the same set of parameters for
conformer generation. The results were evaluated, among
other ways, by looking at the distribution of rmsds to the
crystallographically determined bioactive conformation. In-
deed, it was shown that Omega and Rubicon produced very
different distributions of rmsds for the canonical and non-
canonical SMILES/SD variants, suggesting that the methods
exhibit an intrinsic bias and are highly dependent on the atom
and bond ordering. On the contrary, Catalyst was found to
be much less sensitive to permuted input. Principal compo-
nent visualization of the conformational ensembles generated
by each permuted input further revealed that the canonical
and permuted SMILES sampled distinct regions of confor-
mational space, and, in at least one case, the conformations
generated by the permuted variants were much closer to the
bioactive conformation.
Based on these findings, the authors recommend the use
of multiple permuted inputs in order to improve the
performance of methods such as Omega and Rubicon.
Although this approach is symptomatic, it only requires a
way to generate permuted connection tables and, therefore,
can be used with any conformational search program to
circumvent its intrinsic bias and enhance its sampling
capacity.
Another approach aimed at expanding the range of
geometries sampled during conformational search was
presented by Izrailev et al.
8
The method is based on a self-
organizing algorithm known as stochastic proximity embed-
ding (SPE)
9
for producing coordinates in a low-dimensional
space that best preserve a set of distance constraints. This
algorithm was subsequently extended to the problem of
conformational sampling using a distance geometry formal-
ism.
10
SPE generates conformations that satisfy a set of
interatomic distance constraints derived from the molecule’s
connection table and defined in the form of lower and upper
bounds {l
ij
} and {u
ij
}. While the method was originally
shown to provide a good sampling of conformational space,
it was observed that “extreme” conformations located near
the periphery of conformational space were not as likely to
be visited, and, therefore, important conformations could be
missed.
To alleviate this problem, the authors introduced a boosting
heuristic that can be used in conjunction with SPE or any
other distance geometry algorithm to bias the search toward
more extended or more compact geometries. The method
generates increasingly extended (or compact) conformations
through a series of embeddings, each seeded on the result
of the previous one. In the first iteration, a normal SPE
embedding is performed, generating a chemically sensible
conformation c
1
. The lower bounds of all atom pairs {l
ij
}
are then replaced by the actual interatomic distances {d
ij
}
in conformation c
1
and used along with the unchanged upper
bounds {u
ij
} to perform a second embedding to generate
another conformation, c
2
. This process is repeated for a
prescribed number of iterations. The lower bounds are then
restored to their original default values, and a new sequence
of embeddings is performed using a different random number
seed. Because the distance constraints in any iteration are
always equal to or greater than those in the previous
iterations, successively more extended conformations are
generated. This process will never yield a set of distance
constraints that are impossible to satisfy, because there exists
at least one conformation (i.e., the one generated in the
preceding iteration) that satisfies them. An analogous
procedure can be used to generate increasingly compact
conformations.
Conformational boosting was subsequently validated against
seven widely used conformational sampling techniques
implemented in the Rubicon, Catalyst, Macromodel, Omega,
and MOE software packages and was found, along with
Catalyst, to be significantly more effective in sampling the
full range of geometric sizes attainable by any given molecule
compared to the other methods, which showed distinct
preferences for either more extended or more compact
geometries.
11,12
Since bioactive conformations tend to be
extended and often fall outside the range sampled by an
unbiased search, this heuristic significantly improves the
chances of finding such conformations.
One important technique that benefits greatly from proper
sampling of conformational space is pharmacophore model-
ing. A pharmacophore is the spatial arrangement of steric
and electronic features that are necessary to confer the
optimal interaction with a particular biomolecular target and
to trigger (or block) its biological response. Ligand-based
drug design methods that attempt to identify a pharmacoph-
ore from a set of active compounds have been known to fail
when some of the compounds have different binding modes
from the rest. Current approaches for pharmacophore iden-
tification typically use manual curation or consensus to
remove actives that are presumed to bind with different
binding modes from the majority that share a common mode.
PharmID
13
is a new algorithm for pharmacophore detection
that overcomes these problems by a statistical sampling
approach, Gibbs Sampling,
14
that picks the most likely
binding conformations and key binding features simulta-
neously and iteratively. PharmID’s breakthrough lies in
transforming the complex problem of matching N molecules
with up to M conformations each into a simpler one of
comparing each conformation of each molecule against a
model of the active conformation and its key features. The
method derives the probability that each feature is im-
portant and that each conformation is the active conforma-
tion, starting with no knowledge of important features or
binding conformations. Each one of these probabilities
iteratively determines the other one, and thus PharmID
quickly converges to the correct answer for a large set of
examples.
The algorithm begins with a set of distinct conformations
for each molecule in the alignment, on which pharmacophore
groups are defined using SMARTS. Pairs or triples of
pharmacophore group types and their binned distances are
defined as unique features. For each conformation of each
1280 J. Chem. Inf. Model., Vol. 47, No. 4, 2007 PERSPECTIVE
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime




