The Blue Obelisk-interoperability in chemical informatics.
Journal of Chemical Information and Modeling (2006)
- PubMed: 16711717
Available from www.ncbi.nlm.nih.gov
or
Abstract
The Blue Obelisk Movement is the name used by a diverse Internet group promoting reusable chemistry via open source software development, consistent and complimentary chemoinformatics research, open data, and open standards. We outline recent examples of cooperation in the Blue Obelisk group: a shared dictionary of algorithms and implementations in chemoinformatics algorithms drawing from our various software projects; a shared repository of chemoinformatics data including elemental properties, atomic radii, isotopes, atom typing rules, and so forth; and Web services for the platform-independent use of chemoinformatics programs.
Author-supplied keywords
Available from www.ncbi.nlm.nih.gov
Page 1
The Blue Obelisk-interoperability in chemical informatics.
The Blue ObelisksInteroperability in Chemical Informatics
Rajarshi Guha,
†
Michael T. Howard,
‡
Geoffrey R. Hutchison,
§
Peter Murray-Rust,
|
Henry Rzepa,
⊥
Christoph Steinbeck,*
,#
Jo¨rg Wegner,
∇
and Egon L. Willighagen
O
Pennsylvania State University, University Park, Pennsylvania 16804-3000, Jmol Project, U. S. A.,
Cornell University, Ithaca, New York 14853, Cambridge University, Cambridge CB2 1TN, Great Britain,
Imperial College, London SW7 2AZ, Great Britain, Cologne University Bioinformatics Center (CUBIC),
Zu¨lpicher Str. 47, D-50674 Ko¨ln, Germany, University of Tu¨bingen, Tu¨bingen, Germany, and
Jmol project, The Netherlands
Received September 12, 2005
The Blue Obelisk Movement (http://www.blueobelisk.org/) is the name used by a diverse Internet group
promoting reusable chemistry via open source software development, consistent and complimentary
chemoinformatics research, open data, and open standards. We outline recent examples of cooperation in
the Blue Obelisk group: a shared dictionary of algorithms and implementations in chemoinformatics
algorithms drawing from our various software projects; a shared repository of chemoinformatics data including
elemental properties, atomic radii, isotopes, atom typing rules, and so forth; and Web services for the platform-
independent use of chemoinformatics programs.
1. INTRODUCTION
While the past 20 or 30 years of development in chemo-
informatics has created a plethora of published software
systems and algorithms for solving chemical problems, little
effort has been spent in providing the community with open
components and data, to be reused and improved by
communal efforts. Bioinformatics, with its much younger
history, adopted the principles taught by success stories of
the open source movement in general, and Linux in
particular, from the very beginning. Recent years, however,
have seen the emergence of open tools and databases also
in chemical informatics.
1-4
These draw on the existing ideas
of independent peer review and scientific collaboration,
mixed with “open source” software development paradigms.
Community involvement, including assessments, suggestions,
critiques, and rapid evolution, is a core component of these
efforts. The benefits of open source software have been
discussed in great detail by Eric Raymond in his seminal
work The Cathedral and the Bazaar and following works.
5
The Open Source Initiative (OSI) summarizes: “Open source
promotes software reliability and quality by supporting
independent peer review and rapid evolution of source code.
To be OSI certified, the software must be distributed under
a license that guarantees the right to read, redistribute,
modify, and use the software freely.”
6
In the beginning, most scientific software was free. It was
so difficult to port that scientists did not bother about
licensessone was delighted if someone else could get it
working on another machine. But the 1980s saw the value
of chemical informatics and the need to “productize” it. Much
of this was meritorious, as it brought informatics into the
classroom and the research lab and helped pay for some
chemistry research, but it also had hidden costs, which we
are now facing today. In particular, costs include non-
interoperability and centralized control of informatics.
Now, several open chemistry and chemoinformatics projects
(Table 1) have pooled forces to enhance interoperability
between these tools in a movement we call “The Blue
Obelisk” (BO). The name originates from an informal
meeting place in San Diego, California, during the American
Chemical Society 2005 Spring National Meeting (see Figure
1) and was coined by one of the authors. Because contribu-
tors to the component projects live around the world, few
had met in personsinstead collaborating and meeting via
the Internet.
We identify three core areas for the Blue Obelisk Move-
ment:
• Open Source. One can use other people’s code without
further permission, including changing it for one’s own use
and distributing it again.
• Open Standards. One can find visible community mech-
anisms for protocols and communicating information. The
mechanisms for creating and maintaining these standards
cover a wide spectrum of human organizations, including
various degrees of consent. We have been heavily influenced
by the mantra of the Internet Engineering Task Force: “rough
consensus and running code”.
• Open Data. One can obtain all data in the public domain
when wanted and reuse it for whatever purpose. This is an
underused term, which we are resurrecting. It is independent
of “open access” and has relevance to “closed access” as
well.
As outlined above, these areas are independent of the
concept of “open access” to read publications freely. Instead,
* Corresponding author phone: +49 (0)221 470-7426; fax: +49 (0)
221 470-7786; e-mail: c.steinbeck@uni-koeln.de.
†
Pennsylvania State University.
‡
Jmol project, http://www.jmol.org.
§
Cornell University.
|
Cambridge University.
⊥
Imperial College.
#
Cologne University Bioinformatics Center.
∇
University of Tu¨bingen.
O
Jmol project, http://www.jmol.org.
991J. Chem. Inf. Model. 2006, 46, 991-998
10.1021/ci050400b CCC: $33.50 2006 American Chemical Society
Published on Web 02/22/2006
Rajarshi Guha,
†
Michael T. Howard,
‡
Geoffrey R. Hutchison,
§
Peter Murray-Rust,
|
Henry Rzepa,
⊥
Christoph Steinbeck,*
,#
Jo¨rg Wegner,
∇
and Egon L. Willighagen
O
Pennsylvania State University, University Park, Pennsylvania 16804-3000, Jmol Project, U. S. A.,
Cornell University, Ithaca, New York 14853, Cambridge University, Cambridge CB2 1TN, Great Britain,
Imperial College, London SW7 2AZ, Great Britain, Cologne University Bioinformatics Center (CUBIC),
Zu¨lpicher Str. 47, D-50674 Ko¨ln, Germany, University of Tu¨bingen, Tu¨bingen, Germany, and
Jmol project, The Netherlands
Received September 12, 2005
The Blue Obelisk Movement (http://www.blueobelisk.org/) is the name used by a diverse Internet group
promoting reusable chemistry via open source software development, consistent and complimentary
chemoinformatics research, open data, and open standards. We outline recent examples of cooperation in
the Blue Obelisk group: a shared dictionary of algorithms and implementations in chemoinformatics
algorithms drawing from our various software projects; a shared repository of chemoinformatics data including
elemental properties, atomic radii, isotopes, atom typing rules, and so forth; and Web services for the platform-
independent use of chemoinformatics programs.
1. INTRODUCTION
While the past 20 or 30 years of development in chemo-
informatics has created a plethora of published software
systems and algorithms for solving chemical problems, little
effort has been spent in providing the community with open
components and data, to be reused and improved by
communal efforts. Bioinformatics, with its much younger
history, adopted the principles taught by success stories of
the open source movement in general, and Linux in
particular, from the very beginning. Recent years, however,
have seen the emergence of open tools and databases also
in chemical informatics.
1-4
These draw on the existing ideas
of independent peer review and scientific collaboration,
mixed with “open source” software development paradigms.
Community involvement, including assessments, suggestions,
critiques, and rapid evolution, is a core component of these
efforts. The benefits of open source software have been
discussed in great detail by Eric Raymond in his seminal
work The Cathedral and the Bazaar and following works.
5
The Open Source Initiative (OSI) summarizes: “Open source
promotes software reliability and quality by supporting
independent peer review and rapid evolution of source code.
To be OSI certified, the software must be distributed under
a license that guarantees the right to read, redistribute,
modify, and use the software freely.”
6
In the beginning, most scientific software was free. It was
so difficult to port that scientists did not bother about
licensessone was delighted if someone else could get it
working on another machine. But the 1980s saw the value
of chemical informatics and the need to “productize” it. Much
of this was meritorious, as it brought informatics into the
classroom and the research lab and helped pay for some
chemistry research, but it also had hidden costs, which we
are now facing today. In particular, costs include non-
interoperability and centralized control of informatics.
Now, several open chemistry and chemoinformatics projects
(Table 1) have pooled forces to enhance interoperability
between these tools in a movement we call “The Blue
Obelisk” (BO). The name originates from an informal
meeting place in San Diego, California, during the American
Chemical Society 2005 Spring National Meeting (see Figure
1) and was coined by one of the authors. Because contribu-
tors to the component projects live around the world, few
had met in personsinstead collaborating and meeting via
the Internet.
We identify three core areas for the Blue Obelisk Move-
ment:
• Open Source. One can use other people’s code without
further permission, including changing it for one’s own use
and distributing it again.
• Open Standards. One can find visible community mech-
anisms for protocols and communicating information. The
mechanisms for creating and maintaining these standards
cover a wide spectrum of human organizations, including
various degrees of consent. We have been heavily influenced
by the mantra of the Internet Engineering Task Force: “rough
consensus and running code”.
• Open Data. One can obtain all data in the public domain
when wanted and reuse it for whatever purpose. This is an
underused term, which we are resurrecting. It is independent
of “open access” and has relevance to “closed access” as
well.
As outlined above, these areas are independent of the
concept of “open access” to read publications freely. Instead,
* Corresponding author phone: +49 (0)221 470-7426; fax: +49 (0)
221 470-7786; e-mail: c.steinbeck@uni-koeln.de.
†
Pennsylvania State University.
‡
Jmol project, http://www.jmol.org.
§
Cornell University.
|
Cambridge University.
⊥
Imperial College.
#
Cologne University Bioinformatics Center.
∇
University of Tu¨bingen.
O
Jmol project, http://www.jmol.org.
991J. Chem. Inf. Model. 2006, 46, 991-998
10.1021/ci050400b CCC: $33.50 2006 American Chemical Society
Published on Web 02/22/2006
Page 2
the three points focus on access to the scientific data,
algorithms, and implementations themselves, rather than the
formatted manuscript. In particular, we believe that these
concepts strongly continue the spirit of communal peer
review and reproducibility at the heart of modern scientific
research.
It is well-known in software development that 80% of the
costs are caused by maintaining software and not by the
initial implementation.
7
This holds both for the in-house
development in pharmaceutical companies and the develop-
ment for commercial chemoinformatics suppliers. Besides
judging software by its standardized functional quality, it
can also be compared on the basis of its long-term stability
and interoperability. Openly standardized algorithms and
chemical information can help to reduce the maintenance
costs, because developers can reuse available modules or test
their tools against open source software and open data. This
reduces the risk for both the “buy” and “build” strategies
for software implementation. We agree with De Lano
8
that
the try-before-buy paradigm for open source software does
not necessarily require open standards. Open specifications
for standard algorithms such as kekulization,
9
chirality
coding,
10
and atom typing,
11
however, are indispensable in
academic chemoinformatics research to build better, more
stable, and more reproducible chemical information systems.
In this contribution, we outline several examples for how
the Blue Obelisk projects address this need: a shared
dictionary of algorithms and implementations in chemo-
informatics algorithms drawing from our various software
projects and a shared repository of chemoinformatics data
including elemental properties, atomic radii, isotopes, atom
typing rules, a set of Web-based chemoinformatics services,
and the process of providing open algorithms and data. All
of these projects were developed with continual community
involvement, an open standardization process, and provide
open data to key chemoinformatics processes. Anyone can
take part; we welcome those in commercial organizations,
academia, government, and so forth, and contributions come
as code, compilations of data and molecules, testing, and
more.
2. THE IMPORTANCE OF OPEN SPECIFICATIONS FOR
ALGORITHMS AND DATA
The World Wide Web as it is used today is a collection
of linked HTML pages and other data formats. Whenever
there is chemical or other scientific knowledge or data
published via this mechanism, it is often difficult or
impossible to discover, because it lacks the semantics that
would help machinessthe only practical way to harvest
information “from the Internet”sto identify and classify it.
Recognizing this lack, Tim Berners-Lee introduced the
concept he termed the “Semantic Web”. The Semantic Web
is a mesh of information linked up in such a way as to be
easily processable by machines, on a global scale. One can
think of it as being an efficient way of representing data on
the World Wide Web, or as a globally linked database. An
analogy of the Semantic Web, projected onto the currently
heavily researched idea of creating global networks of
computational resources, so-called Grids, are the Semantic
Grids. A Semantic Web, and even more a Semantic Grid, is
predicated on the supply of information and services without
requiring the user to know the details of how the resource
was obtained. The “users”, who may be humans or robots,
request precise services but should be unconcerned exactly
how or where they originate. For example, the calculation
of a molecular property might depend on a precise method
but should not, in principle, depend on the actual program
used, its version, the operating system, and the machine
involved.
We note that many chemical calculations are described in
an imprecise manner. For example, “molecular weight” is
Table 1. Current Blue Obelisk Projects
project URL principal authors
CML, JUMBO
12
http://cml.sf.net/ P.M.-R., H.R.
JChemPaint
13
http://jchempaint.sf.net/ C.S., E.L.W.
Jmol http://jmol.sf.net/ M.T.H., E.L.W.
NMRShiftDB
3
http://www.nmrshiftdb.org/ C.S.
JOElib http://joelib.sf.net/ J.W.
Kalzium http://edu.kde.org/kalzium/ Carsten Niehaus
Octet http://octet.sf.net/ Rich Apodaca
Open Babel http://openbabel.sf.net/ G.R.H.
QSAR http://qsar.sf.net/ E.L.W., R.G., C.S., J.W.
The Chemistry Development Kit
1
http://cdk.sf.net/ E.L.W., C.S.
WWMM http://wwmm.sf.net/ P.M.-R.
Figure 1. Where it all began. The Blue Obelisk in San Diego,
California, at the 2005 American Chemical Society meeting.
992 J. Chem. Inf. Model., Vol. 46, No. 3, 2006 GUHA ET AL.
algorithms, and implementations themselves, rather than the
formatted manuscript. In particular, we believe that these
concepts strongly continue the spirit of communal peer
review and reproducibility at the heart of modern scientific
research.
It is well-known in software development that 80% of the
costs are caused by maintaining software and not by the
initial implementation.
7
This holds both for the in-house
development in pharmaceutical companies and the develop-
ment for commercial chemoinformatics suppliers. Besides
judging software by its standardized functional quality, it
can also be compared on the basis of its long-term stability
and interoperability. Openly standardized algorithms and
chemical information can help to reduce the maintenance
costs, because developers can reuse available modules or test
their tools against open source software and open data. This
reduces the risk for both the “buy” and “build” strategies
for software implementation. We agree with De Lano
8
that
the try-before-buy paradigm for open source software does
not necessarily require open standards. Open specifications
for standard algorithms such as kekulization,
9
chirality
coding,
10
and atom typing,
11
however, are indispensable in
academic chemoinformatics research to build better, more
stable, and more reproducible chemical information systems.
In this contribution, we outline several examples for how
the Blue Obelisk projects address this need: a shared
dictionary of algorithms and implementations in chemo-
informatics algorithms drawing from our various software
projects and a shared repository of chemoinformatics data
including elemental properties, atomic radii, isotopes, atom
typing rules, a set of Web-based chemoinformatics services,
and the process of providing open algorithms and data. All
of these projects were developed with continual community
involvement, an open standardization process, and provide
open data to key chemoinformatics processes. Anyone can
take part; we welcome those in commercial organizations,
academia, government, and so forth, and contributions come
as code, compilations of data and molecules, testing, and
more.
2. THE IMPORTANCE OF OPEN SPECIFICATIONS FOR
ALGORITHMS AND DATA
The World Wide Web as it is used today is a collection
of linked HTML pages and other data formats. Whenever
there is chemical or other scientific knowledge or data
published via this mechanism, it is often difficult or
impossible to discover, because it lacks the semantics that
would help machinessthe only practical way to harvest
information “from the Internet”sto identify and classify it.
Recognizing this lack, Tim Berners-Lee introduced the
concept he termed the “Semantic Web”. The Semantic Web
is a mesh of information linked up in such a way as to be
easily processable by machines, on a global scale. One can
think of it as being an efficient way of representing data on
the World Wide Web, or as a globally linked database. An
analogy of the Semantic Web, projected onto the currently
heavily researched idea of creating global networks of
computational resources, so-called Grids, are the Semantic
Grids. A Semantic Web, and even more a Semantic Grid, is
predicated on the supply of information and services without
requiring the user to know the details of how the resource
was obtained. The “users”, who may be humans or robots,
request precise services but should be unconcerned exactly
how or where they originate. For example, the calculation
of a molecular property might depend on a precise method
but should not, in principle, depend on the actual program
used, its version, the operating system, and the machine
involved.
We note that many chemical calculations are described in
an imprecise manner. For example, “molecular weight” is
Table 1. Current Blue Obelisk Projects
project URL principal authors
CML, JUMBO
12
http://cml.sf.net/ P.M.-R., H.R.
JChemPaint
13
http://jchempaint.sf.net/ C.S., E.L.W.
Jmol http://jmol.sf.net/ M.T.H., E.L.W.
NMRShiftDB
3
http://www.nmrshiftdb.org/ C.S.
JOElib http://joelib.sf.net/ J.W.
Kalzium http://edu.kde.org/kalzium/ Carsten Niehaus
Octet http://octet.sf.net/ Rich Apodaca
Open Babel http://openbabel.sf.net/ G.R.H.
QSAR http://qsar.sf.net/ E.L.W., R.G., C.S., J.W.
The Chemistry Development Kit
1
http://cdk.sf.net/ E.L.W., C.S.
WWMM http://wwmm.sf.net/ P.M.-R.
Figure 1. Where it all began. The Blue Obelisk in San Diego,
California, at the 2005 American Chemical Society meeting.
992 J. Chem. Inf. Model., Vol. 46, No. 3, 2006 GUHA ET AL.
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime
Start using Mendeley in seconds!
Readership Statistics
33 Readers on Mendeley
by Discipline
42% Chemistry
by Academic Status
21% Post Doc
21% Ph.D. Student
12% Other Professional
by Country
30% United States
18% United Kingdom
12% Germany



