A semantic GRID for molecular science
Available from www.dspace.cam.ac.uk
Page 1
A semantic GRID for molecular science
A semantic GRID for molecular science
Peter Murray-Rust a, Robert C Glen a, Henry S Rzepa b, James J P Stewart c, Joe
A Townsend a, Egon L Willighagen d, Yong Zhang a
a Unilever Centre for Molecular Informatics, Chemistry Department, University of
Cambridge, Lensfield Road, Cambridge, CB2 1EW, UK, b Chemistry Department,
Imperial College, London, SW7 2AY, UK , c Stewart Computational Chemistry, 15210
Paddington Circle, Colorado Springs CO 80921-2512 US, d Laboratory of Analytical
Chemistry, Toernooiveld 1, 6525 ED Nijmegen, NL
Abstract
The properties of molecules have very well defined semantics and allow the creation of a
semantic GRID. Markup languages (CML - Chemical Markup Language) and dictionary-
based ontologies have been designed to support a wide range of applications, including
chemical supply, publication and the safety of compounds. Many properties can be computed
by Quantum Mechanical (QM) programs and we have developed a "black-box" system based
on XML wrappers for all components. This is installed on a Condor system on which we have
computed properties for 250, 000 compounds. The results of this will be available in an
OpenData/OpenSource peer-to-peer (P2P) system (WorldWide Molecular Matrix - WWMM).
Introduction
Over 30 million chemical compounds are known, many of importance in healthcare,
biosciences and new products. It is of fundamental importance to know their properties,
including implications for safety. The UK's Royal Commission on Environmental Pollution
(http://www.rcep.org.uk) has recently emphasised the importance of having this information
and stresses the very low percentage of compounds for which adequate data are available.
It is common to attempt to model the biological and other safety properties of molecules from
known physical properties. Sometimes these have been measured but in most cases they must
be predicted. Many properties can, in principle, be calculated by solving Schroedinger's
equation, although this was often prohibitively expensive. In particular calculations scale
badly, often O(N3) to O(N6) where N is the number of atoms or electrons. However recent
advances include:
• O(N) scaling (usually through localisation of parts of the molecule in the algorithm).
• Farm-like availability of unused compute cycles on non-specialist machines in
heterogeneous environments.
• Semi-empirical parameterised methods (QM Hamiltonians such as MOPAC's PM5
are calibrated against experimental properties).
Raw computer power is often not the major challenge. We report below the automatic
computation of properties of 250, 000 molecules but stress the analysis and dissemination of
results is much more problematic. The data and codes are extremely heterogeneous with no
common infrastructure and often based on column-based FORTRAN-like input. Novitiates
must study at the feet of experts till they master it. Anecdotally we estimate that input errors
render many millions of jobs wasted globally per year. The programs have virtually no
Peter Murray-Rust a, Robert C Glen a, Henry S Rzepa b, James J P Stewart c, Joe
A Townsend a, Egon L Willighagen d, Yong Zhang a
a Unilever Centre for Molecular Informatics, Chemistry Department, University of
Cambridge, Lensfield Road, Cambridge, CB2 1EW, UK, b Chemistry Department,
Imperial College, London, SW7 2AY, UK , c Stewart Computational Chemistry, 15210
Paddington Circle, Colorado Springs CO 80921-2512 US, d Laboratory of Analytical
Chemistry, Toernooiveld 1, 6525 ED Nijmegen, NL
Abstract
The properties of molecules have very well defined semantics and allow the creation of a
semantic GRID. Markup languages (CML - Chemical Markup Language) and dictionary-
based ontologies have been designed to support a wide range of applications, including
chemical supply, publication and the safety of compounds. Many properties can be computed
by Quantum Mechanical (QM) programs and we have developed a "black-box" system based
on XML wrappers for all components. This is installed on a Condor system on which we have
computed properties for 250, 000 compounds. The results of this will be available in an
OpenData/OpenSource peer-to-peer (P2P) system (WorldWide Molecular Matrix - WWMM).
Introduction
Over 30 million chemical compounds are known, many of importance in healthcare,
biosciences and new products. It is of fundamental importance to know their properties,
including implications for safety. The UK's Royal Commission on Environmental Pollution
(http://www.rcep.org.uk) has recently emphasised the importance of having this information
and stresses the very low percentage of compounds for which adequate data are available.
It is common to attempt to model the biological and other safety properties of molecules from
known physical properties. Sometimes these have been measured but in most cases they must
be predicted. Many properties can, in principle, be calculated by solving Schroedinger's
equation, although this was often prohibitively expensive. In particular calculations scale
badly, often O(N3) to O(N6) where N is the number of atoms or electrons. However recent
advances include:
• O(N) scaling (usually through localisation of parts of the molecule in the algorithm).
• Farm-like availability of unused compute cycles on non-specialist machines in
heterogeneous environments.
• Semi-empirical parameterised methods (QM Hamiltonians such as MOPAC's PM5
are calibrated against experimental properties).
Raw computer power is often not the major challenge. We report below the automatic
computation of properties of 250, 000 molecules but stress the analysis and dissemination of
results is much more problematic. The data and codes are extremely heterogeneous with no
common infrastructure and often based on column-based FORTRAN-like input. Novitiates
must study at the feet of experts till they master it. Anecdotally we estimate that input errors
render many millions of jobs wasted globally per year. The programs have virtually no
Page 2
interoperability, and numeric output is often poorly defined without explicit scientific units.
Metadata (when, where, why, what, how) are universally absent.
Paradoxically this is one of the best test beds for developing a semantic GRID! The
underlying (implicit) semantics are extremely stable (the molecule-property relationship (Fig.
1) dates from the 19th century).
Fig.1. A Molecule has many properties (perhaps with repeat measurements) defined by their
types (ptype) and citations (cit).
The codes themselves are very reliable with excellent implicit semantics. In most cases
algorithms (if not always source code) are fully described and agreed. Moreover there is now
great demand for computational chemistry for many domains outside chemistry, such as
materials, safety, biosciences, earth sciences and nanotechnology. These "customers"
increasingly want a "black-box" approach that provides "useable" results on demand.
This challenges much current practice where users are expected to understand the physics of
the calculations, and the many pitfalls. Often there has to be an “expert on tap". Whilst the
quality of input and the interpretation of output are suspect, this is still essential but we
believe that the process can be increasingly "semantically wrapped". A set of rules can decide
which molecules are unsuitable for calculations, and what level of accuracy can be expected
or afforded for the others. For example a small rigid molecule containing light elements will
often give excellent results on a routine basis while large floppy molecules, those with metals
or unpaired electrons are immediately filtered out.
Most importantly we provide metadata for each job. This allows the customer to make their
own decision as to whether the results are "fit for purpose". Computer-based tools can also
analyse the results both for known problems and to discover types of molecules that show
pathological behaviour. The traditional code manuals can evolve to a rule-set taking decisions
or advising users on options.
The Chemical Semantic GRID
In our earlier vision of the Chemical Semantic Web [3] we foresaw scientists asking for
chemical information on demand, often without knowing the details of the science involved.
This slightly heretical approach is driven by the pace and heterogeneity of multidisciplinary
science, exemplified well in bioinformatics. A scientist must retrieve data from many
domains and integrate them without the help of human experts. A key factor in the Semantic
Web is transferring expertise into computer representation ("ontologies").
Here we extend this to a Semantic GRID with high-throughput computing on demand. In
molecular science this is challenging as it does not map easily onto current informatics
practice. There is no equivalent to the publicly funded international bioinformatics institutes
that provide Open Data (see below). Most published molecular data is published piecewise in
Metadata (when, where, why, what, how) are universally absent.
Paradoxically this is one of the best test beds for developing a semantic GRID! The
underlying (implicit) semantics are extremely stable (the molecule-property relationship (Fig.
1) dates from the 19th century).
Fig.1. A Molecule has many properties (perhaps with repeat measurements) defined by their
types (ptype) and citations (cit).
The codes themselves are very reliable with excellent implicit semantics. In most cases
algorithms (if not always source code) are fully described and agreed. Moreover there is now
great demand for computational chemistry for many domains outside chemistry, such as
materials, safety, biosciences, earth sciences and nanotechnology. These "customers"
increasingly want a "black-box" approach that provides "useable" results on demand.
This challenges much current practice where users are expected to understand the physics of
the calculations, and the many pitfalls. Often there has to be an “expert on tap". Whilst the
quality of input and the interpretation of output are suspect, this is still essential but we
believe that the process can be increasingly "semantically wrapped". A set of rules can decide
which molecules are unsuitable for calculations, and what level of accuracy can be expected
or afforded for the others. For example a small rigid molecule containing light elements will
often give excellent results on a routine basis while large floppy molecules, those with metals
or unpaired electrons are immediately filtered out.
Most importantly we provide metadata for each job. This allows the customer to make their
own decision as to whether the results are "fit for purpose". Computer-based tools can also
analyse the results both for known problems and to discover types of molecules that show
pathological behaviour. The traditional code manuals can evolve to a rule-set taking decisions
or advising users on options.
The Chemical Semantic GRID
In our earlier vision of the Chemical Semantic Web [3] we foresaw scientists asking for
chemical information on demand, often without knowing the details of the science involved.
This slightly heretical approach is driven by the pace and heterogeneity of multidisciplinary
science, exemplified well in bioinformatics. A scientist must retrieve data from many
domains and integrate them without the help of human experts. A key factor in the Semantic
Web is transferring expertise into computer representation ("ontologies").
Here we extend this to a Semantic GRID with high-throughput computing on demand. In
molecular science this is challenging as it does not map easily onto current informatics
practice. There is no equivalent to the publicly funded international bioinformatics institutes
that provide Open Data (see below). Most published molecular data is published piecewise in
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime
Start using Mendeley in seconds!
Readership Statistics
4 Readers on Mendeley
by Discipline
25% Chemistry
25% Social Sciences
by Academic Status
25% Librarian
25% Post Doc
25% Researcher (at an Academic Institution)
by Country
25% Sweden
25% China
25% United States


