Why Grid-based Data Mining Matters? Fighting Natural Disasters on the Grid: From SARS to Land Slides
Available from
Jameel Syed's profile on Mendeley.
Page 1
Why Grid-based Data Mining Matters? Fighting Natural Disasters on the Grid: From SARS to Land Slides
Why Grid-based Data Mining Matters?
Fighting Natural Disasters on the Grid:
From SARS to Land Slides
A.K.T.P. Au, V. Curcin, M. M. Ghanem, N. Giannadakis, Y. Guo, M. A. Jafri, M. Osmond,
A. Oleynikov, A.S. Rowe, J. Syed, P. Wendel and Y. Zhang
Department of Computing, Imperial College London,
180 Queens Gate, London, SW7 2AZ
{aktp, vc100, mmg, ng300, yg, jafri, mo197, aio00, asr99, jas5, pjw4, yzhan}@doc.ic.ac.uk
Abstract
The Discovery Net UK e-Science project has built a framework and infrastructure for knowledge
discovery services over data collected from high throughput sensors. In this paper we provide an
overview of the Discovery Net approach and highlight some of the scientific applications
constructed by end-user scientists using the Discovery Net system. These applications include
genome annotation, the analysis of SARS evolution patterns, monitoring air pollution data and the
analysis of earthquake and land slide satellite images.
1. The challenge of discovering new
knowledge
In their simplest definition, e-Science platforms
are Internet-enabled working environments
allowing distributed scientists to form a virtual
organization where they can share data and
computing resources and collectively
collaborate on the analysis of the data to derive
new knowledge.
The vision of e-Science platforms, which are
common in the UK and Europe, is closely
related to the vision of computational grids in
the US. However, current research into
fundamental Grid technologies, such as Globus
[1], has concentrated mainly on the provision of
protocols, services and tools for creating co-
ordinated, transparent and secure globally
accessible computational systems. These
technologies follow a service methodology for
finding both computation and data services for
performing computationally or data intensive
tasks. The delivery of the low-level
infrastructure is essential but does not provide
end users with the easy-to-use tools that aid
them in the creation of their scientific
applications.
Compared to Grid computing platforms, e-
Science platforms concentrate mainly on the
provision of higher-level application-oriented
platforms that are focused on enabling the end-
user scientists in deriving new knowledge when
devices, sensors, databases, analysis
components and computational resources are all
accessible over the Internet or the Grid. The
Discovery Net system is an example of such e-
Science platforms that are dedicated to
empowering end users in conducting knowledge
discovery activities, easily and seamlessly. The
system is currently used by a number of
application groups in different fields including
life science, environmental monitoring and geo-
hazard modeling.
In the remainder of this paper we describe
our experience from building the Discovery Net
platform for grid-based knowledge discovery
and data mining, and using it in conducting data
mining over scientific experimental data.
2. Knowledge Discovery in e-Science
e-Science concerns the development of new
practices and methods to find knowledge. Non-
trivial, actionable knowledge cannot be batch
generated by a set of predefined methods, but
Fighting Natural Disasters on the Grid:
From SARS to Land Slides
A.K.T.P. Au, V. Curcin, M. M. Ghanem, N. Giannadakis, Y. Guo, M. A. Jafri, M. Osmond,
A. Oleynikov, A.S. Rowe, J. Syed, P. Wendel and Y. Zhang
Department of Computing, Imperial College London,
180 Queens Gate, London, SW7 2AZ
{aktp, vc100, mmg, ng300, yg, jafri, mo197, aio00, asr99, jas5, pjw4, yzhan}@doc.ic.ac.uk
Abstract
The Discovery Net UK e-Science project has built a framework and infrastructure for knowledge
discovery services over data collected from high throughput sensors. In this paper we provide an
overview of the Discovery Net approach and highlight some of the scientific applications
constructed by end-user scientists using the Discovery Net system. These applications include
genome annotation, the analysis of SARS evolution patterns, monitoring air pollution data and the
analysis of earthquake and land slide satellite images.
1. The challenge of discovering new
knowledge
In their simplest definition, e-Science platforms
are Internet-enabled working environments
allowing distributed scientists to form a virtual
organization where they can share data and
computing resources and collectively
collaborate on the analysis of the data to derive
new knowledge.
The vision of e-Science platforms, which are
common in the UK and Europe, is closely
related to the vision of computational grids in
the US. However, current research into
fundamental Grid technologies, such as Globus
[1], has concentrated mainly on the provision of
protocols, services and tools for creating co-
ordinated, transparent and secure globally
accessible computational systems. These
technologies follow a service methodology for
finding both computation and data services for
performing computationally or data intensive
tasks. The delivery of the low-level
infrastructure is essential but does not provide
end users with the easy-to-use tools that aid
them in the creation of their scientific
applications.
Compared to Grid computing platforms, e-
Science platforms concentrate mainly on the
provision of higher-level application-oriented
platforms that are focused on enabling the end-
user scientists in deriving new knowledge when
devices, sensors, databases, analysis
components and computational resources are all
accessible over the Internet or the Grid. The
Discovery Net system is an example of such e-
Science platforms that are dedicated to
empowering end users in conducting knowledge
discovery activities, easily and seamlessly. The
system is currently used by a number of
application groups in different fields including
life science, environmental monitoring and geo-
hazard modeling.
In the remainder of this paper we describe
our experience from building the Discovery Net
platform for grid-based knowledge discovery
and data mining, and using it in conducting data
mining over scientific experimental data.
2. Knowledge Discovery in e-Science
e-Science concerns the development of new
practices and methods to find knowledge. Non-
trivial, actionable knowledge cannot be batch
generated by a set of predefined methods, but
Page 2
rather the creativity and expertise of the
scientist is necessary to formulate new
approaches. Whilst the dynamic nature of
massively distributed service-oriented
architectures provides much promise in
providing scientists with powerful tools, it
raises many issues of complexity.
New resources such as online data sources,
algorithms and methods defined as processes
are becoming available daily. A single process
may need to integrate techniques from a range
of disciplines such as data mining, text mining,
image mining, bioinformatics, or
chemoinformatics, and may be created by a
multidisciplinary team of experts. A major
challenge is to effectively coordinate these
resources in a discovery environment in order to
create knowledge.
As examples of e-Science data analysis
processes, consider the following scenarios:
a. Scientists collaborating on the analysis of a
newly discovered viral genome such as
SARS and studying its evolution.
b. Scientists collaborating on the analysis of
environmental air pollution data and
correlating it with available medical records
and traffic data.
c. Scientists collaborating on the analysis of
satellite images for modelling the possible
effects of earthquakes on populated
regions.
In each of the above scenarios a scientific
knowledge discovery process conducted in an
open environment proceeds by making use of
distributed data and resources. The main
features of such processes can be summarised
as:
1. The processes typically operate as data and
application integration pipelines. At
different stages of the knowledge discovery
process, researchers need to access,
integrate and analyse data from disparate
sources, in order to use that data to find
patterns and models, and feed these models
to further stages in the process. At each
stage, new analysis is conducted by
dynamically combining new data with
previously developed models.
2. There are typically many different data
analysis software components that can be
used to analyse the data. Such software
components may be on the user's local
machine, while others may be tied for
execution on remote servers, e.g. via a web-
service interface or even simply via a web
page interface. New software components,
services and tools are continually being
made available, either as downloadable
code or as remote services over the Internet
for access by various groups. An individual
researcher needs to be able to locate such
software components and integrate them
within their analysis procedures.
3. The discovery process itself is almost
always conducted by teams of collaborating
researchers who need to share the data sets,
the results derived from these data sets,
and, more importantly, details about how
these results were derived. In this case,
recording an audit trail of how a particular
analysis result (or new knowledge) was
acquired and used is essential since it
allows researchers to document and manage
their discovery procedures.
4. Since the whole discovery process is
executable, the end user may want to wrap
it as an executable program (or software
component) for access and use by other
researchers. In this case it is essential to
provide methods that allow such processes
to be automatically converted into
executable code, and that allow information
about them to be published to allow users
to locate and access such code.
5. Finally, with a large number of discovery
processes being generated by different
research groups, it is essential to be able to
store such processes within a process
warehouse, from which scientists can
search, retrieve and re-use procedures
developed from one scenario in similar
scenarios. Furthermore, the availability of
such a warehouse will help them in
managing intellectual property activities
such as patent applications, peer reviews
and publications.
scientist is necessary to formulate new
approaches. Whilst the dynamic nature of
massively distributed service-oriented
architectures provides much promise in
providing scientists with powerful tools, it
raises many issues of complexity.
New resources such as online data sources,
algorithms and methods defined as processes
are becoming available daily. A single process
may need to integrate techniques from a range
of disciplines such as data mining, text mining,
image mining, bioinformatics, or
chemoinformatics, and may be created by a
multidisciplinary team of experts. A major
challenge is to effectively coordinate these
resources in a discovery environment in order to
create knowledge.
As examples of e-Science data analysis
processes, consider the following scenarios:
a. Scientists collaborating on the analysis of a
newly discovered viral genome such as
SARS and studying its evolution.
b. Scientists collaborating on the analysis of
environmental air pollution data and
correlating it with available medical records
and traffic data.
c. Scientists collaborating on the analysis of
satellite images for modelling the possible
effects of earthquakes on populated
regions.
In each of the above scenarios a scientific
knowledge discovery process conducted in an
open environment proceeds by making use of
distributed data and resources. The main
features of such processes can be summarised
as:
1. The processes typically operate as data and
application integration pipelines. At
different stages of the knowledge discovery
process, researchers need to access,
integrate and analyse data from disparate
sources, in order to use that data to find
patterns and models, and feed these models
to further stages in the process. At each
stage, new analysis is conducted by
dynamically combining new data with
previously developed models.
2. There are typically many different data
analysis software components that can be
used to analyse the data. Such software
components may be on the user's local
machine, while others may be tied for
execution on remote servers, e.g. via a web-
service interface or even simply via a web
page interface. New software components,
services and tools are continually being
made available, either as downloadable
code or as remote services over the Internet
for access by various groups. An individual
researcher needs to be able to locate such
software components and integrate them
within their analysis procedures.
3. The discovery process itself is almost
always conducted by teams of collaborating
researchers who need to share the data sets,
the results derived from these data sets,
and, more importantly, details about how
these results were derived. In this case,
recording an audit trail of how a particular
analysis result (or new knowledge) was
acquired and used is essential since it
allows researchers to document and manage
their discovery procedures.
4. Since the whole discovery process is
executable, the end user may want to wrap
it as an executable program (or software
component) for access and use by other
researchers. In this case it is essential to
provide methods that allow such processes
to be automatically converted into
executable code, and that allow information
about them to be published to allow users
to locate and access such code.
5. Finally, with a large number of discovery
processes being generated by different
research groups, it is essential to be able to
store such processes within a process
warehouse, from which scientists can
search, retrieve and re-use procedures
developed from one scenario in similar
scenarios. Furthermore, the availability of
such a warehouse will help them in
managing intellectual property activities
such as patent applications, peer reviews
and publications.
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime
Start using Mendeley in seconds!
Readership Statistics
2 Readers on Mendeley
by Discipline
by Academic Status
50% Other Professional
50% Student (Postgraduate)
by Country
50% United Kingdom
50% United States


