Sign up & Download
Sign in

The Design of Discovery Net: Towards Open Grid Services for Knowledge Discovery

by Salman AlSairafi, Filippia-Sofia Emmanouil, Moustafa Ghanem, Nikolaos Giannadakis, Yike Guo, Dimitrios Kalaitzopoulos, Michelle Osmond, Anthony Rowe, Jameel Syed, Patrick Wendel show all authors
International Journal of High Performance Computing Applications (2003)

Abstract

With the emergence of distributed resources and grid technologies there is a need to provide higher level informatics infrastructures allowing scientists to easily create and execute meaningful data integration and analysis processes that take advantage of the distributed nature of the available resources. These resources typically include heterogeneous data sources, computational resources for task execution and various application-specific services. The effort of the high performance community has so far mainly focused on the delivery of low-level informatics infrastructures enabling the basic needs of grid applications. Such infrastructures are essential but do not directly help end-users in creating generic and re-usable applications. In this paper, we present the Discovery Net architecture for building grid-based knowledge discovery applications. Our architecture enables the creation of high-level, re-usable and distributed application workflows that use a variety of common types of distributed resources. It is built on top of standard protocols and standard infrastructures such as Globus but also defines its own protocols such as the Discovery Process Mark-up Language for data flow management. We discuss an implementation of our architecture and evaluate it by building a real-time genome annotation environment on top.

Cite this document (BETA)

Available from Jameel Syed's profile on Mendeley.
Page 1
hidden

The Design of Discovery Net: Towards Open Grid Services for Knowledge Discovery

THE DESIGN OF DISCOVERY NET:
TOWARDS OPEN GRID SERVICES
FOR KNOWLEDGE DISCOVERY
Salman AlSairafi
Filippia-Sofia Emmanouil
Moustafa Ghanem
Nikolaos Giannadakis
Yike Guo
Dimitrios Kalaitzopoulos1
Michelle Osmond
Anthony Rowe
Jameel Syed
Patrick Wendel
DEPARTMENT OF COMPUTING, IMPERIAL COLLEGE,
180 QUEEN’S GATE, LONDON SW7 2BZ, UK
Abstract
With the emergence of distributed resources and grid
technologies there is a need to provide higher level in-
formatics infrastructures allowing scientists to easily cre-
ate and execute meaningful data integration and analysis
processes that take advantage of the distributed nature of
the available resources. These resources typically include
heterogeneous data sources, computational resources for
task execution and various application-specific services.
The effort of the high performance community has so far
mainly focused on the delivery of low-level informatics
infrastructures enabling the basic needs of grid applica-
tions. Such infrastructures are essential but do not di-
rectly help end-users in creating generic and re-usable
applications.
In this paper, we present the Discovery Net architec-
ture for building grid-based knowledge discovery applica-
tions. Our architecture enables the creation of high-level,
re-usable and distributed application workflows that use a
variety of common types of distributed resources. It is
built on top of standard protocols and standard infra-
structures such as Globus but also defines its own proto-
cols such as the Discovery Process Mark-up Language
for data flow management. We discuss an implementa-
tion of our architecture and evaluate it by building a
real-time genome annotation environment on top.
1 Introduction
1.1 MOTIVATION
The design and features of the Discovery Net architec-
ture have originally developed from the needs of the
knowledge discovery process as applied to the field of
bioinformatics, where complicated data analysis work-
flows are typically constructed in a data-pipelined
approach. At different stages of these workflows, also
called discovery pipelines, there are requirements to
acquire, integrate and analyze data from disparate sources,
to use that data in finding patterns and models, and to
feed these models to further analysis stages. In each
stage new analysis is conducted by dynamically com-
bining new data with previously developed models.
As a motivating example, consider an automated lab-
oratory experiment where a range of sensors produces
large volumes of data about the activity of genes in can-
cerous cells. A short time series is produced that records
how each gene responds to the introduction of a possi-
ble drug. The initial requirement of the analysis is to fil-
ter interesting time series from uninteresting ones; one
approach is to use data clustering techniques (Eisen et
al., 1998). If a group of interesting genes is found, then
a crucial step in the scientific discovery process is to
verify if the clusters can be explained by referring to
existing biological knowledge.
This simple discovery pipeline has four main features
that are common to the knowledge discovery process as
applied within many scientific communities. We first
describe these four features, and then describe in more
detail the requirements that allow an informatics infra-
structure to support a wide range of complex discovery
pipelines.
1.1.1 Features of Discovery Pipelines. Dynamic
Information Integration: The first feature of discovery
pipelines is that they may include dynamic queries to
decentralized and semi-structured data sources. Bio-
informatics researchers have made available a signifi-
cant amount of information on the Internet about vari-
ous biological items and processes (Genes, Proteins,
Metabolism and Regulation). These semi-structured
resources can be accessed, from remote online databases
over the Internet, through a range of search mechanisms,
including key-based lookups to biosequence similarity
searches. The need to integrate this information within
the discovery process is inevitable since it dictates how
the discovery may proceed.
Workflow Management and Auditing: The second
feature is that recording how the results of the analysis
were reached and used may be as important as the
results of the analysis itself since they provide an audit
DISCOVERY NET 297
The International Journal of High Performance Computing Applications,
Volume 17, No. 3, Fall 2003, pp. 297–315
© 2003 Sage Publications
Page 2
hidden
trail of the discovery procedure. This recorded audit trail
allows researchers to document and manage their dis-
covery procedures, re-use the same procedure in similar
scenarios, and in many cases it is an essential compo-
nent in managing intellectual property activities such as
patent applications, peer reviews and publications.
Remote Execution: The third feature is that the analy-
sis components used within them can themselves be tied
to remote computing resources, e.g. similarity searches
over DNA sequences executing on a shared high perfor-
mance machine. New services and tools for executing
similar or related operations are continually being made
accessible over the Internet by various researchers, and
there is a need to make them available for use in newly
created discovery pipelines.
Collaborative Knowledge Discovery: The fourth fea-
ture is that the discovery process itself is almost always
conducted by teams of collaborating researchers who
need to share the datasets, the results derived from these
datasets and, more importantly, share the details about
how these results were derived.
This data-pipelined approach is gaining ground beyond
life sciences, where similar needs arise for cross-refer-
encing patterns discovered in a dataset with patterns and
data stored in remote databases, and for using shared
high performance resources. Examples abound in the
analysis of heterogeneous data in fields such as geological
analysis, environmental sciences, astronomy, and particle
physics. Irrespective of the application area, supporting
the data-pipelined knowledge discovery process requires
the provision of knowledge discovery tools that can flex-
ibly operate in an open system by allowing:
• the dynamic retrieval and construction of required
datasets;
• the execution of data mining algorithms on distrib-
uted computing servers;
• the dynamic integration of new servers, new data-
bases and new algorithms within the knowledge dis-
covery process.
The above requirements can be contrasted to the ser-
vices offered by existing knowledge discovery tools that
mainly focus on extracting knowledge within closed sys-
tems such as a centralized database or a data warehouse
where all the data required for an analysis task can be
materialized locally at any time, and fed to data mining
algorithms and tools that were pre-defined at the config-
uration stage of the tool.
1.2 REQUIREMENTS
Having described some of the features of discovery
pipelines, we now formalize the requirements for a knowl-
edge discovery infrastructure that can effectively and
efficiently support them. We describe these require-
ments along three axes while bearing in mind that the
main goal of such an infrastructure is to support collab-
orative and grid-based data integration and analysis.
1.2.1 Data Requirements. The first axis for our
analysis covers how data is accessed, managed and inte-
grated from within the desired infrastructure. Firstly, such
an infrastructure must naturally provide well-defined
and optimized data management and must be able to
handle large datasets of any type. More precisely, col-
laborative data analysis requires a higher level of data
access than provided by sequential files or input streams.
The use of relational databases, although common, can
be an obstacle to achieving high performance since typi-
cal relational databases perform well only if the user
application takes great care in defining the structure of
the data and its access patterns. The dynamic definition,
derivation and refinement of datasets is an important
part of typical data analysis workflows along with sta-
tistical, data mining and data integration operations. To
efficiently support all such operations, the required infra-
structure must be able to provide efficient and light-
weight table management services.
Secondly, since the data resources can be located on
and used from any location or resource accessible to the
user, the required infrastructure must be able to support
dynamic access, integration and structuring of data from
multiple heterogeneous data sources. Finally, and in order
to preserve the overall quality of its supported services,
the infrastructure must naturally provide optimized data
transmission between the available resources.
1.2.2 Execution Requirements. The second axis for
our analysis covers the features of the execution envi-
ronment required for such an infrastructure. Due to an
ever-increasing amount of data being analyzed as well
as the increasing complexity of the algorithms used, the
need for accessing and utilizing distributed high perfor-
mance computing resources to execute these analyses is
a clear requirement for the desired infrastructure. This
infrastructure must be able to utilize all resources made
available to an application in order to maximize the
application’s performance. However, the infrastructure
should also, as much as possible, separate the applica-
tion definition level from the planning of its execution
on available resources. It is important for an application
to be able to preserve its analytical definition separately
from the details of its execution in order to be easily
published and re-usable in different contexts and on dif-
ferent resources.
298 COMPUTING APPLICATIONS

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

2 Readers on Mendeley
by Discipline
 
by Academic Status
 
50% Other Professional
 
50% Ph.D. Student
by Country
 
50% United Kingdom
 
50% Germany