A Distributed Architecture for Data Mining and Integration Categories and Subject Descriptors
- ISBN: 9781605585895
Abstract
This paper presents the rationale for a new architecture to support a significant increase in the scale of data integration and data mining. It proposes the composition into one framework of (1) data mining and (2) data access and integration. We name the combined activity DMI. It supports enactment of DMI processes across heterogeneous and distributed data resources and data mining services. It posits that a useful division can be made between the facilities established to support the definition of DMI processes and the computational infrastructure provided to enact DMI processes. Communication between those two divisions is restricted to requests submitted to gateway services in a canonical DMI language. Larger-scale processes are enabled by incremental refinement of DMI-process definitions often by recomposition of lower-level definitions. Autonomous evolution of data resources and services is supported by types and descriptions which will support detection of inconsistencies and semi-automatic insertion of adaptations. These architectural ideas are being evaluated in a feasibility study that involves an application scenario and representatives of the community.
Author-supplied keywords
A Distributed Architecture for Data Mining and Integration Categories and Subject Descriptors
Malcolm P. Atkinson
National e-Science Centre
School of Informatics
University of Edinburgh, UK
mpa@nesc.ac.uk
Jano I. van Hemert
National e-Science Centre
School of Informatics
University of Edinburgh, UK
j.vanhemert@ed.ac.uk
Liangxiu Han
National e-Science Centre
School of Informatics
University of Edinburgh ,UK
liangxiu.han@ed.ac.uk
Ally Hume
EPCC
University of Edinburgh, UK
a.hume@epcc.ed.ac.uk
Chee Sun Liew
National e-Science Centre
School of Informatics
University of Edinburgh, UK
c.s.liew@nesc.ac.uk
ABSTRACT
This paper presents the rationale for a new architecture to
support a signicant increase in the scale of data integra-
tion and data mining. It proposes the composition into one
framework of (1) data mining and (2) data access and inte-
gration. We name the combined activity \DMI". It supports
enactment of DMI processes across heterogeneous and dis-
tributed data resources and data mining services. It posits
that a useful division can be made between the facilities es-
tablished to support the denition of DMI processes and the
computational infrastructure provided to enact DMI pro-
cesses. Communication between those two divisions is re-
stricted to requests submitted to gateway services in a canon-
ical DMI language. Larger-scale processes are enabled by
incremental renement of DMI-process denitions often by
recomposition of lower-level denitions. Autonomous evolu-
tion of data resources and services is supported by types and
descriptions which will support detection of inconsistencies
and semi-automatic insertion of adaptations. These archi-
tectural ideas are being evaluated in a feasibility study that
involves an application scenario and representatives of the
community.
Categories and Subject Descriptors
C.1.4 [Parallel Architectures]: Distributed architectures
General Terms
Algorithms, Design, Languages
Keywords
Data mining; Data integration; Distributed computing; Data-
aware Distributed Computing; Service-oriented architectures
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
DADC’09, June 9–10, 2009, Munich, Germany.
Copyright 2009 ACM 978-1-60558-589-5/09/06 ...$5.00.
1. INTRODUCTION
We report the rationale for a new architecture, DMI ar-
chitecture, for combined data integration and data mining
under development in the ADMIRE project1. The principal
innovations are: (1) a de-coupling of the enactment technol-
ogy from the tools used to prepare DMI processes, which in
turn enables (2) multiple independent DMI enactment ser-
vices, some of which may be tightly coupled with curated
data collections, and (3) the enactment of each DMI process
by distributing it over these services, c.f. distributed queries.
The DMI architecture is intended to enable society to
make better use of the rapidly expanding wealth of data.
The number of sources of data is increasing, while, at the
same time, the diversity, complexity and scale of these data
resources are also growing dramatically. This cornucopia
of data oers much potential; a combinatorial explosion of
opportunities for knowledge discovery, improved decisions
and better policies. Today, most of these opportunities are
not realised because composing data from multiple sources
and extracting information is too dicult. The proposed
DMI architecture must make all of the stages of DMI pro-
cess development and enactment as identied in [14] easier
and more economic.
This data-rich environment with a growing commitment
to the eective exploitation of data requires an architecture
that must simultaneously address a number of sources of
scale and complexity. The following list is indicative:
The scale and complexity of each data source grows. The
DMI architecture addresses this with data-
ow technology
to reduce data handling and to move data reduction and
transformation operations closer to data sources. These
data transformations can be updated to prevent changes
in the forms of data provided by a data resource from
propagating to other parts of a DMI work
ow unnecessar-
ily.
The number and variety of independent data sources in-
creases. As warehousing and virtualisation become infea-
sible at the envisaged scale, which we address by proposing
dynamic composition of processes.
The computational complexity of extracting information
grows as a result of the above and of increasingly sophisti-
cated application requirements. The DMI architecture ad-
1EU FP7 ICT 215024 www.admire-project.eu
11
computing (DADC) engineers and by supporting the in-
cremental denition and revision of libraries and patterns.
The number of application domains using DMI grows, be-
comes more diverse and engages more users. The DMI
architecture addresses this by recognising communities of
users, by supporting them with their own environments
and by delivering packaged production versions of DMI
processes.
The number of experts involved in developing new DMI
processes and supporting their application grows. The
DMI architecture addresses this by separating support for
DMI experts from that for DADC engineers and application-
domain users. Support for communities with aligned DMI
interests is achieved by enabling sharing between DMI-
developers' workbenches via a common registry for their
community.
The number of providers of data and DMI services grows.
The DMI architecture separates the organisation of envi-
ronments for DMI-process development from the complex-
ities of DMI-service provision by interposing DMI gate-
ways using a canonical language.
The growing sophistication of information extraction from
large bodies of data requires ever more complex and re-
ned work
ows. The DMI architecture addresses this by
structuring collections of components into libraries that
correspond to a conceptual structure captured in DMI
ontologies and by supporting the incremental renement
of libraries and the DMI processes that use them. This
encourages greater contemporaneous eort by support-
ing concurrent independent development by three sepa-
rate categories of experts working both for providers and
users.
The providers of data and services autonomously change
their oered services and schema at a rate which dees
manual adaptation when many data resources are in use.
The DMI architecture proposes to exploit type systems,
semantic description, community eort and light-weight
composition to semi-automatically adapt to change and
to pool the intelligence of human interventions.
Denitions are in Table 1. The principal elements of the
architecture are presented in Section 2. Section 3 introduces
the canonical language used to send requests to DMI gate-
ways. The evaluation of the DMI architecture using proto-
types and test cases is described in Section 4. Related work
is summarised in Section 5 and Section 6 concludes with an
assessment of progress and the plans for further work. More
detailed information about the architecture and the work un-
derway in the ADMIRE project to evaluate it can be found
in [5].
2. DMI ARCHITECTURE
Figure 1 shows how the complexities of matching the diver-
sity of user requirements at the tools level can be separated
from the complexity of the enactment level, accommodating
the diversity of data resources and services, by interposing
the single canonical domain of discourse represented by the
DMI language (see Section 3). Our hypothesis is that, by en-
forcing this logical decoupling, both the tools development
and the platform engineering will proceed rapidly and in-
dependently. Of course, this depends on the quality of the
abstract machine and the language operating at the gateway.
Developing that quality is a research goal.
component a computational item used in a DMI, i.e. data
collections, data resources, functions, PE, PE
instances & types.
connection a pipe streaming data between PE.
CRISP-DM six phases of data mining [14].
data collection a coherent collection of data, e.g. a le, a set
of les, a relational table, a set of tables, an
XML document, etc.
data resource a service that provides data and may accept
data, e.g. a le service or a DBMS.
DADC engineer a person who builds distributed systems that
dynamically adapt to the data they handle.
DMI experts specialists in developing DMI process.
domain experts specialists in applying DMI in their domain.
DMI gateway a service that processes DMI requests.
DMI portal an interface for submitting canned DMI re-
quests.
DMI process a sequence of computational steps to a DMI
goal.
enactment a computation implementing a DMI process.
library a collection of PE, functions and types.
pattern a recurring structure within DMI processes.
processing element an algorithm for a step in DMI (abbr: PE).
PE instance a PE plus its processing state.
registry holds descriptions of all the possible DMI
components.
repository holds denitions and implementations of all
of the DMI components that are generated
within the DMI architecture.
session a dynamically created service providing ac-
cess to parts of the state of an enactment.
streaming passing values incrementally along a connec-
tion.
type a formal description of a class of values.
Table 1: Denitions used in this paper
We propose supporting user interaction with DMI systems
through two mechanisms:
DMI workbenches that support a coherent set of tools de-
signed to support a particular category of DMI-process de-
velopers. DMI workbenches may take many forms to sup-
port particular developer styles and application-developer
requirements.
DMI portals that permit application-domain experts to
use repeatedly and conveniently DMI processes that have
been developed and packaged at the above workbenches.
On each occasion that such processes are enacted, users
will specify parameters, trigger the submission, observe
progress and collect results through carefully prepared user
interfaces.
The design and provision of DMI workbenches and por-
tals will be reported elsewhere. To allow providers of DMI
services to amortise and smooth their costs over many com-
munities of users and to allow selection of DMI services from
multiple providers, the DMI architecture provides a many-
to-many relationship between workbenches (and portals they
have set up) and the DMI gateways, as is shown in Figure 2.
A community of developers will use a number of work-
benches, e.g. A, B & C, will all use one registry 1, which
holds descriptions of the DMI components that they are de-
veloping or have obtained from the gateways they are using,
.e.g, a and b. Some of these descriptions will refer to repre-
sentations of implementations in a repository N. A gateway
has its own registry which describes all of the resources, ser-
vices and components it is able to work with. Requests to
a gateway can interrogate this information, can update it
12
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime


