An architecture for distributed enterprise data mining
Lecture Notes in Computer Science (1999)
- ISSN: 03029743
Available from
Jameel Syed's profile on Mendeley.
or
Abstract
The requirements for data mining systems for large organisations and enterprises range from logical and physical distribution of large data and heterogeneous computational resources to the general need for high performance at a level that is sufficient for interactive work. This work categorises the requirements and describes the Kensington software architecture that addresses these demands. The system is capable of transparently supporting parallel computation at two levels, and we describe a configuration for trans-atlantic distributed parallel data mining that was demonstrated at the recent Supercomputing conference.
Page 1
An architecture for distributed enterprise data mining
An Architecture for Distributed Enterprise Data
Mining
J. Chattratichat, J. Darlington, Y. Guo, S. Hedvall, M. KShler, and J. Syed
Data Mining Group
Imperial College Parallel Computing Centre
180 Queen's Gate, London SW7 2BZ, UK
{jcS, jd, yg, dshl, mk, jasS}Odoc.ic.ac.uk
Abstract. The requirements for data mining systems for large organisa-
tions and enterprises range from logical and physical distribution of large
data and heterogeneous computational resources to the general need for
high performance at a level that is sufficient for interactive work. This
work categorises the requirements and describes the Kensington soft-
ware architecture that addresses these demands. The system is capable
of transparently supporting parallel computation at two levels, and we
describe a configuration for trans-atlantic distributed parallel data min-
ing that was demonstrated atthe recent Supercomputing conference.
1 Introduction
Data Mining, or Knowledge Discovery in Databases is concerned with extracting
useful and new information from data, and provides the basis for leveraging the
investments in data assets. It combines the fields of databases and data ware-
housing with algorithms from machine learning and methods from statistics to
gain insight in hidden structures within the data. In order to apply the knowl-
edge from the Data Mining process, the results need to be analysed, often with
the help of visualisation tools, as well as integrated into the business process.
Data mining systems for enterprises and large organisations have to overcome
unique challenges. They need to combine access to diverse and distributed ata
sources with the large computational power required for many mining tasks. The
data mining process, as perceived by the analysts, knowledge workers and end-
users of the discovered knowledge, is an interactive one that functions best when
a high degree of interactivity is available. The analyses are usually refined dur-
ing several iterations through the cycle of data selection, pre-processing, model
building and model analysis. The best results are usually achieved by combining
models from different techniques, which calls for a wide variety of integrated
tools within the system, as well as openness for future extensions.
In large organisations, data from numerous ources needs to be accessed
and combined to provide comprehensive analyses, and work groups of analysts
require access to the same data and results. For this purpose the existing net-
working infrastructure, typically based on Internet echnology, is to be re-used.
Mining
J. Chattratichat, J. Darlington, Y. Guo, S. Hedvall, M. KShler, and J. Syed
Data Mining Group
Imperial College Parallel Computing Centre
180 Queen's Gate, London SW7 2BZ, UK
{jcS, jd, yg, dshl, mk, jasS}Odoc.ic.ac.uk
Abstract. The requirements for data mining systems for large organisa-
tions and enterprises range from logical and physical distribution of large
data and heterogeneous computational resources to the general need for
high performance at a level that is sufficient for interactive work. This
work categorises the requirements and describes the Kensington soft-
ware architecture that addresses these demands. The system is capable
of transparently supporting parallel computation at two levels, and we
describe a configuration for trans-atlantic distributed parallel data min-
ing that was demonstrated atthe recent Supercomputing conference.
1 Introduction
Data Mining, or Knowledge Discovery in Databases is concerned with extracting
useful and new information from data, and provides the basis for leveraging the
investments in data assets. It combines the fields of databases and data ware-
housing with algorithms from machine learning and methods from statistics to
gain insight in hidden structures within the data. In order to apply the knowl-
edge from the Data Mining process, the results need to be analysed, often with
the help of visualisation tools, as well as integrated into the business process.
Data mining systems for enterprises and large organisations have to overcome
unique challenges. They need to combine access to diverse and distributed ata
sources with the large computational power required for many mining tasks. The
data mining process, as perceived by the analysts, knowledge workers and end-
users of the discovered knowledge, is an interactive one that functions best when
a high degree of interactivity is available. The analyses are usually refined dur-
ing several iterations through the cycle of data selection, pre-processing, model
building and model analysis. The best results are usually achieved by combining
models from different techniques, which calls for a wide variety of integrated
tools within the system, as well as openness for future extensions.
In large organisations, data from numerous ources needs to be accessed
and combined to provide comprehensive analyses, and work groups of analysts
require access to the same data and results. For this purpose the existing net-
working infrastructure, typically based on Internet echnology, is to be re-used.
Page 2
574
Confidentiality becomes a key issue, and the system architecture needs to pro-
vide security features at all levels of access. The different needs of enterprises
require that a system offers a wide range of configuration options, so that it is
possible to scale applications from a few client workstations to high-performance
server machines.
In this article we will discuss the implications of the above requirements, fo-
cusing on the Kensington solution that employs Internet and distributed compo-
nent technologies for deployment on high-performance servers uch as distributed
memory and shared-memory parallel machines. The next section will discuss the
key functional requirements hat have been outlined so far. The following chap-
ter outlines the design and implementation of the Kensington enterprise data
mining system, in particular Java- and CORBA-based networking and compo-
nent technology. We then describe a scenario for distributed ata mining that
was demonstrated atSuperComputing'98 as part of the award-winning Terabyte
Challenge. The final section concludes and outlines future trends in the field.
2 Enterprise Data Mining Requirements
Data mining system architectures for enterprises have to meet a range of de-
mands from the field of data analysis and the additional needs that arise when
handling large amounts of data inside an organisation. Modern data mining ap-
plications are expected to provide a high degree of integration while retaining
flexibility. In this way they can efficiently support different ypes of analyses over
the organisation's data. Data mining is understood to be an iterative process for
the analyst \[FPSS96\], especially in the initial exploratory phases of the analyt-
ical task. Therefore, a high degree of interactivity is required, often combined
with the need for visualisation of the data and the analytical results.
The field of data mining is developing rapidly, and the methods applied in a
tool today may be superseded by more advanced algorithms in the near future.
Furthermore, the convergence with statistical methods has only just started,
and will grow in pace over the next few years. The need for enhancement of the
existing tool set has to be reflected by a software architecture that enables the
straightforward integration of new analytical components. In a similar vein, the
results from the analytical functions need to be presented in portable formats,
as most analysts will want to use different specialist packages to further refine
or report the results.
In large organisations, the amount and the distribution of the data become
an additional challenge. The size of the data may make it impractical to move it
between sites for individual analytical tasks. Instead, data mining operations are
required to execute "close to the database". In the absence of dedicated support
for data mining and other analytical algorithms in the database management
systems, this can be achieved by setting up high-performance s rvers in close
proximity to the databases. The overall data mining system will then have to
manage the distributed execution of the analytical tasks and the combination
of the partial results into a meaningful total. Also, this approach can some-
Confidentiality becomes a key issue, and the system architecture needs to pro-
vide security features at all levels of access. The different needs of enterprises
require that a system offers a wide range of configuration options, so that it is
possible to scale applications from a few client workstations to high-performance
server machines.
In this article we will discuss the implications of the above requirements, fo-
cusing on the Kensington solution that employs Internet and distributed compo-
nent technologies for deployment on high-performance servers uch as distributed
memory and shared-memory parallel machines. The next section will discuss the
key functional requirements hat have been outlined so far. The following chap-
ter outlines the design and implementation of the Kensington enterprise data
mining system, in particular Java- and CORBA-based networking and compo-
nent technology. We then describe a scenario for distributed ata mining that
was demonstrated atSuperComputing'98 as part of the award-winning Terabyte
Challenge. The final section concludes and outlines future trends in the field.
2 Enterprise Data Mining Requirements
Data mining system architectures for enterprises have to meet a range of de-
mands from the field of data analysis and the additional needs that arise when
handling large amounts of data inside an organisation. Modern data mining ap-
plications are expected to provide a high degree of integration while retaining
flexibility. In this way they can efficiently support different ypes of analyses over
the organisation's data. Data mining is understood to be an iterative process for
the analyst \[FPSS96\], especially in the initial exploratory phases of the analyt-
ical task. Therefore, a high degree of interactivity is required, often combined
with the need for visualisation of the data and the analytical results.
The field of data mining is developing rapidly, and the methods applied in a
tool today may be superseded by more advanced algorithms in the near future.
Furthermore, the convergence with statistical methods has only just started,
and will grow in pace over the next few years. The need for enhancement of the
existing tool set has to be reflected by a software architecture that enables the
straightforward integration of new analytical components. In a similar vein, the
results from the analytical functions need to be presented in portable formats,
as most analysts will want to use different specialist packages to further refine
or report the results.
In large organisations, the amount and the distribution of the data become
an additional challenge. The size of the data may make it impractical to move it
between sites for individual analytical tasks. Instead, data mining operations are
required to execute "close to the database". In the absence of dedicated support
for data mining and other analytical algorithms in the database management
systems, this can be achieved by setting up high-performance s rvers in close
proximity to the databases. The overall data mining system will then have to
manage the distributed execution of the analytical tasks and the combination
of the partial results into a meaningful total. Also, this approach can some-
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime
Start using Mendeley in seconds!
Readership Statistics
4 Readers on Mendeley
by Discipline
by Academic Status
25% Student (Master)
25% Other Professional
25% Post Doc
by Country
50% United Kingdom
25% United States


