Sign up & Download
Sign in

A data mining architecture for distributed environments

by M Z Ashrafi, D Taniar, K A Smith
Innovative Internet Computing Systems (2002)

Cite this document (BETA)

Available from www.springerlink.com
Page 1
hidden

A data mining architecture for distributed environments

H.Unger, T.Bö hme, and A.Mikler (Eds.): I²CS 2002, LNCS 2346, pp. 27-38, 2002.
 Springer-Verlag Berlin Heidelberg 2002


A Data Mining Architecture for Distributed
Environments
Mafruz Zaman Ashrafi, David Taniar, and Kate Smith
School of Business Systems, Monash University
PO BOX 63B, Clayton 3800, Australia
{Mafruz.Ashrafi,David.Taniar,Kate.Smith}@infotech.monash.edu.au
Abstract. Data mining offers tools for the discovery of relationship, patterns
and knowledge from a massive database in order to guide decisions about future
activities. Applications from various domains have adopted this technique to
perform data analysis efficiently. Several issues need to be addressed when
such techniques apply on data these are bulk at size and geographically
distributed at various sites. In this paper we describe system architecture for a
scalable and a portable distributed data mining application. The system contains
modules for secure distributed communication, database connectivity,
organized data management and efficient data analysis for generating a global
mining model. Performance evaluation of the system is also carried out and
presented.
1 Introduction
The widespread use of computers and the advance in database technology have
provided huge amounts of data. The explosive growth of data in databases has
generated an urgent need for efficient data mining techniques to discover useful
information and knowledge. On the other hand, the emergence of network-based
distributing computing such as the private intranet, internet, and wireless networks
has created a natural demand for scalable techniques of data mining that can exploit
the full benefit of such computing environments.
Distributed Data Mining (DDM) aims to discover knowledge from different data
sources geographically distributed on multiple sites and to combine it to build a global
data-mining model [3,4,8]. However, several issues emerge when data mining
techniques are used on such systems. The distributing computing system has an
additional level of complexity compared with centralized or host-based system. It
may need to deal with heterogeneous platforms and multiple databases and possibly
different schemas, with the design and implementation of scalable and effective
protocol for communication among the nodes, and the selective and efficient use of
the information that is gathered from several nodes [9].
A fundamental challenge for DDM is to develop mining techniques without having
to communicate data unnecessarily. Such functionality is required for reasons of
efficiency, accuracy and privacy. In addition, appropriate protocols, languages, and
Page 2
hidden
Mafruz Zaman Ashrafi et al.

28
network services are required for mining distributed data to handle the required
metadata and mapping.
In this paper, we present a system architecture for developing mining applications
for distributed systems. The proposed architecture is not focused on any particular
data mining algorithms, since our intention is not to propose new algorithms but to
suggest a system infrastructure that makes it possible to plug in any mining algorithm
and enable it to participate in a highly distributed real time system. The system is
implemented in Java because it supports portable distribute programming on multiple
platforms. Java thread, socket and data compression, JDBC techniques were utilized.
2 Related Work
In this section, we provide some background material and related work in this area.
Several system including JAM, PADMA, Papyrus, BODHI, Kensington, PaDDMAS,
and DMA have been developed/proposed for distributed data mining.
JAM [3] is distributed agent-based data mining system that uses meta-learning
technique. It was develops local patterns of fraudulent activity by mining the local
databases of several financial institutes. Than final patterns are generated by
combining these local patterns. It assumes that each data site consists of a local
database, learning agents, meta-learning agents and configuration modules which
perform the major task of distributing computing by sending and receiving different
requests from different sites.
PADMA [7] is an agent-based architecture for parallel /distributed data mining. It
is a document analysis tool that works on a distributed environment based on
cooperative agents. It aims to develop a flexible system that exploits data mining
parallels. The data-mining agents in PADMA perform several parallel relational
operations with the information extracted from the documents. The authors report on
a PADMA implementation of unstructured text mining although the architecture is
not domain specific.
The Papyrus [4] system is able to mine distributed data sources on a local and wide
area cluster and a super cluster scenario. It uses meta-clusters to generate local
models, which are exchanged to generate a global model. The originator reports that
the system can support the moving of large volumes of mining data. The idea is
founded on a theory similar to JAM system. Nevertheless they use a model
representation language (PMML) and storage system called Osiris.
The BODHI [8] is a hierarchical agent based distributed learning system. The
system was designed to create a communication system and run time environment for
Collective Data Mining. It employs local learning techniques to build models at each
distributed site and then moves these models to a centralized location. The models are
then combed to build a meta-model whose inputs are the outputs of various models.
Kensington [13] Architecture is based on a distributed component environment
located on different nodes on a generic network, like the Internet or Intranet.
Kensington provides different components such as user oriented components,
Application servers and Third level servers. It warps the analysis algorithm as
Enterprise Java Bean components. PaDDMAS [8] is a Parallel and Distributed Data

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

2 Readers on Mendeley
by Discipline
 
by Academic Status
 
50% Post Doc
 
50% Ph.D. Student
by Country
 
50% Slovenia
 
50% United States