Sign up & Download
Sign in

Mining Distributed Evolving Data Streams using Fractal GP Ensembles

by Gianluigi Folino, Clara Pizzuti, Giandomenico Spezzano
Proceedings of the 10th European Conference on Genetic Programming (2007)

Abstract

A Genetic Programming based boosting ensemble method for the classification of distributed streaming data is proposed. The approach handles flows of data coming from multiple locations by building a global model obtained by the aggregation of the local models coming from each node. A main characteristics of the algorithm presented is its adaptability in presence of concept drift. Changes in data can cause serious deterioration of the ensemble performance. Our approach is able to discover changes by adopting a strategy based on self-similarity of the ensemble behaviour, measured by its fractal dimension, and to revise itself by promptly restoring classification accuracy. Experimental results on a synthetic data set show the validity of the approach in maintaining an accurate and up-to-date GP ensemble.

Author-supplied keywords

Cite this document (BETA)

Available from www.springerlink.com
Page 1
hidden

Mining Distributed Evolving Data Streams using Fractal GP Ensembles

Mining Distributed Evolving Data Streams
Using Fractal GP Ensembles
Gianluigi Folino, Clara Pizzuti, and Giandomenico Spezzano
Institute for High Performance Computing and Networking, CNR-ICAR
Via P. Bucci 41C
87036 Rende (CS), Italy
{folino,pizzuti,spezzano}@icar.cnr.it
Abstract. A Genetic Programming based boosting ensemble method
for the classification of distributed streaming data is proposed. The ap-
proach handles flows of data coming from multiple locations by building
a global model obtained by the aggregation of the local models coming
from each node. A main characteristics of the algorithm presented is its
adaptability in presence of concept drift. Changes in data can cause seri-
ous deterioration of the ensemble performance. Our approach is able to
discover changes by adopting a strategy based on self-similarity of the
ensemble behavior, measured by its fractal dimension, and to revise itself
by promptly restoring classification accuracy. Experimental results on a
synthetic data set show the validity of the approach in maintaining an
accurate and up-to-date GP ensemble.
1 Introduction
Ensemble learning algorithms [1,5,2,8] based on Genetic Programming (GP)
[11,16,12,3,7] have been gathering an increasing interest in the research commu-
nity because of the improvements that GP obtains when enriched with these
methods. These approaches have been applied to many real world problems
and assume that all training data is available at once. However, in the last
few years, many organizations are collecting a tremendous amount of data that
arrives in the form of continuous stream. Credit card transactional flows, tele-
phone records, sensor network data, network event logs are just some examples
of streaming data. Processing these kind of data poses two main challenges to
existing data mining methods. The first is relative to the performance and the
second to adaptability.
Many data stream algorithms have been developed over the last decade for
processing and mining data streams that arrive at a single location or at multiple
locations. Some of these algorithms, known as centralized data stream mining
(CDSM) algorithms, require that the data be sent to one single location before
processing. These algorithms, however, are not applicable in cases where the
data, computation, and other resources are distributed and cannot or should
not be centralized for a variety of reasons e.g. low bandwidth, security, privacy
issues, and load balancing. In many cases the cost of centralizing the data can
M. Ebner et al. (Eds.): EuroGP 2007, LNCS 4445, pp. 160–169, 2007.
c© Springer-Verlag Berlin Heidelberg 2007
Page 2
hidden
Mining Distributed Evolving Data Streams 161
be prohibitive and the owners may have privacy constraints. Unlike the tradi-
tional centralized systems, the distributed data mining (DDM) systems offer a
fundamental distributed solution to analyze data without necessarily demanding
collection of the data to a single central site. Typically DDM algorithms involve
local data analysis to extract knowledge structures represented in models and
patterns and the generation of a global model through the aggregation of the
local results.
The ensemble paradigm is particularly suitable to support the DDM model.
However, to extract knowledge from streaming information the ensemble must
adapt its behavior to changes that occur into the data over time.
Incremental or online methods [9,18] are an approach able to support adaptive
ensembles on evolving data streams. These methods build a single model that
represents the entire data stream and continuously refine their model as data
flows. However, maintaining a unique up-to-date model might preclude valuable
information to be used since previously trained classifiers have been discarded.
Furthermore, incremental methods are not able to capture new trends in the
stream. In fact, traditional algorithms assume that data is static, i.e. a concept,
represented by a set of features, does not change because of modifications of the
external environment. In the above mentioned applications, instead, a concept
may drift due to several motivations, for example sensor failures, increases of
telephone or network traffic. Concept drift can cause serious deterioration of the
ensemble performance and thus its detection allows to design an ensemble that
is able to revise itself and promptly restore its classification accuracy.
Another approach to mine evolving data streams is to capture changes in data
by measuring online accuracy deviation over time and deciding to recompute the
ensemble if the deviation has exceeded a pre-specified threshold. These methods
are more effective and allow to handle the concept drift problem in order to
capture time-evolving trends and patterns in the stream.
In this paper we a propose a distributed data stream mining approach based
on the adoption of an ensemble learning method to aggregate models trained
on distributed nodes, and enriched with a change detection strategy to reveal
changes in evolving data streams. We present an adaptive GP boosting ensemble
algorithm for classifying data streams that maintains an accurate and up-to-
date ensemble of classifiers for continuous flows of data with concept drifts.
The algorithm uses a DDM approach where not only data is distributed, but
also the data is non-stationary and arriving in the form of multiple streams.
The method is efficient since each node of the network works with its local
data, and communicate the local model computed with the other peer-nodes
to obtain the results. A main characteristics of the algorithm is its ability to
discover changes by adopting a strategy based on self-similarity of the ensemble
behavior, measured by its fractal dimension, and to revise itself by promptly
restoring classification accuracy. Experimental results on a synthetic data set
show the validity of the approach in maintaining an accurate and up-to-date GP
ensemble.

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

6 Readers on Mendeley
by Discipline
 
by Academic Status
 
50% Student (Master)
 
50% Ph.D. Student
by Country
 
33% United States
 
17% United Kingdom
 
17% Italy