Sign up & Download
Sign in

A framework for mining evolving trends in Web data streams using dynamic learning and retrospective validation

by O Nasraoui, C Rojas, C Cardona
Computer Networks (2006)

Abstract

The expanding and dynamic nature of the Web poses enormous challenges to most data mining techniques that try to extract patterns from Web data, such as Web usage and Web content. While scalable data mining methods are expected to cope with the size challenge, coping with evolving trends in noisy data in a continuous fashion, and without any unnecessary stoppages and reconfigurations is still an open challenge. This dynamic and single pass setting can be cast within the framework of mining evolving data streams. The harsh restrictions imposed by the "you only get to see it once" constraint on stream data calls for different computational models that may furthermore bring some interesting surprises when it comes to the behavior of some well known similarity measures during clustering, and even validation. In this paper, we study the effect of similarity measures on the mining process and on the interpretation of the mined patterns in the harsh single pass requirement scenario. We propose a simple similarity measure that has the advantage of explicitly coupling the precision and coverage criteria to the early learning stages. Even though the cosine similarity, and its close relative such as the Jaccard measure, have been prevalent in the majority of Web data clustering approaches, they may fail to explicitly seek profiles that achieve high coverage and high precision simultaneously. We also formulate a validation strategy and adapt several metrics rooted in information retrieval to the challenging task of validating a learned stream synopsis in dynamic environments. Our experiments confirm that the performance of the MinPC similarity is generally better than the cosine similarity, and that this outperformance can be expected to be more pronounced for data sets that are more challenging in terms of the amount of noise and/or overlap, and in terms of the level of change in the underlying profiles/topics (known sub-categories of the input data) as the input stream unravels. In our simulations, we study the task of mining and tracking trends and profiles in evolving text and Web usage data streams in a single pass, and under different trend sequencing scenarios.

Cite this document (BETA)

Available from linkinghub.elsevier.com
Page 1
hidden

A framework for mining evolving trends in Web data streams using dynamic learning and retrospective validation

A framework for mining evolving trends in Web data
the cosine similarity, and that this outperformance can be expected to be more pronounced for data sets that are more
‘‘You cannot step twice into the same stream.
For as you are stepping in, other waters are ever
flowing on to you’’ HERACLITUS, c.535–475
BC, quoted by Plato.
erved.
*
Corresponding author.
E-mail addresses: olfa.nasraoui@louisville.edu (O. Nasraoui),
c.rojas@louisville.edu (C. Rojas), ccardona@magnify.com (C.
Cardona).
1
This research was done while C. Cardona was at the
University of Memphis.
Computer Networks 50 (2001389-1286/$ - see front matter  2005 Elsevier B.V. All rights reschallenging in terms of the amount of noise and/or overlap, and in terms of the level of change in the underlying pro-
files/topics (known sub-categories of the input data) as the input stream unravels. In our simulations, we study the task
of mining and tracking trends and profiles in evolving text and Web usage data streams in a single pass, and under different
trend sequencing scenarios.
 2005 Elsevier B.V. All rights reserved.
Keywords: Mining evolving data streams; Web clickstreams; Web mining; Text mining; User profiles
1. Introductionstreams using dynamic learning and retrospective validation
Olfa Nasraoui
a,
*
, Carlos Rojas
a
, Cesar Cardona
b,1
a
Department of Computer Engineering and Computer Science, University of Louisville, Louisville, KY 40292, United States
b
Magnify Inc., Chicago, United States
Available online 27 December 2005
Abstract
The expanding and dynamic nature of the Web poses enormous challenges to most data mining techniques that try to
extract patterns from Web data, such as Web usage and Web content. While scalable data mining methods are expected to
cope with the size challenge, coping with evolving trends in noisy data in a continuous fashion, and without any unnec-
essary stoppages and reconfigurations is still an open challenge. This dynamic and single pass setting can be cast within the
framework of mining evolving data streams. The harsh restrictions imposed by the ‘‘you only get to see it once’’ constraint
on stream data calls for different computational models that may furthermore bring some interesting surprises when it
comes to the behavior of some well known similarity measures during clustering, and even validation. In this paper, we
study the effect of similarity measures on the mining process and on the interpretation of the mined patterns in the harsh
single pass requirement scenario. We propose a simple similarity measure that has the advantage of explicitly coupling the
precision and coverage criteria to the early learning stages. Even though the cosine similarity, and its close relative such as
the Jaccard measure, have been prevalent in the majority of Web data clustering approaches, they may fail to explicitly
seek profiles that achieve high coverage and high precision simultaneously. We also formulate a validation strategy and
adapt several metrics rooted in information retrieval to the challenging task of validating a learned stream synopsis in
dynamic environments. Our experiments confirm that the performance of the MinPC similarity is generally better thandoi:10.1016/j.comnet.2005.10.0216) 1488–1512
www.elsevier.com/locate/comnet
Page 2
hidden
O. Nasraoui et al. / Computer Networks 50 (2006) 1488–1512 1489The Web has been a relentless generator of data
that comes in a variety of forms, ranging from Web
content data that forms the substance of most Web
documents, to the daily trails left by visitors as they
surf through a Website, also known as Web usage
data. Hidden in this data, often lurk interesting
knowledge or patterns such as Web user access
trends or profiles that can be used to achieve various
objectives, including supporting customer relation-
ship management, and personalization of the user’s
experience on a Website.
Recently, data mining techniques have been
applied to extract usage patterns from Web log data
[3,6,18–21,24–26,29,30]. Most of these efforts have
proposed using various data mining or machine
learning techniques to model and understand Web
user activity. In [29], clustering was used to segment
user sessions into clusters or profiles that can later
form the basis for personalization. In [21], the
notion of an adaptive Website was proposed, where
the user’s access pattern can be used to automati-
cally synthesize index pages. The work in [6] is based
on using association rule discovery as the basis for
modeling Web user activity, while the approach
proposed in [3] used Markov Random Fields to
model Web navigation patterns for the purpose of
prediction. The work in [30] proposed building data
cubes from Web log data, and later applying online
analytical processing (OLAP) and data mining on
the cube model. [25] presents a complete Web Usage
Mining (WUM) system that extracts patterns from
Web log data with a variety of data mining tech-
niques. New relational clustering techniques with
robustness to noise were used to discover user pro-
files that can overlap in [20,19], while a density-
based evolutionary clustering technique is proposed
to discover multi-resolution and robust user profiles
in [18]. The K Means algorithm was used in [24] to
segment user sequences into different clusters. An
extensive survey of different approaches to Web
usage mining can be found in [26]. It is interesting
to note that an incremental way to update a Web
usage mining model was proposed in [3]. In this
approach, the user navigation records are modeled
by a hypertext probabilistic grammar (HPG) whose
higher probability generated strings correspond to
the user’s preferred trails. The model had the advan-
tages of being self-contained (i.e., has all statistics
needed to mine all the data accumulated), as well
as compact (the model was in the form of a tree
whose size depends on the number of items insteadof the number of users, which enhances scalability).The HPG model was incremental, in the sense that
when more log data became available, it could be
incorporated in the model without the need of
rebuilding the grammar from scratch.
Unfortunately, with the exception of [3] (which
provided a scalable way to model Web user naviga-
tion, but did not explicitly address the change/
evolvability aspect of this data), all the aforemen-
tioned methods assume that the entire pre-processed
Web session data could reside in main memory.
This can be a disadvantage for systems with limited
main memory in case of huge Web session data,
since the I/O operations would have to be extensive
to shuffle chunks of data in and out, and thus com-
promise scalability. Today’s Websites are a source
of an exploding amount of clickstream data that
can put the scalability of any data mining technique
into question.
Moreover, the Web access patterns on a Website
are very dynamic in nature, due not only to the
dynamics of Website content and structure, but also
to changes in the users’ interests, and thus their nav-
igation patterns. The access patterns can be
observed to change depending on the time of day,
day of week, and according to seasonal patterns
or other external events. As an alternative to locking
the state of the Web access patterns in a frozen state
depending on when the Web log data was collected,
an intelligent Web usage mining system should be
able to continuously learn in the presence of such
conditions without ungraceful stoppages, reconfigu-
rations, or restarting from scratch. For all these rea-
sons, Web usage data should be considered as a
reflection of a dynamic environment which there-
fore requires dynamic learning of the user access
patterns. This dynamic setting can be cast within
the framework of mining evolving data streams.
Data streams are massive data sets that arrive with
a throughput so high that the data can only be ana-
lyzed sequentially and in a single pass. The discov-
ery of useful patterns from data streams is referred
to as stream data mining. In particular, a recent
explosion of applications generating and analyzing
data streams has added new unprecedented chal-
lenges for clustering algorithms if they are to be able
to track changing clusters in streams using only the
new data points because storing past data is not
even an option [1,2,5,10]. Because most data
streams unleash data points or measurements in a
non-arbitrary order, they are inherently attached
to a temporal aspect, meaning that the patterns thatcould be discovered from them follow dynamic

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

21 Readers on Mendeley
by Discipline
 
 
 
by Academic Status
 
38% Ph.D. Student
 
24% Student (Master)
 
10% Doctoral Student
by Country
 
24% United Kingdom
 
19% China
 
10% Iran