Sign up & Download
Sign in

Temporal Structure Learning for Clustering Massive Data Streams in Real-Time

by Michael Hahsler, Margaret H Dunham
Learning (2011)

Cite this document (BETA)

Available from Michael Hahsler's profile on Mendeley.
Page 1
hidden

Temporal Structure Learning for Clustering Massive Data Streams in Real-Time

Temporal Structure Learning for Clustering Massive Data Streams in
Real-Time
Michael Hahsler y Margaret H. Dunham z
Abstract
This paper describes one of the rst attempts to model the
temporal structure of massive data streams in real-time us-
ing data stream clustering. Recently, many data stream clus-
tering algorithms have been developed which eciently nd
a partition of the data points in a data stream. However,
these algorithms disregard the information represented by
the temporal order of the data points in the stream which for
many applications is an important part of the data stream.
In this paper we propose a new framework called Temporal
Relationships Among Clusters for Data Streams (TRACDS)
which allows us to learn the temporal structure while clus-
tering a data stream. We identify, organize and describe the
clustering operations which are used by state-of-the-art data
stream clustering algorithms. Then we show that by de n-
ing a set of new operations to transform Markov Chains with
states representing clusters dynamically, we can eciently
capture temporal ordering information. This framework al-
lows us to preserve temporal relationships among clusters for
any state-of-the-art data stream clustering algorithm with
only minimal overhead.
To investigate the usefulness of TRACDS, we evaluate
the improvement of TRACDS over pure data stream cluster-
ing for anomaly detection using several synthetic and real-
world data sets. The experiments show that TRACDS is
able to considerably improve the results even if we intro-
duce a high rate of incorrect time stamps which is typical
for real-world data streams.
Keywords: Data stream, clustering, temporal structure,
Markov chain
1 Problem Speci cation
Algorithms for clustering data streams [4,5,8,13,23,24,
26{28] have focused on many characteristics of stream
data (e.g., limited storage but potentially unbounded
size of data stream, single pass over the data, near real-
time processing, concept drift), but the fact that data
arrives in a temporal order, which is perhaps one of the
most important aspects of the data stream, is typically
To appear in the SIAM Conference on Data Mining 2011
(SDM11)
ySouthern Methodist University, mhahsler@lyle.smu.edu
zSouthern Methodist University, mhd@lyle.smu.edu
6 53
stream clustering
6
53TRACDS(a) (b)
4/51/3 5/5
1/5
2/3
Figure 1: Stream Clustering: (a) Partitioning of a data
stream using standard (data stream) clustering neglects
the temporal aspect of the data. (b) With TRACDS
temporal relationships between clusters are learned dy-
namically as an evolving Markov chain (transitions be-
tween clusters are represented by arcs).
not directly used.
Figure 1(a) illustrates what happens during data
stream clustering. The data stream is represented by
the ordered sequence of shapes in the upper half of
the gure. Di erent shapes represent events that are
similar enough to be put in the same cluster. In
the left lower half, the larger symbols represent the
clusters annotated with the number of events assigned
to each cluster. Since the data volume produced by a
data streams is typically unbounded, it is infeasible to
store each event assigned to a cluster. Rather, cluster-
wide summaries called cluster features or synopses
that include descriptive statistics for a cluster (mean,
variance, etc.) are stored. During this summarization,
any timestamp available for events is either lost or
treated as if it were any other attribute. For example,
when in the stream in Figure 1 a Hexagon event occurs,
the next event is a Circle event 4=5 = 80% of the
time (ignoring last event); a Circle is followed by a
Hexagon 100% of the time; and a Triangle always
follows a Triangle. Although the temporal order of
events is one of the salient concepts of stream data, this
ordering relationship of the clusters is lost completely
after clustering.
One of the major applications using data stream
Page 2
hidden
clustering is rare event (anomaly) detection. When
during cluster formation temporal order information is
discarded, we might sacri ce important indicators to
detect/predict rare event (e.g., intrusion in a computer
network,
ooding, heat waves, hurricanes, or eruptions
of volcanoes based on climate data). For example,
for intrusion detection in a large computer network we
may observe a user change from behavior type A to
behavior type B, both represented by clusters labeled
as non-suspicious behavior. However, the transition
from A to B might be extremely unusual and itself
indicate an intrusion event. This can only be detected
if the temporal structure of the data is preserved in the
clustering.
We argue that temporal and ordering aspects
should be considered as an integral part when perform-
ing clustering events in data streams for many applica-
tions. However, we are not aware of research that com-
bines clustering with preserving the temporal structure
in the data stream that meets the requirements for data
stream processing (single-pass over the data, only store
synopses for clusters, etc.).
In this paper we present the TRACDS framework
which provides a transparent way to add temporal or-
dering information in the form of a dynamically chang-
ing Markov Chain to data stream clustering algorithms
(see Figure 1(b)). The framework generalizes the ideas
by Dunham et al developed for the Extensible Markov
Model (EMM) [11] for data stream clustering. In this
paper we identify a set of clustering operations which is
sucient to describe state-of-the-art data stream clus-
tering algorithms and we develop a complete set of
TRACDS operations to eciently manipulate Markov
Chains to learn temporal information for clustering.
This clean separation between clustering and TRACDS
operations makes the framework directly applicable to
any state-of-the-art data stream clustering algorithm.
We will show that the framework can be implemented to
add only minimal overhead to the data stream cluster-
ing algorithm and thus it is suitable to handle massive
data streams.
While data stream clustering and Markov chains
are well know techniques, this paper provides two novel
contributions.
1. This paper is one of the rst to attempt to model
the temporal structure of massive data streams in
real-time.
2. We combine data stream clustering and Markov
chains which are dynamically changed by a set of
operations in a new and ecient way.
This paper is organized as follows. We start with
related work in Section 2. In Section 3 we formally
introduce the TRACDS framework. In Section 4 we
use several synthetic and real-world data sets to ana-
lyze the improvement of TRACDS over pure clustering
for anomaly detection. Finally, Section 5 discusses di-
rections for future work.
2 Related Work
2.1 Data Stream Clustering. Clustering, the
grouping of similar objects, has been around since the
beginning of human existence. It was not until the
1800s, however, that formal algorithms were developed.
Excellent surveys of clustering techniques have been
published that summarize these developments (e.g.,
[14]). Traditional clustering is often viewed as a par-
titioning or segmentation of a static data set where the
order of observation has no relevance. Dynamic cluster-
ing techniques were developed to incrementally cluster
data and take into account the temporal nature of data
by speci cally looking at how the clusters change as
data arrives [15].
With the advent of the streaming data concept,
many clustering researchers began to adapt clustering
techniques to streaming data. The salient features of
data streams are that data continues to arrive and that
it is impractical to keep all of the data. Data stream
management techniques look at various strategies (such
as time windows and snapshots) to handle unbounded
streams. Clustering techniques must be incremental
and usually consider some sort of dynamic updating
of the clusters. Barbara [6] identi ed the requirements
for clustering of data streams to include: compactness;
\fast, incremental processing" and identi cation of out-
liers.
An early algorithm called STREAM [13] partitions
the data stream into segments, clusters each segment in-
dividually by solving the k-medians problem and then
iteratively reclusters the resulting centers to obtain a -
nal clustering. Since data streams can potentially grow
unbounded, data stream clustering algorithms do not
store all data points but rather use a summary or syn-
opsis consisting of aggregate statistics to store informa-
tion about each cluster. The idea was introduced in
the non-data stream clustering algorithm BIRCH [30]
where the summaries were called cluster feature vectors.
The data stream clustering algorithm CluStream [4] in-
troduces micro-clusters represented by summary statis-
tics. Micro-clusters are handled by the algorithm online.
However, to create a nal clustering, micro-clusters have
to be merged based on their summary statistics result-
ing in a simpler clustering. As this reclustering is per-
formed oine, it can be done with any regular clustering
algorithm.
An important feature of data streams is that their
Page 3
hidden
structure may change over time. Most of the following
clustering algorithms handle change by using an expo-
nential fading model to reduce the weight of older data
points. An exponential fading is used since it can easily
be applied directly to most summary statistics. Den-
Stream [8] maintains micro-clusters in real time and uses
a variant of the density-based GDBSCAN [21] to pro-
duce a nal clustering for users. HPStream [5] nds
clusters that are well de ned in di erent subsets of the
dimensions of the data. WSTREAM [23] uses a kernel
density estimation to nd rectangular windows to repre-
sent clusters. The windows can move, contract, expand
and be merged over time. E-STREAM [27] adds cluster
splitting by maintaining histograms for each cluster and
dimension. While the popular density-based clustering
method OPTICS [18] is not suitable for data streams,
a variant called OpticsStream [24] can be used to visu-
alize the clustering structure and structure changes in
data streams. Recently, the density-based data stream
clustering algorithms D-Stream [26] and MR-Stream [28]
were developed. D-Stream uses an online component to
map each data point into a prede ned grid and then
uses an oine component to cluster the grid based on
density. MR-Stream facilitates the discovery of clusters
at multiple resolutions by using a grid of cells that can
dynamically be sub-divided into more cells using a tree
data structure. Recently, data stream clustering algo-
rithms for massive data [2] and uncertain data [3] have
also been introduced.
All these approaches center on nding clusters of
data points given some notion of similarity but neglect
the temporal ordering structure of the data which might
be crucial to understanding the underlying data.
2.2 Incorporating Temporal Order. There have
been several research e orts to incorporate temporal
order information into clustering. We brie
y review
some of these concepts prior to introducing what we
propose for TRACDS. The term temporal clustering is
generally used to mean applying clustering to time series
data [29]. Typically either several time series are clus-
tered to nd sets of similar series or subsequences are
clustered to nd similar parts in time series [16]. Since
we are interested in clustering individual data points
while preserving the temporal order, neither of these
approaches is applicable. The temporal structure be-
tween data points in time series is typically modeled
by auto-regressive models (e.g., ARIMA) which is more
dicult for multivariate data [25] and typically does not
honor the restrictions for data streams (e.g., single pass
over the data). We are not aware of research which
combines auto-regressive models with clustering mas-
sive multivariate data streams while preserving tempo-
ral structure eciently.
Evolutionary clustering [9] considers the problem of
clustering data over time with the goal to trade o the
two potentially con
icting criteria of representing the
current data as faithfully as possible by the clustering
while preventing dramatic shifts in the clustering from
one timestep to the next. Similarly, the MONIC
framework [22] uses the term cluster transitions to refer
to the fact that a cluster may change over time (e.g.,
change its density, move or merge with another cluster).
While these approaches deal with the fact that there is
a need to detect and evaluate these changes, they still
largely ignore the order information inherent in the data
stream.
C-TREND [1] captures the temporal ordering con-
cept among clusters with transitions between clusters
obtained based on prede ned temporal partitions. It
tries to tackle a problem similar to what we address in
this paper, however it su ers from many restrictions.
Especially, C-TREND is not suited for data streams
as a xed number of partitions must be created before
classical clustering algorithms which need all data are
used. Also only transitions between partitions in time
are identi ed and all transitions between clusters within
each partition are ignored.
In the following we introduce TRACDS, a frame-
work which is able to eciently capture temporal order
information for data stream clustering.
3 Introduction to TRACDS
Although TRACDS does not implement data stream
clustering itself, we have to introduce the concepts
before moving on to our framework. Clustering is
typically thought of in terms of partitioning a data
set consisting of observations into several (typically k,
a prede ned number) groups of similar observations
where observations in di erent groups are less similar.
Formally a clustering can be de ned as:
Definition 3.1. (Clustering) A clustering  is a
partitioning of a data set D into k subsets C1; C2; : : : ; Ck
called clusters such that
(1) Ci \ Cj = ; for all i 6= j,
(2)
Sk
i=1 Ci  D, and
(3) the value of a speci ed cost function fc() is mini-
mized (typically by a heuristic).
The requirement that clusters do not share data
points means that we deal with crisp and not soft par-
titions where a data point could be assigned to several
partitions typically with varying degree of membership.
Page 4
hidden
In the above de nition we do not require that all data
points in D are assigned to a cluster. Some data points
might be labeled as outliers and stay unassigned.
Data stream clustering algorithms take into account
that for many applications data is arriving continuously
and that the number of clusters may not be known
in advance. To deal with the fact that data streams
may produce more data than is practical to store, data
stream clustering algorithms work with synopses for
clusters (sometimes called clustering features or CFs)
instead of keeping all the data points.
Definition 3.2. (Data Stream Clustering) At
each point in time t a data stream clustering t is
a partitioning, as de ned in De nition 3.1, of Dt,
the data seen thus far, into k components. How-
ever, instead of all data points assigned to clusters
C1; C2; : : : ; Ck only synopses ~c1;~c2; : : : ;~ck are stored and
k is allowed to change over time. The synopses ~ci with
i = 1; 2; : : : ; k contain summary information about the
size, distribution and location of the data points in Ci.
Still, data stream clustering algorithms only parti-
tion the data and temporal aspects (e.g., order or times-
tamps) are not preserved in the clustering (with the ex-
ception that old data might be removed or weighted to
be able to re
ect concept drift). We propose to model
the temporal structure between clusters as an evolving
Markov Chain (MC) which at each point in time rep-
resents a regular time-homogeneous MC, but which is
updated using a set of well de ned operations when new
data is available. In the following we will restrict the
discussion to rst order evolving MCs. First order MCs
work well as an approximation for many applications,
however, as for regular MCs, it is possible to extend the
idea to higher order models [17].
A ( rst order) discrete parameter Markov Chain
[19] is a special case of a Markov Process in discrete time
and with a discrete state space. It is characterized by a
sequence of random variables fXtg = hX1; X2; X3; : : : i
with t being the time index. All random variables share
the same domain dom(Xt) = S = fs1; s2; : : : ; skg, a set
called the state space. The Markov property states that
for a rst order model the next state is only dependent
on the current state. Formally,
P (Xt+1 = st+1 j Xt = st; : : : ; X1 = s1) =
P (Xt+1 = st+1 j Xt = st);
where s1; : : : ; st; st+1 2 S.
For simplicity we use for transition probabilities
the notation aij = P (Xt+1 = sj j Xt = si), i; j =
1; 2; : : : ; k, where it is appropriate. Time-homogeneous
MC can be represented as a kk transition matrix A =
(aij) containing the transition probabilities from each
state to all other states. Another representation is as
a graph with the states as vertices and the arcs labeled
with transition probabilities. Transition probabilities
can be easily estimated from the observed transition
counts cij (transitions from state si to sj) using the
maximum likelihood method by aij = cij=ni where
ni =
Pk
j=1 cij .
MCs are very useful to keep track of temporal
information using the Markov property as a relaxation.
With a MC it is easy to predict the probability of future
states or predict missing values based on the temporal
structure of the data. It is also easy to calculate the
probability of a new sequence of length l given a MC as
the product of transition probabilities:
P (Xl = sl; Xl1 = sl1; : : : ; X1 = s1) =
P (X1 = s1)
l1Y
i=1
P (Xi+1 = si+1 j Xi = si)
Data streams typically contain dimensions with
continuous data and/or have discrete dimensions with
a large number of domain values [2]. In addition, the
data may continue to arrive resulting in a large number
of observations. For the MC we use in TRACDS, data
points have to be mapped onto a manageable number of
states. This mapping is already done by the used data
stream clustering algorithm. The clustering algorithm
assigns each point to a cluster (or micro-cluster) which
is represented by a state in the MC.
Definition 3.3. (TRACDS) TRACDS is de ned at
each point in time t as a duple T = (S;C; sc) (we omit
subscript t for readability), where S is the set of k states,
C is the kk transition count matrix specifying the MC
over the states in S and sc 2 S keeps track of the current
state, i.e., the state to which the last observation was
assigned to. Given a data stream clustering , TRACDS
has the following properties:
(1) At each point in time t there is a one-to-one
correspondence between the clusters in  and the
states S.
(2) The transition probability aij estimated by the tran-
sition counts in C represents the probability that
given a data point in cluster i, the next data point
in the data stream will belong to cluster j with
i; j = 1; 2; : : : ; k.
(3) T is created online in parallel to the data stream
clustering .
(4) An empty clustering with no data points is repre-
sented by an empty C with S = ; and we de ne sc
to be  to indicate that there is no current state.
Page 5
hidden
In order to satisfy the properties in De nition 3.3
T has to evolve over time to re
ect all changes to the
clustering. Note that since k in the clustering can
change over time also the number of states in T has
to adapt. In the following we will identify all operations
that data stream clustering algorithms perform and
present a set of well de ned operations, called TRACDS
operations, which are used to update T accordingly.
3.1 Data Stream Clustering Operations. The
MONIC framework [22] deals with the evolution of
clusters over time. It is not suitable for data streams
as it is not online and does not support relationships
between clusters at a given point in time. However,
it presents a technique which is independent of the
used clustering algorithm and in that sense is similar
to TRACDS. MONIC identi es so-called external and
internal cluster transitions re
ecting the change from
a clustering at time point t to a later clustering at
t + 1 (e.g., cluster survives, cluster is absorbed, cluster
moves). Although these cluster transitions re
ect the
changes of clusters between two points in time, they
are a good starting point to identify typical building
blocks (we call clustering operations) of data stream
clustering algorithms. Such building blocks are, for
example, adding a new incoming data point to an
existing cluster or creating a new cluster. Formally a
clustering operation is de ned as:
Definition 3.4. (Clustering Operation) A clus-
tering operation is de ned as a function
t+1 = q(t; x);
which is used by the data stream clustering algorithm to
update the clustering given additional information x (a
new data point, the index of the cluster to be deleted,
etc.).
Any data stream clustering algorithm can be de-
scribed as a sequence of such clustering operations that
are triggered by the speci cs of the clustering algorithm
itself. We identify the necessary clustering operations
for state-of-the-art data stream clustering algorithms in
Table 1. In the following we will de ne these opera-
tions. Some clustering operations are triggered by the
data stream clustering algorithm when a new data point
is available for the data stream. The two typical oper-
ations are:
 qassign(; x): Assign the new data point x to an
existing cluster. The clustering algorithm uses
the cluster summaries in  to nd the appropriate
cluster i and then updates ~ci.
 qcreate(; x): Create a new cluster. At some point
(e.g., if assigning a new data point to an existing
cluster is not appropriate) a new empty cluster rep-
resented by ~ck+1 is added to  and k is incremented
by one.
Several operations can be triggered by other events.
For example by a clean-up process which is scheduled
at regular intervals or by the clustering algorithm when
it runs out of memory. These include:
 qremove(; x): Remove a cluster. Here x is i, the
index of the cluster to be removed. In this case the
associated summary ~ci is removed from  and k is
decremented by one.
 qmerge(; x): Merge two clusters. Here input x
contains i and j, the indices of two clusters to
be merged. First a new merged cluster is created
by combining the two summaries ~ci and ~cj in an
appropriate way. Then the two merged clusters are
removed.
 qfade(; x): Fade the cluster structure. This adapts
the cluster structure over time by reducing the
in
uence of older data points. The input x is empty
in this operation as it is a clustering wide function.
Since only a summary and not the original data
points are available, fading has to be done on this
summary information by updating each summary
~ci for i = 1; 2; : : : ; k. Typically (see [5, 8, 23, 24]) a
decay function f(t) = 2t is used to specify the
weight of data points added t timesteps in the past.
This fading can be done iteratively on summary
statistics if they exhibit the properties of additivity
and temporal multiplicity de ned in [5].
 qsplit(; x): Split a cluster. Given i, an input cluster
index, two new clusters, j and l with appropriate
summaries, ~cj and ~cl, are created. Subsequently i
is removed.
Note that the actions of merging and fading imply
that the clustering summaries themselves should be
additive so that they can be combined during the merge
operation. This was a salient feature of the clustering
features proposed in BIRCH. If not additive, then some
known function must exist to combine them. Also
notice that a reclustering operation, which is used in
many stream clustering algorithms, is accomplished by
a series of merges. The split operation is currently only
supported by E-Stream [27] to split a large cluster up
into several smaller and denser clusters.
The exact procedure of how and why operations
like deleting and merging are executed is de ned by the
Page 6
hidden
Operation CluStream HPStream DenStream WSTREAM OpticsStream E-Stream D-Stream MR-Stream
qassign x x x x x x x x
qcreate x x x x x x x x
qremove x x x x x x x x
qmerge oine oine x oine x oine oine
qfade x x x x x x x
qsplit x
Table 1: Clustering operations used by data stream clustering algorithms (marked with x). Oine in row qmerge
indicates that merging is only used as an oine reclustering step.
used data stream clustering algorithm. For TRACDS it
is only important that we can de ne operations to keep
T consistent with the clustering.
3.2 TRACDS Operations. As indicated earlier,
TRACDS operations are triggered by the stream clus-
tering operations stated above. Each clustering oper-
ation triggers a unique TRACDS operation which up-
dates T . As x is used to indicate the input to the stream
clustering operations, we use y to indicate the input to
the TRACDS operations. y is uniquely determined by
the clustering operation.
Definition 3.5. (TRACDS Operation) A
TRACDS operation is a function
Tt+1 = r(Tt; y);
which is used to update the TRACDS data structure
using information y provided by a clustering operation
(de ned above).
In the following we will de ne for each clustering
operation q a unique TRACDS operation r. We use the
same subscript to identify which TRACDS operation
corresponds to which clustering operation.
 rassign(T ; y): Record a state transition. y identi es
si, the state corresponding to the cluster the new
data point was assigned to in . If the current
state is known (sc 6= ), update C by setting
csc;si csc;si + 1. Finally, we set the current state
to the new state sc si.
 rcreate(T ; y): Create a new state. y is empty in this
case. To represent the new cluster, we have to add
a state sk+1 to T by S S [ fsk+1g. We enlarge
C by a row and a column for this state, and nally,
we set the current state to the newly added state
sc sk+1.
 rremove(T ; y): Remove state. To remove state si,
identi ed in y, let S S n fsig and remove the
row i and column i in the transition count matrix
C. This deletes the state. If sc is si, then set the
current state to the no state sc .
 rmerge(T ; y): Merge two states. To merge two
states si and sj , input in y, into a new state sm,
the following steps are performed:
1. Create new state sm (see rcreate above).
2. Create the outgoing and incoming arcs for sm
by for all l 2 f1; 2; : : : ; kg let cml cil + cjl
and clm cli + clj .
3. Delete the old states si and sj (see rremove
above).
4. If sc is either si or sj , then set the current
state to the new state sc sm.
 rfade(T ; y): Fade the transition probabilities which
represent the cluster structure. y is empty in this
case. The fading strategy used on the cluster syn-
opses by the data stream clustering algorithm must
also be used on the transition count matrix C re-
sulting in a fading e ect consistent with the clus-
tering. Cluster algorithms typically use exponen-
tial decay f(t) = 2t which is multiplicative and
using repeatedly
Ct+1 = 2 Ct
results in the desired compounded fading e ect.
 rsplit(T ; y): Split states. y contains si, where i is
the index of the cluster to be split. As with fading,
the splitting strategy used must be consistent with
the one implemented by the clustering algorithm.
Since only synopses information is available for the
clustering as well as for the transition counts only
some heuristic can be used here (e.g., assign the
transition counts proportionally to the number of
observations assigned to each new cluster).
The following example illustrates the use of some of
the TRACDS clustering operations.
Page 7
hidden
Table 2: Sequence of operations for Example 1
Cluster TRACDS Manipulation sc
assignment operation of C
initial C is 0 0 
1 rnew cluster expand C to 1 1
rassign point no manipulation 1
2 rnew cluster expand C to 2 2
rassign point c1;2 c1;2 + 1 2
3 rnew cluster expand C to 3 3
rassign point c2;3 c2;3 + 1 3
2 rassign point c3;2 c3;2 + 1 2
3 rassign point c2;3 c2;3 + 1 3
4 rnew cluster expand C to 4 4
rassign point c3;4 c3;4 + 1 4
4 rassign point c4;4 c4;4 + 1 4
2 rassign point c4;2 c4;2 + 1 2
3 rassign point c2;3 c2;3 + 1 3
4 rassign point c3;4 c3;4 + 1 4
Example 1. (Create a TRACDS) A data stream
clustering algorithm starts with an empty clustering 
and we also start with an empty TRACDS data struc-
ture T with no states S = ; and the starting state sc
set to  indicating that there is no state yet. We assume
that the clustering algorithm assigns ten incoming data
points the clusters 1, 2, 3, 2, 3, 4, 4, 2, 3, 4. This
sequence of assignments triggers the 14 TRACDS oper-
ations shown in Table 2. The table shows the assumed
cluster assignments by the clustering algorithm, the exe-
cuted TRACDS operations and manipulations to C and
sc. We have 10 operations to assign points to clusters
and 4 operations to create the 4 needed clusters/states.
Creating new clusters/states increases the size of C and
adding a data point increases the counts inside the ma-
trix. The transition count matrix C resulting from the
14 operations is shown in Figure 2(a). A graph rep-
resentation of the transition count matrix is shown in
Figure 2(b). The nodes are labeled with the cluster/state
labels 1{4 and larger nodes represent clusters with more
observations. The arcs represent transitions with heav-
ier arcs indicating a higher transition count.
3.3 Implementation and Complexity of
TRACDS. TRACDS operations can be imple-
mented separately from the clustering operations.
We only need a very light-weight interface through
which we can observe what operations the clustering
algorithm executes plus minimal additional information
(e.g., the cluster index for a new data point). Adding
1 2 3 4
1 0 1 0 0
2 0 0 3 0
3 0 1 0 2
4 0 1 0 1
(a)
123 4
(b)
Figure 2: (a) Transition count matrix C and (b) graph
representation of the Markov Chain for Example 1
such an interface to existing clustering algorithms is
straight forward.
The space and time complexity of maintaining
TRACDS depends on the data structure used to store
the transition count matrix C. A suitable data structure
is a two-dimensional array with k0  k columns/rows
where we allow columns and the corresponding rows to
be marked as currently unused. On such a data struc-
ture the most often used operation of assigning a new
data point can be done in constant time. Deleting clus-
ters, creating new clusters and merging clusters takes
O(k). We only have to reorganize the data structure
with O(k2) if no more unused rows/columns are avail-
able. Note that most data stream clustering algorithms
make sure that k does not increase unbounded which
reduces or might even avoid the need for the more ex-
pensive reorganization operation. Fading is also a more
expensive operation since we have to go through all k2
cells. However, a strategy similar to the one used in [28]
can be used to signi cantly reduce the burden by us-
ing timestamps when the last fading was performed on
a count and then only do compounded fading when a
transition is updated.
Space requirements are O(k02) where k0  k is the
chosen size for C. However, since for certain types of
data C might be very sparsely populated it is possible
to use a list-based data structure to reduce the space
requirements at the expense of the time complexity of
the operations.
The computational needs of TRACDS directly de-
pend on the number and type of clustering operations
executed by the used data stream clustering algorithm.
Our experiments show that the time spent on TRACDS
operations is negligible compared to the clustering op-
erations (see next section).
4 Evaluation
In this section we evaluate the improvement of
TRACDS compared to standard stream clustering. Al-
Page 8
hidden
though TRACDS is potentially useful for other appli-
cations we focus here on the improvement for anomaly
detection. Anomaly detection is an important and well
researched stream mining task. It is the basis for many
applications like anomaly detection in weather related
data and intrusion detection in computer networks. Al-
though many anomaly detection techniques were pro-
posed (see [10] for a current survey), we only consider
a data stream clustering based approach here since we
are only interested in analyzing if TRACDS can improve
over standard clustering.
As the baseline for unsupervised anomaly detection
via clustering, we follow the simple approach by Eskin
et al [12]. Clustering with a xed distance threshold
around a center is used to get local density estimates
and all members assigned to a cluster i are classi ed as
outliers if the density in the cluster is low, i.e., ni < c
where ni is the number of points assigned to cluster i
and c is a suitable threshold. This approach was found
to perform favorably compared to more complicated and
computationally expensive approaches using support
vector machines and k-Nearest Neighbor [12].
For TRACDS we only use temporal information
given by transition probabilities in the MC in T .
TRACDS also has access to cluster size which might fur-
ther improve results, but we ignore it here since we want
to concentrate only on the captured temporal order in-
formation. We classify each data point as an anomaly
if the transition probability from cluster i, the cluster
for the previous data point in the stream, to the cluster
of the current point j is below a pre-speci ed threshold,
i.e., aij < T .
As the data stream clustering algorithm we use
the threshold nearest neighbor (tNN) algorithm we
implemented together with TRACDS in the R-extension
package rEMM 1 This simple data stream clustering
algorithm is very similar to the one used in [12]. It
represents clusters as synopses containing the position
and the size of each cluster. It assigns a new data point
to an existing cluster if it is within a xed threshold
of its center. If a data point cannot be assigned
to any existing cluster, a new cluster is created for
the data point. After assignment the data point is
discarded. Note that the clustering algorithm used
here is only of secondary importance. TRACDS only
uses the clustering results as input and we evaluate if
TRACDS can improve the results of pure clustering.
Other data stream clustering algorithms which might
produce better results for anomaly detection will also
improve the accuracy of TRACDS.
1http://CRAN.R-project.org/package=rEMM
4.1 Synthetic data. We rst use several synthetic
data sets to evaluate our approach and analyze sensi-
tivity to data dimensionality and imperfections in the
temporal structure, e.g., caused by data points arriving
out-of-order, incorrect time stamps or by a weak tem-
poral structure in the data. To make the experiments
in this paper reproducible, we included the used data
generator in package rEMM. The data generation pro-
cess creates a data set consisting of k clusters in roughly
[0; 1]d. For simplicity, the data points for each cluster
are drawn from a multivariate normal distribution given
a random mean and a random variance/covariance ma-
trix for each cluster. The temporal aspect is modeled by
a xed subsequence through the k clusters of length n.
In each step we have a transition probability pt that
the next data point is in the same cluster or in a ran-
domly chosen other cluster, thus we can create slowly
or fast changing data. For the complete sequence, the
subsequence is repeated l times creating a xed tem-
poral structure. The data set of size N = nl is gen-
erated by drawing a data point from the cluster corre-
sponding to each position in the sequence. To introduce
imperfections in the temporal sequence (i.e., incorrect
time stamps) we swap two consecutive observations with
probability ps. Note that each swap of two observations
in
uences three transitions and a rather low setting of
ps can distort the temporal structure signi cantly.
Anomalies are introduced by replacing data points
in the data set with probability pa by randomly chosen
data points in [0; 1]d. These data points potentially
lie far away from clusters and can be easily detected.
However, they might also fall within existing clusters
and therefore are hard to detect. Since these anomalies
violate the temporal structure of the data, TRACDS
can be used for detection.
An example of synthetic data generated by the
procedure described above and used in the experiments
below is shown in Figure 3. The clusters have di erent
shapes, densities and most are not well separated. Most
of the anomalies are far away from the clusters but
several are close or even within clusters of regular data
so it is expected that detecting these anomalies, if
possible at all, is a hard task. Figure 4 shows the
Receiver Operator Characteristics (ROC) curves [20]
with false positive rate (FPR) and true positive rate
(TPR) for the two anomaly detection approaches. The
ROC curve is formed by connecting the FPR/TPR
combinations obtained using di erent values for the two
algorithms thresholds, c and T . It can clearly be
seen that TRACDS improves the results over simple
clustering resulting in a larger area under the ROC
curve (AUC).
For the following experiments we create several data
Page 11
hidden
0 10000 20000 30000 40000 50000
0
50
100
150
200
N
Tim
e (s
ec.)
Clustering
TRACDS
k
0
125
250
k
Figure 7: Execution time on KDD-99 data set.
5 Conclusion and Future Work
Current work in the temporal clustering eld concen-
trates on the evolution of data streams and how the
clustering can follow or detect such changes. In this pa-
per we addressed a completely di erent problem. We
deal with the temporal (or order) relationship between
clusters. We presented TRACDS, a framework which
can be used to eciently learn online a model of a mas-
sive data stream's temporal structure.
We systematically evaluated TRACDS for anomaly
detection on synthetic data with the result that if there
is a strong temporal structure in the data, TRACDS can
improve accuracy signi cantly. Experiments on real-
world data support these ndings and show that the
TRACDS framework only minimally increases runtime.
Since this is one of the rst papers dealing with
modeling the temporal order structure of massive data
streams, there are many directions for further research
and applications:
 We will study the use of di erent data stream clus-
tering algorithms as the base for TRACDS. Since
the algorithms have di erent strategies for assign-
ing data points, and for merging, removing and
fading clusters, we need to evaluate the impact on
performance in terms of runtime and accuracy for
di erent applications and data structure choices.
 We plan to investigate the use of higher order
Markov models to learn a more detailed temporal
structure. The space complexity for storing the
complete transition matrix increases exponentially
with the order of the chain. Also diculties with
getting reliable transition probability estimates for
higher order models are expected. However, we
can deal with these problems. For example, fading
transition counts and then removing low counts
can help with avoiding the estimation problem. It
can be seen as removing noise and the resulting
transition matrix can be relatively sparse and thus
can be stored in a more compact way. Also we can
resort to use lower order transition probabilities for
transitions where not enough data is available to
reliably estimate the higher order probabilities (see
variable order Markov chains [7]).
 It is straight forward to calculate the probability of
future states given the current state and a Markov
chain. The Markov chain learned by TRACDS can
thus be used to predict the cluster a future data
point will belong to. An example application is to
impute missing values in a stream by identifying
the cluster which a data point most likely would
have been assigned to given the temporal structure
and then to use the cluster's center as the imputed
value.
 Since TRACDS learns a temporal model in form of
a Markov chain, we can evaluate dissimilarities be-
tween data streams by using dissimilarities between
the learned models. Applied to genetic sequences,
this will lead to new computationally ecient ap-
proaches of sequence clustering and sequence clas-
si cation for high volume genetic sequence data
based on TRACDS models.
Acknowledgments
This work is supported in part by the U.S. National
Science Foundation under contract number IIS-0948893.
References
[1] G. Adomavicius and J. Bockstedt. C-TREND: Tempo-
ral cluster graphs for identifying and visualizing trends
in multiattribute transactional data. IEEE Transac-
tions on Knowledge and Data Engineering, 20(6):721{
735, June 2008.
[2] C. Aggarwal. A framework for clustering massive-
domain data streams. In IEEE 25th International
Conference on Data Engineering (ICDE '09), pages
102{113, March 29 2009-April 2 2009.
[3] C. Aggarwal and P. Yu. A framework for clustering
uncertain data streams. In IEEE 24th International
Conference on Data Engineering (ICDE 2008), pages
150{159, 2008.
[4] C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu. A
framework for clustering evolving data streams. In
Proceedings of the International Conference on Very
Large Data Bases (VLDB '03), pages 81{92, 2003.
[5] C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu. A
framework for projected clustering of high dimensional
data streams. In Proceedings of the Thirtieth Interna-
tional Conference on Very Large Data Bases (VLDB
'04), pages 852{863, 2004.
Page 12
hidden
[6] D. Barbara. Requirements for clustering data streams.
SIGKDD Explorations, 3(2):23{27, 2002.
[7] R. Begleiter, R. El-Yaniv, and G. Yona. On predic-
tion using variable order markov models. Journal of
Arti cial Intelligence Research, 22:385{421, 2004.
[8] F. Cao, M. Ester, W. Qian, and A. Zhou. Density-
based clustering over an evolving data stream with
noise. In Proceedings of the 2006 SIAM International
Conference on Data Mining, pages 328{339. SIAM,
2006.
[9] D. Chakrabarti, R. Kumar, and A. Tomkins. Evo-
lutionary clustering. In KDD '06: Proceedings of
the 12th ACM SIGKDD international conference on
Knowledge discovery and data mining, pages 554{560.
ACM, 2006.
[10] V. Chandola, A. Banerjee, and V. Kumar. Anomaly
detection: A survey. ACM Computing Surveys,
41(3):1{58, 2009.
[11] M. H. Dunham, Y. Meng, and J. Huang. Extensible
markov model. In Proceedings IEEE ICDM Confer-
ence, pages 371{374. IEEE, November 2004.
[12] E. Eskin, A. Arnold, M. Prerau, L. Portnoy, and
S. Stolfo. A geometric framework for unsupervised
anomaly detection: Detecting intrusions in unlabeled
data. In Data Mining for Security Applications.
Kluwer, 2002.
[13] S. Guha, A. Meyerson, N. Mishra, R. Motwani, and
L. O'Callaghan. Clustering data streams: Theory and
practice. IEEE Transactions on Knowledge and Data
Engineering, 15(3):515{528, 2003.
[14] A. Jain, M. Murty, and P. Flynn. Data clustering:
A review. ACM Computing Surveys, 31(3):264{323,
September 1999.
[15] Y. Kambayashi, T. Hayashi, and S. Yajima. Dynamic
clustering procedures for bibliographic data. In Pro-
ceedings of the ACM SIGIR Conference, pages 90{99,
June 1981.
[16] E. Keogh, J. Lin, and W. Truppel. Clustering of time
series subsequences is meaningless: Implications for
previous and future research. In ICDM '03: Proceed-
ings of the Third IEEE International Conference on
Data Mining, page 115. IEEE Computer Society, 2003.
[17] M. Kijima. Markov Processes for Stochastic Modeling.
Stochastic Modeling Series. Chapman & Hall/CRC,
1997.
[18] H.-P. Kriegel, P. Kroger, and I. Gotlibovich. Incre-
mental OPTICS: Ecient computation of updates in a
hierarchical cluster ordering. In Data Warehousing and
Knowledge Discovery, volume 2737 of Lecture Notes in
Computer Science, pages 224{233. Springer, 2003.
[19] E. Parzen. Stochastic Processes. Society for Industrial
Mathematics, 1999.
[20] F. Provost and T. Fawcett. Analysis and visualization
of classi er performance: Comparison under imprecise
class and cost distributions. In D. Heckerman, H. Man-
nila, and D. Pregibon, editors, Proceedings of the 3rd
International Conference on Knowledge Discovery and
Data Mining, pages 43{48, Newport Beach, CA, Au-
gust 1997. AAAI Press.
[21] J. Sander, M. Ester, H.-P. Kriegel, and X. Xu. Density-
based clustering in spatial databases: The algorithm
GDBSCAN and its applications. Data Minining and
Knowledge Discovery, 2(2):169{194, 1998.
[22] M. Spiliopoulou, I. Ntoutsi, Y. Theodoridis, and
R. Schult. MONIC: Modeling and monitoring cluster
transitions. In Proceedings of the 12th ACM SIGKDD
International Conference on Knowledge Discovery and
Data Mining, Philadelphia, PA, USA, pages 706{711,
2006.
[23] D. Tasoulis, N. Adams, and D. Hand. Unsupervised
clustering in streaming data. In IEEE International
Workshop on Mining Evolving and Streaming Data.
Sixth IEEE International Conference on Data Mining
(ICDM 2006), pages 638{642, Dec. 2006.
[24] D. K. Tasoulis, G. Ross, and N. M. Adams. Visualising
the cluster structure of data streams. In Advances
in Intelligent Data Analysis VII, Lecture Notes in
Computer Science, pages 81{92. Springer, 2007.
[25] R. S. Tsay, D. Pea, and A. E. Pankratz. Outliers in
multivariate time series. Biometrika, 87(4):789{804,
2000.
[26] L. Tu and Y. Chen. Stream data clustering based
on grid density and attraction. ACM Transactions on
Knowledge Discovery from Data, 3(3):1{27, 2009.
[27] K. Udommanetanakit, T. Rakthanmanon, and
K. Waiyamai. E-stream: Evolution-based technique
for stream clustering. In ADMA '07: Proceedings of
the 3rd international conference on Advanced Data
Mining and Applications, pages 605{615. Springer-
Verlag, Berlin, Heidelberg, 2007.
[28] L. Wan, W. K. Ng, X. H. Dang, P. S. Yu, and
K. Zhang. Density-based clustering of data streams at
multiple resolutions. ACM Transactions on Knowledge
Discovery from Data, 3(3):1{28, 2009.
[29] T. Warren Liao. Clustering of time series data{a sur-
vey. Pattern Recognition, 38(11):1857{1874, November
2005.
[30] T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH:
An ecient data clustering method for very large
databases. In Proceedings of the 1996 ACM SIGMOD
International Conference on Management of Data,
pages 103{114. ACM, 1996.

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

6 Readers on Mendeley
by Discipline
 
 
 
by Academic Status
 
33% Ph.D. Student
 
33% Researcher (at a non-Academic Institution)
 
17% Doctoral Student
by Country
 
33% Japan
 
17% United Kingdom
 
17% Turkey