Sign up & Download
Sign in

rEMM : Extensible Markov Model for Data Stream Clustering in R

by Michael Hahsler, Margaret H Dunham
Journal Of Statistical Software (2010)

Cite this document (BETA)

Available from Michael Hahsler's profile on Mendeley.
Page 1
hidden

rEMM : Extensible Markov Model for Data Stream Clustering in R

JSS Journal of Statistical Software
July 2010, Volume 35, Issue 5. http://www.jstatsoft.org/
rEMM: Extensible Markov Model for Data Stream
Clustering in R
Michael Hahsler
Southern Methodist University
Margaret H. Dunham
Southern Methodist University
Abstract
Clustering streams of continuously arriving data has become an important application
of data mining in recent years and ecient algorithms have been proposed by several
researchers. However, clustering alone neglects the fact that data in a data stream is
not only characterized by the proximity of data points which is used by clustering, but
also by a temporal component. The extensible Markov model (EMM) adds the temporal
component to data stream clustering by superimposing a dynamically adapting Markov
chain. In this paper we introduce the implementation of the R extension package rEMM
which implements EMM and we discuss some examples and applications.
Keywords: data mining, data streams, clustering, Markov chain.
1. Introduction
Clustering data streams (Guha, Mishra, Motwani, and O'Callaghan 2000) has become an
important eld in recent years. A data stream is an ordered and potentially in nite sequence
of data points hy1;y2;y3; : : :i. Such streams of constantly arriving data are generated by
many types of applications and include web click-stream data, computer network monitoring
data, telecommunication connection data, readings from sensor nets, stock quotes, etc. An
important property of data streams for clustering is that data streams often produce massive
amounts of data which have to be processed in (or close to) real time since it is impractical
to permanently store the data (transient data). This leads to the following requirements:
ˆ The data stream can only be processed in a single pass or scan and typically only in
the order of arrival.
ˆ Only a minimal amount of data can be retained and the clusters have to be represented
in an extremely concise way.
Page 2
hidden
2 rEMM: Extensible Markov Model for Data Stream Clustering in R
ˆ Data stream characteristics may change over time (e.g., clusters move, merge, disappear
or new clusters may appear).
Many algorithms for data stream clustering have been proposed recently. For example,
O'Callaghan, Mishra, Meyerson, Guha, and Motwani (2002) (see also Guha, Meyerson,
Mishra, Motwani, and O'Callaghan 2003) study the k-medians problem. Their algorithm
called STREAM divides the data stream into pieces, clusters each piece individually and
then iteratively reclusters the resulting centers to obtain a nal clustering. Aggarwal, Han,
Wang, and Yu (2003) present CluStream which uses micro-clusters (an extension of cluster
feature vectors used by BIRCH, Zhang, Ramakrishnan, and Livny 1996). Micro-clusters can
be deleted and merged and permanently stored at di erent points in time to allow to cre-
ate nal clusterings (recluster micro-clusters with k-means) for di erent time frames. Even
though CluStream allows clusters to evolve over time, the ordering of the arriving data points
in the stream is lost. Kriegel, Kroger, and Gotlibovich (2003) and Tasoulis, Ross, and Adams
(2007) present variants of the density based method OPTICS (Ankerst, Breunig, Kriegel, and
Sander 1999) suitable for streaming data. Aggarwal, Han, Wang, and Yu (2004) introduce
HPStream which nds clusters that are well de ned in di erent subsets of the dimensions of
the data. The set of dimensions for each cluster can evolve over time and a fading function
is used to discount the in
uence of older data points by fading the entire cluster structure.
Cao, Ester, Qian, and Zhou (2006) introduce DenStream which maintains micro-clusters in
real time and uses a variant of GDBSCAN (Sander, Ester, Kriegel, and Xu 1998) to produce
a nal clustering for users. Tasoulis, Adams, and Hand (2006) present WSTREAM, which
uses kernel density estimation to nd rectangular windows to represent clusters. The win-
dows can move, contract, expand and be merged over time. More recent density-based data
stream clustering algorithms are D-Stream (Tu and Chen 2009) and MR-Stream (Wan, Ng,
Dang, Yu, and Zhang 2009). D-Stream uses an online component to map each data point
into a prede ned grid and then uses an oine component to cluster the grid based on density.
MR-Stream facilitates the discovery of clusters at multiple resolutions by using a grid of cells
that can dynamically be sub-divided into more cells using a tree data structure.
All approaches center on nding clusters of data points based on some notion of proximity,
but neglect the temporal structure of the data stream which might be crucial to under-
standing the underlying processes. For example, for intrusion detection a user might change
from behavior A to behavior B, both represented by clusters labeled non-suspicious behav-
ior, but the transition form A to B might be extremely unusual and give away an intru-
sion event. The extensible Markov model (EMM) originally developed by Dunham, Meng,
and Huang (2004) provides a technique to add temporal information in form of an evolving
Markov chain (MC) to data stream clustering algorithms. Clusters correspond to states in
the Markov chain and transitions represent the temporal information in the data. EMM
was successfully applied to rare event and intrusion detection (Meng, Dunham, Marchetti,
and Huang 2006; Isaksson, Meng, and Dunham 2006; Meng and Dunham 2006c), web us-
age mining (Lu, Dunham, and Meng 2006), and identifying emerging events and developing
trends (Meng and Dunham 2006a,b). In this paper we describe an implementation of EMM in
the extension package rEMM for the R environment for statistical computing (R Development
Core Team 2010). The package is available from the Comprehensive R Archive Network at
http://CRAN.R-project.org/package=rEMM.
Although the traditional Markov chain is an excellent modeling technique for a static set of
Page 3
hidden
Journal of Statistical Software 3
temporal data, it can not be applied directly to stream data. As the content of stream data is
not known apriori, the requirement of a xed transition matrix is too restrictive. The dynamic
nature of EMM resolves this problem. Although there have been a few other approaches to
the use of dynamic Markov chains (Cormack and Horspool 1987; Ostendorf and Singer 1997;
Goldberg and Mataric 1999), none of the others provide the complete
exibility needed by
stream clustering to create, merge, and delete clusters.
This paper is organized as follows. In the next section we introduce the concept of EMM and
show that all operations needed for adding EMM to data stream clustering algorithms can
be performed eciently. Section 3 introduces the simple data stream clustering algorithm
implemented in rEMM. In Section 4 we discuss implementation details of the package. Sec-
tions 5 and 6 provide examples for the package's functionality and apply EMM to analyzing
river
ow data and to genetic sequences. We conclude with Section 7.
2. Extensible Markov model
The extensible Markov model (EMM) can be understood as an evolving Markov chain (MC)
which at each point in time represents a regular time-homogeneous MC which is updated
when new data is available. In the following we will restrict the discussion to rst order EMM
but, as for a regular MC, it is straight forward to extend EMM to higher order models (Kijima
1997).
Markov chain. A ( rst order) discrete parameter Markov chain (Parzen 1999) is a special
case of a Markov Process in discrete time and with a discrete state space. It is characterized
by a sequence hX1; X2; : : : i of random variables Xt with t being the time index. All random
variables have the same domain dom(Xt) = S = fs1; s2; : : : ; sKg, a set called the state space.
The Markov property states that the next state is only dependent on the current state.
Formally,
P (Xt+1 = s j Xt = st; : : : ; X1 = s1) = P (Xt+1 = s j Xt = st) (1)
where s; st 2 S. For simplicity we use for transition probabilities the notation
aij = P (Xt+1 = sj j Xt = si)
where it is appropriate. Time-homogeneous MC can be represented by a graph with the states
as vertices and the edges labeled with transition probabilities. Another representation is as a
KK transition matrix A containing the transition probabilities from each state to all other
states.
A =
0
B
B
B
@
a11 a12 : : : a1K
a21 a22 : : : a2K
...
...
. . .
...
aK1 aK2 : : : aKK
1
C
C
C
A
(2)
MCs are very useful to keep track of temporal information using the Markov Property as a
relaxation. With a MC it is easy to forecast the probability of future states. For example the
Page 4
hidden
4 rEMM: Extensible Markov Model for Data Stream Clustering in R
probability to get from a given state to any other state in n time steps is given by the matrix
An. With an MC it is also easy to calculate the probability of a new sequence of length t as
the product of transition probabilities:
P (Xt = st; Xt1 = st1 : : : ; X1 = s1) = P (X1 = s1)
t1Y
i=1
P (Xi+1 = si+1 j Xi = si) (3)
The probabilities of a Markov chain can be directly estimated from data using the maximum
likelihood method by
aij = cij=ni; (4)
where cij is the observed count of transitions from si to sj in the data and ni =
PK
k=1 cik,
the sum of all outgoing transitions from si.
Stream data and Markov chains. Data streams typically contain dimensions with con-
tinuous data and/or have discrete dimensions with a large number of domain values (Aggarwal
2009). In addition, the data may continue to arrive resulting in a possibly in nite number of
observations. Therefore data points have to be mapped onto a manageable number of states.
This mapping is done online as data arrives using data stream clustering where each cluster
(or micro-cluster) is represented by a state in the MC. Because of this one-to-one relationship
we use cluster and state for EMM often as synonyms.
The transition count information is obtained during the clustering process by using an addi-
tional data structure eciently representing the MC transitions. Since it only uses information
(assignment of a data point to a cluster) which is created by the clustering algorithm any-
way, the computational overhead is minimal. When the clustering algorithm creates, merges
or deletes clusters, the corresponding states in the MC are also created, merged or deleted
resulting in the evolving MC. Note that K, the size of the set of clusters and of states S is
not xed for EMMs and will change over time.
In the following we look at the additional data structures and the operations on these structure
which are necessary to extend an existing data stream clustering algorithm for EMM.
Data structures for the EMM. Typically algorithms for data stream clustering use a
very compact representation for each cluster consisting of a description of the center and how
many data points were assigned to the cluster so far. Some algorithms also keep summary
information of the dispersion of the data points assigned to each cluster. Since the cluster
also represents a state in the EMM we need to add a data structure to store the outgoing
edges and their counts. For each cluster i representing state si we need to store a transition
count vector ci. All transition counts in an EMM can be seen as a transition K K count
matrix C composed of all transition count vectors. It is easy to calculate the estimated tran-
sition probability matrix from the transition count matrix (see Equation 4). Note that ni in
Equation 4 normally is the same as the number of data points assigned to cluster i maintained
by the clustering algorithm. If we manipulate the clustering using certain operations, e.g., by
deleting clusters or fading the cluster structure (see below), the values of ni calculated from C
will diverge from the number of assigned data points maintained by the clustering algorithm.
However, this is desirable since it ensures that the probabilities calculated for the transition
probability matrix A stay consistent and keep adding up to unity.
Page 5
hidden
Journal of Statistical Software 5
For EMM we also need to keep track of the current state
2 f; 1; 2; : : : ;Kg which is either no
state (; before the rst data point has arrived) or the index of one of the K states. We store
the transitions from  to the rst state in form of an initial transition count vector c of length
K. Note that the superscript is used to indicate that this is the special count vector from  to
all existing states. The initial transition probability vector is calculated by p = c=
PK
k=1 c

k:
For a single continuous data stream, only one of the elements of p is one and all others are
zero. However, if we have a data stream that naturally should be split into several sequences
(e.g., a sequence for each day for stock exchange data), p is the probability of each state to be
the rst state in a sequence (see also the genetic sequence analysis application in Section 6.2).
Thus in addition to the current state
there are only two data structures needed by EMM:
the transition count matrix, C, and and the initial transition count vector, c. These are
only related to maintaining the transition information. No additional data is needed for the
clusters themselves.
EMM clustering operations. We now de ne how the operations typically performed by
data stream clustering algorithms on (micro-)clusters can be mirrored for the EMM.
Adding a data point to an existing cluster. When a data point is added to an existing
cluster i, the EMM has to update the transition count from the current state
to the
new state si by setting c
i = c
i + 1. Finally the current state is set to the new state
by
= i.
Creating a new cluster. This operation increases the number of clusters/states from K to
K + 1 by adding a new (micro-)cluster. To store the transition counts from/to this new
cluster, we enlarge the transition count matrix C by a row and a column which are
initialized to zero.
Deleting clusters. When a cluster i (typically an outlier cluster) is deleted by the clustering
algorithm, all we need to do is to remove the row i and column i in the transition count
matrix C. This deletes the corresponding state si and reduces K to K 1.
Merging clusters. When two clusters i and j are merged into a new cluster m, we need to:
1. Create new state sm in C (see creating a new cluster above).
2. Compute the outgoing edges for sm by cmk = cik + cjk; k = 1; 2; : : :K.
3. Compute the incoming edges for sm by ckm = cki + ckj ; k = 1; 2; : : :K.
4. Delete columns and rows for the old states si and sj from C (see deleting clusters
above).
It is straight forward to extend the merge operation to an arbitrary number of clusters
at a time. Merging states also covers reclustering which is done by many data stream
clustering algorithm to create a nal clustering for the user/application.
Splitting clusters. Splitting micro-clusters is typically not implemented in data stream clus-
tering algorithms since the individual data points are not stored and therefore it is not
clear how to create two new meaningful clusters. When clusters are\split"by algorithms
like BIRCH, it typically only means that one or several micro-clusters are assigned to a
Page 6
hidden
6 rEMM: Extensible Markov Model for Data Stream Clustering in R
di erent cluster of micro-clusters. This case does not a ect the EMM, since the states
are attached to the micro-clusters and thus will move with them to the new cluster.
However, if splitting cluster i into two new clusters n and m is necessary, we replace si
by the two states, sn and sm, with equal incoming and outgoing transition probabilities
by splitting the counts between sn and sm proportional to nn and nm:
cnk = nn(cik=ni); k = 1; 2; : : :K
ckn = nn(cki=ni); k = 1; 2; : : :K
cmk = nm(cik=ni); k = 1; 2; : : :K
ckm = nm(cki=ni); k = 1; 2; : : :K
After the split we delete si.
Fading the cluster structure. Clusterings and EMMs adapt to changes in data over time.
New data points in
uence the clusters and transition probabilities. However, to enable
the EMM to learn the temporal structure, it also has to forget old data. Fading the
cluster structure is for example used by HPStream (Aggarwal et al. 2004). Fading is
achieved by reducing the weight of old observations in the data stream over time. We
use a decay rate   0 to specify the weight over time. We de ne the weight for data
that is t timesteps in the past by the following strictly decreasing function:
wt = 2
t: (5)
Since data points are not stored, the weighting has to be performed on the transition
counts. This is easy since the weight de ned above is multiplicative:
wt =
tY
i=1
2 (6)
and thus can be applied iteratively. This property allows us to fade all transition counts
in the EMM by
Ct+1 = 2 Ct and
ct+1 = 2
 ct
each time step resulting in a compounded fading e ect. The exact time of fading is
decided by the clustering algorithm. Fading can be used before each new data point is
added, or at other regular intervals appropriate for the application.
The discussed operations cover all cases typically needed to incorporate EMM into existing
data stream clustering algorithms. For example, BIRCH (Zhang et al. 1996), CluStream (Ag-
garwal et al. 2003), DenStream (Cao et al. 2006) or WSTREAM (Tasoulis et al. 2006) can
be extended to maintain temporal information in form of an EMM.
Next we introduce the simple data stream clustering algorithm called threshold nearest neigh-
bor clustering algorithm implemented in rEMM.
Page 7
hidden
Journal of Statistical Software 7
3. Threshold nearest neighbor clustering algorithm
Although the EMM concept can be built on top of any stream clustering algorithm that uses
exclusively the operations described above, we discuss here only a simple algorithm used in
our initial R implementation. The clustering algorithm applies a variation of the nearest
neighbor (NN) algorithm which instead of always placing a new observation in the closest
existing cluster creates a new cluster if no existing cluster is near enough. To specify what
near enough means, a threshold value must be provided. We call this algorithm threshold NN
(tNN). The clusters produced by tNN can be considered micro-clusters which can be merged
later on in an optional reclustering phase. To represent (micro-)clusters, we use the following
information:
ˆ Cluster centers
ˆ Number of data points assigned to the cluster
In Euclidean space we use centroids as cluster centers since they can be easily incrementally
updated as new data points are added to the cluster by
zt+1 = n=(n+ 1)zt + 1=(n+ 1)y
where zt is the old centroid for a cluster containing n points, y is the new data point and
zt+1 is the updated centroid for n+ 1 data points (see, e.g., BIRCH by Zhang et al. (1996)).
Finding canonical centroids in non-Euclidean space typically has no closed form and is a com-
putationally expensive optimization problem which needs access to all data points belonging
to the cluster (Leisch 2006). Since we do not store the data points for our clusters, even exact
medoids cannot be found and we have to resort to xed pseudo medoids or moving pseudo
centroids. We de ne xed pseudo medoids as the rst data point which creates a new cluster.
The idea is that since we use a xed threshold around the center, points will be added around
the initial data point which makes it a reasonable center possibly close to the real medoid. As
an alternative approach, if we have at least a linear space, we de ne moving pseudo centroids
as the rst data point and then, to approximate the adjustment, we apply a simple updating
scheme that moves a pseudo centroid towards each new data point that is assigned to its
cluster:
zt+1 = (1 )zt + y
where controls how much the pseudo centroid moves in the direction of the new data point.
Typically we use = 1n+1 which results in an approximation of the centroid that is equal to
adjustments made for centroids in Euclidean space.
Note, that we do not store the sums and sum of squares of observations like BIRCH (Zhang
et al. 1996) and similar micro-cluster based algorithms since this only helps with calculating
measures meaningful in Euclidean space and the clustering algorithm here is intended to be
independent from the chosen proximity measure.
Algorithm to add a new data point to a clustering:
1. Compute dissimilarities between the new data point and the k centers.
2. Find the closest cluster with a dissimilarity smaller than the threshold.
Page 8
hidden
8 rEMM: Extensible Markov Model for Data Stream Clustering in R
tNN TRACDS
EMM
Figure 1: UML class diagram for `EMM'.
3. If such a cluster exists then assign the new point to the cluster and adjust the cluster
center.
4. Otherwise create a new cluster for the point.
To observe memory limitations, clusters with very low counts (outliers) can be removed or
close clusters can be merged during clustering.
The clustering produces a set of micro-clusters. These micro-clusters can be directly used for
an application or they can be reclustered to create a nal clustering to present to a user or
to be used by an application. For reclustering, the micro-cluster centers are treated as data
points and clustered by an arbitrary algorithm (hierarchical clustering, k-means, k-medoids,
etc.). This choice of clustering algorithm gives the user the
exibility to accommodate apriori
knowledge about the data and the shape of expected clusters. For example for spherical
clusters k-means or k-medoids can be used and if clusters of arbitrary shape are expected,
hierarchical clustering with single linkage make sense. Reclustering micro-clusters results in
merging the corresponding states in the MC.
4. Implementation details
Package rEMM implements the simple data stream clustering algorithm threshold NN (tNN)
described above with an added temporal EMM layer. The package uses the S4 class system
and builds on the infrastructure provided by the packages proxy (Meyer and Buchta 2010)
for dissimilarity computation, cluster (Maechler, Rousseeuw, Struyf, and Hubert 2010) for
clustering, and Rgraphviz (Gentry, Long, Gentleman, Falcon, Hahne, and Sarkar 2010) for
one of the visualization options.
The central class in the package is `EMM' which contains two classes, class `tNN' which contains
all information pertaining to the clustering and class `TRACDS' (short for temporal relationship
among clusters for data streams) for the temporal aspects. Figure 1 shows the UML class
diagram (Fowler 2004). The advantage of separating the classes is that for future development
it is easier to replace the clustering algorithm or perform changes on the temporal layer without
breaking the whole system.
Class `tNN' contains slots for all the clustering information used by threshold NN:
ˆ Used dissimilarity measure.
ˆ Dissimilarity threshold for micro-clusters.
ˆ An indicator if (pseudo) centroids or pseudo medoids are used.
Page 9
hidden
Journal of Statistical Software 9
ˆ The cluster centers as a Kd matrix containing the centers (d-dimensional vectors) for
the K clusters currently used. Note that K changes over time when clusters are added
or deleted.
ˆ The cluster count vector n = (n1; n2; : : : ; nK) with the number of data points currently
assigned to each cluster.
Class `TRACDS' contains exclusively temporal information:
ˆ The Markov chain is represented by an object of the internal class `SimpleMC' which
allows for fast manipulation of the transition count matrix C. It also stores the initial
transition count vector c.
ˆ Current state
as a state index. NA represents no state ().
An `EMM' object is created by function EMM() which initializes an empty clustering with a
temporal layer. Several methods are de ned for either classe `tNN' or `TRACDS'. Only meth-
ods which need clustering and temporal information together (e.g., building a new EMM or
plotting an EMM) are directly de ned for `EMM'. Since `EMM' contains `tNN' and `TRACDS',
all methods can directly be used for `EMM' objects. The reason of separation is
exibility for
future development.
The temporal layer information can be accessed using size() (number of states), states()
(names of states), current_state() (get current state), transition() (access count or
probability of a certain transition), transition_matrix() (compute a transition count or
probability matrix), initial_transition() (get initial transition count vector). To access
information about the clustering, we provide the functions clusters() (names of clusters),
cluster_counts() (number of observations assigned to each cluster) and cluster_centers()
(centroids/medoids of clusters).
Clustering and building the EMM is integrated in the function build(). It adds new data
points by rst clustering and then updating the MC structure. For convenience, build() can
be called with several data points as a matrix, however, internally the data points (rows) are
processed sequentially.
To process multiple sequences, reset() is provided. It sets the current state to no state (
=
). The next observation will start a new sequence and the initial transition count vector
will be updated. For convenience, a row of all NAs in a sequence of data points supplied to
build() as a matrix also works as a reset.
rEMM implements cluster structure fading by two mechanisms. First, build() has a decay
rate parameter lambda. If this parameter is set, build() automatically fades all counts before
a new data point is added. The second mechanism is to explicitly call the function fade()
whenever fading is needed. This has the advantage that the overhead of manipulating all
counts in the EMM can be reduced and that fading can be used in a more
exible manner.
For example, if the data points are arriving at an irregular rate, fade() could be called at
regular time intervals (e.g., every second).
To manipulate states/clusters and transitions, rEMM o ers a wide array of functions.
remove_clusters() and remove_transitions() remove user speci ed states/clusters or
transitions from the model. To nd rare clusters or transitions with a count below a speci-
ed threshold rare_clusters() and rare_transitions() can be used. prune() combines
Page 10
hidden
10 rEMM: Extensible Markov Model for Data Stream Clustering in R
nding rare clusters or transitions and removing them into a convenience function. For some
applications transitions from a state to itself might not be interesting. These transitions
can be removed by using remove_selftransitions(). The last manipulation function is
merge_clusters() which combines several clusters/states into a single cluster/state.
As described above, the threshold NN data stream clustering algorithm can use an optional
reclustering phase to combine micro-clusters into a nal clustering. For reclustering we provide
several wrapper functions for popular clustering methods in rEMM: recluster_hclust()
for hierarchical clustering, recluster_kmeans() for k-means and recluster_pam() for k-
medoids. However, it is easy to use any other clustering method. All that is needed is a
vector with the cluster assignments for each state/cluster. This vector can be supplied to
merge_clusters() with clustering = TRUE to create a reclustered EMM. Optionally new
centers calculated by the clustering algorithm can also be supplied to merge_clusters() as
the parameter new_center.
Predicting a future state and calculating the probability of a new sequence are implemented
as predict() and score(), respectively.
The helper function find_clusters() returns the cluster/state sequence for given data
points. The matching can be nearest neighbor or exact. Nearest neighbor always returns
a matching cluster, while exact will return no cluster (NA) if a data point does not fall within
the threshold of any cluster.
Finally, plot() implements several visualization methods for class `EMM'.
In the next section we give some examples of how to use rEMM in practice.
5. Examples
5.1. Basic usage
First, we load the package and a simple data set called EMMTraffic, which comes with the
package and was used by Dunham et al. (2004) to illustrate EMMs. Each of the 12 observa-
tions in this hypothetical data set is a vector of seven values obtained from sensors located
at speci c points on roads. Each sensor collects a count of the number of vehicles which have
crossed this sensor in the preceding time interval.
R> library("rEMM")
R> data("EMMTraffic")
R> EMMTraffic
Loc_1 Loc_2 Loc_3 Loc_4 Loc_5 Loc_6 Loc_7
1 20 50 100 30 25 4 10
2 20 80 50 20 10 10 10
3 40 30 75 20 30 20 25
4 15 60 30 30 10 10 15
5 40 15 25 10 35 40 9
6 5 5 40 35 10 5 4
7 0 35 55 2 1 3 5
8 20 60 30 11 20 15 10
Page 11
hidden
Journal of Statistical Software 11
9 45 40 15 18 20 20 15
10 15 20 40 40 10 10 14
11 5 45 55 10 10 15 0
12 10 30 10 4 15 15 10
We use EMM() to create a new EMM object using extended Jaccard as proximity measure
and a dissimilarity threshold of 0.2. For the extended Jaccard measure pseudo medoids are
automatically chosen (use centroids = TRUE in EMM() to use pseudo centroids). Then we
build a model using the EMMTrac data set. Note that build() takes the whole data set
at once, but this is only for convenience. Internally the data points are processed as a data
stream, strictly one after the other in a single pass.
R> emm <- EMM(threshold = 0.2, measure = "eJaccard")
R> emm <- build(emm, EMMTraffic)
R> size(emm)
[1] 7
The resulting EMM has 7 states. The number of data points represented by each cluster can
be accessed via cluster_counts().
R> cluster_counts(emm)
1 2 3 4 5 6 7
2 3 1 2 2 1 1
Cluster 2 has with a count of three the most assigned data points. The cluster centers can
be inspected using cluster_centers().
R> cluster_centers(emm)
Loc_1 Loc_2 Loc_3 Loc_4 Loc_5 Loc_6 Loc_7
1 20 50 100 30 25 4 10
2 20 80 50 20 10 10 10
3 40 15 25 10 35 40 9
4 5 5 40 35 10 5 4
5 0 35 55 2 1 3 5
6 45 40 15 18 20 20 15
7 10 30 10 4 15 15 10
plot() for `EMM' objects provides several visualization methods. For example as a graph.
R> plot(emm, method = "graph")
The resulting graph is presented in Figure 2. In this representation the vertex size and the
arrow width code for the number of observations represented by each state and the transition
counts, i.e., more popular clusters and transitions are more prominently displayed.
The current transition probability matrix of the EMM can be calculated using
transition_matrix().
Page 12
hidden
12 rEMM: Extensible Markov Model for Data Stream Clustering in R
1
2
3
4
5
6
7
Figure 2: Graph representation of an EMM for the EMMTrac data set.
R> transition_matrix(emm)
1 2 3 4 5 6 7
1 0.0000 1.0 0.0000 0 0 0.0000 0.0
2 0.3333 0.0 0.3333 0 0 0.3333 0.0
3 0.0000 0.0 0.0000 1 0 0.0000 0.0
4 0.0000 0.0 0.0000 0 1 0.0000 0.0
5 0.0000 0.5 0.0000 0 0 0.0000 0.5
6 0.0000 0.0 0.0000 1 0 0.0000 0.0
7 0.0000 0.0 0.0000 0 0 0.0000 1.0
Alternatively we can get also get the raw transition count matrix.
R> transition_matrix(emm, type = "counts")
1 2 3 4 5 6 7
1 0 2 0 0 0 0 0
2 1 0 1 0 0 1 0
3 0 0 0 1 0 0 0
4 0 0 0 0 2 0 0
5 0 1 0 0 0 0 1
6 0 0 0 1 0 0 0
7 0 0 0 0 0 0 0
Individual transition probabilities or counts can be obtained more eciently via transition().
R> transition(emm, "1", "2", type = "probability")
Page 13
hidden
Journal of Statistical Software 13
[1] 1
Using the EMM model, we can predict a future cluster given a current cluster For example,
we can predict the most likely cluster two time steps away from cluster 2.
R> predict(emm, n = 2, current = "2")
[1] "4"
predict() with probabilities = TRUE produced the probability distribution over all clus-
ters.
R> predict(emm, n = 2, current = "2", probabilities = TRUE)
1 2 3 4 5 6 7
0.0000 0.3333 0.0000 0.6667 0.0000 0.0000 0.0000
In this example cluster 4 was predicted since it has the highest probability. If several clusters
have the same probability the tie is randomly broken.
5.2. Manipulating EMMs
EMMs can be manipulated by removing clusters or transitions and by merging clusters.
Figure 3(a) shows again the EMM for the EMMTrac data set created above. We can
remove a cluster with remove_clusters(). For example, we remove cluster 3 and display the
resulting EMM in Figure 3(b).
R> emm_3removed <- remove_clusters(emm, "3")
R> plot(emm_3removed, method = "graph")
Removing transitions is done with remove_transitions(). In the following example we
remove the transition from cluster 5 to cluster 2 from the original EMM for EMMTrac in
Figure 3(a). The resulting graph is shown in Figure 3(c).
R> emm_52removed <- remove_transitions(emm, "5", "2")
R> plot(emm_52removed, method = "graph")
Clusters can be merged using merge_clusters(). Here we merge clusters 2 and 5 into a
combined cluster. The combined cluster automatically gets the name of the rst cluster in
the merge vector. The resulting EMM is shown in Figure 3(d).
R> emm_25merged <- merge_clusters(emm, c("2", "5"))
R> plot(emm_25merged, method = "graph")
Note that a transition from the combined cluster 2 to itself is created which represents the
transition from cluster 5 to cluster 2 in the original EMM.
Page 14
hidden
14 rEMM: Extensible Markov Model for Data Stream Clustering in R
1
23 4 5
6
7
(a)
1
2
4 5
6
7
(b)1
23 45
6
7
(c)
1
2
3
4
6 7
(d)
Figure 3: Graph representation for an EMM for the EMMTrac data set. (a) shows the
original EMM, in (b) cluster 3 is removed, in (c) the transition from cluster 5 to cluster 2 is
removed, and in (d) clusters 2 and 5 are merged.
5.3. Using cluster structure fading and pruning
EMMs can adapt to changes in data over time. This is achieved by fading the cluster structure
using a decay rate. To show the e ect, we train an EMM on the EMMTrac data with a
rather high decay rate of  = 1. Since the weight is calculated by wt = 2t, the observations
are weighted 1; 12 ;
1
4 ; : : : .
R> emm_fading <- EMM(threshold = 0.2, measure = "eJaccard", lambda = 1)
R> emm_fading <- build(emm_fading, EMMTraffic)
R> plot(emm_fading, method = "graph")
The resulting graph is shown in Figure 4(b). The clusters which were created earlier on (clus-
ters with lower index number) are smaller (represent a lower weighted number of observations)
compared to the original EMM without fading displayed in Figure 4(a).
Over time clusters in an EMM can become obsolete and no new observations are assigned
to them. Similarly transitions might become obsolete over time. To simplify the model and
improve eciency, such obsolete clusters and transitions can be pruned. For the example
Page 15
hidden
Journal of Statistical Software 15
1
23 4 5
6
7
(a)
1
2
3
4
5
6
7
(b)
4
5
6
7
(c)
Figure 4: Graph representation of an EMM for the EMMTrac data set. (a) shows the
original EMM. (b) shows an EMM with a learning rate of  = 1. (c) EMM with learning rate
after pruning with a count threshold of 0:1.
here, we prune all clusters and transitions which have a weighted count of less than 0:1 and
show the resulting model in Figure 4(c).
R> emm_pruned <- prune(emm_fading, count_threshold = 0.1)
R> plot(emm_pruned, method = "graph")
5.4. Visualization options
We use a simulated data set called EMMsim which is included in rEMM. The data contains four
well separated clusters in R2. Each cluster is represented by a bivariate normally distributed
random variable Xi  N2(;).  are the coordinates of the mean of the distribution and 
is the covariance matrix.
The temporal structure of the data is modeled by the xed sequence h1; 2; 1; 3; 4i through
the four clusters which is repeated 40 times (200 data points) for the training data set and 5
times (25 data points) for the test data.
R> data("EMMsim")
Page 18
hidden
18 rEMM: Extensible Markov Model for Data Stream Clustering in R
The simple graph representation in Figure 6(a) shows a rather complicated graph for the
EMM. However, Figure 6(b) with the vertices positioned to represent similarities between
cluster centers shows more structure. The clusters clearly fall into four groups. The projection
of the cluster centers onto the data set in Figure 6(c) shows that the four groups represent
the four clusters in the data where the larger clusters are split into several micro-clusters. We
will introduce reclustering to simplify the structure in a later section.
5.5. Scoring new sequences
A score of how likely it is that a sequence was generated by a given EMM model can be
calculated by the length-normalized product or sum of probabilities on the path along the
new sequence. The scores for a new sequence of length l are de ned as:
Pprod =
l1
v
u
u
t
l1Y
i=1
as(i)s(i+1) (7)
Psum =
1
l 1
l1X
i=1
as(i)s(i+1) (8)
where s(i) is the state of the ith data point in the new sequence it is assigned to. Points
are assigned to the closest cluster only if the distance to the center is smaller than the
threshold. Data points which are not within the threshold of any cluster stay unassigned.
Note that for a sequence of length l we have l 1 transitions. If we want to take the initial
transition probability also into account we extend the above equations by the additional initial
probability a;s(1):
Pprod =
l
v
u
u
ta;s(1)
l1Y
i=1
as(i)s(i+1) (9)
Psum =
1
l

a;s(1) +
l1X
i=1
as(i)s(i+1)
!
(10)
As an example, we calculate how well the test data ts the EMM created for the EMMsim
data in the section above. The test data is supplied together with the training set in rEMM.
R> score(emm, EMMsim_test, method = "prod", match_cluster = "exact",
+ plus_one = FALSE, initial_transition = FALSE)
[1] 0
R> score(emm, EMMsim_test, method = "sum", match_cluster = "exact",
+ plus_one = FALSE, initial_transition = FALSE)
[1] 0.227
Page 19
hidden
Journal of Statistical Software 19
Even though the test data was generated using exactly the same model as the training
data, the normalized product (method = "prod") produces a score of 0 and the normal-
ized sum (method = "sum") is also low. To analyze the problem we can look at the transition
table for the test sequence. The transition table is computed by transition_table().
R> transition_table(emm, EMMsim_test, match_cluster = "exact",
+ plus_one = FALSE)
from to prob
1 1 17 0.14286
2 17 3 0.15385
3 3 14 0.58333
4 14 5 0.73333
5 5 10 0.03571
6 10 7 0.33333
7 7 9 0.06667
8 9 14 0.14286
9 14 15 0.26667
10 15 1 0.00000
11 1 17 0.14286
12 17 3 0.15385
13 3 14 0.58333
14 14 5 0.73333
15 5 <NA> 0.00000
16 <NA> 7 0.00000
17 7 18 0.00000
18 18 14 0.00000
19 14 5 0.73333
20 5 3 0.10714
21 3 17 0.16667
22 17 <NA> 0.00000
23 <NA> 4 0.00000
24 4 15 0.36842
The low score is caused by data points that do not fall within the threshold for any cluster
(<NA> above) and by missing transitions in the matching sequence of clusters (counts and
probabilities of zero above). These missing transitions are the result of the fragmentation
of the real clusters into many micro-clusters (see Figures 6(b) and (c)). Suppose we have
two clusters called cluster A and cluster B and after an observation in cluster A always an
observation in cluster B follows. If now cluster A and cluster B are represented by many
micro-clusters each, it is likely that we nd a pair of micro-clusters (one in A and one in B)
for which we did not see a transition yet and thus will have a transition count/probability of
zero.
To reduce the problem of not being able to match a data point to a cluster we can use a
nearest neighbor approach instead of exact matching (match_cluster = "nn" is the default
for score()). Here a new data point is assigned to the closest cluster even if it falls outside
the threshold. The problem with missing transitions can be reduced by starting with a prior
Page 22
hidden
22 rEMM: Extensible Markov Model for Data Stream Clustering in R
0 500 1000 1500
0
20
40
60
80
100
Long Bridge
Index
Gau
ged
flow
Figure 9: Gauged
ow (in m3=s) of the river Derwent at the Long Bridge catchment.
catchments of the river Derwent and two of its main tributaries in northern England. The
data was collected daily for roughly 5 years (1918 observations) from November 1, 1971 to
January 31, 1977. The catchments are Long Bridge, Matlock Bath, Chat Sworth, What Stand
Well, Ashford (river Wye) and Wind Field Park (river Amber).
The data set is interesting since it contains annual changes of river levels and also some special
ooding events.
R> data("Derwent")
R> summary(Derwent)
Long Bridge Matlock Bath Chat Sworth What Stand Well
Min. : 2.78 Min. : 2.61 Min. : 0.30 Min. : 0.74
1st Qu.: 7.10 1st Qu.: 5.08 1st Qu.: 1.32 1st Qu.: 2.27
Median : 10.95 Median : 7.89 Median : 2.16 Median : 3.13
Mean : 14.33 Mean : 10.64 Mean : 2.67 Mean : 4.51
3rd Qu.: 17.09 3rd Qu.: 12.78 3rd Qu.: 3.45 3rd Qu.: 4.87
Max. :109.30 Max. :104.60 Max. :16.06 Max. :72.79
Wye@Ashford Amber@Wind Field Park
Min. : 0.030 Min. : 0.010
1st Qu.: 0.180 1st Qu.: 0.040
Median : 0.330 Median : 0.090
Mean : 0.544 Mean : 0.143
3rd Qu.: 0.640 3rd Qu.: 0.160
Max. : 6.280 Max. : 4.160
NA's :31.000 NA's :252.000
Page 23
hidden
Journal of Statistical Software 23
1 2 6 8 4 14 3 15 7 13 16 18 5 9 10 11 12 17 19 20
State
Cou
nt
1
5
10
50
100
500
Figure 10: Distribution of state counts of the EMM for the Derwent data.
From the summary we see that the gauged
ows vary among catchments signi cantly (from
0.143 to 14.238). The in
uence of di erences in averages
ows can be removed by scaling the
data before building the EMM. From the summary we also see that for the Ashford and Wind
Field Park catchments a signi cant amount of observations is not available. EMM deals with
these missing values by using only the non-missing dimensions of the observations for the
proximity calculations (see package proxy for details).
R> plot(Derwent[, 1], type = "l", ylab = "Gauged flow",
+ main = colnames(Derwent)[1])
In Figure 9 we can see the annual
ow pattern for the Long Bridge catchment with higher
ows in September to March and lower
ows in the summer months. The rst year seems
to have more variability in the summer months and the second year has an unusual event
(around the index of 600 in Figure 9) with a
ow above 100m3=s which can be classi ed as
ooding.
We build an EMM from the (centered and) scaled river data using Euclidean distance between
the vectors containing the
ows from the six catchments and experimentally found a distance
threshold of 3 (just above the 3rd quartile of the distance distribution between all scaled
observations) to give useful results.
R> Derwent_scaled <- scale(Derwent)
R> emm <- EMM(measure = "euclidean", threshold = 3)
R> emm <- build(emm, Derwent_scaled)
R> plot(emm, method = "cluster_counts", log = "y")
The resulting EMM has 20 clusters/states. In Figure 10 shows that the cluster counts have a
very skewed distribution with clusters 1 and 2 representing most observations and clusters 5,
9, 10, 11, 12, 17, 19 and 20 being extremely rare.
Page 24
hidden
24 rEMM: Extensible Markov Model for Data Stream Clustering in R
−5 0 5 10 15

10

5
0
5
These two dimensions explain 84.65 % of the point variability.Dimension 1
Dim
ens
ion
2
l
l
l
l
l
l
l
l
l
l
ll
ll
l l
l
l
l
12
3
4
5
6
7
8
9
10
11
1213
14
15
16
17
18
19
20
(a)
−2 0 2 4

2.0

1.5

1.0

0.5
0.0
0.5
These two dimensions explain 91.41 % of the point variability.Dimension 1
Dim
ens
ion
2 l
l
ll 12
4
68
(b)
Figure 11: Cluster centers of the EMM for the Derwent data set projected on 2-dimensional
space. (a) shows the full EMM and (b) shows a pruned EMM (only the most frequently used
states)
R> plot(emm)
The projection of the cluster centers into 2-dimensional space in Figure 11(a) reveals that all
but clusters 11 and 12 are placed closely together.
Next we look at frequent clusters and transitions. We de ne rare here as all clusters/transitions
that represent less than 0.5% of the observations. On average this translates into less than
two daily observation per year. We calculate a count threshold, use prune() to remove rare
clusters/transitions and then we plot the pruned EMM.
R> rare_threshold <- sum(cluster_counts(emm)) * 0.005
R> rare_threshold
[1] 9.59
R> plot(prune(emm, rare_threshold))
The pruned model depicted in Figure 11(b) shows that 5 clusters represent approximately
99.5% of the river's behavior. All ve clusters come from the lower half of the large group
of clusters in Figure 11(a). Clusters 1 and 2 are the most frequently occurring clusters and
the wide bidirectional arrow connecting them means that observing transitions between these
two clusters are common. To analyze the meaning of the two outlier clusters (11 and 12)
identi ed in Figure 11(a) above, we plot the
ows at a catchment and mark the observations
for these states.
Page 25
hidden
Journal of Statistical Software 25
0 500 1000 1500
0
20
40
60
80
100
Long Bridge
Index
Gau
ged
flow
s
l 11
l 12
(a)
0 500 1000 1500
0
1
2
3
4
Amber@Wind Field Park
Index
Gau
ged
flow
l 11
l 12
(b)
Figure 12: Gauged
ow (in m3=s) at (a) the Long Bridge catchment and (b) the Amber at
the Wind Field Park catchment. Outliers (states 11 and 12) are marked.
R> catchment <- 1
R> plot(Derwent[, catchment], type = "l", ylab = "Gauged flows",
+ main = colnames(Derwent)[catchment])
R> state_sequence <- find_clusters(emm, Derwent_scaled)
R> mark_states <- function(states, state_sequence, ys, col = 0,
+ label = NULL, ...) {
+ x <- which(state_sequence %in% states)
+ points(x, ys[x], col = col, ...)
+ if (!is.null(label))
+ text(x, ys[x], label, pos = 4, col = col)
+ }
R> mark_states("11", state_sequence, Derwent[, catchment],
+ col = "blue", label = "11")
R> mark_states("12", state_sequence, Derwent[, catchment],
+ col = "red", label = "12")
In Figure 12(a) we see that cluster 12 has a river
ow in excess of 100m3=s which only
happened once in the observation period. Cluster 11 seems to be a regular observation with
medium
ow around 20m3=s and it needs more analysis to nd out why this cluster is also
an outlier directly leading to cluster 12.
R> catchment <- 6
R> plot(Derwent[, catchment], type = "l", ylab = "Gauged flow",
+ main = colnames(Derwent)[catchment])
R> mark_states("11", state_sequence, Derwent[, catchment],
+ col = "blue", label = "11")
Page 26
hidden
26 rEMM: Extensible Markov Model for Data Stream Clustering in R
R> mark_states("12", state_sequence, Derwent[, catchment],
+ col = "red", label = "12")
The catchment at Wind Field Park is at the Amber river which is a tributary of the Derwent
and we see in Figure 12(b) that the day before the
ood occurs, the
ow shoots up to
4m3=s which is caught by cluster 11. The temporal structure clearly indicated that a
ood
is imminent the next day.
6.2. Genetic sequence analysis
The rEMM package also contains examples for 16S ribosomal RNA (rRNA) sequences for
the two phylogenetic classes, Alphaproteobacteria and Mollicutes. 16S rRNA is a component
of the ribosomal subunit 30S and is regularly used for phylogenetic studies (e.g., see Wang,
Garrity, Tiedje, and Cole 2007). Typically alignment heuristics like BLAST (Altschul, Gish,
Miller, Myers, and Lipman 1990) or a hidden Markov model (HMM, e.g., Hughey and Krogh
1996) are used for evaluating the similarity between two or more sequences. However, these
procedures are computationally very expensive.
An alternative approach is to describe the structure in terms of the occurrence frequency
of so called n-words, subsequences of length n. Counting the occurrences of the 4n (there
are four di erent nucleotide types) n-words is straight forward and computing similarities
between frequency pro les if very ecient. Because no alignment is computed, such methods
are called alignment-free (Vinga and Almeida 2003).
rEMM contains preprocessed sequence data for 30 16S sequences of the phylogenetic class
Mollicutes. The sequences were preprocessed by cutting them into windows of length 100
nucleotides without overlap and then for each window the occurrence of triplets of nucleotides
was counted resulting in 43 = 64 counts per window. Each window will be used as an
observation to build the EMM. The counts for the 30 sequences are organized as a matrix
and sequences are separated by rows of NA resulting in resetting the EMM during the build
process.
Vinga and Almeida (2003) review dissimilarity measures used for alignment-free methods.
The most commonly used measures are Euclidean distance, d2 distance (a weighted Euclidean
distance), Mahalanobis distance and Kullback-Leibler discrepancy (KLD). Since Wu, Hsieh,
and Li (2001) nd in their experiments that KLD provides good results while it still can be
computed as fast as Euclidean distance, it is also used here. Since KLD becomes 1 for
counts of zero, we add one to all counts which conceptually means that we start building the
EMM with a prior that all triplets have the equal occurrence probability (see Wu et al. 2001).
We use an experimentally found threshold of 0.1.
R> data("16S")
R> emm <- EMM(threshold = 0.1, "Kullback")
R> emm <- build(emm, Mollicutes16S + 1)
R> plot(emm, method = "graph")
R> it <- initial_transition(emm)
R> it[it > 0]
1 23 36 43 47
0.70000 0.06667 0.13333 0.03333 0.06667
Page 27
hidden
Journal of Statistical Software 27
1
l2
l3
4
l5
6
l7
8
l9
l10
11
l12
l13
14
l15
l16
l17
l18
l19
l20
l21
l22
l23
l24
l25
l26
l27
l28
l29
l30
l31
l32
l33
l34
l35
l36
l37
l38
l39
l40
l41
l42
l43
l44
l45
l46
l47
l48
l49
l50
l51
l52
l53
l54
l55
l56
l57
l58
l59
l60
l61
l62
l63
l64
l65
l66
l67
l68
l69
l70
l71
l72
l73
l74
l75
l76
Figure 13: An EMM representing 16S sequences from the class Mollicutes represented as a
graph.
The graph representation of the EMM is shown in Figure 13. Note that each cluster/state
in the EMM corresponds to one or more windows of the rRNA sequence (the size of the
cluster indicates the number of windows). The initial transition probabilities show that most
sequences start the rst count window in cluster 1. Several interesting observations can be
made from this representation.
ˆ There exists a path through the graph using only the largest clusters and widest arrows
which represents the most common sequence of windows.
ˆ There are several places in the EMM where almost all sequences converge (e.g., 4
and 14).
Page 28
hidden
28 rEMM: Extensible Markov Model for Data Stream Clustering in R
ˆ There are places with high variability where many possible parallel paths exist (e.g., 7,
27, 20, 35, 33, 28, 65, 71).
ˆ The window composition changes over the whole sequences since there are no edges
going back or skipping states on the way down.
In general it is interesting that the graph has no loops since Deschavanne, Giron, Vilain, Fagot,
and Fertil (1999) found in their study using Chaos Game Representation that the variability
along genomes and among genomes is low. However, they looked at longer sequences and we
look here at the micro structure of a very short sequence. These observations merit closer
analysis by biologists.
7. Conclusion
Temporal structure in data streams is ignored by current data stream clustering algorithms.
A temporal EMM layer can be used to retain such structure. In this paper we showed that
a temporal EMM layer can be added to any data stream clustering algorithm which works
with dynamically creating, deleting and merging clusters. As an example, we implemented in
rEMM a simple data stream clustering algorithm and the temporal EMM layer and demon-
strated its usefulness with two applications.
Future work will include extending popular data stream clustering algorithms with EMM,
incorporate higher order models and add support for reading data directly from data stream
sources.
Acknowledgments
The authors would like to thank the anonymous reviewers for their valuable comments. Part
of the research presented in this paper was supported by by NSF IIS-0948893.
References
Aggarwal C (2009). \A Framework for Clustering Massive-Domain Data Streams." In IEEE
25th International Conference on Data Engineering (ICDE '09), pp. 102{113.
Aggarwal CC, Han J, Wang J, Yu PS (2003). \A Framework for Clustering Evolving Data
Streams." In Proceedings of the International Conference on Very Large Data Bases (VLDB
'03), pp. 81{92.
Aggarwal CC, Han J, Wang J, Yu PS (2004). \A Framework for Projected Clustering of High
Dimensional Data Streams." In Proceedings of the Thirtieth International Conference on
Very Large Data Bases (VLDB '04), pp. 852{863. ISBN 0-12-088469-0.
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990). \Basic Local Alignment
Search Tool." Journal of Molecular Biology, 215(3), 403{410.
Page 29
hidden
Journal of Statistical Software 29
Ankerst M, Breunig MM, Kriegel HP, Sander J (1999). \OPTICS: Ordering Points To Iden-
tify the Clustering Structure." In Proceedings of the 1999 ACM SIGMOD International
Conference on Management of Data, pp. 49{60.
Cao F, Ester M, Qian W, Zhou A (2006). \Density-Based Clustering over an Evolving Data
Stream with Noise." In Proceedings of the 2006 SIAM International Conference on Data
Mining, pp. 328{339. SIAM.
Cormack GV, Horspool RNS (1987). \Data Compression Using Dynamic Markov Modeling."
The Computer Journal, 30(6).
Cox T, Cox M (2001). Multidimensional Scaling. Chapman and Hall.
Deschavanne PJ, Giron A, Vilain J, Fagot G, Fertil B (1999). \Genomic Signature: Character-
ization and Classi cation of Species Assessed by Chaos Game Representation of Sequences."
Molecular Biology and Evolution, 16(10), 1391{1399.
Dunham MH, Meng Y, Huang J (2004). \Extensible Markov Model." In Proceedings IEEE
ICDM Conference, pp. 371{374. IEEE.
Fowler M (2004). UML Distilled: A Brief Guide to the Standard Object Modeling Language.
3rd edition. Addison-Wesley Professional.
Gentry J, Long L, Gentleman R, Falcon S, Hahne F, Sarkar D (2010). Rgraphviz: Provides
Plotting Capabilities for R Graph Objects. R package version 1.26.0, URL http://www.
bioconductor.org/packages/2.6/bioc/html/Rgraphviz.html.
Goldberg D, Mataric MJ (1999). \Coordinating Mobile Robot Group Behavior Using a Model
of Interaction Dynamics." In Proceedings of the Third International Conference on Au-
tonomous Agents.
Guha S, Meyerson A, Mishra N, Motwani R, O'Callaghan L (2003). \Clustering data Streams:
Theory and Practice." IEEE Transactions on Knowledge and Data Engineering, 15(3),
515{528.
Guha S, Mishra N, Motwani R, O'Callaghan L (2000). \Clustering Data Streams." In Pro-
ceedings of the ACM Symposium on Foundations of Computer Science, pp. 359{366.
Hughey R, Krogh A (1996). \Hidden Markov Models for Sequence Analysis: Extension and
Analysis of the Basic Method." Computational Applications in Bioscience, 12(2), 95{107.
Isaksson C, Meng Y, Dunham MH (2006). \Risk Leveling of Network Trac Anomalies."
International Journal of Computer Science and Network Security, 6(6), 258{265.
Kijima M (1997). Markov Processes for Stochastic Modeling. Stochastic Modeling Series.
Chapman & Hall/CRC, Boca Raton.
Kriegel HP, Kroger P, Gotlibovich I (2003). \Incremental OPTICS: Ecient Computation
of Updates in a Hierarchical Cluster Ordering." In Data Warehousing and Knowledge
Discovery, volume 2737 of Lecture Notes in Computer Science, pp. 224{233. Springer-
Verlag.
Page 30
hidden
30 rEMM: Extensible Markov Model for Data Stream Clustering in R
Leisch F (2006). \A Toolbox for K-Centroids Cluster Analysis." Computational Statistics &
Data Analysis, 51(2), 526{544.
Lu L, Dunham MH, Meng Y (2006). \Mining Signi cant Usage Patterns from Clickstream
Data." In Advances in Web Mining and Web Usage Analysis, volume 4198 of Lecture Notes
in Computer Science. Springer-Verlag.
Maechler M, Rousseeuw P, Struyf A, Hubert M (2010). cluster: Cluster Analysis Basics
and Extensions. R package version 1.13.1, URL http://CRAN.R-project.org/package=
cluster.
Meng Y, Dunham MH (2006a). \Ecient Mining of Emerging Events in a Dynamic Spa-
tiotemporal Environment." In Advances in Knowledge Discovery and Data Mining, volume
3918 of Lecture Notes in Computer Science, pp. 750{754. Springer-Verlag.
Meng Y, Dunham MH (2006b). \Mining Developing Trends of Dynamic Spatiotemporal Data
Streams." Journal of Computers, 1(3), 43{50.
Meng Y, Dunham MH (2006c). \Online Mining of Risk Level of Trac Anomalies with User's
Feedbacks." In Proceedings of the IEEE International Conference on Granular Computing,
pp. 176{181.
Meng Y, Dunham MH, Marchetti F, Huang J (2006). \Rare Event Detection in a Spatiotem-
poral Environment." In Proceedings of the IEEE International Conference on Granular
Computing, pp. 629{634.
Meyer D, Buchta C (2010). proxy: Distance and Similarity Measures. R package version
0.4-6, URL http://CRAN.R-project.org/package=proxy.
O'Callaghan L, Mishra N, Meyerson A, Guha S, Motwani R (2002). \Streaming-data Algo-
rithms for High-quality Clustering." In Proceedings of the 18th International Conference
on Data Engineering (ICDE'02), pp. 685{. IEEE Computer Society.
Ostendorf M, Singer H (1997). \HMM Topology Desing Using Maximum Likelihood Successive
State Splitting." Computer Speech and Language, 11(1), 17{41.
Parzen E (1999). Stochastic Processes. Society for Industrial Mathematics.
R Development Core Team (2010). R: A Language and Environment for Statistical Computing.
R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http:
//www.R-project.org/.
Sander J, Ester M, Kriegel HP, Xu X (1998). \Density-Based Clustering in Spatial Databases:
The Algorithm GDBSCAN and Its Applications." Data Minining and Knowledge Discovery,
2(2), 169{194.
Tasoulis D, Adams N, Hand D (2006). \Unsupervised Clustering in Streaming Data." In IEEE
International Workshop on Mining Evolving and Streaming Data. Sixth IEEE International
Conference on Data Mining (ICDM 2006), pp. 638{642.
Tasoulis DK, Ross G, Adams NM (2007). \Visualising the Cluster Structure of Data Streams."
In Advances in Intelligent Data Analysis VII, Lecture Notes in Computer Science, pp. 81{
92. Springer-Verlag.
Page 31
hidden
Journal of Statistical Software 31
Tu L, Chen Y (2009). \Stream Data Clustering Based on Grid Density and Attraction." ACM
Transactions on Knowledge Discovery from Data, 3(3), 1{27. ISSN 1556-4681.
Vinga S, Almeida J (2003). \Alignment-Free Sequence Comparison|A Review." Bioinfor-
matics, 19(4), 513{523.
Wan L, Ng WK, Dang XH, Yu PS, Zhang K (2009). \Density-Based Clustering of Data
Streams at Multiple Resolutions." ACM Transactions on Knowledge Discovery from Data,
3(3), 1{28. ISSN 1556-4681.
Wang Q, Garrity GM, Tiedje JM, Cole JR (2007). \Naive Bayesian Classi er for Rapid Assign-
ment of rRNA Sequences into the new Bacterial Taxonomy." Applied and Environmental
Microbiology, 73(16), 5261{5267.
Wu TJ, Hsieh YC, Li LA (2001). \Statistical Measures of DNA Sequence Dissimilarity under
Markov Chain Models of Base Composition." Biometrics, 57(2), 441{448.
Zhang T, Ramakrishnan R, Livny M (1996). \BIRCH: An Ecient Data Clustering Method
for Very Large Databases." In Proceedings of the 1996 ACM SIGMOD International Con-
ference on Management of Data, pp. 103{114. ACM.
Aliation:
Michael Hahsler
Computer Science and Engineering
Lyle School of Engineering
Southern Methodist University
P.O. Box 750122
Dallas, TX 75275-0122, United States of America
E-mail: mhahsler@lyle.smu.edu
URL: http://lyle.smu.edu/~mhahsler/
Margaret H. Dunham
Computer Science and Engineering
Lyle School of Engineering
Southern Methodist University
P.O. Box 750122
Dallas, TX 75275-0122, United States of America
E-mail: mhd@lyle.smu.edu
URL: http://lyle.smu.edu/~mhd/
Journal of Statistical Software http://www.jstatsoft.org/
published by the American Statistical Association http://www.amstat.org/
Volume 35, Issue 5 Submitted: 2009-05-26
July 2010 Accepted: 2010-05-10

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

8 Readers on Mendeley
by Discipline
 
 
 
by Academic Status
 
38% Ph.D. Student
 
13% Student (Bachelor)
 
13% Other Professional
by Country
 
50% United States
 
13% India
 
13% France