Sign up & Download
Sign in

Mining the structure of user activity using cluster stability

by Jeffrey Heer, E H Chi
Proceedings of the Workshop on Web Analytics SIAM Conference on Data Mining (2002)

Abstract

Recent research has explored web user session clustering as a means of understanding user activity and interests on theWorld Wide Web. Though the proposed techniques have proven to be useful and effective, they require that one either specify the number of clusters in advance or browse a large hierarchy of clusters to find the optimal depth at which to describe user activity. In this paper, we examine the utility of a stability-based technique for automatically determining the optimal number of clusters in the context of web user session clustering.We present two case studies evaluating the techniques effectiveness.

Cite this document (BETA)

Available from citeseerx.ist.psu.edu
Page 1
hidden

Mining the structure of user activity using cluster stability

Mining the Structure of User Activity
using Cluster Stability
Jeffrey Heer, Ed H. Chi
PARC(PaloAltoResearchCenter)
Palo Alto, CA 94304, USA
{jheer, echi}@parc.com
Abstract
Recent research has explored web user session clustering as a means of understanding user activity and
interests on the World Wide Web. Though the proposed techniques have proven to be useful and effective,
they require that one either specify the number of clusters in advance or browse a large hierarchy of
clusters to find the optimal depth at which to describe user activity. In this paper, we examine the utility of
a stability-based technique for automatically determining the optimal number of clusters in the context of
web user session clustering. We present two case studies evaluating the technique s effectiveness.
Keywords
Clustering, Stability, Cluster Analysis, Log Analysis, Web Mining, User Interest, User Sessions
1 INTRODUCTION
As the Web continues to evolve and expand, engraining itself into the fabric of our everyday lives, it
becomes increasingly important to get accurate pictures of web usage and activity. Nearly every aspect of
the web experience can be improved by understanding the composition of goals and activities on a web
site. This includes an array of topics ranging from server performance, to page caching and prefetching, to
content and navigation design.
Determining the composition of user interests on the Web is a daunting task. Given the massive size of the
Web, along with the time and resource costs involved in traditional techniques such as contextual inquiry
and user surveys, it becomes clear that accurate, automated techniques are necessary for acquiring this
information. One promising automated approach is user session clustering, which, using web usage logs,
attempts to group site visits into common activities such as product catalog browsing, job seeking, and
article browsing.
While a number of session clustering approaches have been proposed, with varying degrees of complexity
and accuracy [8], they all share a common setback: none of the proposed methods currently discover the
optimal structure of the data to be clustered. That is, they are unable to determine the number of clusters
which best represent the high level composition of user activity on a site. This leaves web analysts with the
need to browse through large categorization hierarchies or try a number of cluster counts until a seemingly
acceptable choice is reached.
In this paper, we build upon our previous work in user session clustering, incorporating a recently proposed
method for automatically determining the optimal number of clusters. We then evaluate this method to test
its effectiveness for user session clustering.
The remainder of the paper is organized as follows: First, we discuss related work in user session clustering
and automatic cluster count determination. Next, we present our session clustering method in greater detail
and describe a stability-based method for determining the structure of clustered data. We then evaluate this
method, performing case studies on a pair of web sites, and present the results. Finally, we offer some
concluding remarks.
2 RELATED WORK
2.1 Web User Session Clustering
A number of web mining analysis tools have emerged which offer basic summarization of web activity by
grouping user actions into activities such as reading bulletin board messages, finding product information,
or searching for technical support. A number of clustering approaches have been proposed, all of which use
web server logs to generate a model of user actions that is then grouped with a clustering algorithm.
Shahabi et. al. describe a prototype system that uses viewing time as the primary feature to describe a user
session and then clusters the sessions using K-Means clustering [16].
Page 2
hidden
Fu et. al. suggested a technique called Generalization-based Clustering which uses page URLs to
construct a hierarchy which is then used to categorize the pages [6]. The page accesses in each user session
are described using these page categorizations and are then clustered using the BIRCH algorithm [20].
Banerjee et. al. utilized the combination of time spent on a page and Longest Common Subsequences
(LCS) to cluster the user sessions [1]. The LCS algorithm is first applied on all pairs of user sessions. Then
each LCS path is reduced using page hierarchy in a generalization-based approach called Concept-based
Clustering . This is basically a simplified form of Generalization-based Clustering, using only the top most
level of the page hierarchy. Similarities between LCS paths are then computed as a function of page
viewing time, creating a similarity graph that is then partitioned using the Metis algorithm [11].
Finally, Heer and Chi proposed a technique that utilizes a number of information sources to create a model
of user profiles, which can then be grouped using standard clustering algorithms [8, 9]. This method utilizes
data features from content and structure, in addition to the URLs, sequence ordering, and timing data
already contained in logs. A user study on www.xerox.com found that the method accurately grouped users
by their surfing goals [9]. We will revisit both this method and its evaluation later in this paper.
One common drawback to these clustering techniques is that they contain no measure of the structure of the
data being clustered. For the partitioning approaches, no means for determining the appropriate number of
clusters is provided. For hierarchical clustering approaches, the analyst must probe the entire hierarchy and
manually search for the right levels at which to describe the sessions.
To ease the burden of web analysts who wish to apply these techniques, it is necessary to augment these
methods such that they also find the structure (or lack thereof) in the usage data. Optimally, this would
include finding structure at multiple levels of granularity, determining both high-level groupings such as
Product Browsing or Job Searching, and more specific groupings such as Engineering Positions and Sales
and Marketing Jobs. In our experience [8], this can be achieved with a human supervisor making the
decisions on the optimal clustering structure, merging and reclustering groups as necessary. We now
examine proposed techniques for automating this process.
2.2 Determining Cluster Counts
A number of methods for automatically determining the structure of clustered data have been proposed,
including statistical modeling methods, cluster dispersion measures, and analyses of cluster stability.
Fraley & Raftery [4] describe the use of an approximated Bayes factor, computed with the EM
(Expectation-Maximization) algorithm, to compare statistical models of cluster data and simultaneously
choose both the desired number of clusters and clustering technique.
Calinski & Harabasz [3] proposed maximizing a normalized ratio of between and within cluster distances
as a means of choosing the optimal number of clusters. A 1985 study by Milligan and Cooper [13]
determined this to be the best metric among those considered in the study.
A number of methods based on cluster dispersion, or the within cluster sum squared distances, have been
proposed, including approaches by Hartigan [7], Krzanowski and Lai [12], and Tibshirani et. al. [18]. Of
recent interest is the work of Tibshirani et. al., who proposed the Gap statistic for determining the optimal
number of clusters. The method computes the within cluster dispersion for increasing values of k,and
compares the change in these values against a reference null distribution. Tibshirani et. al. explored using
both a uniform reference distribution over the range of each feature, and a uniform reference in the
principal component orientation.
Cluster stability has also been proposed as a criterion for determining the structure of data. Building off of
previous work in stability measurement [17] and cluster comparison [5], Ben-Hur et. al. proposed a
stability-based method for finding the optimal number of clusters [2]. Their technique samples a space of
clusterings for each choice of k, and uses a clustering similarity metric to generate a distribution of stability
values. This distribution is then used to choose the most stable clustering.
For the purposes of our research, we decided to investigate the stability-based measure as formulated by
Ben-Hur et al. Not only does this approach match most closely with our own preliminary formulations, the
technique also showed the best results in an evaluation across an array of different data sets [2].
Additionally, as Ben-Hur et. al. discussed, stability-based methods are independent of model and technique,
and furthermore do not make any assumptions as to cluster shape or density as some other methods do.

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

6 Readers on Mendeley
by Discipline
 
 
 
by Academic Status
 
50% Ph.D. Student
 
33% Student (Master)
 
17% Professor
by Country
 
50% Germany
 
17% Romania
 
17% France