Cluster EnsemblesA Knowledge Reus...
Journal of Machine Learning Research 3 (2002) 583-617 Submitted 4/02 Published 12/02 Cluster Ensembles ��� A Knowledge Reuse Framework for Combining Multiple Partitions Alexander Strehl alexander@strehl.com Joydeep Ghosh ghosh@ece.utexas.edu Department of Electrical and Computer Engineering The University of Texas at Austin Austin, TX 78712, USA Editor: Claire Cardie Abstract This paper introduces the problem of combining multiple partitionings of a set of objects into a single consolidated clustering without accessing the features or algorithms that deter- mined these partitionings. We first identify several application scenarios for the resultant ���knowledge reuse��� framework that we call cluster ensembles. The cluster ensemble prob- lem is then formalized as a combinatorial optimization problem in terms of shared mutual information. In addition to a direct maximization approach, we propose three effective and efficient techniques for obtaining high-quality combiners (consensus functions). The first combiner induces a similarity measure from the partitionings and then reclusters the objects. The second combiner is based on hypergraph partitioning. The third one collapses groups of clusters into meta-clusters which then compete for each object to determine the combined clustering. Due to the low computational costs of our techniques, it is quite feasible to use a supra-consensus function that evaluates all three approaches against the objective function and picks the best solution for a given situation. We evaluate the ef- fectiveness of cluster ensembles in three qualitatively different application scenarios: (i) where the original clusters were formed based on non-identical sets of features, (ii) where the original clustering algorithms worked on non-identical sets of objects, and (iii) where a common data-set is used and the main purpose of combining multiple clusterings is to improve the quality and robustness of the solution. Promising results are obtained in all three situations for synthetic as well as real data-sets. Keywords: cluster analysis, clustering, partitioning, unsupervised learning, multi-learner systems, ensemble, mutual information, consensus functions, knowledge reuse 1. Introduction The notion of integrating multiple data sources and/or learned models is found in sev- eral disciplines, for example, the combining of estimators in econometrics (Granger, 1989), evidences in rule-based systems (Barnett, 1981) and multi-sensor data fusion (Dasarathy, 1994). A simple but effective type of multi-learner system is an ensemble in which each component learner (typically a regressor or classifier) tries to solve the same task. While early studies on combining multiple rankings, such as the works by Borda and Condorcet, pre-date the French Revolution (Ghosh, 2002a), this area noticeably came to life in the past c 2002 Alexander Strehl and Joydeep Ghosh.
Strehl and Ghosh decade, and now even boasts its own series of dedicated workshops (Kittler and Roli, 2002). Until now the main goal of ensembles has been to improve the accuracy and robustness of a given classification or regression task, and spectacular improvements have been obtained for a wide variety of data sets (Sharkey, 1999). Unlike classification or regression settings, there have been very few approaches proposed for combining multiple clusterings.1 Notable exceptions include: ��� strict consensus clustering for designing evolutionary trees, typically leading to a so- lution at a much lower resolution than that of the individual solutions, and ��� combining the results of several clusterings of a given data-set, where each solution resides in a common, known feature space, for example, combining multiple sets of cluster centers obtained by using k-means with different initializations (Bradley and Fayyad, 1998). In this paper, we introduce the problem of combining multiple partitionings of a set of objects without accessing the original features. We call this the cluster ensemble problem, and will motivate this new, constrained formulation shortly. Note that since the combiner can only examine the cluster label but not the original features, this is a framework for knowledge reuse (Bollacker and Ghosh, 1999). The cluster ensemble design problem is more difficult than designing classifier ensembles since cluster labels are symbolic and so one must also solve a correspondence problem. In addition, the number and shape of clusters provided by the individual solutions may vary based on the clustering method as well as on the particular view of the data available to that method. Moreover, the desired number of clusters is often not known in advance. In fact, the ���right��� number of clusters in a data-set often depends on the scale at which the data is inspected, and sometimes equally valid (but substantially different) answers can be obtained for the same data (Chakaravathy and Ghosh, 1996). We call a particular clustering algorithm with a specific view of the data a clusterer. Each clusterer outputs a clustering or labeling, comprising the group labels for some or all objects. Some clusterers may provide additional information such as descriptions of cluster means, but we shall not use such information in this paper. There are two primary motivations for developing cluster ensembles as defined above: to exploit and reuse existing knowledge implicit in legacy clusterings, and to enable clustering over distributed data-sets in cases where the raw data cannot be shared or pooled together because of restrictions due to ownership, privacy, storage, etc. Let us consider these two application domains in greater detail. Knowledge Reuse. In several applications, a variety of clusterings for the objects under consideration may already exist, and one desires to either integrate these clusterings into a single solution, or use this information to influence a new clustering (perhaps based on a different set of features) of these objects. Our first encounter with this application scenario was when clustering visitors to an e-tailing website based on market basket analysis, in order to facilitate a direct marketing campaign (Strehl and Ghosh, 2000). The company already had a variety of legacy customer segmentations based on demographics, credit rating, geographical region and purchasing patterns in 1. See Section 5 on related work for details. 584
Cluster Ensembles their retail stores, etc. They were obviously reluctant to throw out all this domain knowledge, and instead wanted to reuse such pre-existing knowledge to create a single consolidated clustering. Note that since the legacy clusterings were largely provided by human experts or by other companies using proprietary methods, the information in the legacy segmentations had to be used without going back to the original features or the ���algorithms��� that were used to obtain these clusterings. This experience was instrumental in our formulation of the cluster ensemble problem. Another notable aspect of this engagement was that the two sets of customers, purchasing from retail outlets and from the website respectively, had significant overlap but were not iden- tical. Thus the cluster ensemble problem has to allow for missing labels in individual clusterings. There are several other applications where legacy clusterings are available and a con- strained use of such information is useful. For example, one may wish to combine or reconcile a clustering or categorization of web pages based on text analysis with those already available from Yahoo! or DMOZ (according to manually-crafted taxonomies), from Internet service providers according to request patterns and frequencies, and those indicated by a user���s personal bookmarks according to his or her preferences. As a second example, clustering of mortgage loan applications based on the informa- tion in the application forms can be supplemented by segmentations of the applicants indicated by external sources such as the FICO scores provided by Fair Isaac. Distributed Computing. The desire to perform distributed data mining is being increas- ingly felt in both government and industry. Often, related information is acquired and stored in geographically distributed locations due to organizational or operational con- straints (Kargupta and Chan, 2000), and one needs to process data in situ as far as possible. In contrast, machine learning algorithms invariably assume that data is available in a single centralized location. One can argue that by transferring all the data to a single location and performing a series of merges and joins, one can get a single (albeit very large) flat file, and our favorite algorithms can then be used after randomizing and subsampling this file. But in practice, such an approach may not be feasible because of the computational, bandwidth and storage costs. In certain cases, it may not even be possible due to variety of real-life constraints including security, pri- vacy, the proprietary nature of data and the accompanying ownership issues, the need for fault tolerant distribution of data and services, real-time processing requirements or statutory constraints imposed by law (Prodromidis et al., 2000). Interestingly, the severity of such constraints has become very evident of late as several government agencies attempt to integrate their databases and analytical techniques. A cluster ensemble can be employed in ���privacy-preserving��� scenarios where it is not possible to centrally collect all records for cluster analysis, but the distributed com- puting entities can share smaller amounts of higher level information such as cluster labels. The ensemble can be used for feature-distributed clustering in situations where each processor/clusterer has access to only a limited number of features or attributes of each object, i.e., it observes a particular aspect or view of the data. Aspects can be completely disjoint features or have partial overlaps. In gene function prediction, separate gene clusterings can be obtained from diverse sources such as gene sequence 585
Strehl and Ghosh comparisons, combinations of DNA microarray data from many independent experi- ments, and mining of the biological literature such as MEDLINE. An orthogonal scenario is object-distributed clustering, wherein each processor/clusterer has access to only a subset of all objects, and can thus only cluster the observed ob- jects. For example, corporations tend to split their customers regionally for more efficient management. Analysis such as clustering is often performed locally, and a cluster ensemble provides a way of obtaining a holistic analysis without complete integration of the local data warehouses. One can also consider the use of cluster ensembles for the same reasons as classification ensembles, namely to improve the quality and robustness of results. For classification or regression problems, it has been analytically shown that the gains from using ensemble methods involving strong learners are directly related to the amount of diversity among the individual component models (Krogh and Vedelsby, 1995, Tumer and Ghosh, 1999). One desires that each individual model be powerful, but at the same time, these models should have different inductive biases and thus generalize in distinct ways (Dietterich, 2001). So it is not surprising that ensembles are most popular for integrating relatively unstable models such as decision trees and multi-layered perceptrons. If diversity is indeed found to be beneficial in the clustering context, then it can be created in numerous ways, including: ��� using different features to represent the objects. For example, images can be repre- sented by their pixels, histograms, location and parameters of perceptual primitives or 3D scene coordinates ��� varying the number and/or location of initial cluster centers in iterative algorithms such as k-means ��� varying the order of data presentation in on-line methods such as BIRCH ��� using a portfolio of very different clustering algorithms such as density based, k-means or soft variants such as fuzzy c-means, graph partitioning based, statistical mechanics based, etc. It is well known that the comparative performance of different clustering methods can vary significantly across data-sets. For example, the popular k-means algorithm performs miserably in several situations where the data cannot be accurately characterized by a mixture of k Gaussians with identical covariance matrices (Karypis et al., 1999). In fact, for difficult data-sets, comparative studies across multiple clustering algorithms typically show much more variability in results than studies comparing the results of strong learners for classification (Richard and Lippmann, 1991). Thus there could be a potential for greater gains when using an ensemble for the purpose of improving clustering quality. Note that, in contrast to the knowledge reuse and distributed clustering scenarios, in this situation the combination mechanism could have had access to the original features. Our restriction that the consensus mechanism can only use cluster labels is in this case solely to simplify the problem and limit the scope of the solution, just as combiners of multiple classifiers are often based solely on the classifier outputs (for example, voting and averaging methods), although a richer design space is available. 586
Cluster Ensembles X ��(1) ��(2) ��(r) �� (r) �� (2) �� (1) �� �� Figure 1: The Cluster Ensemble. A consensus function �� combines clusterings ��(q) from a variety of sources, without resorting to the original object features X or algorithms ��. A final, related motivation for using a cluster ensemble is to build a robust clustering portfolio that can perform well over a wide range of data-sets with little hand-tuning. For example, by using an ensemble that includes approaches such as k-means, SOM (Kohonen, 1995) and DBSCAN (Ester et al., 1996), that typically work well in low-dimensional metric spaces, as well as algorithms tailored for high-dimensional sparse spaces such as spherical k- means (Dhillon and Modha, 2001) and Jaccard-based graph-partitioning (Strehl and Ghosh, 2000), one may perform very well in three dimensional as well as in 30000 dimensional spaces without having to switch models. This characteristic is very attractive to the general practitioner. Notation. Let X = {x1, x2, . . . , xn} denote a set of objects/samples/points. A parti- tioning of these n objects into k clusters can be represented as a set of k sets of objects {C���|��� = 1, . . . , k} or as a label vector �� ��� Nn. A clusterer �� is a function that delivers a label vector given a tuple of objects. Figure 1 shows the basic setup of the cluster ensemble: A set of r labelings ��(1,...,r) is combined into a single labeling �� (the consensus labeling) using a consensus function ��. Vector/matrix transposition is indicated with a superscript ���. A superscript in brackets denotes an index and not an exponent. Organization. In the next section, we formally define the design of a cluster ensemble as an optimization problem and propose an appropriate objective function. In Section 3, we propose and compare three effective and efficient combining schemes, ��, to tackle the combinatorial complexity of the problem. In Section 4 we describe applications of cluster ensembles for the scenarios described above, and show results on both real and artificial data. 587