Coupled two-way clustering analys...
arXiv:physics/0004009v1 [physics.bio-ph] 4 Apr 2000 Coupled Two-Way Clustering Analysis of Gene Microarray Data G. Getz, E. Levine and E. Domany Department of Physics of Complex Systems, Weizmann Inst. of Science, Rehovot 76100, Israel February 2, 2008 Abstract We present a novel coupled two-way clustering approach to gene microarray data analysis. The main idea is to identify subsets of the genes and samples, such that when one of these is used to cluster the other, stable and significant partitions emerge. The search for such subsets is a computationally complex task: we present an algorithm, based on iterative clustering, which performs such a search. This analysis is especially suitable for gene microarray data, where the contributions of a variety of biological mechanisms to the gene expression levels are entangled in a large body of experimental data. The method was applied to two gene microarray data sets, on colon cancer and leukemia. By identifying relevant subsets of the data and focusing on them we were able to discover partitions and correlations that were masked and hidden when the full dataset was used in the analysis. Some of these partitions have clear biological interpretation others can serve to identify possible directions for future research. Introduction In a typical DNA microarray experiment expression levels of thousands of genes are recorded over a few tens of dif- ferent samples1 [1, 3, 4]. Hence this new technology gave rise to a new computational challenge: to make sense of such massive expression data [5, 6, 7]. The sizes of the datasets and their complexity call for multi-variant clus- tering techniques [8, 9], which are essential for extracting correlated patterns and the natural classes present in a set of N data points, or objects, represented as points in the multidimensional space defined by D measured features. Gene microarray data are fairly special in that it makes good sense to perform clustering analysis in two ways [1, 2]. The first views the ns samples as the N = ns objects to be clustered, with the ng genes��� levels of ex- pression in a particular sample playing the role of the fea- tures, representing that sample as a point in a D = ng dimensional space. The different phases of a cellular pro- cess emerge from grouping together samples with similar or related expression profiles. The other, not less natural way, looks for clusters of genes that act correlatively on the different samples. This view considers the N = ng genes as the objects to be clustered, each represented by its expression profile, as measured over all the samples, as a point in a D = ns dimensional space. Whereas in previous work [1, 2, 10] the samples and genes were clustered completely independently, we in- troduce and perform here a coupled two-way clustering (CTWC) analysis. Our philosophy is to narrow down both the features that we use and the data points that are clustered. We believe that only a small subset of the genes participate in any cellular process of interest, which takes place only in a subset of the samples by focusing on small subsets, we lower the noise induced by the other samples and genes. We look for pairs of a relatively small subset F of features (either genes or samples) and of objects O, (samples or genes), such that when the set O is clustered using the features F, stable and significant partitions are obtained. Finding such pairs of subsets is a rather complex mathe- matical problem the CTWC method produces such pairs in an iterative clustering process. CTWC can be performed with any clustering algo- rithm. We tested it in conjunction with several clustering methods, but present here only results that were obtained using the super-paramagnetic clustering algorithm (SPC) [16, 11, 12], which is especially suitable for gene microar- ray data analysis due to its robustness against noise and its ���natural��� ability to identify stable clusters. The CTWC clustering scheme was applied to two gene microarray data sets, one from a colon cancer experiment [1] and the other from a leukemia experiment [3]. From both datasets we were able to ���mine��� new partitions and correlations that have not been obtained in an unsuper- vised fashion by previously used methods. Some of these new partitions have clear, well understood biological inter- pretation. We do not report here discoveries of biologically relevant, previously unknown results. The main point of our message is twofold: (a) we were able to identify bio- logically relevant partitions in an unsupervised way and (b) other, not less natural new partitions were also found, 1By ���sample��� we refer to any kind of living matter that is being tested, e.g. different tissues[1] cell populations collected at different times[2] etc. 1
which may contain new, important information and for which one should seek biological interpretation. Coupled Two Way Clustering Motivation and Algorithm The results of every gene microarray experiment can be summarized as a set of numbers, which we organize in an expression level matrix A. A row of this matrix cor- responds to a single gene, while each column represents a particular sample. Our normalization is described in detail later. In a typical experiment simultaneous expression lev- els of thousands of genes are measured. Gene expression is influenced by the cell type, cell phase, external signals and more [13]. The expression level matrix is therefore the result of all these processes mixed together. Our goal is to separate and identify these processes and to extract as much information as possible about them. The main point is that each biological process on which we wish to focus may involve a relatively small subset of the genes that are present on a microarray the large majority of the genes constitute a noisy background which may mask the effect of the small subset. The same may happen with respect to samples. The CTWC procedure which we now describe is de- signed to identify subsets of genes and samples, such that a single process is the main contributor to the expression of the gene subset over the sample subset. We start with clustering the samples and the genes of the full data set and identify all stable clusters of either samples or genes. We scan these clusters one by one. The expression levels of the genes of each cluster are used as the feature set F to represent object sets. The different object sets O contain either all the samples or any sample cluster. Similarly, we scan all stable clusters of samples and use them as the feature set F to identify stable clusters of genes. We keep track of all the stable clusters that are generated, of both genes, denoted as vg, and samples vs. The gene clusters are accumulated in a list V g and the sample clusters in V s. Furthermore, we keep all the chain of clustering analyses that has been performed (which subset was used as ob- jects, which subset was used as features, and which were the stable clusters that have been identified). When new clusters are found, we use them in the next iteration. At each iteration step we cluster a subset of the objects (either samples or genes) using a subset of the features (genes or samples). The procedure stops when no new relevant information is generated. The outcome of the CTWC algorithm are the final sets V g and V s and the pointers that identify how all stable clusters of genes and samples were generated. A precise, step by step definition of the algorithm is given in Fig. 1. Analyzing the clusters obtained by CTWC The output of CTWC has two important components. First, it provides a broad list of gene and sample clus- ters. Second, for each cluster (of samples, say) we know which subset (of samples) was clustered to find it, and which were the features (genes) used to represent it. We also know for every cluster C, which other clusters can be identified by using C as the feature set. We present here a brief selection of the possible ways one can utilize this kind of information. Implementations of the particular uses listed here are described in the Applications section. Identifying genes that partition the samples ac- cording to a known classification. This particular application is supervised. Denote by C a known classi- fication of the samples, say into two classes, c1 and c2. CTWC provides an easy way to rank the clusters of genes in V g by their ability to separate the samples according to C. It should be noted that CTWC not only provides a list of candidate gene clusters one should check, but also a unique method of testing them. First we evaluate for each cluster of samples vs in V s two scores, purity and efficiency, which reflect the extent to which assignment of the samples to vs corresponds to the classification C. These figures of merit are defined (for c1, say) as purity(vs|c1) = |vs ��� c1| |vs| efficiency(vs|c1) = |vs ��� c1| |c1| . Once a cluster vs with high purity and efficiency has been found, we can use the saved pointers to read off the clus- ter (or clusters) of genes that were used as the feature set to yield vs in our clustering procedure. Clustering, as op- posed to classification, discovers only those partitions of the data which are, in some sense, ���natural���. Hence by this method we identify the most natural group of genes that can be used to induce a desired classification. Needless to say, one can also test a gene cluster vg that was provided by CTWC using more standard statis- tics, such as the t-test [14] or the Jensen-Shannon distance [15]. Both compare the expression levels of the genes of vg on the two groups of samples, c1, c2, partitioned ac- cording to C. Alternatively, one can also use the genes of vg to train a classifier to separate the samples according to C [3], and use the success of the classifier to measure whether the expression levels of the genes in vg do or do not correspond to the classification. Discovering new partitions. Every cluster vs of V s is a subset of all the samples, the members of which have been linked to each other and separated from the other samples on the basis of the expression levels of some co- expressed subset of genes. It is reasonable therefore to argue that the cluster vs has been formed for some biolog- ical or experimental reason. As a first step to understand the reason for the for- mation of a robust cluster vs, one should try to relate it to some previously known classification (for example, in terms of purity and efficiency). Clusters which cannot be 2