Sign up & Download
Sign in

On testing the significance of sets of genes

by Bradley Efron, Robert Tibshirani
Annals of Applied Statistics ()

Abstract

This paper discusses the problem of identifying differentially expressed groups of genes from a microarray experiment. The groups of genes are externally defined, for example, sets of gene pathways derived from biological databases. Our starting point is the interesting Gene Set Enrichment Analysis (GSEA) procedure of Subramanian et al. Proc. Natl. Acad. Sci. USA 102 (2005) 15545-15550. We study the problem in some generality and propose two potential improvements to GSEA: the maxmean statistic for summarizing gene-sets, and restandardization for more accurate inferences. We discuss a variety of examples and extensions, including the use of gene-set scores for class predictions. We also describe a new R language package GSA that implements our ideas.

Cite this document (BETA)

Available from arxiv.org
Page 1
hidden

On testing the significance of se...

arXiv:math/0610667v2 [math.ST] 4 Sep 2007 The Annals of Applied Statistics 2007, Vol. 1, No. 1, 107���129 DOI: 10.1214/07-AOAS101 c circlecopyrt Institute of Mathematical Statistics, 2007 ON TESTING THE SIGNIFICANCE OF SETS OF GENES By Bradley Efron1 and Robert Tibshirani2 Stanford University This paper discusses the problem of identifying differentially ex- pressed groups of genes from a microarray experiment. The groups of genes are externally defined, for example, sets of gene pathways de- rived from biological databases. Our starting point is the interesting Gene Set Enrichment Analysis (GSEA) procedure of Subramanian et al. [Proc. Natl. Acad. Sci. USA 102 (2005) 15545���15550]. We study the problem in some generality and propose two potential improve- ments to GSEA: the maxmean statistic for summarizing gene-sets, and restandardization for more accurate inferences. We discuss a va- riety of examples and extensions, including the use of gene-set scores for class predictions. We also describe a new R language package GSA that implements our ideas. 1. Introduction. We discuss the problem of identifying differentially ex- pressed groups of genes from a set of microarray experiments. In the usual situation we have N genes measured on n microarrays, under two differ- ent experimental conditions, such as control and treatment. The number of genes N is usually large, say, at least a few thousand, while the number samples n is smaller, say, a hundred or fewer. This problem is an example of multiple hypothesis testing with a large number of tests, one that often arises in genomic and proteomic applications, and also in signal processing. We focus mostly on the gene expression problem, but our proposed methods are more widely applicable. Most approaches start by computing a two-sample t-statistic zj for each gene. Genes having t-statistics larger than a pre-defined cutoff (in absolute value) are declared significant, and then the family-wise error rate or false discovery rate of the resulting gene list is assessed by comparing the tail area Received October 2006 revised January 2007. 1Supported in part by NSF Grant DMS-05-05673 and National Institutes of Health Contract 8RO1 EB002784. 2Supported in part by NSF Grant DMS-99-71405 and National Institutes of Health Contract N01-HV-28183. Key words and phrases. Multiple testing, gene set enrichment, hypothesis testing. This is an electronic reprint of the original article published by the Institute of Mathematical Statistics in The Annals of Applied Statistics, 2007, Vol. 1, No. 1, 107���129. This reprint differs from the original in pagination and typographic detail. 1
Page 2
hidden
2 B. EFRON AND R. TIBSHIRANI from a null distribution of the statistic. This null distribution is derived from data permutations, or from asymptotic theory. In an interesting and useful paper, Subramanian et al. (2005) proposed a method called Gene Set Enrichment Analysis (GSEA) for assessing the significance of pre-defined gene-sets, rather than individual genes. The gene- sets can be derived from different sources, for example, the sets of genes representing biological pathways in the cell, or sets of genes whose DNA sequences are close to together on the cell���s chromosomes. The idea is that these gene-sets are closely related and, hence, will have similar expression patterns. By borrowing strength across the gene-set, there is potential for increased statistical power. In addition, in comparing study results on the same disease from different labs, one might get more reproducibility from gene-sets than from individual genes, because of biological and technical variability. The GSEA methods works roughly as follows. We begin with a pre-defined collection of gene-sets S1,S2,...,SK. We compute t-statistic zj for all N genes in our data. Let zk = (z1,z2,...,zm) be the gene scores for the m genes in gene-set Sk. In GSEA we then compute a gene-set score Sk(zk) for each gene-set Sk, equal to essentially a signed version of the Kolmogorov���Smirnov statistic between the values {zj,j ��� Sk} and their complement {zj,j / ��� Sk} the sign taken positive or negative depending on the direction of shift. The idea is that if some or all of the gene-set Sk have higher (or lower) values of zj than expected, their summary score Sk should be large. An absolute cutoff value is defined, and values of Sk above (or below) the cutoff are declared significant. The GSEA method then does many permutations of the sample labels and recomputes the statistic on each permuted dataset. This information is used to estimate the false discovery rate of the list of significant gene-sets. The Bioconductor package limma offers an analysis option similar to GSEA, but uses instead the simple average of the scores zk [see Smyth (2004)], randomizing over the set of genes rather than the set of samples, what we call ���row randomization��� here. Other related ideas may be found in Pavlidis et al. (2002) and Rahnenfhrer et al. (2004). Nobel and Wright (2005) propose the ���SAFE��� methodology a quite general permutation-based approach to the enrichment testing problem. Newton et al. (2006) propose random set scoring methods for assessing the significance of gene-set enrichment. Zahn et al. (2006) propose an alternative to GSEA that uses a Van der Waerden statistic in place of the Kolmogorov��� Smirnov statistic and bootstrap sampling of the arrays instead of a permu- tation distribution. Other papers that address the problem of testing for dif- ferentially expressed sets of genes include Szabo et al. (2003), Frisina et al. (2004), Lu et al. (2005) and Dettling et al. (2005). One of our goals here is to make explicit the choices involved between the various randomization schemes.
Page 3
hidden
ON TESTING THE SIGNIFICANCE OF SETS OF GENES 3 In studying the GSEA work, we have found some shortcomings and ways it could be improved. The GSEA���s dependence on Kolmogorov���Smirnov statistics is a reasonable choice, but not a necessary one. This paper puts the GSEA procedure in a more theoretical framework that allows us to inves- tigate questions of efficiency for gene-set inference a new procedure based on the ���maxmean��� statistic is suggested that has superior power characteristics versus familiar location/scale alternatives. Here are two simulated data examples that illustrate some of the main issues, and allow us to introduce our proposed solution. We generated data on 1000 genes and 50 samples, with each consecutive nonoverlapping block of 20 genes considered to be a gene-set. The first 25 samples are the control group, and the second 25 samples are the treatment group. First we gen- erated each data value as i.i.d. N(0,1). Then the constant 2.5 was added to the first 10 genes in the treatment group. Thus, half of the first gene-set (first block of 20 genes) has a higher average expression in the treatment group, while all other gene-sets have no average difference in the two groups. The left panel of Figure 1 shows a histogram (black lines) of the GSEA scores for the 50 gene-sets. The first gene-set clearly stands out, with a value of about 0.9. We did 200 permutations of the control-treatment labels, producing the dashed histogram in the top left panel of Figure 1. The first gene-set stands out on the right side of the histogram. So the GSEA method has performed reasonably well in this example. In the paper we study alternative summary statistics for gene-sets. Our favorite is something we call the ���maxmean statistic���: we compute the av- erage of the positive parts of each zi in S, and also the negative parts, and choose the one that is larger in absolute value. The results for maxmean in this example are shown in the right panel of Figure 1. The first gene-set stands out more clearly than it does in the left panel. In this paper we show by both analytic calculations and simulations that the maxmean statistics are generally more powerful than GSEA. Now consider a different problem. We generated data exactly as before, except that the first 10 genes in every gene-set are 2.5 units higher in the treatment group. The top left panel of Figure 2 shows a histogram of the maxmean scores for the 50 gene-sets, and a histogram of the scores from 200 permutations of the sample labels (dashed). All of the scores look signifi- cantly large compared to the permutation values. But given the way that the data were generated, there seems to be nothing special about any one gene-set. To quantify this, we ���row randomized��� the 1000 genes, leaving the sample labels as is. The first 20 genes in the scrambled set became the first gene-set, the second 20 genes became the second gene-set, and so on. We did this many 200 times, recomputing the maxmean statistic on each scrambled set. The results are shown in the bottom left panel of Figure 2. None of

Readership Statistics

148 Readers on Mendeley
by Discipline
 
 
 
by Academic Status
 
31% Ph.D. Student
 
14% Researcher (at an Academic Institution)
 
13% Post Doc
by Country
 
41% United States
 
7% United Kingdom
 
7% Canada

Tags

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in