This paper discusses the problem of identifying differentially expressed groups of genes from a microarray experiment. The groups of genes are ex-ternally defined, for example, sets of gene pathways derived from biological databases. Our starting point is the interesting Gene Set Enrichment Analy-sis (GSEA) procedure of Subramanian et al. [Proc. Natl. Acad. Sci. USA 102 (2005) 15545–15550]. We study the problem in some generality and propose two potential improvements to GSEA: the maxmean statistic for summarizing gene-sets, and restandardization for more accurate inferences. We discuss a variety of examples and extensions, including the use of gene-set scores for class predictions. We also describe a new R language package GSA that im-plements our ideas. 1. Introduction. We discuss the problem of identifying differentially ex-pressed groups of genes from a set of microarray experiments. In the usual situation we have N genes measured on n microarrays, under two different ex-perimental conditions, such as control and treatment. The number of genes N is usually large, say, at least a few thousand, while the number samples n is smaller, say, a hundred or fewer. This problem is an example of multiple hypothesis testing with a large number of tests, one that often arises in genomic and proteomic ap-plications, and also in signal processing. We focus mostly on the gene expression problem, but our proposed methods are more widely applicable. Most approaches start by computing a two-sample t-statistic z j for each gene. Genes having t-statistics larger than a pre-defined cutoff (in absolute value) are declared significant, and then the family-wise error rate or false discovery rate of the resulting gene list is assessed by comparing the tail area from a null distribution of the statistic. This null distribution is derived from data permutations, or from asymptotic theory. In an interesting and useful paper, Subramanian et al. (2005) proposed a method called Gene Set Enrichment Analysis (GSEA) for assessing the significance of pre-defined gene-sets, rather than individual genes. The gene-sets can be derived from different sources, for example, the sets of genes representing biological pathways
CITATION STYLE
Efron, B., & Tibshirani, R. (2007). On testing the significance of sets of genes. The Annals of Applied Statistics, 1(1). https://doi.org/10.1214/07-aoas101
Mendeley helps you to discover research relevant for your work.