GSEA-P: a desktop application for...
The MSigDB contains over 3000 gene sets of different types: (i) sets representing genes in the same chromosome or cytogenetic band, (ii) gene sets representing metabolic and signaling pathways from eight publicly available, manually curated pathway databases, (iii) genes reported in the literature as coexpressed in response to genetic or chemical perturbations, (iv) genes sharing conserved upstream regulatory motifs and (v) sets of genes in expression neighborhoods of cancer-related genes. Users may use this resource or define their own gene sets relevant to the process or phenotype they are investigating. Version 1.0 of the GSEA-P software and MSigDB were originally released in Spring of 2005. There are currently over 3500 registered users. The new version 2.0 of both the software and the database represent a substantial enhancement of the features, interface and content, which we describe below. 2 FEATURES Version 2.0 of the GSEA-P Java desktop software contains a complete implementation of the GSEA methodology, including leading edge analysis, as well as several usability improvements based on user feedback. New features include a gene set browser to search, download and map gene sets from the MSigDB database. We also have developed a website with comprehensive software documentation and Gene Set Cards with annotations including the source and biological relevance of MSigDB gene sets. A complete list of the new and improved features is in Supplementary Table 1. 2.1 Enrichment analysis In enrichment analysis, a user seeks to determine whether the members of a gene set are over-represented at the top (or bottom) of a ranked list of markers which have been ordered by their correlation with a specified phenotype. This functionality is central to the GSEA-P 2.0 software and is accessed via the ���Run GSEA page���. Users select a dataset, phenotype and a gene set collection and set parameters to run an enrichment analysis. We have improved this interface by enabling conversion of the dataset and gene sets to the same identifier format (i.e. gene symbols) before running the analysis (see ���Chip2Chip��� descrip- tion below and Supplementary Figure 1). To address the need for alternative or specialized gene ranking procedures, we now provide a vehicle within the GSEA-P software for use with a user-provided ranked gene list. Importantly, in this new release, enrichment results are saved to an XML formatted local database and hence are available for downstream analysis with other GSEA components (see ���Leading edge analysis��� below) and integration with other software programs. 2.2 Enrichment reports GSEA-P 2.0 produces richly annotated HTML reports of enrichment results. In addition to statistical details such as the ES, P-value and FDR, we now provide a link to gene set annotations at the MSigDB website. These annotations allow users to view the full details of the provenance and content of a gene set in a structure similar to that of the GeneCards resource (Rebhan et al., 1997). The GSEA report also contains improved enrichment plots (Supplementary Fig. 2). 2.3 Leading edge analysis After an enrichment analysis has been performed, it is often useful to examine and compare the genes in high scoring sets which occur before the maximum of the running ES. These genes can be thought of as the core of a gene set that drives the enrichment signal. By grouping leading edge subsets, high scoring gene sets can often be categorized into similar and distinct biological processes. To facilitate leading edge analysis, GSEA-P 2.0 provides an interactive viewer that can be run after a GSEA process completes. The user selects gene sets for leading edge analysis after which the program: (1) computes the core matrix over all selected gene sets, (2) clusters this matrix and (3) visualizes the result in a heat map (Supplementary Fig. 3A). Additionally, similarities between gene sets can be visualized by the Jacquard coefficient (Supplementary Fig. 3B). 2.4 Batch analysis mode To support the analysis of a large number of datasets or the integration of GSEA into a data analysis pipeline, GSEA-P 2.0 can run in ���headless��� mode as part of a shell script or load sharing facility. The analysis performed and the reports produced in this mode are identical to those produced with the graphical user interface. 2.5 Mapping identifiers between platforms with Chip2Chip Microarray platforms come from a number of manufacturers who use a variety of identifiers to represent gene transcripts. Additionally, cross-species comparisons require ortholog map- pings. Several tools such as NetAffx (Liu et al., 2003) provide the ability to map a given list of genes between platforms. However, these programs are often restricted to a particular vendor or are cumbersome to use when mapping a large collection of gene sets as they are tailored to map a single input list. To address this need, the GSEA-P 2.0 software provides a new utility called Chip2Chip that maps identifiers between platforms. Currently, GSEA-P 2.0 supports mappings between 93 platforms. Chip2Chip can convert between Entrez gene symbols and any of these platforms or between identifiers for any two of these chip types (Supplementary Fig. 4). 2.6 Integrated gene set browser & query interface To enable users of GSEA-P 2.0 to easily access the substantially enlarged MSigDB 2.0 collection we have embedded a gene set browser into the software which enables users to quickly search MSigDB for gene sets using an intuitive graphical user interface (Supplementary Fig. 5). By providing an integrated program, we enable the seamless interoperation of gene set analytics with the MSigDB gene sets database. 2.7 Documentation The website accompanying GSEA-P 2.0 includes extensive documentation: a user guide describing all aspects of the A.Subramanian et al. 3252
software, an illustrated tutorial, a frequently asked questions section, as well as four examples of GSEA analysis and results. The documentation is packaged into a GSEA Wiki site which will grow over time. ACKNOWLEDGEMENTS The authors wish to thank members of the Cancer Program at the Broad Institute for suggestions. They also thank Jide software for a free license to their component suite. Conflict of Interest: none declared. REFERENCES Bourquin,J.P. et al. (2006) Identification of distinct molecular phenotypes in acute megakaryoblastic leukemia by gene expression profiling. Proc. Natl Acad. Sci. USA, 103, 3339���3344. Liu,G. et al. (2003) NetAffx: affymetrix probesets and annotations. Nucleic Acids Res., 31, 82���86. Mootha,V.K. et al. (2003) PGC-1alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat. Genet., 34, 267���273. Rebhan,M. et al. (1997) GeneCards: Encyclopedia for Genes, Proteins and Diseases, Weizmann Institute of Science, Bioinformatics Unit and Genome Center. Trends in Genetics, 13, 163. Subramanian,A. et al. (2005) From the Cover: Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl Acad. Sci. USA, 102, 15545���15550. Sweet-Cordero,A. et al. (2005) An oncogenic KRAS2 expression signature identified by cross-species gene-expression analysis. Nat. Genet., 37, 48���55. GSEA-P: a desktop application 3253