Sign up & Download
Sign in

FIRMA: a method for detection of alternative splicing from exon array data

by E Purdom, K M Simpson, M D Robinson, J G Conboy, A V Lapuk, T P Speed
Bioinformatics ()

Abstract

Motivation: Analyses of EST data show that alternative splicing is much more widespread than once thought. The advent of exon and tiling microarrays means that researchers now have the capacity to experimentally measure alternative splicing on a genome wide level. New methods are needed to analyze the data from these arrays. Results: We present a method, finding isoforms using robust multichip analysis (FIRMA), for detecting differential alternative splicing in exon array data. FIRMA has been developed for Affymetrix exon arrays, but could in principle be extended to other exon arrays, tiling arrays or splice junction arrays. We have evaluated the method using simulated data, and have also applied it to two datasets: a panel of 11 human tissues and a set of 10 pairs of matched normal and tumor colon tissue. FIRMA is able to detect exons in several genes confirmed by reverse transcriptase PCR. Availability: R code implementing our methods is contributed to the package aroma.affymetrix. Contact: epurdomstat.berkeley.edu Supplementary information: Supplementary data are available at Bioinformatics online.

Cite this document (BETA)

Available from www.pubmedcentral.nih.gov
Page 1
hidden

FIRMA: a method for detection of ...

[12:19 17/7/03 Bioinformatics-btn284.tex] Page: 1707 1707���1714 BIOINFORMATICS ORIGINAL PAPER Vol. 24 no. 15 2008, pages 1707���1714 doi:10.1093/bioinformatics/btn284 Gene expression FIRMA: a method for detection of alternative splicing from exon array data E. Purdom1,���, K. M. Simpson2, M. D. Robinson2,3, J. G. Conboy4, A. V. Lapuk4 and T. P. Speed1,2 1Department of Statistics, University of California at Berkeley, 367 Evans Hall #3860, Berkeley, CA 94720���3860, USA, 2The Walter and Eliza Hall Institute, 1G Royal Parade, Parkville, Victoria, 3050, 3Department of Medical Biology, University of Melbourne, Parkville, Victoria 3010, Australia and 4Life Sciences Division, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA Received on February 25, 2008 revised on May 18, 2008 accepted on June 6, 2008 Advance Access publication June 23, 2008 Associate Editor: David Rocke ABSTRACT Motivation: Analyses of EST data show that alternative splicing is much more widespread than once thought. The advent of exon and tiling microarrays means that researchers now have the capacity to experimentally measure alternative splicing on a genome wide level. New methods are needed to analyze the data from these arrays. Results: We present a method, finding isoforms using robust multichip analysis (FIRMA), for detecting differential alternative splicing in exon array data. FIRMA has been developed for Affymetrix exon arrays, but could in principle be extended to other exon arrays, tiling arrays or splice junction arrays. We have evaluated the method using simulated data, and have also applied it to two datasets: a panel of 11 human tissues and a set of 10 pairs of matched normal and tumor colon tissue. FIRMA is able to detect exons in several genes confirmed by reverse transcriptase PCR. Availability: R code implementing our methods is contributed to the package aroma.affymetrix. Contact: epurdom@stat.berkeley.edu Supplementary information: Supplementary data are available at Bioinformatics online. 1 INTRODUCTION Alternative splicing is thought to have several roles in complex organisms, primarily in increasing protein diversity (Maniatis and Tasic, 2002). It can affect the intracellular localization, binding properties or stability of a protein, or regulate its expression via nonsense-mediated decay (NMD) (Stamm et al., 2005).These events usually occur in a regulated manner, but if an aberrant splicing event occurs, it can be causative for, or symptomatic of, disease. More than 15% of heritable human diseases are known to be associated with mutations in splice sites or in splicing regulatory elements (Matlin et al., 2005). In particular, aberrant pre���mRNA splicing events are known to be implicated in several types of cancer (Brinkman, 2004 Venables, 2004). Previously thought to be a relatively uncommon phenomenon, alternative splicing has recently been shown to be widespread ��� To whom correspondence should be addressed. throughout the genome. Analyses of data on human expressed sequence tags (ESTs) give estimated lower bounds between 35% and 59% for the proportion of genes which have at least one splice variant (Modrek and Lee, 2002). The frequency of functional alternative splicing events is probably lower than this. Several groups have searched for alternative splicing events conserved between human and mouse, and their results suggest that the proportion of functionally alternatively spliced genes is ���10% (Sorek et al., 2004 Sugnet et al., 2004 Yeo et al., 2005). Aweakness of all EST-based methods is that they are biased towards genes which have greater EST coverage (Modrek and Lee, 2002). Several kinds of alternative splicing have been observed (see Black, 2003, for a recent review). The most common form is skipping or inclusion of one or more ���cassette��� exons (roughly 40���50% of cases based on bioinformatic evidence (Clark and Thanaraj, 2002 Sugnet et al., 2004), these being exons which are wholly present in some transcripts, and wholly absent in some others. Alternatively, mutually exclusive cassette exon usage can take place e.g. exon A or exon B forms part of the transcript, but never A and B together (more generally, multiple exons can exhibit mutual exclusivity). Usage of alternative 3 or 5 splice sites can result in shortening or lengthening of an exon. Other types of alternative splicing that have been observed are alternative promoter usage, alternative polyadenylation sites and intron retention. Additionally, any combination of the above may occur in an alternatively spliced transcript (Black, 2003). Skipping or inclusion of internal cassette exons is the most common kind of alternative splicing, and possibly the easiest to detect and verify. For this reason, we have focused on identifying specific exons showing patterns of differential alternative expression and have not approached the problem of reconstructing more complicated transcript patterns. Our algorithm FIRMA has been developed for analyzing the Affymetrix exon array, Santa Clara, California, USA, which queries the expression level of well annotated and as well as predicted exons. In brief, FIRMA scores each exon as to whether its probes systematically deviate from the expected gene expression level. With a small number of probes per exon (four or less), this is a challenging microarray platform to analyze���such deviations can �� 2008 The Author(s) This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Page 2
hidden
[12:19 17/7/03 Bioinformatics-btn284.tex] Page: 1708 1707���1714 E.Purdom et al. come from a myriad of biological and technical factors unrelated to alternative splicing. We show that FIRMA performs well in detecting exon-specific changes in expression and therefore can contribute substantially to the detection of regulated alternative splicing. Of course a single scoring method can only be one step in the analysis, and any results must be evaluated in the light of these other complications. 2 MATERIALS The GeneChip Human Exon 1.0 ST (sense target) array is a whole-genome array, containing over 1.4 million probesets of up to four perfect match (PM) probes each, spread across exons from all known genes, plus a number of additional regions based on other annotation sources, including GENSCAN predictions and ESTs from dbEST. In the design phase, sequences from all the annotation sources were mapped to the July 2003 version of the human genome (UCSC hg16, NCBI 34). Regions which had some evidence from one or more sources for being transcribed were divided into probe selection regions (PSR) according to the presence of canonical splice sites, CDS start and stop positions or polyadenylation sites. Probes were then selected from within PSRs 25 bp in length. Each PSR corresponds to a probeset, which generally contains four possibly overlapping probes (sometimes fewer). About a quarter of the probesets are based solely on EST evidence, while another quarter are based solely on GENSCAN predictions (GeneChip Exon Array Design Technical Note, Affymetrix). The array contains only PM probes, with a small number of generic mismatch probes for the purposes of background correction. There are no probes which span exon���exon junctions. Association of probesets with genes is not made at design time. Instead, these ���main-design���probesets are annotated afterwards, using their alignment to the genome (Exon Probeset Annotations Whitepaper, Affymetrix). This process has been undertaken by Affymetrix, first for NCBI Build 34 of the genome, and more recently for Build 35. The result is that each probeset is assigned to a ���transcript cluster���, and also has an annotation quality indicator associated with it. 3 ALGORITHM Our method is developed to evaluate levels of alternative splicing for the situation where there are no replicates nor pre-defined groups in the samples or alternative splicing does not consistently follow the groupings that exist. The second situation can be quite common, for example in disease versus normal where alternative splicing may exist in only a proportion of the diseased samples for a given gene. Alternatively there may be groupings, such as tissue type, where patterns of alternative splicing are shared amongst several tissue types, but the tissue types that share a similar splicing pattern may be different in different genes. For this reason, our algorithm is sample-by-exon specific: each exon and sample pairing is given a score that is comparable across either samples, genes or exons. 3.1 Alternative splicing detection method The two major steps in our method are estimation of the expression levels of each gene, using the robust multichip analysis (RMA) approach (Irizarry et al., 2003), and detection of alternative splicing using a suitably defined score from auxiliary information from the estimation step. We call the combined approach finding isoforms using robust multichip analysis (FIRMA). RMA itself involves three steps: background correction, normalization and summarization of the probe-level data (Irizarry et al., 2003). The following discussion assumes that the first two steps have already been performed. We extract the normalized probe-level data for the probes belonging to a transcript cluster. The exact set of probesets which are used may depend on the aim of the experiment. If this is detection of novel exons, then all probesets might be used (though note the problems with non-expressed regions mentioned in the Discussion). If alternative splicing of well-annotated exons is of more interest, then the analysis might be restricted to well-annotated probesets. In the development below, we will refer to ���exons���, though in fact the analysis is done on probesets, which usually coincide with an exon. The final step in RMA is to estimate the gene expression level of each sample by fitting the following additive model for each gene: log2(PMik) = ci +pk + ik , (1) where ci is the chip effect (expression level) for chip i, pk is the probe effect for probe k, (and which can be interpreted as a relative probe affinity if we use the constraint ��� k pk = 0), and log2(PM)ik is the log (base 2) of the background-corrected, normalized PM signal for probe k on chip i (Irizarry et al., 2003). The model is fitted using iteratively reweighted least squares (IRLS) (Marazzi, 1993). For the exon array, we can consider a more general additive model which includes the possibility of alternative splicing or different levels of expression per exon, log2(PMijk(j)) = ci +ej +dij +pk(j) + ijk(j), (2) where (again assuming a zero-sum constraint for these parameters) ej is the relative change in exon expression for exon j, dij is the interaction between chip and exon giving the relative change for sample i in exon j, and pk( j) is the nested relative probe effect for the k-th probe in exon j. The parameter dij indicates the discrepancy of a given sample in exon j from the expected expression for that exon. It is large values of this parameter that indicate differential alternative splicing. Rather than fit this extended model and estimate dij explicitly, we propose to fit the standard RMA model in (1) for the exon array. If there is a large discrepancy in some samples (a large dij) then we will see this as large residuals for the probes for that sample in that exon. In this way, we frame the problem of detecting alternative splicing as a problem of outlier detection, rather than estimation of an interaction effect. By robustly fitting without the term dij we avoid the additional noise that would be added to all of our parameter estimates, since there are at most four observations to estimate this term. We do assume, however, limited levels of alternative splicing so that the other terms in the model are still well estimated with our robust estimation procedure even though dij is excluded from the model. Based on this logic, let rijk = yijk ��� ��i c ��� ��k p , be the residuals from fitting the standard model in Equation (1). Then for each exon j and sample i, a summary score based on the four residuals from exon j and sample i gives a measure of the discrepancy dij in the expression of the exon in that sample. Any number of scoring functions could be used. The most obvious choice is the mean. More robust alternatives would be the median residual, the lower quartile or even the smallest of the absolute residuals. We considered these various options in scoring. Ultimately, we determined that the median of the residuals in an 1708

Readership Statistics

58 Readers on Mendeley
by Discipline
 
 
 
by Academic Status
 
36% Ph.D. Student
 
17% Post Doc
 
9% Researcher (at an Academic Institution)
by Country
 
31% United States
 
14% Germany
 
7% Italy

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in