Improving RNA-Seq expression estimates by correcting for fragment bias
Genome Biology (2011)
- DOI: 10.1186/gb-2011-12-3-r22
- PubMed: 21410973
Available from www.pubmedcentral.nih.gov
or
Abstract
The biochemistry of RNA-Seq library preparation results in cDNA fragments that are not uniformly distributed within the transcripts they represent. This non-uniformity must be accounted for when estimating expression levels, and we show how to perform the needed corrections using a likelihood based approach. We find improvements in expression estimates as measured by correlation with independently performed qRT-PCR and show that correction of bias leads to improved replicability of results across libraries and sequencing technologies.
Available from www.pubmedcentral.nih.gov
Page 1
Improving RNA-Seq expression esti...
METHOD Open Access Improving RNA-Seq expression estimates by correcting for fragment bias Adam Roberts1, Cole Trapnell2,3, Julie Donaghey2, John L Rinn2,3 and Lior Pachter1,4* Abstract The biochemistry of RNA-Seq library preparation results in cDNA fragments that are not uniformly distributed within the transcripts they represent. This non-uniformity must be accounted for when estimating expression levels, and we show how to perform the needed corrections using a likelihood based approach. We find improvements in expression estimates as measured by correlation with independently performed qRT-PCR and show that correction of bias leads to improved replicability of results across libraries and sequencing technologies. Background RNA-Seq technology offers the possibility of accurately measuring transcript abundances in a sample of RNA by sequencing of double stranded cDNA [1]. Unfortunately, current technological limitations of sequencers require that the cDNA molecules represent only partial frag- ments of the RNA being probed. The cDNA fragments are obtained by a series of steps, often including reverse transcription primed by random hexamers (RH), or by oligo (dT). Most protocols also include a fragmentation step, typically RNA hydrolysis or nebulization, or alter- natively cDNA fragmentation by DNase I treatment or sonication. Many sequencing technologies also require constrained cDNA lengths, so a final gel cutting step for size selection may be included. Figure 1 shows how some of these procedures are combined in a typical experiment. The randomness inherent in many of the preparation steps for RNA-Seq leads to fragments whose starting points (relative to the transcripts from which they were sequenced) appear to be chosen approximately uni- formly at random. This observation has been the basis of assumptions underlying a number of RNA-Seq analy- sis approaches that, in computer science terms, invert the ���reduction��� of transcriptome estimation to DNA sequencing [2-6]. However, recent careful analysis has revealed both positional [7] and sequence-specific [8,9] biases in sequenced fragments. Positional bias refers to a local effect in which fragments are preferentially located towards either the beginning or end of transcripts. Sequence-specific bias is a global effect where the sequence surrounding the beginning or end of potential fragments affects their likelihood of being selected for sequencing. These biases can affect expression estimates [10], and it is therefore important to correct for them during RNA-Seq analysis. Although many biases can be traced back to specifics of the preparation protocols (see Figure 2 and [8]), it is currently not possible to predict fragment distributions directly from a protocol. This is due to many factors, including uncertainty in the biochemistry of many steps and the unknown shape and effect of RNA secondary structure on certain procedures [10]. It is therefore desirable to estimate the extent and nature of bias indir- ectly by inferring it from the data (fragment alignments) in an experiment. However, such inference is non-trivial due to the fact that fragment abundances are propor- tional to transcript abundances, so that the expression levels of transcripts from which fragments originate must be taken into account when estimating bias, as Figure 2 demonstrates. At the same time, expression estimates made without correcting for bias may lead to the over- or under-representation of fragments. There- fore the problems of bias estimation and expression esti- mation are fundamentally linked, and must be solved together. Likelihood based approaches are well suited to resolving this difficulty, as the bias and abundance para- meters can be estimated jointly by maximizing a likeli- hood function for the data. * Correspondence: lpachter@math.berkeley.edu 1Department of Computer Science, 387 Soda Hall, UC Berkeley, Berkeley, CA 94720, USA Full list of author information is available at the end of the article Roberts et al. Genome Biology 2011, 12:R22 http://genomebiology.com/2011/12/3/R22 �� 2011 Roberts et al. licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Page 2
1. fragmentation of RNA 2. random priming to make sscDNA rst-strand synthesis) 3. construction of dscDNA (second-strand synthesis) 4. size selection 5. sequencing 6. mapping RNA molecules RNA fragments sscDNA dscDNA Gel cutout RNA sequence paired-end read sense anti-sense short long Figure 1 Overview of a typical RNA-Seq experiment. RNA is initially fragmented (1) followed by first-strand synthesis priming (2), which selects the 3��� fragment end (in transcript orientation), to make single stranded cDNA. Double stranded cDNA created during second-strand synthesis (3), which selects the 5��� fragment end, is then size selected (4) resulting in fragments suitable for sequencing (5). Sequenced reads are mapped to opposite strands of the genome (6), and in the case of known transcript or fragment strandedness, the read alignments reveal the 5��� and 3��� ends of the sequenced fragment (see Supplementary methods in Additional file 3). All arrows are directed 5��� to 3��� in transcript orientation. Roberts et al. Genome Biology 2011, 12:R22 http://genomebiology.com/2011/12/3/R22 Page 2 of 14
Readership Statistics
448 Readers on Mendeley
by Discipline
2% Mathematics
by Academic Status
34% Ph.D. Student
20% Post Doc
10% Researcher (at an Academic Institution)
by Country
38% United States
8% Germany
7% United Kingdom
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime





