Comprehensive comparative analysis of strand-specific RNA sequencing methods.
- PubMed: 20711195
Abstract
Strand-specific, massively parallel cDNA sequencing (RNA-seq) is a powerful tool for transcript discovery, genome annotation and expression profiling. There are multiple published methods for strand-specific RNA-seq, but no consensus exists as to how to choose between them. Here we developed a comprehensive computational pipeline to compare library quality metrics from any RNA-seq method. Using the well-annotated Saccharomyces cerevisiae transcriptome as a benchmark, we compared seven library-construction protocols, including both published and our own methods. We found marked differences in strand specificity, library complexity, evenness and continuity of coverage, agreement with known annotations and accuracy for expression profiling. Weighing each method's performance and ease, we identified the dUTP second-strand marking and the Illumina RNA ligation methods as the leading protocols, with the former benefitting from the current availability of paired-end sequencing. Our analysis provides a comprehensive benchmark, and our computational pipeline is applicable for assessment of future protocols in other organisms.
Comprehensive comparative analysis of strand-specific RNA sequencing methods.
NATURE METHODS | VOL.7 NO.9 | SEPTEMBER 2010 | 709
Strand-specific, massively parallel cDNA sequencing (RNA-seq)
is a powerful tool for transcript discovery, genome annotation
and expression profiling. There are multiple published methods
for strand-specific RNA-seq, but no consensus exists as to how
to choose between them. Here we developed a comprehensive
computational pipeline to compare library quality metrics from
any RNA-seq method. Using the well-annotated Saccharomyces
cerevisiae transcriptome as a benchmark, we compared seven
library-construction protocols, including both published and
our own methods. We found marked differences in strand
specificity, library complexity, evenness and continuity of
coverage, agreement with known annotations and accuracy
for expression profiling. Weighing each method’s performance
and ease, we identified the dUTP second-strand marking and
the Illumina RNA ligation methods as the leading protocols,
with the former benefitting from the current availability of
paired-end sequencing. Our analysis provides a comprehensive
benchmark, and our computational pipeline is applicable for
assessment of future protocols in other organisms.
Recent advances in massively parallel cDNA sequencing (RNA-
seq) have opened the way for comprehensive analysis of any tran-
scriptome
1
. In principle, RNA-seq allows analysis of all expressed
transcripts, with three key goals: (i) annotating the structures of
all transcribed genes including their 5 and 3 ends and all splice
junctions
2–4
, (ii) quantifying expression of each transcript
5,6
and
(iii) measuring the extent of alternative splicing
7–11
.
Standard libraries for RNA-seq do not preserve information
about which strand was originally transcribed. Synthesis of ran-
domly primed double-stranded cDNA followed by addition of
adaptors for next-generation sequencing leads to the loss of infor-
mation about which strand was present in the original mRNA
template. In some cases, strand information can be inferred by
subsequent computational analyses using, for example, open
reading frame (ORF) information in protein-coding genes, biases
in coverage between 5 and 3 ends
4
or splice-site orientation in
eukaryotic genomes
4,10,11
.
Nevertheless, direct information on the originating strand can
substantially enhance the value of an RNA-seq experiment. For
example, such information would help to accurately identify anti-
sense transcripts, with potential regulatory roles
12
, determine the
transcribed strand of other noncoding RNAs, demarcate the exact
boundaries of adjacent genes transcribed on opposite strands and
resolve the correct expression levels of coding or noncoding over-
lapping transcripts. These tasks are particularly challenging in
small microbial genomes, prokaryotic and eukaryotic, in which
genes are densely coded, with overlapping untranslated regions
(UTRs) or ORFs and in which splice-site information is limited
or nonexistent.
Many methods have been recently developed for strand-specific
RNA-seq, and they fall into two main classes. One class relies on
attaching different adaptors in a known orientation relative to
the 5 and 3 ends of the RNA transcript (Fig. 1a). These proto-
cols generate a cDNA library flanked by two distinct adaptor
sequences, marking the 5 end and the 3 end of the original
mRNA. A second class of methods relies on marking one strand
by chemical modification, either on the RNA itself by bisulfite
treatment or during second-strand cDNA synthesis followed by
degradation of the unmarked strand (Fig. 1b). Both modification
methods essentially follow the standard protocol for RNA-seq
with the exception of these marking steps.
Although standard RNA-seq largely relies on one protocol, the
great diversity of published protocols for strand-specific RNA-
seq poses several challenges. First, when conducting an experi-
ment, researchers are challenged to identify a suitable protocol.
Furthermore, if protocols vary considerably in their performance,
the chosen method can dramatically affect the conclusions drawn
from an experiment, confounding interpretation and comparison
across studies. There is therefore a substantial need for a sys-
tematic evaluation of the performance of different protocols for
strand-specific RNA-seq.
Here we present a comprehensive comparison of seven proto-
cols for strand-specific RNA-seq. Using Saccharomyces cerevisiae
poly(A)
+
RNA, we built a compendium of libraries using these
1
Broad Institute of Massachusetts Institute of Technology and Harvard University, Cambridge, Massachusetts, USA.
2
Department of Biology, Massachusetts Institute
of Technology, Cambridge, Massachusetts, USA.
3
School of Engineering and Computer Science, Hebrew University, Jerusalem, Israel.
4
Alexander Silberman Institute
of Life Sciences, Hebrew University, Jerusalem, Israel.
5
Howard Hughes Medical Institute, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA.
6
These authors contributed equally to this work. Correspondence should be addressed to J.Z.L. (jlevin@broadinstitute.org) or A.R. (aregev@broad.mit.edu).
RECEIVED 26 MARCH; ACCEPTED 20 JULY; PUBLISHED ONLINE 15 AUGUST 2010; DOI:10.1038/NMETH.1491
Comprehensive comparative analysis of strand-specific
RNA sequencing methods
Joshua Z Levin
1,6
, Moran Yassour
1–3,6
, Xian Adiconis
1
, Chad Nusbaum
1
, Dawn Anne Thompson
1
,
Nir Friedman
3,4
, Andreas Gnirke
1
& Aviv Regev
1,2,5
2
0
1
0
N
a
t
u
r
e
A
m
e
r
i
c
a
,
I
n
c
.
A
l
l
r
i
g
h
t
s
r
e
s
e
r
v
e
d
.
ANALYSIS
protocols and sequenced each of them on an Illumina Genome
Analyzer instrument to deep coverage. We developed a computa-
tional pipeline to assess each library’s quality according to library
complexity, strand specificity, evenness and continuity of cover-
age, agreement with known genome annotation and quantitative
accuracy for expression profiling, in addition to considering the
ease of laboratory and computational manipulations. We identi-
fied the dUTP and Illumina RNA ligation methods as the leading
protocols, with the dUTP library providing the added benefit of
the ability to conduct paired-end sequencing.
RESULTS
A comparison of strand-specific RNA-seq
We evaluated 13 stand-specific libraries. We constructed 11 librar-
ies based on seven strand-specific RNA-seq methods (Fig. 1),
including two variations for four of the methods. We also compiled
comparable data for two published libraries: a dUTP library
13
and a library based on another (eighth) method from the differ-
ential adaptor class
14
(3 split adaptor; Supplementary Fig. 1).
Finally, we prepared a standard, non–strand-specific cDNA
library to use as a control in these comparisons.
We explored two different variations for four of the seven methods
to improve our libraries (Online Methods). These variations
were the addition of actinomycin D to the ‘not not so random’
(NNSR) library protocol, two published variations of the bisulfite
library protocol (‘H’ and ‘S’; Online Methods
15,16
), different size-
selection methods for the Illumina RNA ligation libraries and
different reverse transcription primers for the dUTP libraries.
We present results only for the ‘S’ bisulfite library because we
found no substantial differences between the two libraries in
our analyses.
We used each method to prepare a cDNA library for Illumina
sequencing from S. cerevisiae poly(A)
+
RNA. We chose S. cerevisiae
because this eukaryotic model organism has an exceptionally
well-annotated genome, facilitating quality evaluations. We
used paired-end Illumina sequencing for each library (Online
Methods), except for the RNA ligation and Illumina RNA ligation
libraries, which we sequenced only from the 3 end of each cDNA
because of the RNA adaptors used in these protocols. These
approaches could be modified in the future to accommodate
paired-end sequencing by changing the RNA adaptor and PCR
primer sequences.
An analysis framework for assessing RNA-seq libraries
To compare the quality of the different libraries, we defined six
assessment criteria (Fig. 2) implemented in a computational pipe-
line (Online Methods). These criteria were library complexity,
defined as the number of unique reads (Fig. 2a); strand specificity,
defined as the number of reads mapping to known transcribed
regions at the expected strand (Fig. 2b); evenness and continuity
of coverage at annotated transcripts (Fig. 2c,d); performance at
5 and 3 ends, defined as agreement with known end annotation
(Fig. 2d); and performance in expression profiling, defined by
sensitivity, linearity and dynamic range. With the exception of
strand specificity, we compared each criterion to that for the con-
trol library. We focused on only one variation per method unless
there were substantial differences in performance between vari-
ations. We provide the full evaluation results in Supplementary
Tables 1–2 and Supplementary Figures 2–4.
Equal sampling of reads enables direct library comparisons
We mapped each library’s reads to the S. cerevisiae genome using
Arachne
17
. For paired-end libraries, we mapped unique pairs with
opposite orientations and an appropriate separation; for single-
end libraries, we identified unique mappings for individual
reads
17
(Online Methods).
The libraries had a broad range of yields, measured by the total
number of reads and by the number of reads or paired reads
mapping to a unique location (Supplementary Table 1). In this
initial comparison, the dUTP library had the highest percent-
age of paired-end mapped reads (Supplementary Table 1). The
Illumina RNA ligation–solid-phase reversible immobilization
(SPRI) library, which we prepared using SPRI-based size selec-
tion, had a smaller percentage of unique reads than the Illumina
RNA ligation library, which we prepared using gel-based size
selection (35% versus 59%; Supplementary Table 1). This was
likely due to the difficulty in physically removing cDNAs shorter
than 76 base pairs with the SPRI method, resulting in the ends
Bisulfite
15,16
Convert ‘C’s to ‘U’s in RNA
C CC CC
U UU UU
T TT TT
A AA AA
RT
mRNA
Bisulfite
cDNA
RNA ligation
29
3 and 5 adaptors ligated
sequentially to RNA
with cleanup
Ligation
Gel size selection
3 adaptor
Ligation
Gel size selection
mRNA
5 adaptor
+
+
SMART–RNA ligation (hybrid)
Adaptor ligated on 3 end of RNA
and nontemplate ‘C’s on 5 end of
cDNA; template switching, PCR
CCC
CCC
GGG
mRNA
Template switch
3 adaptor
3 adaptor
Ligation, gel size selection
+
NNSR priming
31
First- and second-strand cDNA
synthesis with adaptors on ends
of the primers
mRNA
First-strand
cDNA
Second-strand
cDNA
SMART
30
Nontemplate ‘C’s on
5 end of cDNA
CCC
CCC
PCR
GGG
mRNA
CCC
GGG
Template
switch
Primer
cDNA
dUTP second strand
13
Second-strand synthesis with dUTP;
remove ‘U’s after adaptor ligation
and size selection USER
U UU UU
Second-strand synthesis
with dUTP
cDNA
Illumina RNA ligation
3 preadenylated adaptors and
5 adaptors ligated sequentially
to RNA without cleanup
(S. Luo and G. Schroth,
personal communication)
Ligation
No gel size selection
3 preadenylated
adaptorLigation
No gel size selection
mRNA
5 adaptor
+
*
+
First-strand cDNA synthesis
a
b
Figure 1
|
Methods for strand-specific RNA-seq. (a,b) Salient details
for differential adaptor methods including RNA ligation
29
, SMART
30
and
NNSR priming
31
(a) and differential marking methods (b). USER, uracil-
specific excision reagent. mRNA is shown in gray and cDNA in black.
For differential adaptor methods, 5 adaptors are shown in blue, and
3 adaptors are shown in red.
2
0
1
0
N
a
t
u
r
e
A
m
e
r
i
c
a
,
I
n
c
.
A
l
l
r
i
g
h
t
s
r
e
s
e
r
v
e
d
.
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime




