Uncovering the Complexity of Tran...
Hindawi Publishing Corporation Journal of Biomedicine and Biotechnology Volume 2010, Article ID 853916, 19 pages doi:10.1155/2010/853916 Review Article Uncovering the Complexity of Transcriptomes with RNA-Seq Valerio Costa,1 Claudia Angelini,2 Italia De Feis,2 and Alfredo Ciccodicola1 1 Institute of Genetics and Biophysics ���A. Buzzati-Traverso���, IGB-CNR, 80131 Naples, Italy 2 Istituto per le Applicazioni del Calcolo ���Mauro Picone���, IAC-CNR, 80131 Naples, Italy Correspondence should be addressed to Valerio Costa, costav@igb.cnr.it Received 22 February 2010 Accepted 7 April 2010 Academic Editor: Momiao Xiong Copyright �� 2010 Valerio Costa et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. In recent years, the introduction of massively parallel sequencing platforms for Next Generation Sequencing (NGS) protocols, able to simultaneously sequence hundred thousand DNA fragments, dramatically changed the landscape of the genetics studies. RNA-Seq for transcriptome studies, Chip-Seq for DNA-proteins interaction, CNV-Seq for large genome nucleotide variations are only some of the intriguing new applications supported by these innovative platforms. Among them RNA-Seq is perhaps the most complex NGS application. Expression levels of specific genes, differential splicing, allele-specific expression of transcripts can be accurately determined by RNA-Seq experiments to address many biological-related issues. All these attributes are not readily achievable from previously widespread hybridization-based or tag sequence-based approaches. However, the unprecedented level of sensitivity and the large amount of available data produced by NGS platforms provide clear advantages as well as new challenges and issues. This technology brings the great power to make several new biological observations and discoveries, it also requires a considerable effort in the development of new bioinformatics tools to deal with these massive data files. The paper aims to give a survey of the RNA-Seq methodology, particularly focusing on the challenges that this application presents both from a biological and a bioinformatics point of view. 1. Introduction It is commonly known that the genetic information is conveyed from DNA to proteins via the messenger RNA (mRNA) through a finely regulated process. To achieve such a regulation, the concerted action of multiple cis- acting proteins that bind to gene flanking regions������core��� and ���auxiliary��� regions���is necessary [1]. In particular, core elements, located at the exons��� boundaries, are strictly required for initiating the pre-mRNA processing events, whereas auxiliary elements, variable in number and location, are crucial for their ability to enhance or inhibit the basal splicing activity of a gene. Until recently���less than 10 years ago���the central dogma of genetics indicated with the term ���gene��� a DNA portion whose corresponding mRNA encodes a protein. According to this view, RNA was considered a ���bridge��� in the transfer of biological information between DNA and proteins, whereas the identity of each expressed gene, and of its transcriptional levels, were commonly indicated as ���transcriptome��� [2]. It was considered to mainly consist of ribosomal RNA (80���90%, rRNA), transfer RNA (5���15%, tRNA), mRNA (2���4%) and a small fraction of intragenic (i.e., intronic) and intergenic noncoding RNA (1%, ncRNA) with undefined regulatory functions [3]. Particularly, both intragenic and intergenic sequences, enriched in repetitive elements, have long been considered genetically inert, mainly composed of ���junk��� or ���selfish��� DNA [4]. More recently it has been shown that the amount of noncoding DNA (ncDNA) increases with organism complexity, ranging from 0.25% of prokaryotes��� genome to 98.8% of humans [5]. These observations have strengthened the evidence that ncDNA, rather than being junk DNA, is likely to represent the main driving force accounting for diversity and biological complexity of living organisms. Since the dawn of genetics, the relationship between DNA content and biological complexity of living organisms has been a fruitful field of speculation and debate [6]. To date, several studies, including recent analyses performed during the ENCODE project, have shown the pervasive nature of eukaryotic transcription with almost the full length of nonrepeat regions of the genome being transcribed [7].
2 Journal of Biomedicine and Biotechnology The unexpected level of complexity emerging with the discovery of endogenous small interfering RNA (siRNA) and microRNA (miRNA) was only the tip of the iceberg [8]. Long interspersed noncoding RNA (lincRNA), promoter- and terminator-associated small RNA (PASR and TASR, resp.), transcription start site-associated RNA (TSSa-RNA), transcription initiation RNA (tiRNA) and many others [8] represent part of the interspersed and crosslinking pieces of a complicated transcription puzzle. Moreover, to cause further di���culties, there is the evidence that most of the pervasive transcripts identified thus far, have been found only in specific cell lines (in most of cases in mutant cell lines) with particular growth conditions, and/or particular tissues. In light of this, discovering and interpreting the complexity of a transcriptome represents a crucial aim for understanding the functional elements of such a genome. Revealing the complexity of the genetic code of living organisms by analyzing the molecular constituents of cells and tissues, will drive towards a more complete knowledge of many biological issues such as the onset of disease and progression. The main goal of the whole transcriptome analyses is to identify, characterize and catalogue all the transcripts expressed within a specific cell/tissue���at a particular stage��� with the great potential to determine the correct splicing patterns and the structure of genes, and to quantify the differential expression of transcripts in both physio- and pathological conditions [9]. In the last 15 years, the development of the hybridiza- tion technology, together with the tag sequence-based approaches, allowed to get a first deep insight into this field, but, beyond a shadow of doubt, the arrival on the marketplace of the NGS platforms, with all their ���Seq��� appli- cations, has completely revolutionized the way of thinking the molecular biology. The aim of this paper is to give an overview of the RNA-Seq methodology, trying to highlight all the challenges that this application presents from both the biological and bioinformatics point of view. 2. Next Generation Sequencing Technologies Since the first complete nucleotide sequence of a gene, pub- lished in 1964 by Holley [10] and the initial developments of Maxam and Gilbert [11] and Sanger et al. [12] in the 1970s (see Figure 1), the world of nucleic acid sequencing was a RNA world and the history of nucleic acid sequencing technology was largely contained within the history of RNA sequencing. In the last 30 years, molecular biology has undergone great advances and 2004 will be remembered as the year that revolutionized the field thanks to the introduction of massively parallel sequencing platforms, the Next Gen- eration Sequencing-era, [13���15], started. Pioneer of these instruments was the Roche (454) Genome Sequencer (GS) in 2004 (http://www.454.com/), able to simultaneously sequence several hundred thousand DNA fragments, with a read length greater than 100 base pairs (bp). The cur- rent GS FLX Titanium produces greater than 1 million reads in excess of 400 bp. It was followed in 2006 by the Illumina Genome Analyzer (GA) (http://www.illumina .com/) capable to generate tens of millions of 32-bp reads. Today, the Illumina GAIIx produces 200 million 75���100 bp reads. The last to arrive in the marketplace was the Applied Biosystems platform based on Sequencing by Oligo Ligation and Detection (SOLiD) (http://www3.appliedbiosystems .com/AB Home/index.htm), capable of producing 400 mil- lion 50-bp reads, and the Helicos BioScience HeliS- cope (http://www.helicosbio.com/), the first single-molecule sequencer that produces 400 millions 25���35 bp reads. While the individual approaches considerably vary in their technical details, the essence of these systems is the miniaturization of individual sequencing reactions. Each of these miniaturized reactions is seeded with DNA molecules, at limiting dilutions, such that there is a single DNA molecule in each, which is first amplified and then sequenced. To be more precise, the genomic DNA is randomly broken into smaller sizes from which either fragment templates or mate- pair templates are created. A common theme among NGS technologies is that the template is attached to a solid surface or support (immobilization by primer or template) or indi- rectly immobilized (by linking a polymerase to the support). The immobilization of spatially separated templates allows simultaneous thousands to billions of sequencing reactions. The physical design of these instruments allows for an optimal spatial arrangement of each reaction, enabling an e���cient readout by laser scanning (or other methods) for millions of individual sequencing reactions onto a standard glass slide. While the immense volume of data generated is attractive, it is arguable that the elimination of the cloning step for the DNA fragments to sequence is the greatest benefit of these new technologies. All current methods allow the direct use of small DNA/RNA fragments not requiring their insertion into a plasmid or other vector, thereby removing a costly and time-consuming step of traditional Sanger sequencing. It is beyond a shadow of doubt that the arrival of NGS technologies in the marketplace has changed the way we think about scientific approaches in basic, applied and clinical research. The broadest application of NGS may be the resequencing of different genomes and in particular, human genomes to enhance our understanding of how genetic differences affect health and disease. Indeed, these platforms have been quickly applied to many genomic contexts giving rise to the following ���Seq��� protocols: RNA-Seq for transcrip- tomics, Chip-Seq for DNA-protein interaction, DNase-Seq for the identification of most active regulatory regions, CNV- Seq for copy number variation, and methyl-Seq for genome wide profiling of epigenetic marks. 3. RNA-Seq RNA-Seq is perhaps one of the most complex next- generation applications. Expression levels, differential splic- ing, allele-specific expression, RNA editing and fusion tran- scripts constitute important information when comparing samples for disease-related studies. These attributes, not
Journal of Biomedicine and Biotechnology 3 1961-1963 researchers crack the genetic code linking gene and protein. 1953 James Watson and Francis Crick deduce DNA���s conformation from experimental clues and model building. 1972 Paul Berg and colleagues create first recombinant DNA molecules. 1985 Kary Mullis invents PCR. 1986 the idea to sequence human genome is broached. Leroy Hood and Lloyd Smith automate DNA sequencing. 1990 sequencing of human and model organism genomes begins. BLAST algorithm developed to align DNA sequences. 1995 researchers at the institute for genomic research publish first genome sequence of a organism: H. influenzae. 1999 first human chromosome sequence published. 2001 mid-february, science and nature publish the first draft of human genome sequence. 2004 introduction of massively parallel sequencing platforms giving rise to the ���next generation sequencin���. 1958 Matthew Meselson and Franklin Stahl demonstrate how DNA replicates. 1964 Robert Holley complete the first nucleotide sequence of the gene encoding yeast alanine tRNA. 1977 Frederick Sanger, Allan Maxam, and Walter Gilbert pioneer DNA sequencing. 1986-1987 US DOE o���cially begins human genome project. US NIH takes over genome project, James Watson at the helm. 1994 detailed genetic map of the human genome was published including 5840 mapped loci. 1996 international human genome project consortium establishes ���bermuda rules��� for public data release. 2000 fruit fly genome sequenced, validating Celera���s whole-genome shotgun method.First assembly of the human genome completed by the UCSC group. 2003 april the human genome sequence completed, 2 years earlier than planned. Figure 1: Evolution of DNA revolution. readily available by hybridization-based or tag sequence- based approaches, can now be far more easily and precisely obtained if su���cient sequence coverage is achieved. How- ever, many other essential subtleties in the RNA-Seq data remain to be faced and understood. Hybridization-based approaches typically refer to the microarray platforms. Until recently, these platforms have offered to the scientific community a very useful tool to simultaneously investigate thousands of features within a single experiment, providing a reliable, rapid, and cost- effective technology to analyze the gene expression pat- terns. Due to their nature, they suffer from background and cross-hybridization issues and allow researchers to only measure the relative abundance of RNA transcripts included in the array design [16]. This technology, which measures gene expression by simply quantifying���via an indirect method���the hybridized and labeled cDNA, does not allow the detection of RNA transcripts from repeated sequences, offering a limited dynamic range, unable to detect very subtle changes in gene expression levels, critical in understanding any biological response to exogenous stimuli and/or environmental changes [9, 17, 18]. Other methods such as Serial, Cap Analysis of Gene Expression (SAGE and CAGE, resp.) and Polony Multiplex Analysis of Gene Expression (PMAGE), tag-based sequenc- ing methods, measure the absolute abundance of transcripts in a cell/tissue/organ and do not require prior knowledge of any gene sequence as occurs for microarrays [19]. These analyses consist in the generation of sequence tags from fragmented cDNA and their following concatenation prior to cloning and sequencing [20]. SAGE is a powerful technique that can therefore be viewed as an unbiased digital microar- ray assay. However, although SAGE sequencing has been successfully used to explore the transcriptional landscape of various genetic disorders, such as diabetes [21, 22], cardiovascular diseases [23], and Downs syndrome [24, 25], it is quite laborious for the cloning and sequencing steps that have thus far limited its use. In contrast, RNA-Seq on NGS platforms has clear advantages over the existing approaches [9, 26]. First, unlike hybridization-based technologies, RNA-Seq is not limited to the detection of known transcripts, thus allowing the iden- tification, characterization and quantification of new splice isoforms. In addition, it allows researchers to determine the correct gene annotation, also defining���at single nucleotide resolution���the transcriptional boundaries of genes and the expressed Single Nucleotide Polymorphisms (SNPs). Other advantages of RNA-Seq compared to microarrays are the low ���background signal,��� the absence of an upper limit for quantification and consequently, the larger dynamic range of expression levels over which transcripts can be detected. RNA-Seq data also show high levels of reproducibility for both technical and biological replicates.