Sign up & Download
Sign in

Detection of splice junctions from paired-end RNA-seq data by SpliceMap

by Kin Fai Au, Hui Jiang, Lan Lin, Yi Xing, Wing Hung Wong
Nucleic Acids Research ()

Abstract

Alternative splicing is a prevalent post-transcriptional process, which is not only important to normal cellular function but is also involved in human diseases. The newly developed second generation sequencing technique provides high-throughput data (RNA-seq data) to study alternative splicing events in different types of cells. Here, we present a computational method, SpliceMap, to detect splice junctions from RNA-seq data. This method does not depend on any existing annotation of gene structures and is capable of finding novel splice junctions with high sensitivity and specificity. It can handle long reads (50100 nt) and can exploit paired-read information to improve mapping accuracy. Several parameters are included in the output to indicate the reliability of the predicted junction and help filter out false predictions. We applied SpliceMap to analyze 23 million paired 50-nt reads from human brain tissue. The results show at this depth of sequencing, RNA-seq can support reliable detection of splice junctions except for those that are present at very low level. Compared to current methods, SpliceMap can achieve 12% higher sensitivity without sacrificing specificity.

Cite this document (BETA)

Available from www.pubmedcentral.nih.gov
Page 1
hidden

Detection of splice junctions fro...

Detection of splice junctions from paired-end RNA-seq data by SpliceMap Kin Fai Au1, Hui Jiang1,2, Lan Lin3, Yi Xing3 and Wing Hung Wong1,* 1Department of Statistics, Stanford University, Stanford, CA 94305, 2Stanford Genome Technology Center, 855 California Ave, Palo Alto, CA 94304 and 3Department of Internal Medicine and Department of Biomedical Engineering, University of Iowa, Iowa City, IA, 52242, USA Received December 7, 2009 Revised March 10, 2010 Accepted March 12, 2010 ABSTRACT Alternative splicing is a prevalent post- transcriptional process, which is not only important to normal cellular function but is also involved in human diseases. The newly developed second generation sequencing technique provides high- throughput data (RNA-seq data) to study alternative splicing events in different types of cells. Here, we present a computational method, SpliceMap, to detect splice junctions from RNA-seq data. This method does not depend on any existing annotation of gene structures and is capable of finding novel splice junctions with high sensitivity and specificity. It can handle long reads (50���100nt) and can exploit paired-read information to improve mapping accuracy. Several parameters are included in the output to indicate the reliability of the predicted junction and help filter out false predictions. We applied SpliceMap to analyze 23 million paired 50-nt reads from human brain tissue. The results show at this depth of sequencing, RNA-seq can support reliable detection of splice junctions except for those that are present at very low level. Compared to current methods, SpliceMap can achieve 12% higher sensitivity without sacrificing specificity. INTRODUCTION RNA splicing is an important post-transcriptional step where one or more segments of the pre-mRNA are spliced out and the remaining segments (exons) are concatenated to form the mature mRNA product. By alternative splicing, it is possible to produce different tran- scripts (isoforms) from the same genetic locus. This process occurs in over 90% of multi-exon human genes (1,2) and greatly increases the diversity of possible tran- scripts in the transcriptome. Aberrant RNA splicing has been found to be associated with many human diseases (3,4). For this reason, techniques to identify and quantify splicing events are important to biology and medicine. The most popular way to study the structure and abun- dance of spliced transcripts is through sequencing of ex- pressed sequence tags (ESTs) (5). Traditionally, such studies were expensive and ine���cient due to the low throughput of the Sanger method which was the main sequencing technology used in EST projects. However, with the recent advent of second generation sequencing technology (SGS), it is now feasible to conduct deep and comprehensive sequencing of transcriptomes in a high throughput and cost effective manner (6���8), making it possible to detect rare alternative splicing events. In such RNA-seq projects, tens or hundreds of millions of short sequences (30���100nt) are read randomly from the popu- lation of transcripts under study. The first step of the analysis is thus the mapping of each short read to a refer- ence genome to determine the genetic loci that may give rise to this read. For reads that are sampled completely within exonic regions, this mapping task can be handled by any existing short-read mapping programs, such as ELAND (Cox, unpublished software) and SeqMap (9). However, the reads that are of most interest to us for novel isoform discovery are the ones that span across exon-exon junctions. These ���junction reads��� cannot be mapped directly to the genome. One approach is to map the reads onto the known transcript sequences from the currently annotated exon library. Since the exon library is incomplete, this method cannot find the junctions that involve novel splicing events (10). In another approach, used in the recently developed TopHat (11) program, reads that are mappable on the reference genome are grouped into distinct clusters such that the reads within *To whom correspondence should be addressed. Tel: +1 650 725 2915 Fax: +1 650 725 8977 Email: whwong@stanford.edu 4570���4578 Nucleic Acids Research, 2010, Vol. 38, No. 14 Published online 5 April 2010 doi:10.1093/nar/gkq211 �� The Author(s) 2010. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Page 2
hidden
each cluster are linked together through overlapping regions. Each cluster then defines a putative exonic region. Subsequently, exon-exon junctions can be searched based on these putative exon definitions. Clustering is a natural approach to find novel junctions in the first RNA-seq experiments (1,6,10,12���17) because the data generated at the early stage of the development of SGS are mostly very short reads (25���36nt) that are not suitable for direct de novo detection of exon���exon junc- tions. However, the technology is improving rapidly, and currently the usable length of reads from some SGS in- struments like the Illumina Genome Analyzer are typically in the range of 50���100nt. The increased read length opens up the possibility to directly map the exon-exon junction without any reference to putative or annotated exons. Here we report a novel algorithm, based on the idea of using the mapping of half-reads as a way to identify the approximate location of a junction. Moreover, this method can be adapted to incorporate the extra informa- tion contained in paired-end sequencing data, to achieve a much higher level of specificity than attainable by single end sequencing. The method is implemented in a freely available Python program named SpliceMap (http:// biogibbs.stanford.edu/ kinfai/SpliceMap/). MATERIALS AND METHODS SpliceMap utilizes merely the reference genomic sequence to find the junction independently of existing exon anno- tation. It is possible to explore all exon splicing events, including known and novel ones, if the sequencing is of su���cient depth. The core notion is to pin down first the junction boundary on one of the two exons that are involved in the splicing event, before the mapping of the full junction. A read that spans a junction must have a match in the reference genome that is not shorter than its half length. Such a match then provides a seeding that can be used to identify a small genomic region for the search of the corresponding junction. There are four main steps in SpliceMap: half-read mapping, seeding selection, junction search and paired-end filtering (Figure 1). The last step is not applied when the data is not paired-end. For reads longer than 50nt, we extract from them several overlapped 50-nt reads and then apply the standard method. For example, we split a 100-nt read to three segments (1���50, 26���75 and 51���100). An extra filter is added in the post-processing step for the long-read data to check the results with the full length information. In this way, we can find multiply junctions from a single long read. Half-read mapping Taking advantage of the reasonably long reads (50nt) offered by the newest models of second-generation sequen- cers, the half length (25nt) can be reliably aligned to the reference genomic sequence with high probability. In this step, SpliceMap maps both halves of the read to the ref- erence genome by any currently available short read mapping tools, such as SeqMap (9) and ELAND. The maximum mismatch allowed for the half read mapping can be chosen accordingly, based on the quality of data and read length. After mapping, the following steps are carried out chromosome by chromosome. Seeding selection We use the mapped hits of a half-read to narrow the search regions of the junction. These hits are extended base by base in the following step. Thus, we call the half-read mapped hit ���seeding���. The mapped hits from the above steps are examined for seeding selection. Although the uniquely mapped hits are more reliable as seeding for junction search, one should not simply exclude all multiply mapped reads (i.e. reads mapped to more than one location) because doing so will greatly diminish the chance of detecting junctions with homologous sequences elsewhere, such as those in paralogous genes or pseudo- genes. Instead of rejecting all multiply mapped hits, SpliceMap excludes only those hits that are within 400000 nt of another hit from the same half-read. Because if two regions are identical within a distance of 400000 nt, false splice predictions tend to form between these two regions which match the reads perfectly. Junction search For each seeding identified, the alignment on the reference genome is then extended base by base to find the splicing point (Figure 1). SpliceMap subsequently tries to find the partner splicing point that provides perfect match of the corresponding residual sequence of the original read, within a user-specified distance (set to be 400000 nt in our examples). When the full reads are 50nt in size, can- didates of splicing point must meet two criteria: first, the alignment extension cannot be longer than 40nt and the residual length has to be at least 10nt and second, the splicing point must be next to the canonical dinucleotides splicing signal GT and AG for donor and acceptor sites, because they appears in 98% known splice sites (18). The mapping of the residual sequence is achieved by searching 10-nt seeding in a pre-computed chromosome-wide hash table and then extending to complete the full alignment. In order to reduce false positive junctions, the results are discarded if the search yields multiple matches of the residual sequence satisfying the above criteria. Paired-end filtering When paired-end reads are available, the pairing informa- tion is used in this step to improve the specificity of junction detection. First, in the previous steps, three types of hits are identified as ���good hits���, namely exonic hits, extension hits and junction hits. An exonic hit occurs if the two halves of a full read are mapped to locations that differ by exactly half of the read length. On the other hand, if a half-read hit can be extended maximally to an alignment length that is suitably long but yet shorter than the full read length, then it is regarded as an extension hit. Finally, junction hits are identified as above. To qualify as reliable hits, the hits generated from the two reads from a paired-end reads must satisfy the following conditions (i) both hits are ���good hits��� (ii) their distance is not longer than 400000 nt (iii) the mapping direction and the Nucleic Acids Research, 2010, Vol. 38, No. 14 4571

Readership Statistics

145 Readers on Mendeley
by Discipline
 
 
 
by Academic Status
 
40% Ph.D. Student
 
14% Student (Master)
 
14% Post Doc
by Country
 
31% United States
 
8% China
 
8% United Kingdom

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in