Sign up & Download
Sign in

Ultrafast and memory-efficient alignment of short DNA sequences to the human genome

by Ben Langmead, Cole Trapnell, Mihai Pop, Steven L Salzberg
Genome Biology ()

Abstract

Bowtie: a new ultrafast memory-efficient tool for the alignment of short DNA sequence reads to large genomes.

Cite this document (BETA)

Available from www.pubmedcentral.nih.gov
Page 1
hidden

Ultrafast and memory-efficient al...

Genome Biology 2009, 10:R25 Open Access 2009 Langmead et al.
Software Volume 10, Issue 3, Article R25 Ultrafast and memory-efficient alignment of short DNA sequences to the human genome Ben Langmead, Cole Trapnell, Mihai Pop and Steven L Salzberg Address: Center for Bioinformatics and Computational Biology, Institute for Advanced Computer Studies, University of Maryland, College Park, MD 20742, USA. Correspondence: Ben Langmead. Email: langmead@cs.umd.edu �� 2009 Langmead et al. licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Bowtie: short-read alignment pBowtie: a new ultrafast memory-efficient tool for the alignment of short DNA sequence reads to large genomes./p Abstract Bowtie is an ultrafast, memory-efficient alignment program for aligning short DNA sequence reads to large genomes. For the human genome, Burrows-Wheeler indexing allows Bowtie to align more than 25 million reads per CPU hour with a memory footprint of approximately 1.3 gigabytes. Bowtie extends previous Burrows-Wheeler techniques with a novel quality-aware backtracking algorithm that permits mismatches. Multiple processor cores can be used simultaneously to achieve even greater alignment speeds. Bowtie is open source http://bowtie.cbcb.umd.edu. Rationale Improvements in the efficiency of DNA sequencing have both broadened the applications for sequencing and dramatically increased the size of sequencing datasets. Technologies from Illumina (San Diego, CA, USA) and Applied Biosystems (Fos- ter City, CA, USA) have been used to profile methylation pat- terns (MeDIP-Seq) [1], to map DNA-protein interactions (ChIP-Seq) [2], and to identify differentially expressed genes (RNA-Seq) [3] in the human genome and other species. The Illumina instrument was recently used to re-sequence three human genomes, one from a cancer patient and two from pre- viously unsequenced ethnic groups [4-6]. Each of these stud- ies required the alignment of large numbers of short DNA sequences ('short reads') onto the human genome. For exam- ple, two of the studies [4,5] used the short read alignment tool Maq [7] to align more than 130 billion bases (about 45�� cov- erage) of short Illumina reads to a human reference genome in order to detect genetic variations. The third human re- sequencing study [6] used the SOAP program [8] to align more than 100 billion bases to the reference genome. In addi- tion to these projects, the 1,000 Genomes project is in the process of using high-throughput sequencing instruments to sequence a total of about six trillion base pairs of human DNA [9]. With existing methods, the computational cost of aligning many short reads to a mammalian genome is very large. For example, extrapolating from the results presented here in Tables 1 and 2, one can see that Maq would require more than 5 central processing unit (CPU)-months and SOAP more than 3 CPU-years to align the 140 billion bases from the study by Ley and coworkers [5]. Although using Maq or SOAP for this purpose has been shown to be feasible by using multiple CPUs, there is a clear need for new tools that consume less time and computational resources. Maq and SOAP take the same basic algorithmic approach as other recent read mapping tools such as RMAP [10], ZOOM [11], and SHRiMP [12]. Each tool builds a hash table of short oligomers present in either the reads (SHRiMP, Maq, RMAP, and ZOOM) or the reference (SOAP). Some employ recent theoretical advances to align reads quickly without sacrificing sensitivity. For example, ZOOM uses 'spaced seeds' to signif- icantly outperform RMAP, which is based on a simpler algo- Published: 4 March 2009 Genome Biology 2009, 10:R25 (doi:10.1186/gb-2009-10-3-r25) Received: 21 October 2008 Revised: 19 December 2008 Accepted: 4 March 2009 The electronic version of this article is the complete one and can be found online at http://genomebiology.com/2009/10/3/R25
Page 2
hidden
http://genomebiology.com/2009/10/3/R25 Genome Biology 2009, Volume 10, Issue 3, Article R25 Langmead et al. R25.2 Genome Biology 2009, 10:R25 rithm developed by Baeza-Yaetes and Perleberg [13]. Spaced seeds have been shown to yield higher sensitivity than contig- uous seeds of the same length [14,15]. SHRiMP employs a combination of spaced seeds and the Smith-Waterman [16] algorithm to align reads with high sensitivity at the expense of speed. Eland is a commercial alignment program available from Illumina that uses a hash-based algorithm to align reads. Bowtie uses a different and novel indexing strategy to create an ultrafast, memory-efficient short read aligner geared toward mammalian re-sequencing. In our experiments using reads from the 1,000 Genomes project, Bowtie aligns 35-base pair (bp) reads at a rate of more than 25 million reads per CPU-hour, which is more than 35 times faster than Maq and 300 times faster than SOAP under the same conditions (see Tables 1 and 2). Bowtie employs a Burrows-Wheeler index based on the full-text minute-space (FM) index, which has a memory footprint of only about 1.3 gigabytes (GB) for the human genome. The small footprint allows Bowtie to run on a typical desktop computer with 2 GB of RAM. The index is small enough to be distributed over the internet and to be stored on disk and re-used. Multiple processor cores can be used simultaneously to achieve even greater alignment speed. We have used Bowtie to align 14.3�� coverage worth of human Illumina reads from the 1,000 Genomes project in about 14 hours on a single desktop computer with four processor cores. Bowtie makes a number of compromises to achieve this speed, but these trade-offs are reasonable within the context of mammalian re-sequencing projects. If one or more exact matches exist for a read, then Bowtie is guaranteed to report one, but if the best match is an inexact one then Bowtie is not guaranteed in all cases to find the highest quality alignment. With its highest performance settings, Bowtie may fail to align a small number of reads with valid alignments, if those reads have multiple mismatches. If the stronger guarantees are desired, Bowtie supports options that increase accuracy at the cost of some performance. For instance, the '--best' option will guarantee that all alignments reported are best in terms Table 1 Bowtie alignment performance versus SOAP and Maq Platform CPU time Wall clock time Reads mapped per hour (millions) Peak virtual memory footprint (megabytes) Bowtie speed-up Reads aligned (%) Bowtie -v 2 Server 15 m 7 s 15 m 41 s 33.8 1,149 - 67.4 SOAP 91 h 57 m 35 s 91 h 47 m 46 s 0.10 13,619 351�� 67.3 Bowtie PC 16 m 41 s 17 m 57 s 29.5 1,353 - 71.9 Maq 17 h 46 m 35 s 17 h 53 m 7 s 0.49 804 59.8�� 74.7 Bowtie Server 17 m 58 s 18 m 26 s 28.8 1,353 - 71.9 Maq 32 h 56 m 53 s 32 h 58 m 39 s 0.27 804 107�� 74.7 The performance and sensitivity of Bowtie v0.9.6, SOAP v1.10, and Maq v0.6.6 when aligning 8.84 M reads from the 1,000 Genome project (National Center for Biotechnology Information Short Read Archive: SRR001115) trimmed to 35 base pairs. The 'soap.contig' version of the SOAP binary was used. SOAP could not be run on the PC because SOAP's memory footprint exceeds the PC's physical memory. For the SOAP comparison, Bowtie was invoked with '-v 2' to mimic SOAP's default matching policy (which allows up to two mismatches in the alignment and disregards quality values). For the Maq comparison Bowtie is run with its default policy, which mimics Maq's default policy of allowing up to two mismatches during the first 28 bases and enforcing an overall limit of 70 on the sum of the quality values at all mismatched positions. To make Bowtie's memory footprint more comparable to Maq's, Bowtie is invoked with the '-z' option in all experiments to ensure only the forward or mirror index is resident in memory at one time. CPU, central processing unit. Table 2 Bowtie alignment performance versus Maq with filtered read set Platform CPU time Wall clock time Reads mapped per hour (millions) Peak virtual memory footprint (megabytes) Bowtie speed up Reads aligned (%) Bowtie PC 16 m 39 s 17 m 47 s 29.8 1,353 - 74.9 Maq 11 h 15 m 58 s 11 h 22 m 2 s 0.78 804 38.4�� 78.0 Bowtie Server 18 m 20 s 18 m 46 s 28.3 1,352 - 74.9 Maq 18 h 49 m 7 s 18 h 50 m 16 s 0.47 804 60.2�� 78.0 Performance and sensitivity of Bowtie v0.9.6 and Maq v0.6.6 when the read set is filtered using Maq's 'catfilter' command to eliminate poly-A artifacts. The filter eliminates 438,145 out of 8,839,010 reads. Other experimental parameters are identical to those of the experiments in Table 1. CPU, central processing unit.

Readership Statistics

1344 Readers on Mendeley
by Discipline
 
 
 
by Academic Status
 
33% Ph.D. Student
 
15% Post Doc
 
13% Student (Master)
by Country
 
36% United States
 
8% Germany
 
8% United Kingdom

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in