Whole-genome sequence assembly fo...
Whole-Genome Sequence Assembly for Mammalian Genomes: Arachne 2 David B. Jaffe,1,2 Jonathan Butler,1 Sante Gnerre,1 Evan Mauceli,1 Kerstin Lindblad-Toh,1 Jill P. Mesirov,1 Michael C. Zody,1 and Eric S. Lander1,3 1Whitehead Institute/MIT Center for Genome Research, Cambridge, Massachusetts 02141, USA 3Department of Biology, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA We previously described the whole-genome assembly program Arachne, presenting assemblies of simulated data for small to mid-sized genomes. Here we describe algorithmic adaptations to the program, allowing for assembly of mammalian-size genomes, and also improving the assembly of smaller genomes. Three principal changes were simultaneously made and applied to the assembly of the mouse genome, during a six-month period of development: (1) Supercontigs (scaffolds) were iteratively broken and rejoined using several criteria, yielding a 64-fold increase in length (N50), and apparent elimination of all global misjoins (2) gaps between contigs in supercontigs were filled (partially or completely) by insertion of reads, as suggested by pairing within the supercontig, increasing the N50 contig length by 50% (3) memory usage was reduced fourfold. The outcome of this mouse assembly and its analysis are described in (Mouse Genome Sequencing Consortium 2002). In the whole-genome shotgun method of genome sequenc- ing, as presently practiced, the entire genome is sheared to specified approximate sizes (generally chosen from the range 2���200 kb), yielding random fragments (called inserts), whose ends can then be determined, yielding sequence reads of size about 600 bp, given in pairs whose genomic separation is approximately known from the insert size (Sanger et al. 1977 Edwards et al. 1990). In principle, the genome can then be reconstructed in silico from these reads, using their overlap and pairing as glue, provided that the inserts gave sufficiently even representation of the genome, that an appropriate mix of insert lengths (short to long)was used, and that the DNA itself was sufficiently free of polymorphism (Sanger et al. 1982 Fleischmann et al. 1995 Myers et al. 2000 Aparicio et al. 2002 Yu et al. 2002). This assembly process produces sequence units (called supercontigs or scaffolds), which under the best of circum- stances can approximate chromosomes, but which in general are smaller, less contiguous, and have errors. Within a super- contig, contiguous segments (or contigs)are punctuated by gaps (captured by paired reads)whose sizes are approximately known (as a consequence of knowing the insert lengths). The resulting draft sequence may be an end goal or may be the starting point for production of finished sequence, through additional clone-based sequencing. For whole-genome shotgun reads from large repetitive genomes such as those of mammals, assembly is a computa- tional challenge (because a priori any read may overlap any other read), and an algorithmic challenge, because repetitive structures and defects in data make it easy to falsely join dis- tant parts of the genome. Still, assemblies consisting of either a mixture of whole-genome shotgun and clone-based reads, or only whole-genome shotgun reads have been produced (human: Venter et al. 2001 mouse chromosome 16: Mural et al. 2002). Presented with the publicly available mouse whole- genome data set, we and another group (Mullikin and Ning 2003)set out to produce high-quality draft sequence for the mouse genome. Our starting point was the work presented in (Batzoglou et al. 2002), wherein the Arachne software was used to assemble simulated reads from small to mid-sized ge- nomes. In this paper we describe algorithmic enhancements to the program, made while assembling the mouse data set and yielding the mouse assembly described in (Mouse Ge- nome Sequencing Consortium 2002). These enhancements fall under three general headings: global assembly improvement, local assembly improvement, and computational performance improvement. These changes were made over a six-month period during which successively larger collections of mouse reads were made available to us and which we assembled, while modifying our algorithms. To expose the motivation for the solutions, we proceed where possible by explaining our successive attempts and their outcomes. Global Assembly Improvement We begin our analysis at the point where the mouse assembly had been run up to the end of the released Arachne code (Batzoglou et al. 2002). The N50 supercontig size was 265 kb (half the total bases in the supercontigs lay in supercontigs of size 265 kb or larger). By improving our algorithms, we hoped to raise this value to 1 Mb. To that end, we set out to design code to look off the ends of each supercontig, in the hopes of finding genomically adjacent supercontigs with which it could be merged. Manual examination of the data revealed many cases where it was clear in principle how to merge, but many of the cases were complicated by the presence of multiple supercon- tigs that linked off the end of a given supercontig (Fig. 1). To specify precisely how such cases should be handled, we posed the following mathematical problem (J, see Methods): Off a particular end of a given supercontig, consider all supercon- tigs it links to, and which, based on pairing separations, should lie off the end for each ordering of these supercontigs, 2Corresponding author. E-MAIL jaffe@genome.wi.mit.edu FAX (617) 258-9108. Article and publication are at http://www.genome.org/cgi/doi/10.1101/ gr.828403. Methods 13:91���96 ��2003 by Cold Spring Harbor Laboratory Press ISSN 1088-9051/03 $5.00 www.genome.org Genome Research 91 www.genome.org
find the nonoverlapping positioning of them which mini- mizes ���stretching��� of links mark the order as acceptable if the stretching is modest, at most 2.5 standard deviations deter- mine whether among the acceptable orders, there is only one supercontig, nearest the given one if so, merge the given supercontig with this nearest neighbor. This technique facilitated some mergers, but other po- tential mergers were inhibited by misassemblies. This sug- gested that to optimize supercontigs, we needed to break su- percontigs at suspicious points, rejoin them as described (J), and repeat, iteratively. Multiple breaking schemes were required because there are diverse sources of assembly error. Genomic repeats are the most important source, but there are others, which act in combination with repeats to create ���hazardous��� sections of the genome to assemble. For example, there are regions of the genome whose assembly coverage is low or of low quality, either for random reasons, or because of the nature of the genome itself, as reflected in cloning bias or sequencing dif- ficulty. Misassemblies arise from errors in linking (themselves arising from chimeric clones, from intraplate contamination, from labeling errors, from duplicated sequencing of clones). Misassemblies also arise from chimeric reads (from chimeric clones or bleed-through of contaminating sequence). Finally, misassemblies can arise from undiagnosed algorithmic defects in the assembly process. Given this complex background, we did not expect a single supercontig breaking approach to suffice, and so we developed several methods, not all of which are correlated with particular sources of assembly error. We describe some of these methods now, and another later. Negative Breaking These methods detect regions of the assembly which are weakly held together. A simple case (which we call method B1)is where we break supercontigs between contigs if the template coverage across the gap is only one. Evidently this situation arises now because earlier in the assembly process, we merged supercontigs connected by only one link. Never- theless, in aggregate, it was not a mistake to do this initially, as some were reinforced subsequently by longer links. As a generalization, in method B2, we break supercontigs at gaps whose covering templates fall nearly on top of each other. This subsumes the case where the same clone was end- sequenced more than once, giving rise to the appearance of multiple links. Next, in method B3, we break supercontigs at places within a contig where only sequence holds together the supercontig (no links hold it together see Methods). In the most glaring cases, we found that the supercontig was held together by only a thread: a single poor, short join be- tween two reads. However, the same criterion was designed to identify subtler mistakes, in which templates might falsely overlap by about a read length. Further, after assembling mouse, we devised criteria to identify disguised instances of the same phenomenon, exemplified in Figure 2. Breaking Off Ends The ends of supercontigs are enriched for misassembly. In method B4, we removed 10 kb of sequence from the ends of long supercontigs, breaking mid-contig. No specific evidence of error was required we expect that most of these supercon- tig ends were in fact correct. In combination, these methods proved effective. After applying them (iteratively breaking and rejoining), the N50 supercontig size rose to 11 Mb, surpassing our initial expec- tation of 1 Mb, but possibly aided by false global joins. To assess this possibility, we aligned the mouse genetic markers (Dietrich et al. 1996)to the assembly. Because genetic markers are the most reliable source of long-range ordering, and are not used by Arachne, this was an independent test for assem- bly integrity. Yet from the alignments of the genetic markers to the assembly, we could see instances where supercontigs crossed chromosomes, proof of global misassembly. Although the genetic map could be used to evaluate the assembly, substantial spacing between markers prohibited its use in editing the assembly. Therefore we devised a supercon- tig breaking criterion which might remedy the defects uncov- ered by the genetic map, without using the map itself. Positive Breaking We found examples where there were multiple consistent links from the middle of one supercontig to another. These supercontigs were adequately held together [as evidenced by inapplicability of (B1���4)], but strong evidence supported an alternative structure. A specific criterion was devised to ad- dress this. In method B5 (described further in Methods), we found correlated clusters of links between pairs of supercon- tigs, which could not be explained without breaking them. Then we broke the supercontigs, immediately after the last link, mid-contig. This criterion depended on heuristics: The minimum number of links in a cluster, and especially, the minimum acceptable spread of the cluster, where the spread is the minimum (over the two participating supercontigs)of the total span in bases of the link end reads on a given supercon- Figure 1 Joining of supercontigs. Three supercontigs (a, b, c) are seen off the end of supercontig s. There are two or more read pair links from s to each of them. Each has an optimal position relative to s, determined by the insert lengths corresponding to the read pairs. However, each insert length has a standard deviation associated to it, and so the positions of a, b, and c relative to s also have standard deviations. Supposing that we allow each of them to slide from their optimal positions by up to 2.5 standard deviations, but that we do not allow overlap between any of the supercontigs, is there more than one possible order for the supercontigs? Among the possible orders, does a always appear first (after s)? If so, we join supercontig s to supercontig a. Figure 2 A disguised instance where sequence join alone holds together a supercontig. A long supercontig (blue) from one part of the genome subsumes a small foreign inset (red) from a completely different part of the genome, held together by a single point of at- tachment within a contig (bicolor): in fact only a sequence join ties blue to red. This was not recognized in the version of the code which produced the released mouse assembly (Mouse Genome Sequencing Consortium 2002). Resolution: break at the bicolor juncture, move the red sequence to where it links in another supercontig. Jaffe et al. 92 Genome Research www.genome.org