Generations of sequencing technol...
Review Generations of sequencing technologies Erik Pettersson ���, Joakim Lundeberg, Afshin Ahmadian Department of Gene Technology, Royal Institute of Technology (KTH), AlbaNova University Center, Roslagstullsbacken 21, SE - 106 91 Stockholm, Sweden a b s t r a c t a r t i c l e i n f o Article history: Received 2 September 2008 Accepted 2 October 2008 Available online 21 November 2008 Keywords: DNA sequencing Next generation Advancements in the field of DNA sequencing are changing the scientific horizon and promising an era of personalized medicine for elevated human health. Although platforms are improving at the rate of Moore's Law, thereby reducing the sequencing costs by a factor of two or three each year, we find ourselves at a point in history where individual genomes are starting to appear but where the cost is still too high for routine sequencing of whole genomes. These needs will be met by miniaturized and parallelized platforms that allow a lower sample and template consumption thereby increasing speed and reducing costs. Current massively parallel, state-of-the-art systems are providing significantly improved throughput over Sanger systems and future single-molecule approaches will continue the exponential improvements in the field. �� 2008 Elsevier Inc. All rights reserved. Contents Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Present generation of DNA sequencing technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 Terminating chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 Hybridization to tiling arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 Parallelized Pyrosequencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 Reverse termination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 Ligating degenerated probes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 Future generation of DNA sequencing technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 Single-molecule sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 Introduction The ability to swiftly and accurately gain knowledge of nucleic acid composition is essential to many of the biological sciences. As the pace of progress is high and we are moving towards an era of synthetic genomics and personalized medicine, the demand for highly efficient sequencing technologies is obvious, where effortless deciphering of genetic sequences will shed light on novel biological functions and phenotypic differences. Metagenomic endeavors [1���3] are providing new tools in the art of genetic engineering, thereby enabling the design of artificial life in the service of humanity [4]. These future synthetic organisms may produce petrol substitutes or provide systems for mopping up excessive carbon dioxide in the atmosphere [5���7]. Perhaps even more captivating is the possibility of resequencing larger and larger fractions of human genomes at an ever decreasing cost, an effort that will elucidate phenotypic variants, extending the comprehension of disease susceptibility and pharmacogenomics, permitting personalized medicine. Although we have not yet reached the long envisioned $1000 genome [8], novel approaches and refinements of existing methods are reducing the cost per base by the day while increasing the throughput. The establishment of a reference genome in the beginning of this decade [9,10] is now permitting cost-effective resequencing of ever larger fractions of human genomes. The Advanced Sequencing Technology Development Awards initiated by the National Human Genome Research Institute (NHGRI) in 2004 [11] are beginning to show results. Advancements for the next generation sequencing methods include not only current state-of- the-art systems from 454 [12,13], Illumina [14,15] and Applied Biosystems [16] but also single-molecule detection approaches, capable of recognizing incorporation or hybridization events on single molecules. Further into the future lies more direct recognition of unamplified material, i.e. nano pores or nano edges relying on physical recognition of the bases in an unmodified DNA strand, rather than detecting chemical incorporation. Genomics 93 (2009) 105���111 ��� Corresponding author. Fax: +46 8 5537 8481. E-mail address: eriq@kth.se (E. Pettersson). 0888-7543/$ ��� see front matter �� 2008 Elsevier Inc. All rights reserved. doi:10.1016/j.ygeno.2008.10.003 Contents lists available at ScienceDirect Genomics journal homepage: www.elsevier.com/locate/ygeno
The drop in cost has led to the initiation of several sequencing projects aiming at elucidating the variation not covered by SNP arrays. In the Personal Genome Project [17���19], the exon regions of ten genomes are to be sequenced and compared. Researchers at the Beijing Genomics Institute (BGI) [20] are determined to sequence 100 individuals of Han Chinese origin during the upcoming three years in the Yanhuang Project and recently, an international consortium announced the ���1000 Genomes Project��� where the sequence of 1000 individuals will provide ���A catalogue of human genetic variation��� [21]. The improvements in sequencing technology and reduction in cost have allowed the first personal genomics company [22] to begin the sequencing of customers' genomes. To allow for a further reduction in cost the X PRIZE Foundation in Santa Monica, CA, has introduced the Archon X PRIZE for Genomics [23] and will award a sum of $10 million to the first team that can design a system capable of sequencing 100 human genomes in 10 days. Additional requirements are an error rate of no more than one in 100,000 bases, a coverage of at least 98% and a cost of no more than $10,000 for each sequenced genome. Representatives from many of the different sequencing categories are represented in the Archon X PRIZE challenge and the research world is closing in on the $1000 genome. The race is on. Present generation of DNA sequencing technologies There are many factors to consider in DNA sequencing such as read length, bases per second and raw accuracy. All the work in the field has led to an exponential reduction in cost per base. Sanger sequencing has been one of the most influential innovations in biological research since it was first presented in 1977. A little more than 20 years later, a bioluminescence sequencing-by-synthesis approach saw the light of day [24]. Today, Pyrosequencing has evolved at 454 Life Sciences, generating about five hundred million bases of raw sequence in just a few hours [12]. This throughput, although heavily refined and improved during the years, is something Sanger sequencing in its current form cannot easily match. However, during the last year, Illumina and Applied Biosystems have introduced sequencing systems offering even higher throughput than the systems provided by 454, generating billions of bases in a single run. These novel methods all rely on parallel, cyclic interrogation of sequences from spatially separated clonal amplicons. Although with shorter read lengths and a slower sequence extraction from individual features as compared to the Sanger method, the parallelized process offers a much higher total throughput and reduces cost significantly by generating thousands of bases per second. By shearing the template and parallel sequencing of single fragments, over sampling may provide improved coverage and the possibility of stitching together the original sequence while increasing total accuracy. Already today these high-throughput methods are expanding our knowledge, also in the related fields of transcriptome and proteome research. Gene expression analysis with whole-transcriptome sequencing is possi- ble and furthermore, in proteome research, by sequencing DNA extracted by antibodies targeting DNA-binding proteins (ChIP-Seq), transcription factor binding sites and chromatin modifications can be investigated [25,26]. Terminating chains Since 1977, a total nucleic acid polymer of approximately 1011 bases has been determined with Sanger's chain termination sequencing method [27]. By halting the elongation with a labeled, and thereby identifiable, dideoxyribonucleotide triphosphate (ddNTP), the length of the fragment can be utilized for interrogating the base identity of the terminating base [28]. In its current form, fluorescently labeled ddNTPs [29,30] are mixed with regular, non labeled, non terminating nucleotides in a cycle sequencing reaction [31,32] rendering elonga- tion stops at all positions in the template. Capillary electrophoresis can then be applied for separating sequences by length and providing subsequent interrogation of the terminating base (see Fig. 1A). Initially at a high cost, refinements and automation have improved cost effectiveness significantly. In 1985, $10 allowed reading one single base, while the same amount of money rendered 10,000 bases 20 years later [8,27]. Current instruments provided by Applied Biosystems deliver read lengths of up to 1000 bases, high raw accuracy and allow for 384 samples to be sequenced in parallel generating 24 bases per instrument second. Projects of multiplexing and miniaturization in order to reduce reagent volumes, lower consumable costs and increase throughput are being pursued [33,34]. Hybridization to tiling arrays The concept of allele-specific hybridization (ASH) has been used for resequencing and genotyping purposes by expanding a probe set, targeting a specific position in the genome, to include interrogation of each of the four possible nucleotides [35]. A tiling array can be fabricated with probe sets targeting each position in the reference genome. Read length is given by the probe length (often 25 bp) and base calling is performed by examining the signal intensities for the different probes of each set. Accuracy is an issue and is dependent on the ability of the assay to discriminate between exact matches and those with a single base difference. Performance may vary signifi- cantly due to different base compositions (different thermal annealing properties) of different regions, resulting in problems with false positives as well as with large inaccessible regions composed of repetitive sequence stretches [36,37]. The throughput is an obvious benefit, since all bases are interrogated simultaneously and the concept has been applied to resequencing the human chromosome 21 by Perlgen [37] and HIV [36]. By representing all possible sequences for a given probe length, de novo sequencing can be performed and overlapping sequences used for sequence assembly [38]. In a recent report, the genome of Bacteriophage �� and Escherichia coli were resequenced by ���shotgun sequencing by hybridization��� with an accuracy of 99.93% and a raw throughput of 320 Mbp/day [39]. Parallelized Pyrosequencing The Genome Sequencer FLX by 454 Life Sciences [13] and Roche depends on an emulsion PCR followed by parallel and individual Pyrosequencing of the clonally amplified beads in a PicoTiterPlate (see Fig. 1B). Emulsion PCR is a clonal amplification performed in an oil- aqueous emulsion. Unlike when digesting a genome with restriction endonucleases, shearing will provide randomly fragmented pieces of more or less similar length. By the addition of general adaptor sequences to the fragments, only one primer pair is required for amplification. In the emulsion PCR, a primer-coated bead, a DNA fragment and other necessary components for PCR (including the second general primer) are isolated in a water micro-reactor, favoring a 1:1 bead to fragment ratio. Once the emulsion is broken, beads not carrying any amplified DNA are removed in an enrichment process [12,40]. The amplified and enriched beads are then distributed on the PicoTiterPlate, where a well (44 ��m in diameter) allows fixation of one bead (28 ��m in diameter) [12]. However, out of the 1.6 million wells, not all will contain a bead and not all of those that do will give a useful sequence. Following the distribution of the DNA-carrying beads to the PicoTiterPlate Pyrosequencing will be performed. Pyrosequencing is a sequencing-by-synthesis method where a successful nucleotide incorporation event is detected as emitted photons [41]. Since the single-stranded DNA fragments on the beads have been amplified with general tags, a general primer is annealed permitting an elongation 106 E. Pettersson et al. / Genomics 93 (2009) 105���111