Macronuclear Genome Sequence of t...
Macronuclear Genome Sequence of the Ciliate Tetrahymena thermophila, a Model Eukaryote Jonathan A. Eisen1��a*, Robert S. Coyne1, Martin Wu1, Dongying Wu1, Mathangi Thiagarajan1, Jennifer R. Wortman1, Jonathan H. Badger1, Qinghu Ren1, Paolo Amedeo1, Kristie M. Jones1, Luke J. Tallon1, Arthur L. Delcher1��b, Steven L. Salzberg1��b, Joana C. Silva1, Brian J. Haas1, William H. Majoros1��c, Maryam Farzad1��d, Jane M. Carlton1��e, Roger K. Smith Jr.1��f, Jyoti Garg2, Ronald E. Pearlman2,3, Kathleen M. Karrer4, Lei Sun4, Gerard Manning5, Nels C. Elde6��g, Aaron P. Turkewitz6, David J. Asai7, David E. Wilkes7, Yufeng Wang8, Hong Cai9, Kathleen Collins10, B. Andrew Stewart10, Suzanne R. Lee10, Katarzyna Wilamowska11, Zasha Weinberg11��h, Walter L. Ruzzo11, Dorota Wloga12, Jacek Gaertig12, Joseph Frankel13, Che-Chia Tsao14, Martin A. Gorovsky14, Patrick J. Keeling15, Ross F. Waller15��j, Nicola J. Patron15��j, J. Michael Cherry16, Nicholas A. Stover16, Cynthia J. Krieger16, Christina del Toro17��k, Hilary F. Ryder17��l, Sondra C. Williamson17, Rebecca A. Barbeau17��m, Eileen P. Hamilton17, Eduardo Orias17 1 The Institute for Genomic Research, Rockville, Maryland, United States of America, 2 Department of Biology, York University, Toronto, Ontario, Canada, 3 Centre for Research in Mass Spectrometry, York University, Toronto, Ontario, Canada, 4 Department of Biological Sciences, Marquette University, Milwaukee, Wisconsin, United States of America, 5 Razavi-Newman Center for Bioinformatics, The Salk Institute for Biological Studies, San Diego, California, United States of America, 6 Department of Molecular Genetics and Cell Biology, University of Chicago, Chicago, Illinois, United States of America, 7 Department of Biology, Harvey Mudd College, Claremont, California, United States of America, 8 Department of Biology, University of Texas at San Antonio, San Antonio, Texas, United States of America, 9 Department of Electrical Engineering, University of Texas at San Antonio, San Antonio, Texas, United States of America, 10 Department of Molecular and Cellular Biology, University of California Berkeley, Berkeley, California, United States of America, 11 Department of Computer Science and Engineering, University of Washington, Seattle, Washington, United States of America, 12 Department of Cellular Biology, University of Georgia, Athens, Georgia, United States of America, 13 Department of Biological Sciences, University of Iowa, Iowa City, Iowa, United States of America, 14 Department of Biology, University of Rochester, Rochester, New York, United States of America, 15 Canadian Institute for Advanced Research, Department of Botany, University of British Columbia, Vancouver, British Columbia, Canada, 16 Department of Genetics, Stanford University, Stanford, California, United States of America, 17 Department of Molecular, Cellular, and Developmental Biology, University of California Santa Barbara, Santa Barbara, California, United States of America The ciliate Tetrahymena thermophila is a model organism for molecular and cellular biology. Like other ciliates, this species has separate germline and soma functions that are embodied by distinct nuclei within a single cell. The germline-like micronucleus (MIC) has its genome held in reserve for sexual reproduction. The soma-like macronucleus (MAC), which possesses a genome processed from that of the MIC, is the center of gene expression and does not directly contribute DNA to sexual progeny. We report here the shotgun sequencing, assembly, and analysis of the MAC genome of T. thermophila, which is approximately 104 Mb in length and composed of approximately 225 chromosomes. Overall, the gene set is robust, with more than 27,000 predicted protein-coding genes, 15,000 of which have strong matches to genes in other organisms. The functional diversity encoded by these genes is substantial and reflects the complexity of processes required for a free-living, predatory, single-celled organism. This is highlighted by the abundance of lineage-specific duplications of genes with predicted roles in sensing and responding to environmental conditions (e.g., kinases), using diverse resources (e.g., proteases and transporters), and generating structural complexity (e.g., kinesins and dyneins). In contrast to the other lineages of alveolates (apicomplexans and dinoflagellates), no compelling evidence could be found for plastid-derived genes in the genome. UGA, the only T. thermophila stop codon, is used in some genes to encode selenocysteine, thus making this organism the first known with the potential to translate all 64 codons in nuclear genes into amino acids. We present genomic evidence supporting the hypothesis that the excision of DNA from the MIC to generate the MAC specifically targets foreign DNA as a form of genome self-defense. The combination of the genome sequence, the functional diversity encoded therein, and the presence of some pathways missing from other model organisms makes T. thermophila an ideal model for functional genomic studies to address biological, biomedical, and biotechnological questions of fundamental importance. Citation: Eisen JA, Coyne RS, Wu M, Wu D, Thiagarajan M, et al. (2006) Macronuclear genome sequence of the ciliate Tetrahymena thermophila, a model eukaryote. PLoS Biol 4(9): e286. DOI: 10.1371/journal.pbio.0040286 Introduction Tetrahymena thermophila is a single-celled model organism for unicellular eukaryotic biology . Studies of T. thermophila (referred to as T. pyriformis variety 1 or syngen 1 prior to 1976 ) have contributed to fundamental biological discoveries such as catalytic RNA , telomeric repeats [4,5], telomerase , and the function of histone acetylation . T. thermophila is advantageous as a model eukaryotic system because it grows rapidly to high density in a variety of media and conditions, its life cycle allows the use of conventional tools of genetic analysis, and molecular genetic tools for sequence-enabled experimental analysis of gene function have been developed [8,9]. In addition, although it is unicellular, it possesses many core processes conserved across a wide diversity of eukaryotes (including humans) that are not found in other single-celled model systems (e.g., the yeasts Saccharomyces cerevisiae and Schizosaccharomyces pombe). T. thermophila is a member of the phylum Ciliophora, which also includes the genera Paramecium, Oxytricha, and Ichthyoph- PLoS Biology | www.plosbiology.org September 2006 | Volume 4 | Issue 9 | e286 1620 PLoS BIOLOGY
thirius. A cartoon showing the phylogenetic position of T. thermophila relative to other eukaryotes for which the genomes have been sequenced is shown in Figure 1. The ciliates are one of three major evolutionary lineages that make up the alveolates. The other two lineages are dinoflagellates and the exclusively parasitic apicomplexa, which includes the Plasmo- dium species that cause malaria. Although experimental tools are improving for the apicomplexa [10���12], they can still be challenging to work with, and in some situations T. thermophila can serve as a useful ������distant cousin������ model for this group . As is typical of ciliates, T. thermophila cells exhibit nuclear dimorphism . Each cell has two nuclei, the micronucleus (MIC) and the macronucleus (MAC), containing distinct but closely related genomes. The MIC is diploid and contains five pairs of chromosomes. It is the germline, the store of genetic information for the progeny produced by conjugation in the sexual stage of the T. thermophila life cycle. Conjugation involves meiosis, fusion of haploid MIC gametes to produce a new zygotic MIC, and differentiation of new MACs from mitotic copies of the zygotic MIC (for details, see ). After formation of the MAC, cells reproduce asexually until the next sexual conjugation. During this asexual growth, all gene expression occurs in the MAC, which is thus considered the somatic nucleus. The MAC genome derives from that of the MIC, but the two genomes are quite distinct. During MAC differentiation, several types of developmentally programmed DNA rear- rangements occur [16,17] (Figure 2). One such rearrangement is the deletion of segments of the MIC genome known as internally eliminated sequences (IESs). It is estimated that approximately 6,000 IESs are removed, resulting in the MAC genome being an estimated 10% to 20% smaller than that of the MIC . A key aspect of the process is the preferential removal of repetitive DNA, which results in 90% to 100% of MIC repeats being eliminated [19,20]. Thus the process can be considered analogous to and more extreme than other forms of repeat element silencing phenomena such as repeat- induced point mutation (RIP) in Neurospora and heterochro- matin formation [21,22]. A second programmed DNA rearrangement is the site-specific fragmentation at each location of the 15���base pair (bp) chromosome breakage sequence (Cbs) [23���25]. During fragmentation, sections of the MIC genome containing each Cbs, as well as up to 30 bp on either side, are deleted . Telomeres are then added to each new end , generating some 250 to 300 MAC chromosomes [28,29]. Another process that occurs during MAC differentiation is the amplification of the number of copies of the MAC chromosomes. The rDNA chromosome, which encodes the 5.8S, 17S, and 26S rRNAs, is maintained at an average of 9,000 copies per MAC . Six other chromosomes that have been examined are each maintained at an average of 45 copies per MAC . During asexual reproduction, the MAC divides amitotically, with apparently random distribution of chro- mosome copies that behave as if acentromeric. In contrast, MIC chromosomes are metacentric  and are distributed mitotically [33,34]. Parental MAC DNA is not transmitted to sexual progeny, although it does have an epigenetic influence on postzygotic MAC genome rearrangement, mediated by RNA interference . The Tetrahymena research community has coordinated an effort to develop genomic tools for T. thermophila [9,36]. The MAC genome was selected for initial sequencing because it contains all the expressed genes and because the complexity of the assembly process was expected to be reduced due to the lower amounts of repetitive DNA. These advantages, however, are countered by some complexities not seen in other eukaryotic genome projects, including the presence of several hundred medium-sized to small chromosomes, the possibility of unequal copy number of at least some chromosomes, the existence of polymorphisms that are generated during MAC development, and the inability to completely separate the MIC from the MAC prior to DNA isolation. We report here on the shotgun sequencing, assembly, and analysis of the MAC genome of T. thermophila strain SB210, an inbred strain B derivative that has been extensively used for genetic mapping and for the isolation of mutants. We discuss how the complexities of sequencing the MAC were success- fully addressed, as well as the biological and evolutionary implications of our analysis of the genome sequence. Academic Editor: Mikhail Gelfand, Institute for Information Transmission Problems, Russian Federation Received January 4, 2006 Accepted June 23, 2006 Published August 29, 2006 DOI: 10.1371/journal.pbio.0040286 Copyright: �� 2006 Eisen et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Abbreviations: bp, base pairs Cbs, chromosome breakage sequence CM, covariance model EST, expressed sequence tag IES, internal eliminated sequence ITR, inverted terminal repeat MAC, macronucleus/macronuclear MIC, micro- nucleus/micronuclear ncRNA, noncoding RNA RIP, repeat induced point mutation SCI, single-cell isolation Sec, selenocysteine TE, transposable element TGD, Tetrahymena Genome Database TIGR, The Institute for Genomic Research VIC, voltage-gated ion channel * To whom correspondence should be addressed. E-mail: firstname.lastname@example.org ��a Current address: University of California Davis Genome Center, Section of Evolution and Ecology, School of Biological Sciences and Department of Medical Microbiology and Immunology, School of Medicine, University of California Davis, Davis, California, United States of America ��b Current address: Center for Bioinformatics and Computational Biology, University of Maryland, College Park, Maryland, United States of America ��c Current address: Duke Institute for Genome Sciences and Policy, Duke University, Durham, North Carolina, United States of America ��d Current address: Agilent Technologies, Inc., Santa Clara, California, United States of America ��e Current address: Department of Medical Parasitology, New York University School of Medicine, New York, New York, United States of America ��f Current address: Dupont Agriculture and Nutrition, Wilmington, Delaware, United States of America ��g Current address: Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America ��h Current address: Department of Molecular, Cellular and Developmental Biology, Yale University, New Haven, Connecticut, United States of America ��j Current address: School of Botany, The University of Melbourne, Melbourne, Australia ��k Current address: Meharry Medical College, Nashville, Tennessee, United States of America ��l Current address: Dartmouth-Hitchcock Medical Center, Lebanon, New Hampshire, United States of America ��m Current address: Lung Biology Center, University of California San Francisco, San Francisco, California, United States of America PLoS Biology | www.plosbiology.org September 2006 | Volume 4 | Issue 9 | e286 1621 Tetrahymena thermophila Genome Sequence
Results/Discussion Genome Assembly and General Chromosome Structure Sequencing and assembly. Using physical isolation meth- ods, MAC were purified from a culture of T. thermophila strain SB210 and used to create multiple differentially sized shotgun sequencing libraries (Table S1). Construction of large (greater than 10 kb) insert libraries was not successful���a common problem in working with AT-rich genomes. Approximately 1.2 million paired end sequences were generated from the libraries and assembled using the Celera Assembler . In an initial assembly, the mitochondrial genome (mtDNA which was present due to some contamination of the MAC preparation with mitochondria) and the highly amplified rDNA chromosome did not assemble well compared to the published sequences of these molecules [38,39]. This was probably because contigs from these molecules had higher depths of coverage than those from other chromosomes, which caused the Celera Assembler to treat them as repetitive DNA. Thus we divided sequence reads into three bins (mtDNA, rDNA, and bulk MAC DNA) and generated assemblies for each bin separately. This resulted in a moderate improvement, and the three separate assemblies Figure 1. Unrooted Consensus Phylogeny of Major Eukaryotic Lineages Representative genera are shown for which whole genome sequence data are either in progress (marked with asterisks *) or available. The ciliates, dinoflagellates, and apicomplexans constitute the alveolates (lighter yellow box). Branch lengths do not correspond to phylogenetic distances. Adapted from the more detailed consensus in . DOI: 10.1371/journal.pbio.0040286.g001 Figure 2. Relationship between MIC and MAC Chromosomes The top horizontal bar shows a small portion of one of the five pairs of MIC chromosomes. MAC-destined sequences are shown in alternating shades of gray. MIC-specific IESs (internally eliminated sequences) are shown as blue rectangles, and sites of the 15-bp Cbs are shown as red bars (not to scale). Below the top bar are shown macronuclear chromosomes derived from the above region of the MIC by deletion of IESs, site-specific cleavage at Cbs sites, and amplification. Telomeres are added to the newly generated ends (green bars). Most of the MAC chromosomes are amplified to approximately 45 copies (only three shown). Through the process of phenotypic assortment, initially heterozygous loci generally become homozygous in each lineage within approximately 100 vegetative fissions. Polymorphisms located on the same MAC chromosome tend to co-assort. DOI: 10.1371/journal.pbio.0040286.g002 PLoS Biology | www.plosbiology.org September 2006 | Volume 4 | Issue 9 | e286 1622 Tetrahymena thermophila Genome Sequence