Sign up & Download
Sign in

The Chado Natural Diversity module: a new generic database schema for large-scale phenotyping and genotyping data.

by Sook Jung, Naama Menda, Seth Redmond, Robert M Buels, Maren Friesen, Yuri Bendana, Lacey-Anne Sanderson, Hilmar Lapp, Taein Lee, Bob MacCallum, Kirstin E Bett, Scott Cain, Dave Clements, Lukas A Mueller, Dorrie Main show all authors
Database the journal of biological databases and curation ()

Abstract

Linking phenotypic with genotypic diversity has become a major requirement for basic and applied genome-centric biological research. To meet this need, a comprehensive database backend for efficiently storing, querying and analyzing large experimental data sets is necessary. Chado, a generic, modular, community-based database schema is widely used in the biological community to store information associated with genome sequence data. To meet the need to also accommodate large-scale phenotyping and genotyping projects, a new Chado module called Natural Diversity has been developed. The module strictly adheres to the Chado remit of being generic and ontology driven. The flexibility of the new module is demonstrated in its capacity to store any type of experiment that either uses or generates specimens or stock organisms. Experiments may be grouped or structured hierarchically, whereas any kind of biological entity can be stored as the observed unit, from a specimen to be used in genotyping or phenotyping experiments, to a group of species collected in the field that will undergo further lab analysis. We describe details of the Natural Diversity module, including the design approach, the relational schema and use cases implemented in several databases.

Cite this document (BETA)

Available from Hilmar Lapp's profile on Mendeley.
Page 1
hidden

The Chado Natural Diversity modul...

Database tool The Chado Natural Diversity module: a new generic database schema for large-scale phenotyping and genotyping data Sook Jung1,*,y, Naama Menda2,*,y, Seth Redmond3,z, Robert M. Buels2, Maren Friesen4, Yuri Bendana4, Lacey-Anne Sanderson5, Hilmar Lapp6, Taein Lee1, Bob MacCallum3, Kirstin E. Bett5, Scott Cain7, Dave Clements6,��, Lukas A. Mueller2 and Dorrie Main1 1 Department of Horticulture and Landscape, Washington State University, Pullman, WA 99164, 2 Boyce Thompson Institute for Plant Research, Ithaca, NY 14853, USA, 3 Imperial College London, London SW7 2AZ, UK, 4 University of Southern California, Los Angeles, CA 90089, USA, 5Department of Plant Sciences, University of Saskatchewan, Saskatoon, SK, S7N 5A8, Canada, 6National Evolutionary Synthesis Center (NESCent), Durham, NC, USA and 7Ontario Institute for Cancer Research, Toronto, Ontario, M5G 0A3, Canada *Corresponding author: Tel: 509-335-7093 Fax: 509-335-8660 Email: sook_jung@wsu.edu/ Correspondence may also be addressed to Naama Menda. Tel: 607-254-3569 Fax: 607-254-1242 Email: naama.menda@cornell.edu yThese authors contributed equally to work. zPresent address: Seth Redmond, Pasteur Institute, 28 Rue Du Docteur Roux, Paris, 75015, France. ��Present address: Dave Clements, Department of Biology, Emory University, Atlanta, GA 30322, USA. Submitted 17 July 2011 Revised 21 October 2011 Accepted 23 October 2011 ............................................................................................................................................................................................................................................................................................. Linking phenotypic with genotypic diversity has become a major requirement for basic and applied genome-centric bio- logical research. To meet this need, a comprehensive database backend for efficiently storing, querying and analyzing large experimental data sets is necessary. Chado, a generic, modular, community-based database schema is widely used in the biological community to store information associated with genome sequence data. To meet the need to also accommodate large-scale phenotyping and genotyping projects, a new Chado module called Natural Diversity has been developed. The module strictly adheres to the Chado remit of being generic and ontology driven. The flexibility of the new module is demonstrated in its capacity to store any type of experiment that either uses or generates specimens or stock organisms. Experiments may be grouped or structured hierarchically, whereas any kind of biological entity can be stored as the observed unit, from a specimen to be used in genotyping or phenotyping experiments, to a group of species collected in the field that will undergo further lab analysis. We describe details of the Natural Diversity module, including the design approach, the relational schema and use cases implemented in several databases. ............................................................................................................................................................................................................................................................................................. Introduction In the last 20 years, high-throughput technology develop- ments have revolutionized biology and transformed it into an information-based science. The first spur in data gener- ation occurred in the early 1990���s when large-scale sequen- cing became available through Sanger technology for relatively small-scale projects such as EST and BAC sequencing (1,2). In more recent years, the number of nucleic acids and protein sequences freely available at data repositories, such as GenBank, has grown exponentially as higher throughput and relatively inexpensive sequencing and genotyping plat- forms have enabled widespread generation of genomic data. Although Model-Organism Databases (MODs) were designed for storing genomic data and their derived annotations, they all face similar challenges in answering the needs of the de- veloper and user community, including how to store the data efficiently, and how to adapt to newly emerging data types and represent them in a meaningful way so that biological questions can be answered. At the core of the Generic Model Organism Database (GMOD) is a generic schema, named Chado, which was ............................................................................................................................................................................................................................................................................................. �� The Author(s) 2011. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http:// creativecommons.org/licenses/by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. Page 1 of 13 (page number not for citation purposes) Database, Vol. 2011, Article ID bar051, doi:10.1093/database/bar051 ............................................................................................................................................................................................................................................................................................. at CSIRO Library Services on March 12, 2012 http://database.oxfordjournals.org/ Downloaded from
Page 2
hidden
initially designed for storing Drosophila data at FlyBase, with the vision of creating a reusable and generic open source schema (3). Chado is ontology-driven and modular, and thus highly flexible. Chado���s design principles enable the same schema to be used in projects with widely differ- ent metadata. Metadata can be modified or added as new data types become available. Its modular design allows de- velopers to select those parts needed to manage their data, and to add new modules when advances in biology require new data types. Currently, the Chado schema consists of 18 modules, covering sequence, phenotype, genotype, ontol- ogies, publications and phylogenies (http://gmod.org/wiki/ Chado), with 23 genomic databases reporting that they use some or all of the Chado modules (http://gmod.org/wiki/ GMOD_Users). As a GMOD component Chado is open source, and therefore any user can contribute to the schema and the underlying code (http://sourceforge.net/ projects/gmod/), provided those contributions are consist- ent with the Chado generic design principle. The initial development of Chado focused on genome sequence data, but as more complex data was generated, such as microarray and expression data, new modules were added to accommodate new data types. Chado has also proven useful for handling multiple closely related organ- isms. Clade Oriented Databases (CODs) using Chado include the Sol Genomics Network [SGN http://solgenomics.net/ (4)] Genome Database for Rosaceae [GDR www.rosaceae .org (5)] Citrus Genome Database (www.citrusgenomedb .org) Cool Season Food Legume Genome Database (www .gabcsfl.org) KnowPulse (http://knowpulse2.usask.ca/ portal/) and the Genome Database for Vaccinium (www .vaccinium.org). Unlike the exponential increase in sequencing data, phenotypic data has been growing at a much slower pace. Although count, structure and functional annotation for genes can be derived in silico using sequence similarity and other methods, analysis tools for correlating phenotype with genotype fall far behind those for sequence analysis (6,7). A problem in phenotyping is the lack of genetic diversity in cultivated plants, which underwent heavy selection during domestication (8���11), causing a decrease in geno- typic variation. As a result, cultivated plants may have as little as 5% of the natural diversity found in their wild counterparts. With such low allele pools, there is also a dramatic decrease in phenotypic variation. Another prob- lem is related to difficulties in high-throughput phenotyp- ing. For genotyping, one can choose from numerous available technologies according to desired quality and funds available. The end product is always a molecular se- quence, and hence a uniform data type for which there are well understood and standard ways for processing and stor- age. In contrast, the data collection process for phenotyp- ing is slow, expensive and subjective to the person collecting the data, generally with no set standards for terminology or descriptors to capture phenotype observa- tions. Moreover, many phenotypes are subtle or even un- detectable to the naked eye. Traits controlled by multiple quantitative trait loci may have dozens of underlying genes, each having a small additive affect (12). In addition, traits may be sensitive to environmental effects and may exhibit interactions with the genotype. Because of these and other challenges, phenotype data are notoriously dif- ficult to record. Although new technology, such as automated computer- ized facilities for growing plant germplasm (13) and soft- ware for taking multiple computerized measurements of a specimen (14), may facilitate high-throughput phenotyp- ing, the challenge of capturing phenotypic diversity data remains complex, expensive and error-prone. Despite these difficulties, breeding programs generate large volumes of phenotypic data, which poses a challenge for databases to efficiently store, query and analyze these data. In addition, such programs require genetic informa- tion to be integrated with phenotype data for progeny se- lection and crossing designs. Large-scale genotyping, mostly based on next-generation sequencing-based SNP markers, is now routinely performed on the same group of individual plants or animals for which phenotype assays were done. Large-scale phenotyping and genotyping experiments are currently practiced in various projects (15���18) as well as in applied breeding experiments. Shared among these projects is the challenge of determin- ing how to best manage these data. In maize, three monocot databases, Panzea (19), Gramene (20) and GrainGenes (21), jointly developed the Genomic Diversity and Phenotype Data Model (GDPDM) to capture molecular and phenotypic diversity data. The core schema of GDPDM consists of tables for germplasm, pheno- type, genotype and environment that capture associations between phenotypes and genotypes. Although GDPDM performs well for its creators, the genericity of the schema is limited, and its design deviates enough from Chado���s principles to make it difficult to adapt to other modules in Chado. As model organism and clade-oriented databases, which were already using Chado for storing and managing their data, increasingly faced the need for stor- ing large-scale diversity data efficiently, several of them collaborated on developing a schema module for Chado capable of recording a wide variety of phenotyping and genotyping experiments in a way that maintains links to stocks and germplasms. An initial version of a natural diversity module for Chado was developed at the National Evolutionary Synthesis Center (NESCent) in collaboration with W. Owen McMillan, who was a Center fellow at the time. McMillan is part of a community of researchers who use neotropical butterflies of the genus Heliconius as an emerging model system to study evolutionary genomics of Mullerian�� ............................................................................................................................................................................................................................................................................................. Page 2 of 13 Database tool Database, Vol. 2011, Article ID bar051, doi:10.1093/database/bar051 ............................................................................................................................................................................................................................................................................................. at CSIRO Library Services on March 12, 2012 http://database.oxfordjournals.org/ Downloaded from

Authors on Mendeley

  1. Hilmar Lapp
    Researcher (at an Academic Institution)
    National Evolutionary Synthesis Center (NESCent)

Readership Statistics

8 Readers on Mendeley
by Discipline
 
 
by Academic Status
 
38% Other Professional
 
13% Student (Master)
 
13% Post Doc
by Country
 
25% Australia
 
25% United States
 
25% France

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in