Sign up & Download
Sign in

Community-based gene structure annotation.

by Shannon D Schlueter, Matthew D Wilkerson, Eva Huala, Seung Y Rhee, Volker Brendel
Trends in Plant Science ()

Abstract

Uncertainty and inconsistency of gene structure annotation remain limitations on research in the genome era, frustrating both biologists and bioinformaticians, who have to sort out annotation errors for their genes of interest or to generate trustworthy datasets for algorithmic development. It is unrealistic to hope for better software solutions in the near future that would solve all the problems. The issue is all the more urgent with more species being sequenced and analyzed by comparative genomics - erroneous annotations could easily propagate, whereas correct annotations in one species will greatly facilitate annotation of novel genomes. We propose a dynamic, economically feasible solution to the annotation predicament: broad-based, web-technology-enabled community annotation, a prototype of which is now in use for Arabidopsis.

Cite this document (BETA)

Available from www.ncbi.nlm.nih.gov
Page 1
hidden

Community-based gene structure an...

Community-based gene structure annotation Shannon D. Schlueter1, Matthew D. Wilkerson1, Eva Huala2, Seung Y. Rhee2 and Volker Brendel1,3 1Department of Genetics, Development and Cell Biology, Iowa State University, 2112 Molecular Biology Building, Ames, IA 50011, USA 2Carnegie Institution, Department of Plant Biology, 260 Panama Street, Stanford, CA 94305, USA 3Department of Statistics, Iowa State University, Ames, IA 50011, USA Uncertainty and inconsistency of gene structure annota- tion remain limitations on research in the genome era, frustrating both biologists and bioinformaticians, who have to sort out annotation errors for their genes of interest or to generate trustworthy datasets for algo- rithmic development. It is unrealistic to hope for better software solutions in the near future that would solve all the problems. The issue is all the more urgent with more species being sequenced and analyzed by comparative genomics ��� erroneous annotations could easily propa- gate, whereas correct annotations in one species will greatly facilitate annotation of novel genomes. We pro- pose a dynamic, economically feasible solution to the annotation predicament: broad-based, web-technology- enabled community annotation, a prototype of which is now in use for Arabidopsis. When is a genome finished? For all plant and animal species, presentation of the ���finished genome��� is considered to be a major milestone in the study of its genetics. However, ambiguous claims of this highly prized accomplishment beg the question of the meaning and worth of such announcements. Competitive and controversial claims concerning the completion of the human genome have been widely discussed [1]. In the area of plant genetics, the completed Arabidopsis genome was reported at the end of 2000 [2]. At that time, the genomic assembly comprised 115 409 949 base pairs covering the five chromosomes and leaving only an estimated 10 Mb of centromeric and ribosomal DNA (rDNA) repeat regions not sequenced. The total length of the assembled genome has increased by about 1 Mb per year (http://www. plantgdb.org/AtGDB/resource.php). A more demanding definition of a ���finished genome��� requires extensive annotation of the assembled chromosome sequences in addition to the mere sequence report. In particular, researchers using the genome as a model system require annotation of the protein coding genes as the basis for assessing the transcriptome and proteome of the species. At the time of the Arabidopsis genome release, 25 498 protein-coding genes were annotated on the genome sequence. Since that time, this annotation challenge has continued to receive serious consideration for Arabidopsis, as evidenced by a w10% increase in the number of annotated gene structures during the past three years [3] and continuing correction of erroneous initial annotations [4]. Perhaps the most ambitious and accurate definition of a ���finished genome��� should include functional characteriz- ation of all the genes, a goal of the Arabidopsis 2010 project [5]. It is clear that each, successively more comprehensive, definition requires completion of the less ambitious tasks. The complexities of providing compre- hensive annotation, whether that annotation is structural or functional, depend on an accurately defined gene struc- ture. Because our collective understanding of genes and genome function continually advances, and users of the genome annotation naturally expect it to remain up to date with recent discoveries, the definition of a finished genome is necessarily a bit of a moving target. Currently, a considerable time lag between completion of sequencing and completion of annotation appears to be unavoidable. This is because, even though sequencing is largely automated and robotic, and sequence assembly is largely routine (at least for genome regions that are not highly repetitive), accurate sequence annotation entirely by gene-finding software has remained elusive [6]. Current efforts towards more accurate and comprehensive gene structure annotation have focused on expressed sequence tag (EST) and full-length cDNA mapping onto the Arabidopsis genome [7���9] and combinations of compu- tational and experimental approaches [10,11]. These studies have underscored the utility of spliced alignment to identify non-coding exons and to correct inaccurate computational gene predictions that formed the basis of the initial genome annotation. In particular, the results of cDNA mapping point to inherent limitations of high- throughput computational gene prediction, including diffi- culties in predicting exact exon borders, problems with distinguishing intergenic regions from introns and lack of models capable of identifying untranslated mRNA regions. However, these recent efforts have also not been entirely immune to the problems of large-scale automated annota- tion. For example, novel algorithmic changes incorporated into the newest annotation release [12] have inadvertently resulted in the ambiguous assignment of ESTs to mul- tiple adjacent genes, thereby falsely extending their gene Corresponding author: Brendel, V. (vbrendel@iastate.edu). Available online 15 December 2004 Opinion TRENDS in Plant Science Vol.10 No.1 January 2005 www.sciencedirect.com 1360-1385/$ - see front matter Q 2004 Elsevier Ltd. All rights reserved. doi:10.1016/j.tplants.2004.11.002
Page 2
hidden
structure annotations (e.g. Figure 2 in Ref. [13]). Inclusion of draft sequences of clones that are too repetitive to finish with existing technology, although useful as a way to improve genome coverage with the available fragments of sequence data, has had some undesirable consequences, such as the inclusion of pBlueScript vector sequences in the genome sequence (http://www.plantgdb.org/AtGDB/ Annotation/vector.php). The scope and complexity of the genome annotation task would seem to imply that short- comings and mistakes are simply unavoidable in the early to middle stages offinishing a genome. Hild et al. [14] have discussed similar challenges with respect to the Drosophila genome annotation. Arabidopsis genome annotation The Arabidopsis research community currently has several ways to access genome data. TAIR (The Arabi- dopsis Information Resource http://www.arabidopsis.org/ [15]), TIGR (The Institute for Genome Research http:// www.tigr.org/tdb/), MATdb (MIPS Arabidopsis thaliana Databases http://mips.gsf.de/proj/thal/db/ [16]), SIGnAL (Salk Institute Genomic Analysis Laboratory http:// signal.salk.edu/ [10]), and AtGDB (The Arabidopsis thaliana Genome Database at PlantGDB http://www. plantgdb.org/AtGDB/ [9,13]) provide web-based genome browsers for Arabidopsis that display gene structure annotation and comparisons with spliced alignment of ESTs and cDNAs. In addition to its genome browser, TAIR provides a comprehensive access point for Arabidopsis data, including information about genes, sequences, proteins, microarrays, germplasms, polymorphisms, seed and DNA stocks, and the research community. TAIR���s curation efforts include the functional annotation of genes, with an emphasis on capturing experimental data from the literature and using controlled vocabularies [17]. Since the first release of the genome sequence in 2000, TIGR has maintained and updated the Arabidopsis genome annotation, making the updates publicly avail- able in periodic releases, ending with the TIGR 5.0 release in January 2004, visible also at both AtGDB and TAIR. Because TIGR���s role in maintaining and improving the genome annotation has come to an end, other mechanisms must be put in place to ensure that the genome data remain as error-free and up to date as possible. In response to this need, TAIR is currently setting up its own automated pipeline for improving gene models using new EST and cDNA data and manual methods for updating gene structures in response to community input. Although TAIR will work to eliminate the pre- viously reported problems associated with automated gene structure annotation, automated methods will never be as flexible as a human curator in handling unusual cases or making use of new kinds of data. However, manual curation efforts by trained curators are limited by the size of the curation team and the amount of time needed to resolve each problematic gene structure annotation. Even with well-organized community resources to support the informatics needs of a genome project, genome annotation remains a difficult task because, ultimately, all gene models will have to be evaluated by human experts. We have argued previously [18,19] that the only promising solution to this quandary is involvement of the user community and the development of enabling technology that streamlines user input, curation of user contributions and dissemination of approved user contributions. The purpose of this article is to introduce web-based gene structure annotation tools that are directly linked into AtGDB and TAIR and that will, we believe, facilitate broad-based community participation in the genome annotation task. To assist in evaluating the quality of specific gene structure annotation and to determine the overall quality of the current Arabidopsis annotation, we have developed a system at AtGDB that allows gene structure comparison in the genomic context (http://www.plantgdb.org/AtGDB/ Annotation/). The system, called Genome Annotation EVALuation (GAEVAL, pronounced ���gavel���), highlights inconsistencies between current gene structure annota- tion and the cognate placement of spliced aligned ESTs and (full-length) cDNAs. The reference for current gene structure annotation is provided by the mRNA fields in the GenBank deposited chromosome sequence files (Accession no. NC_003070, Accession no. NC_003071, Accession no. NC_003074, Accession no. NC_003075, Accession no. NC_003076). The cognate spliced align- ments were derived with the GeneSeqer program as described previously [9] and provide the ability to identify non-coding exons, to confirm splicing boundaries and to correct inaccurate ab initio gene predictions [4,6]. Addi- tionally, owing to the nature of cognate mapping, these spliced alignments provide higher accuracy when evalu- ating genes from multigene families by explicitly using only sequences native to the specific locus for annotation. Quality assessment of predicted gene structures Alignments are first evaluated to determine their native locus and, if necessary, the specific transcript isoform derived from the locus. A scoring system for comparing the spliced alignment with overlapping gene annotations was devised to aid in this determination (http://www.plantgdb. org/AtGDB/Annotation/gaeval/). Once a transcript iso- form has been identified from which the EST or cDNA originated, all corresponding spliced alignments are compared with the predicted gene structure. This com- parison is used to judge the accuracy of the gene annotation and to assign a quality flag for immediate appraisal of annotation validity. Five levels of annotation quality were established (Figure 1). The first quality level corresponds to an unconfirmed gene annotation for which no EST or cDNA evidence is currently available. These gene structure annotations are generally based entirely on ab initio computational prediction. Further analysis using homologous ESTs and cDNAs can be used to provide estimates of the annotation accuracy [20,21]. Annotations of quality levels beyond the first level benefit from the spliced alignment of ESTs and cDNAs. Increasing quality levels (Figure 1) represent increasing confidence in the accuracy and completeness of an annotation. Ultimately, the fifth level of quality assignment is given to gene anno- tations completely tiled by cognate ESTs or cDNAs, with all splice site boundaries supported. These annotations Opinion TRENDS in Plant Science Vol.10 No.1 January 2005 10 www.sciencedirect.com

Authors on Mendeley

Readership Statistics

13 Readers on Mendeley
by Discipline
 
 
 
by Academic Status
 
31% Researcher (at an Academic Institution)
 
23% Ph.D. Student
 
8% Student (Master)
by Country
 
23% United States
 
15% United Kingdom
 
8% Switzerland

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in