Using RNA-Seq for gene identifica...
RESEARCH ARTICLE Open Access Using RNA-Seq for gene identification, polymorphism detection and transcript profiling in two alfalfa genotypes with divergent cell wall composition in stems S Samuel Yang1*, Zheng Jin Tu2, Foo Cheung3,5, Wayne Wenzhong Xu2, JoAnn FS Lamb1,4, Hans-Joachim G Jung1,4, Carroll P Vance1,4* and John W Gronwald1,4* Abstract Background: Alfalfa, [Medicago sativa (L.) sativa], a widely-grown perennial forage has potential for development as a cellulosic ethanol feedstock. However, the genomics of alfalfa, a non-model species, is still in its infancy. The recent advent of RNA-Seq, a massively parallel sequencing method for transcriptome analysis, provides an opportunity to expand the identification of alfalfa genes and polymorphisms, and conduct in-depth transcript profiling. Results: Cell walls in stems of alfalfa genotype 708 have higher cellulose and lower lignin concentrations compared to cell walls in stems of genotype 773. Using the Illumina GA-II platform, a total of 198,861,304 expression sequence tags (ESTs, 76 bp in length) were generated from cDNA libraries derived from elongating stem (ES) and post-elongation stem (PES) internodes of 708 and 773. In addition, 341,984 ESTs were generated from ES and PES internodes of genotype 773 using the GS FLX Titanium platform. The first alfalfa (Medicago sativa) gene index (MSGI 1.0) was assembled using the Sanger ESTs available from GenBank, the GS FLX Titanium EST sequences, and the de novo assembled Illumina sequences. MSGI 1.0 contains 124,025 unique sequences including 22,729 tentative consensus sequences (TCs), 22,315 singletons and 78,981 pseudo-singletons. We identified a total of 1,294 simple sequence repeats (SSR) among the sequences in MSGI 1.0. In addition, a total of 10,826 single nucleotide polymorphisms (SNPs) were predicted between the two genotypes. Out of 55 SNPs randomly selected for experimental validation, 47 (85%) were polymorphic between the two genotypes. We also identified numerous allelic variations within each genotype. Digital gene expression analysis identified numerous candidate genes that may play a role in stem development as well as candidate genes that may contribute to the differences in cell wall composition in stems of the two genotypes. Conclusions: Our results demonstrate that RNA-Seq can be successfully used for gene identification, polymorphism detection and transcript profiling in alfalfa, a non-model, allogamous, autotetraploid species. The alfalfa gene index assembled in this study, and the SNPs, SSRs and candidate genes identified can be used to improve alfalfa as a forage crop and cellulosic feedstock. * Correspondence: firstname.lastname@example.org email@example.com john. firstname.lastname@example.org 1USDA-Agricultural Research Service, Plant Science Research Unit, St. Paul, MN, 55108, USA Full list of author information is available at the end of the article Yang et al. BMC Genomics 2011, 12:199 http://www.biomedcentral.com/1471-2164/12/199 �� 2011 Yang et al licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Background The advent of next generation high-throughput sequen- cing has revolutionized the analysis of genomes and transcriptomes [1-5]. When applied to the transcrip- tome, this methodology is referred to as RNA-Seq (RNA sequencing). RNA-Seq has been used for gene annota- tion, expression analysis and SNP discovery [6,7]. This methodology has also proven useful for discovery of novel transcripts (coding and non-coding) and identifi- cation of alternative splice variants [5,8]. It is expected that RNA-Seq methodologies will supersede microarrays for transcript profiling because of higher sensitivity, base-pair resolution and the larger range of expression values that can be detected [3,5,9]. Furthermore, in con- trast to microarrays, RNA-Seq does not require prior knowledge of gene sequences. However, RNA-Seq pre- sents bioinformatic challenges because of the required assembly of millions of short sequence reads that are generated by the methodology. RNA-Seq has been successfully used for annotation, transcript profiling and/or SNP discovery in a number of plant species. For model plant species with sequenced genomes, sequence reads can be mapped to the reference genome. The model species where RNA-Seq analysis has been applied include Arabidopsis [10,11], soybean [12,13], rice , maize  and Medicago truncatula . There are also examples of the application of RNA- Seq to non-model plant species that lack a reference gen- ome. In the absence of a reference genome, de novo assembly of sequence reads into contigs is required. RNA-Seq has been used for transcript profiling in Euca- lyptus grandis , grape (Vitis vinifera L.) , Califor- nia poppy (Eschschlozia califonica) , avocado (Persea americana) , Pachycladon enysii  and Artemisia annua . In Eucalyptus grandis and rape (Brassica napus), RNA-Seq was used for SNP discovery [17,21]. Alfalfa is the most widely cultivated forage legume in the world and the fourth most widely grown crop in the US [22,23]. In addition to its value as a livestock feed, alfalfa also has potential as a cellulosic ethanol feedstock [24,25]. Alfalfa is an allogamous autotetraploid with complex polysomic inheritance [26-28]. Slow progress has been made in improving the agronomic traits of this species using traditional breeding approaches based on phenotypic selection. For the most part, genomic approaches for crop improvement (e.g., molecular breeding) have not been applied to this legume because of limited genomic resources. As of February 2010, there were 12,371 alfalfa ESTs available in the public database. A few SSRs have been detected but SNPs have not yet been identified [28-30]. Recently, we reported on the results of transcript profiling and single feature poly- morphism (SFP) detection in alfalfa using the Medicago GeneChip as a cross-species platform [25,31]. The Medicago GeneChip contains probe sets designed for the model plant, Medicago truncatula, a diploid relative of alfalfa. Using a method based on probe affinity differ- ences and affinity shape power, we identified over 10,000s SFPs in the stem internodes of alfalfa genotypes 252 and 1283 that differed in cellulose and lignin con- centrations in cell walls . In a subsequent study using the Medicago GeneChip for transcript profiling of alfalfa genotypes 252 and 1283, interspecies variable regions and SFPs were masked prior to data analysis resulting in a 2-fold increase in the number of differen- tially expressed genes detected in stem internodes of the two genotypes . Although the research of Yang et al. [25,31] significantly advanced alfalfa genomics, the use of a cross-species platform for microarray analysis limits the sensitivity and specificity of transcriptome analysis and polymorphism detection. The stem tissue of alfalfa is important in determining the value of this forage as a livestock feed and cellulosic feedstock. Increasing the cellulose and decreasing the lignin content in cell walls in stems would improve alfalfa for both uses. In this study, we applied RNA-Seq to gene identification, polymorphism detection and tran- script profiling of two alfalfa clonal lines (708, 773) that differ in cell wall composition in stems. The results were used to assemble the first gene atlas for alfalfa (MSGI 1.0). Our research also provides the first report of high-throughput SNP detection and digital gene expression analysis in the alfalfa transcriptome. Results and discussion Cell wall composition of stems of genotypes 708 and 773 The alfalfa genotypes 708 and 773 used in this study were selected for divergent cell wall composition in stems under field conditions (see Methods for details). Cell wall composition of greenhouse grown stems used for RNA sampling in the current study is shown in Table 1. Cell wall concentration in stems of the two clones did not dif- fer. In contrast, cellulose content (defined as glucose) in the stems of genotype 708 was 5.2% greater compared to genotype 773 (p 0.05) (Table 1). In addition, galactose and mannose concentrations were 14.2% (p 0.05) and 8.5% (p 0.01) greater, respectively, in stems of genotype 708 compared to genotype 773 (Table 1). Klason lignin concentration in the cell wall was 8.0% greater in stems of 773 compared to stems of 708 (p 0.05) (Table 1). These genotypes consistently displayed differences in cell wall cellulose and lignin content in stems when plants were grown under different field environments (Figure 1) and in the greenhouse (Table 1). RNA-Seq using the Illumina GA-II platform For RNA-Seq analysis, we developed a total of four cDNA libraries derived from elongating stem (ES) and Yang et al. BMC Genomics 2011, 12:199 http://www.biomedcentral.com/1471-2164/12/199 Page 2 of 19
post-elongation stem (PES) internodes of alfalfa genotypes 708 and 773 (see Methods for details). In alfalfa stems, genes associated with primary cell wall development are preferentially expressed in ES internodes while genes asso- ciated with secondary xylem development are enriched in PES internodes . For sequencing by synthesis using the Illumina GA-II platform, cDNA libraries 708ES, 708PES and 773ES were run on two lanes per library while the 773PES library was run on one lane. A total of 234,908,899 EST reads were generated by a single run of 76 cycles. After filtering low quality reads, a total of 198,861,304 reads (76-bp in size) were selected for further analysis (see Methods for details). The Illumina reads generated in this study are available at the NCBI SRA browser (accession number GSE26757 http://www.ncbi. nlm.nih.gov/geo/query/acc.cgi?acc=GSE26757. de novo assembly of short RNA-Seq reads without a known reference is a challenging task especially for alfalfa, an allogamous autotetraploid with complex poly- somic inheritance. In this study, we used the Velvet algo- rithm  for de novo assembly of the 198,861,304 Illumina reads (76 bp) into a total of 132,153 unique sequences with an average length of 284 bp (Additional file 1). The Velvet algorithm has also been used success- fully for de novo transcriptome assembly in previous stu- dies [33,34]. The Velvet algorithm was originally developed for de novo assembly of genome sequences where the coverage is expected to be homogeneous throughout the genome. However, the coverage of tran- scripts is highly heterogeneous due to difference in gene expression. Previous studies showed that de novo assem- bly using the Velvet program with longer k-mers results in a more contiguous transcript assembly but lower tran- script diversity compared to shorter k-mers [32,33]. Although several recent studies introduced new algo- rithms and methodologies developed for de novo tran- scriptome assembly [35-38], a consensus standard protocol has not yet emerged for de novo transcriptome assembly. In this study, we optimized our Velvet de novo transcriptome assembly to favor transcript contiguity with high specificity as opposed to increased transcript diversity (see Methods for details). To complement the limitation of the high k-mer that we selected for the Vel- vet assembly in this study (lower diversity and probably biased toward highly expressed genes), we generated additional ESTs using the GS FLX Titanium platform. RNA-Seq using the GS FLX Titanium platform We generated a total of 341,984 additional ESTs (average length 243 bp, minimum length 40 bp, maximum length 792 bp) using the GS FLX Titanium platform http:// www.454.com. The additional EST sequences were gen- erated from the cDNA libraries derived from ES (124,533 ESTs, average length 230 bp) and PES (217,451 ESTs, average length 256 bp) internodes of the genotype 773. The additional ESTs obtained using the GS FLX Tita- nium platform increased the diversity of transcripts dis- covered and hence provided broader coverage of the alfalfa transcriptome than would have been achieved based on the de novo assembly of the Illumina reads alone. The additional ESTs are also available at the NCBI SRA browser (accession number GSE26757 http://www. ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE26757. Alfalfa Gene Index 1.0 (MSGI 1.0) We used the Gene Index Assembly protocol [39,40] for reference transcriptome assembly in alfalfa. This Environmental Index -20 -10 0 10 20 30 40 150 200 250 300 350 150 200 250 300 350 Stem Cellulose Concentration (g / kg dry matter) Stem Klason Li g nin Concentration ( g /k g dry matter) Clone 708 Stem Cellulose y=305+1.2x, r2=0.94 Clone 773 Stem Cellulose y=284+1.2x, r2=0.94 Clone 773 Stem Klason Lignin 169+1.1x, r2=0.80 Clone 708 Stem Klason Lignin y=146+1.0x, r2=0.84 Figure 1 Regression analyses of cellulose and Klason lignin concentrations in stems of two alfalfa genotypes. The stems of genotype 708 were consistently higher in cellulose and lower in Klason lignin compared to stems of genotype 773 across twelve environmental indexes (field environments). The high r2 values for all regression lines suggest that genotypic differences in stem cellulose and Klason lignin concentrations were environmentally stable. Table 1 Comparison of cell wall components in stems of genotypes 708 and 773 on a cell wall basis Component Genotype 708 Genotype 773 SEM p-value ������������������������ g kg-1 cell wall ������������������������ Klason lignin 162 175 2 p 0.05 Glucose 443 421 2 p 0.05 Xylose 137 149 3 NS Arabinose 39 39 1 NS Galactose 32 28 1 p 0.05 Mannose 33.1 30.5 0.1 p 0.01 Rhamnose 11.5 11.4 0.4 NS Fucose 3.01 3.1 0.03 NS Uronic acids 139 142 6 NS Values are least square means based on an analysis of variance with three biological replicates for each clone arranged in a randomized complete block design (see Methods for details). SEM = Standard error of mean, NS = Non- significant (p 0.05). Yang et al. BMC Genomics 2011, 12:199 http://www.biomedcentral.com/1471-2164/12/199 Page 3 of 19