Comparison of assembly algorithms for improving rate of metatranscriptomic functional annotation

48Citations
Citations of this article
244Readers
Mendeley users who have this article in their library.

Abstract

Background: Microbiome-wide gene expression profiling through high-throughput RNA sequencing ('metatranscriptomics') offers a powerful means to functionally interrogate complex microbial communities. Key to successful exploitation of these datasets is the ability to confidently match relatively short sequence reads to known bacterial transcripts. In the absence of reference genomes, such annotation efforts may be enhanced by assembling reads into longer contiguous sequences ('contigs'), prior to database search strategies. Since reads from homologous transcripts may derive from several species, represented at different abundance levels, it is not clear how well current assembly pipelines perform for metatranscriptomic datasets. Here we evaluate the performance of four currently employed assemblers including de novo transcriptome assemblers - Trinity and Oases; the metagenomic assembler - Metavelvet; and the recently developed metatranscriptomic assembler IDBA-MT.Results: We evaluated the performance of the assemblers on a previously published dataset of single-end RNA sequence reads derived from the large intestine of an inbred non-obese diabetic mouse model of type 1 diabetes. We found that Trinity performed best as judged by contigs assembled, reads assigned to contigs, and number of reads that could be annotated to a known bacterial transcript. Only 15.5% of RNA sequence reads could be annotated to a known transcript in contrast to 50.3% with Trinity assembly. Paired-end reads generated from the same mouse samples resulted in modest performance gains. A database search estimated that the assemblies are unlikely to erroneously merge multiple unrelated genes sharing a region of similarity (<2% of contigs). A simulated dataset based on ten species confirmed these findings. A more complex simulated dataset based on 72 species found that greater assembly errors were introduced than is expected by sequencing quality. Through the detailed evaluation of assembly performance, the insights provided by this study will help drive the design of future metatranscriptomic analyses.Conclusion: Assembly of metatranscriptome datasets greatly improved read annotation. Of the four assemblers evaluated, Trinity provided the best performance. For more complex datasets, reads generated from transcripts sharing considerable sequence similarity can be a source of significant assembly error, suggesting a need to collate reads on the basis of common taxonomic origin prior to assembly.

Figures

  • Figure 1 Trinity-based assembly of short-read metatranscriptomic data improves annotation. The de novo transcriptome assembler, Trinity [19], was applied to a metatranscriptomic dataset generated from a non-obese diabetic (NOD) mouse cecal sample (NOD503CecMN). The probability of obtaining a significant sequence alignment (bit score >50) to a known protein increases with contig length. Contigs greater than 79 bp demonstrate greater annotation potential compared to unassembled reads.
  • Figure 2 Performance of three short-read assemblers on a single-end metatranscriptomic dataset. Three different single-end assemblers (with varying k-mer parameters where appropriate) were applied to the NOD503CecMN single-end dataset and evaluated on the basis of: 1) the probability of contigs of different lengths having significant sequence similarity (bit score >50) to a known protein, as well as the percentage of reads which could be annotated (top panel), and 2) contig length distributions (bottom panel). While the assemblers varied greatly in the contig length distribution, number of contigs assembled, and number of reads which could be matched to an annotated contig, all contigs over 180 bp, irrespective of the assembler used to generate them, had a consistently high probability of having significant sequence similarity to a known protein.
  • Figure 3 Performance of four short-read assemblers on both single- and paired-end metatranscriptomic datasets. Assembly performance was assessed using both single-end and paired-end datasets generated from the NOD503CecMN sample. Comparisons between the two datasets are presented for each assembler/parameter combination except IDBA-MT which requires paired-end data. Assemblers were evaluated on the basis of: 1) number of contigs assembled, 2) percentage of reads that map to assembled contigs, and 3) whether contigs have sequence similarity to a known protein at two levels of stringency.
  • Table 1 Overlap in assemblies
  • Figure 4 Identification and evaluation of misassembled contigs. (A) Strategy used to identify misassembled contigs with the potential to align to multiple bacterial proteins. First, we perform a database search to identify proteins aligning to the contig (1). Next, iterating from the start of the contig, we identify the set of highest scoring non-overlapping alignments (2). Based on these, the contig is subsequently fragmented (3). (B) Incidence of misassembles, as defined from the heuristic presented in (A), generated from both the single-end and paired-end read datasets generated from the NOD503CecMN sample (left panel). Also shown is the proportion of intact contigs and fragments which align <90% of their length to a known protein (right panel).
  • Figure 5 Overview of metatranscriptome simulation pipeline based on FluxSimulator. For each species considered, the genome sequence and ORF annotation file is used to create a list of predicted transcripts for each species. FluxSimulator then assigns each gene a random expression value based on Zipf’s law to create a library of the mRNA molecules that are present in the sample. Given a list of experimental parameter input (sequence errors, sample bias, and relative species abundance), a set of simulated metatranscriptomic reads are generated based on the set of transcripts provided. A gold standard assembly is then generated by aligning reads to the original transcripts and obtaining consensus sequences from the resulting alignments.
  • Figure 6 Accuracy of simulated metatranscriptome assemblies. For each simulated dataset, the accuracy of the reconstructed transcripts is evaluated based on their matches to the original set of transcripts used to generate the datasets. (A) Ten species dataset. (B) 72 species dataset. Shown is the percentage of contigs in each assembly which contain a region of at least one read length (76 bp) which does not align to a transcript at a variety of sequence cutoffs (97%–100% sequence identity). The gold standard assembly indicates the number of predicted misassemblies that are the result of introduced sequence errors during generation of the simulated datasets. Note this is higher for the ten species dataset as it includes a larger number of contigs than are generated by the assemblers (see text).

References Powered by Scopus

Basic local alignment search tool

79489Citations
28679Readers
Get full text
Get full text

This article is free to access.

Cited by Powered by Scopus

This article is free to access.

Get full text

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Cite

CITATION STYLE

APA

Celaj, A., Markle, J., Danska, J., & Parkinson, J. (2014). Comparison of assembly algorithms for improving rate of metatranscriptomic functional annotation. Microbiome, 2(1). https://doi.org/10.1186/2049-2618-2-39

Readers over time

‘14‘15‘16‘17‘18‘19‘20‘21‘22‘23‘24‘25015304560

Readers' Seniority

Tooltip

PhD / Post grad / Masters / Doc 111

57%

Researcher 64

33%

Professor / Associate Prof. 16

8%

Lecturer / Post doc 3

2%

Readers' Discipline

Tooltip

Agricultural and Biological Sciences 111

60%

Biochemistry, Genetics and Molecular Bi... 44

24%

Environmental Science 20

11%

Computer Science 10

5%

Article Metrics

Tooltip
Social Media
Shares, Likes & Comments: 643

Save time finding and organizing research with Mendeley

Sign up for free
0