Sign up & Download
Sign in

Widespread RNA and DNA sequence differences in the human transcriptome.

by M Li, I X Wang, Y Li, A Bruzel, A L Richards, J M Toung, V G Cheung
Science ()

Abstract

The transmission of information from DNA to RNA is a critical process. We compared RNA sequences from human B cells of 27 individuals to the corresponding DNA sequences from the same individuals and uncovered more than 10,000 exonic sites where the RNA sequences do not match that of the DNA. All 12 possible categories of discordances were observed. These differences were nonrandom as many sites were found in multiple individuals and in different cell types, including primary skin cells and brain tissues. Using mass spectrometry, we detected peptides that are translated from the discordant RNA sequences and thus do not correspond exactly to the DNA sequences. These widespread RNA-DNA differences in the human transcriptome provide a yet unexplored aspect of genome variation.

Cite this document (BETA)

Available from Science
Page 1
hidden

Widespread RNA and DNA sequence d...

Very Few RNA and DNA Sequence Differences in the Human Transcriptome Daniel R. Schrider1,2*., Jean-Francois Gout1., Matthew W. Hahn1,2 1 Department of Biology, Indiana University, Bloomington, Indiana, United States of America, 2 School of Informatics and Computing, Indiana University, Bloomington, Indiana, United States of America Abstract RNA editing is an important cellular process by which the nucleotides in a mature RNA transcript are altered to cause them to differ from the corresponding DNA sequence. While this process yields essential transcripts in humans and other organisms, it is believed to occur at a relatively small number of loci. The rarity of RNA editing has been challenged by a recent comparison of human RNA and DNA sequence data from 27 individuals, which revealed that over 10,000 human exonic sites appear to exhibit RNA-DNA differences (RDDs). Many of these differences could not have been caused by either of the two previously known human RNA editing mechanisms���ADAR-mediated ARG substitutions or APOBEC1-mediated CRU switches���suggesting that a previously unknown mechanism of RNA editing may be active in humans. Here, we reanalyze these data and demonstrate that genomic sequences exist in these same individuals or in the human genome that match the majority of RDDs. Our results suggest that the majority of these RDD events were observed due to accurate transcription of sequences paralogous to the apparently edited gene but differing at the edited site. In light of our results it seems prudent to conclude that if indeed an unknown mechanism is causing RDD events in humans, such events occur at a much lower frequency than originally proposed. Citation: Schrider DR, Gout J-F, Hahn MW (2011) Very Few RNA and DNA Sequence Differences in the Human Transcriptome. PLoS ONE 6(10): e25842. doi:10.1371/journal.pone.0025842 Editor: Philip Awadalla, University of Montreal, Canada Received August 3, 2011 Accepted September 12, 2011 Published October 12, 2011 Copyright: �� 2011 Schrider et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Funding: D.R.S. is supported by National Institutes of Health Genetics, Cellular & Molecular Sciences Training Grant GM007757. J.F.G. is supported by National Science Foundation Grant EF-0827411. M.W.H. is supported by a fellowship from the Alfred P. Sloan Foundation. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing Interests: The authors have declared that no competing interests exist. * E-mail: dschride@indiana.edu . These authors contributed equally to this work. Introduction The accurate transcription of genomic DNA into RNA is essential for carrying out cellular processes, as RNA transcripts are either translated into functional proteins or perform functions directly. However, in humans [1], plant chloroplasts and mitochondria [2], and certain viruses (e.g., [3]), there are known cases of RNA transcripts differing from the transcribed DNA at specific positions. For example, in humans, adenosine deaminases acting on RNA (ADARs) replace certain adenosines (A) with inosines, which then act as guanosines (G) during translation [4,5] furthermore, the protein APOBEC1 causes a small number of CRU changes [6,7,8]. Many of these RNA editing events result in alternative proteins that are useful to the organism, and alterations of the frequency of certain RNA editing events can negatively affect organismal function [9]. Despite the demonstrated benefits of RNA editing events, RNA editing is currently viewed as a relatively rare phenomenon, with one comprehensive study identifying only several hundred ARG changes in the human transcriptome [10]. However, a recent study comparing RNA and DNA sequences from 27 human individuals challenges this view [11]. In this study, Li et al. [11] discovered more than 10,000 human exonic sites where the RNA sequences appeared to differ from DNA sequences obtained from the same individual. Interestingly, the majority of these RNA-DNA differ- ences (RDDs) produce changes other than the typical ARG or CRU changes expected by known mechanisms of RNA editing [6,7]. This surprising result implies that most RDDs are produced by some as yet unknown molecular mechanism. Perhaps even more strikingly, this study found a much larger number of modified sites in human mRNAs (10,210) than any study to date, suggesting that RDDs are an important contributor to transcriptomic diversity. Li et al. [11] experimentally confirmed that many of these modified RNA sequences do exist and sometimes result in altered proteins, and are therefore not artifacts of sequencing error. Furthermore, by restricting their analysis to mostly invariant sites, they minimized the likelihood that unsampled genetic variation at the RDD site could result in false positives comparison with previous studies also ensured the accuracy of their genotype calls at each RDD site. However, the authors did not take adequate steps to ensure that the modified RNA could have resulted from accurate transcription of DNA somewhere else in the genome. Their only check of the DNA sequences present in each individual was to ensure that RNA-seq reads mapped uniquely to the annotated human GENCODE mRNA sequences [11]. Unfortu- nately, this step is not enough to ensure the absence of genomic sequences matching the modified sequences. For example, spurious RDDs would be observed if a highly similar paralog absent from GENCODE and differing from the edited locus at the RDD site was transcribed and translated. In this case the RNA-seq reads supporting RDD events could be derived from sequences other than the seemingly modified gene. PLoS ONE | www.plosone.org 1 October 2011 | Volume 6 | Issue 10 | e25842
Page 2
hidden
Because Li et al. only searched their RNA-seq reads against the GENCODE sequences, there are actually three different potential sources of spurious RDDs. First, transcribed sequences paralogous to a GENCODE gene present in the reference genome but not included in the GENCODE predictions of protein-coding genes would be incorrectly inferred to be an RDD if nucleotides differed between the two sequences. Second, even ensuring that sequences are unique in the reference genome does not ensure that they are the result of RNA editing: paralogous sequences present in the reference genome that contain a single-nucleotide polymorphism in the sampled individual (such that the exact sequence is not present in the reference genome) would also appear to be RDDs. Finally, nucleotide differences in segregating copy number variants [12] that are absent from the reference assembly and contain a single nucleotide difference from their paralogous sequence in the reference genome would also be inferred to be RDDs. In any of these cases the paralogous sequence, if transcribed, has the potential to confound the analysis by producing evidence of post- transcriptional modifications where none exist. We examined both the DNA and RNA sequences analyzed by Li et al [11], and found that the vast majority of apparent RDD sites identified in their study match genomic sequence and are therefore most likely the result of accurate transcription of paralogous sequence rather than some unknown RNA editing mechanism. Results In order to determine the extent to which RDD events could be erroneously called due to transcription of paralogous sequences matching RDD sites, we first asked whether RDD calls made by Li et al. [11] matched sequence elsewhere in the genome by searching their 10,210 RDD sites against the reference genome, a step not taken in the original paper. When extracting RDD sites and flanking sequences from the reference genome in order to perform this search, we noticed that at 39 of these RDD sites the reference genome exhibited the nucleotide reported by Li et al. to be present in the mRNA but not in the genome (which we will refer to as the ������RDD nucleotide������). This suggests that these 39 RDD events were reported in error. We then searched the remaining 10,179 RDD sites against the reference genome (see Materials and Methods) and found that 890 of these RDD sites have a paralog in the reference genome that exhibits the RDD nucleotide. The observation of RNA-DNA sequence differences at these sites suggests that the inferred RDDs are more likely due to transcription of these paralogous sequences than RNA alterations. This explanation is supported by the fact that 674 (75.7%) of these paralogs are found in transcribed regions of the genome, and 640 (71.9%) are located within an annotated gene (Materials and Methods). We also found that 1,316 additional RDD sites have at least one paralog in the reference genome not containing the RDD nucleotide. However, such paralogs could contain polymorphisms such that the transcription of these sequences would result in the appearance of RDDs, if the polymorphic allele not present in the reference genome matches the RDD nucleotide. Again, this possibility is supported by the large percentage of such paralogs found in genes (86.5%) or transcribed regions of the genome (86.2%). In total, RDD sites are much more likely to have a paralog than an average human gene (80.6% of RDD sites versus 68.3% of all human genes P,2.2610216 Fisher���s Exact Test using paralogy assignments from ref. [13]). In addition to paralogs present in the reference genome, duplication polymorphisms absent from the reference genome could also create the appearance of RNA-DNA differences. This possibility seems especially relevant given that 3,893 of the remaining RDD sites are either within a duplication listed in the Database of Genomic Variants [14] or have a paralog within such a duplication���a 1.5- fold enrichment of RDD sites for copy number-variable regions of the genome (P,0.001 see Materials and Methods). For both of the above possibilities to explain the appearance of RDDs, there must be genomic DNA present in an individual (and not the reference genome) that matches the RDD nucleotide. We therefore asked whether Li et al.s��� RDD calls for each individual were matched by genomic reads from the same individual, again, a step not taken in the original paper. Because the list of individuals exhibiting each RDD site was not made available, we attempted to recapitulate Li et al.���s results by mapping their RNA-seq data to a database of transcripts containing the 10,210 RDD sites. We used the short-read mapping program BWA [15] to map all RNA-seq reads and applied Li et al.���s criteria for detecting RDD events and determining which events occur in which individuals (Materials and Methods). For most individuals, the number of RDD events we called closely matched the corresponding number of events found by Li et al. (compare our Supplementary Figure S1 with Figure 1B from Li et al.���the exact number of events originally found in each individual was not provided by the authors), suggesting that we fairly accurately recreated their set of RDDs. Next, we mapped genomic reads from these same individuals available through the 1000 Genomes project [16] to these RDD sites, and found that on average 30.5% of RDD events called in an individual are matched by at least one genomic read from that same individual containing the RDD nucleotide. This result suggests that a substantial proportion of RDD sites called by Li et al. may not be the result of some type of RNA editing event. Instead, there are likely paralogous sequences matching the RDD nucleotide in some or all of the 27 individuals, and these apparently edited transcripts could be the result of transcription of these sequences. Given the low genomic sequencing coverage of many of the 27 individuals [16], we suspected that even more of Li et al.���s RDD sites could have been false positives. We reasoned that if an RDD site matched genomic sequence from any individual, whether that individual met the criteria for exhibiting the specific RDD event or not, the RDD site was likely not a true editing event. We therefore examined genomic reads from all individuals to determine how many RDD sites matched genomic sequences present in this sample. In total, we found that 74% of RDD sites have at least one genomic read matching the RDD nucleotide in at least one individual. Because some of these matches could be due to simple sequencing errors in genomic reads, we used more stringent criteria to identify a higher-confidence set of genomic sequences, and examined the numbers of reads not matching either the genomic nucleotide or the RDD nucleotide to verify that se- quencing error had a minimal impact on this analysis (Materials and Methods). These methods found that the majority (5,666 or 55%) of the 10,210 RDD sites match genomic sequence from at least one individual. In total, 5,900 (57.8%) of the 10,210 RDD sites match either sequence from one of the 27 individuals or from the human reference genome. Table S1 provides a list of the 10,210 RDDs, and whether or not we find evidence for a genomic explanation for the event. If RNA editing is largely restricted to ARG substitutions, and if the 5,900 RDD sites matching genomic sequence data are truly spurious, then the remaining 4,310 RDD sites in Li et al.���s set should be enriched for ARG changes. This is indeed the case, as the percentage of all RDD sites that are ARG differences increases from 22.8% to 23.5% (P = 0.013 Fisher���s Exact Test) when RDDs matching genomic sequence are removed from the Few RNA and DNA Sequence Differences in Humans PLoS ONE | www.plosone.org 2 October 2011 | Volume 6 | Issue 10 | e25842

Readership Statistics

387 Readers on Mendeley
by Discipline
 
 
 
by Academic Status
 
33% Ph.D. Student
 
16% Post Doc
 
13% Researcher (at an Academic Institution)
by Country
 
36% United States
 
9% United Kingdom
 
6% China

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in