Sign up & Download
Sign in

Identification of allele-specific alternative mRNA processing via transcriptome sequencing.

by Gang Li, Jae Hoon Bahn, Jae-Hyung Lee, Guangdun Peng, Zugen Chen, Stanley F Nelson, Xinshu Xiao
Nucleic Acids Research ()

Abstract

Establishing the functional roles of genetic variants remains a significant challenge in the post-genomic era. Here, we present a method, allele-specific alternative mRNA processing (ASARP), to identify genetically influenced mRNA processing events using transcriptome sequencing (RNA-Seq) data. The method examines RNA-Seq data at both single-nucleotide and whole-gene/isoform levels to identify allele-specific expression (ASE) and existence of allele-specific regulation of mRNA processing. We applied the methods to data obtained from the human glioblastoma cell line U87MG and primary breast cancer tissues and found that 26-45% of all genes with sufficient read coverage demonstrated ASE, with significant overlap between the two cell types. Our methods predicted potential mechanisms underlying ASE due to regulations affecting either whole-gene-level expression or alternative mRNA processing, including alternative splicing, alternative polyadenylation and alternative transcriptional initiation. Allele-specific alternative splicing and alternative polyadenylation may explain ASE in hundreds of genes in each cell type. Reporter studies following these predictions identified the causal single nucleotide variants (SNVs) for several allele-specific alternative splicing events. Finally, many genes identified in our study were also reported as disease/phenotype-associated genes in genome-wide association studies. Future applications of our approach may provide ample insights for a better understanding of the genetic basis of gene regulation underlying phenotypic diversity and disease mechanisms.

Cite this document (BETA)

Available from Nucleic Acids Research
Page 1
hidden

Identification of allele-specific...

Identification of allele-specific alternative mRNA processing via transcriptome sequencing Gang Li1, Jae Hoon Bahn1, Jae-Hyung Lee1, Guangdun Peng1, Zugen Chen2, Stanley F. Nelson2,3,4 and Xinshu Xiao1,4,* 1 Department of Integrative Biology and Physiology, 2 Department of Human Genetics, 3 Department of Pathology and Laboratory Medicine, David Geffen School of Medicine and 4 Molecular Biology Institute, University of California Los Angeles, Los Angeles, CA 90095, USA Received November 16, 2011 Revised March 2, 2012 Accepted March 14, 2012 ABSTRACT Establishing the functional roles of genetic variants remains a significant challenge in the post-genomic era. Here, we present a method, allele-specific alter- native mRNA processing (ASARP), to identify genet- ically influenced mRNA processing events using transcriptome sequencing (RNA-Seq) data. The method examines RNA-Seq data at both single- nucleotide and whole-gene/isoform levels to identify allele-specific expression (ASE) and existence of allele-specific regulation of mRNA processing. We applied the methods to data obtained from the human glioblastoma cell line U87MG and primary breast cancer tissues and found that 26���45% of all genes with sufficient read coverage demonstrated ASE, with significant overlap between the two cell types. Our methods predicted potential mechan- isms underlying ASE due to regulations affecting either whole-gene-level expression or alternative mRNA processing, including alternative splicing, al- ternative polyadenylation and alternative transcrip- tional initiation. Allele-specific alternative splicing and alternative polyadenylation may explain ASE in hundreds of genes in each cell type. Reporter studies following these predictions identified the causal single nucleotide variants (SNVs) for several allele-specific alternative splicing events. Finally, many genes identified in our study were also reported as disease/phenotype-associated genes in genome-wide association studies. Future applica- tions of our approach may provide ample insights for a better understanding of the genetic basis of gene regulation underlying phenotypic diversity and disease mechanisms. INTRODUCTION Recent advances in sequencing technologies have enabled an extraordinary expansion of the catalogs of genetic variants in disease genomes or across populations. However, significant challenges still exist in establishing the functional roles of such variants. To date, only a minority of genetic variants identified by genome-wide as- sociation studies (GWASs) elicits protein-coding changes. A large number of variants are expected to influence cis- regulation of gene expression (1). Thus far, the most common approach used to predict regulatory variants is the method of expression quantitative trait loci (eQTL) mapping (1). In this approach, massive-scale parallel ex- pression assays are required to identify statistical associ- ations between genotypes and gene expression in populations with a diverse genetic background (2,3). Such studies often focus on the association between genetic variants and whole-gene expression levels, without differentiating isoforms resulted from alternative mRNA processing. Allele-specific expression (ASE) is an attractive alter- native method to infer the existence of cis-acting regu- latory variants (4). In an ASE study, the relative proportion of mRNA expression levels of two alleles of a heterozygous variant is measured in the same cellu- lar environment within the same subject (4,5). Thus, a major advantage of the method is that the alternative alleles serve as within-sample controls of each other, eliminating environmental or trans-acting influences that alter gene expression and making it optimal for detecting cis-acting differences. If the regulatory variants are located in intronic or untranscribed regions, those in the mRNAs may serve as markers for the existence of causal variants. Identification of autosomal ASE might be the most direct method to identify functional cis-regu- lation, which can be followed-up by detailed experimental analyses. *To whom correspondence should be addressed. Tel: +1 310 206 6522 Fax: +1 310 206 9184 Email: gxxiao@ucla.edu Nucleic Acids Research, 2012, 1���13 doi:10.1093/nar/gks280 �� The Author(s) 2012. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. Nucleic Acids Research Advance Access published March 29, 2012 at University of Florida on April 8, 2012 http://nar.oxfordjournals.org/ Downloaded from
Page 2
hidden
However, observation of ASE in a gene does not normally suggest which type of cis-regulatory mechanism is responsible for ASE but rather that such mechanisms exist. Cis-acting regulation by genetic variants may affect different aspects of gene expression, e.g. transcription, al- ternative mRNA processing or mRNA stability. Genetic control of transcription often results in changes in whole-gene expression levels, which have been the focus of many eQTL studies. Other mechanisms, such as alter- native mRNA processing, were much less often examined despite their known importance in example genes (6���8). Results from large-scale exon array studies showed that genetic influence on alternative mRNA processing could add remarkable complexity to molecular diversity (9). However, investigations of such relationships using micro- arrays normally require a large number of subjects and arrays to ensure statistical power. Here, we present methods to analyze transcriptome sequencing (RNA-Seq) data and demonstrate that data of a single subject enabled identification of many genes and alternatively processed regions that are under genetic influence. RNA-Seq provides concurrent allelic and gene expression data. Thus, it allows expression analyses at different levels including single-nucleotide, alternatively processed mRNA isoforms and whole-gene levels. Integrative analysis of the information at multiple levels allows in-depth understanding of the transcriptome, an advantage of RNA-Seq rarely exploited in previous work. This advantage enabled us to develop pipelines to first identify ASE patterns followed by inference of their potential involvement in alternative mRNA processing including alternative splicing, alternative 30 processing and alternative 50 initiation. We applied this method to two human cancer data sets, our in-house RNA-Seq data obtained from the glioblastoma cell line U87MG and a public RNA-Seq data set from a breast cancer patient. Our study demonstrated that ASE analysis of individual samples via RNA-Seq can provide substantial insights about the genetic control of gene expression, a potentially much more cost-effective approach than existing methods relying on massive-scale parallel expres- sion assays of a large number of samples. MATERIALS AND METHODS Cell culture, RNA purification and RNA-Seq data acquisition U87MG cells were purchased from American Type Culture Collection (ATCC) and maintained in DMEM high glucose medium supplemented with pyruvate, L-glutamine and 10% fetal bovine serum (FBS) (Hyclone). Total RNA was isolated using the mirVana kit (Ambion), according to the manufacturer���s instruc- tions. We used the standard Illumina protocol to prepare libraries for RNA-Seq (http://www.illumina .com/support/documentation.ilmn). Briefly, 10 mg total RNA was first processed via poly-A selection and frag- mentation. We generated first-strand cDNA using random hexamer-primed reverse transcription and subse- quently used it to generate second-strand cDNA using RNase H and DNA polymerase. Sequencing adapters were ligated using the Illumina Paired-End sample prep kit. Fragments of 200bp were isolated by gel electro- phoresis, amplified by 15 cycles of PCR and sequenced on the Illumina Genome Analyzer IIx (Cofactor Genomics) in the paired-end sequencing mode (2 60nt reads). RNA-Seq reads mapping The same mapping methods as in our previous work (10) were used. Briefly, reads were mapped to the human genome and Ensembl-defined transcriptome using multiple tools including Bowtie (11), BLAT (12) and Tophat (13). Two reads in a pair were mapped separately. Alignments of a read with more than 12 mismatches were discarded. Read pairs were then examined for uniqueness and correct pairing. A uniquely mapped pair was required to have less than six mismatches on each read and not to map to anywhere else in the genome as a pair with less than or equal to 12 mismatches each. Since the genomic locations of heterozygous single nucleotide variants (SNVs) were provided by whole-genome sequencing of U87MG (14), we corrected the number of mismatches in reads harboring the non-reference allele of an SNV such that reads with SNVs were treated without a bias. Only uniquely paired reads were used for subsequent analyses. In addition, we removed all duplicate reads (those mapped to the same genomic locations as a pair) except the one with the best quality score in the mismatch positions (if any). Identification of ASE of SNVs For each heterozygous SNV, we first obtained the number of RNA-Seq reads mapped to its alleles. Since the first read position was observed to have relatively large sequencing errors in our data, we excluded reads whose SNVs were located at the first nucleotide. We then calculated the allelic ratio defined as the number of reads mapped to the reference allele divided by the total number of reads covering an SNV. To identify ASE patterns, we used the Chi-square Goodness-of-Fit test to determine if the allelic ratio deviates from the expected ratio 0.5 (i.e. when the two alleles are equally expressed). SNVs were excluded if they are potentially in regions with copy number variants determined by the read depth of the genome sequencing data (14,15). In this analysis, only SNVs with at least 20 RNA-Seq reads were included to reach adequate statistical power (see ���Results��� section). Significant ASE patterns were determined using a false discovery rate (FDR) cutoff of 5% based on a modified Benjamini���Hochberg method (16,17) to account for possible correlations of ASE patterns in a gene. The FDR was also estimated using biological replicates (see ���Results��� section) or an explicit simulation procedure. In this procedure, for each heterozygous SNV location, we randomly assigned each mapped read in the data set to either allele (with equal probability). Following this ran- domization, the ASE patterns were identified as described above and an FDR was calculated. 2 Nucleic Acids Research, 2012 at University of Florida on April 8, 2012 http://nar.oxfordjournals.org/ Downloaded from

Readership Statistics

18 Readers on Mendeley
by Discipline
 
 
 
by Academic Status
 
50% Ph.D. Student
 
17% Post Doc
 
11% Other Professional
by Country
 
50% United States
 
11% China
 
11% Germany

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in