Sign up & Download
Sign in

Genome-wide analysis of allelic expression imbalance in human primary cells by high-throughput transcriptome resequencing

by Graham A Heap, Jennie H M Yang, Kate Downes, Barry C Healy, Karen A Hunt, Nicholas Bockett, Lude Franke, Patrick C Dubois, Charles A Mein, Richard J Dobson, Thomas J Albert, Matthew J Rodesch, David G Clayton, John A Todd, David A Van Heel, Vincent Plagnol show all authors
Human Molecular Genetics ()

Abstract

Many disease-associated variants identified by genome-wide association (GWA) studies are expected to regulate gene expression. Allele-specific expression (ASE) quantifies transcription from both haplotypes using individuals heterozygous at tested SNPs. We performed deep human transcriptome-wide resequencing (RNA-seq) for ASE analysis and expression quantitative trait locus discovery. We resequenced double poly(A)-selected RNA from primary CD4+ T cells (n = 4 individuals, both activated and untreated conditions) and developed tools for paired-end RNA-seq alignment and ASE analysis. We generated an average of 20 million uniquely mapping 45 base reads per sample. We obtained sufficient read depth to test 1371 unique transcripts for ASE. Multiple biases inflate the false discovery rate which we estimate to be 50% for random SNPs. However, after controlling for these biases and considering the subset of SNPs that pass HapMap QC, 4.6% of heterozygous SNP-sample pairs show evidence of imbalance (P < 0.001). We validated four findings by both bacterial cloning and Sanger sequencing assays. We also found convincing evidence for allelic imbalance at multiple reporter exonic SNPs in CD6 for two samples heterozygous at the multiple sclerosis-associated variant rs17824933, linking GWA findings with variation in gene expression. Finally, we show in CD4+ T cells from a further individual that high-throughput sequencing of genomic DNA and RNA-seq following enrichment for targeted gene sequences by sequence capture methods offers an unbiased means to increase the read depth for transcripts of interest, and therefore a method to investigate the regulatory role of many disease-associated genetic variants.

Cite this document (BETA)

Available from Vincent Plagnol and Graham Heap's profiles on Mendeley.
Page 1
hidden

Genome-wide analysis of allelic e...

Genome-wide analysis of allelic expression imbalance in human primary cells by high-throughput transcriptome resequencing Graham A. Heap1, Jennie H.M. Yang2, Kate Downes2, Barry C. Healy2, Karen A. Hunt1, Nicholas Bockett1, Lude Franke1, Patrick C. Dubois1, Charles A. Mein3, Richard J. Dobson3, Thomas J. Albert4, Matthew J. Rodesch4, David G. Clayton2, John A. Todd2, David A. van Heel1,{ and Vincent Plagnol2, ,{ 1Centre for Digestive Diseases, Blizard Institute of Cell and Molecular Science, Barts and The London School of Medicine and Dentistry, Queen Mary University of London, London E1 2AT, UK, 2Department of Medical Genetics, Juvenile Diabetes Research Foundation/Wellcome Trust Diabetes and Inflammation Laboratory, Cambridge Institute for Medical Research, University of Cambridge, Cambridge CB2 0XY, UK, 3Genome Centre, Barts and the London School of Medicine and Dentistry, Queen Mary University of London, London, UK and 4Roche NimbleGen, 504 S. Rosa Rd. Madison, WI 35393 Received May 14, 2009 Revised September 17, 2009 Accepted October 9, 2009 Many disease-associated variants identified by genome-wide association (GWA) studies are expected to regulate gene expression. Allele-specific expression (ASE) quantifies transcription from both haplotypes using individuals heterozygous at tested SNPs. We performed deep human transcriptome-wide resequencing (RNA-seq) for ASE analysis and expression quantitative trait locus discovery. We resequenced double poly(A)-selected RNA from primary CD41 T cells (n 5 4 individuals, both activated and untreated conditions) and developed tools for paired-end RNA-seq alignment and ASE analysis. We generated an average of 20 million uniquely mapping 45 base reads per sample. We obtained sufficient read depth to test 1371 unique transcripts for ASE. Multiple biases inflate the false discovery rate which we estimate to be 50% for random SNPs. However, after controlling for these biases and considering the subset of SNPs that pass HapMap QC, 4.6% of heterozygous SNP-sample pairs show evidence of imbalance (P 0.001). We validated four findings by both bacterial cloning and Sanger sequencing assays. We also found convincing evidence for allelic imbalance at multiple reporter exonic SNPs in CD6 for two samples heterozygous at the multiple sclerosis-associated variant rs17824933, linking GWA findings with variation in gene expression. Finally, we show in CD41 T cells from a further individual that high-throughput sequencing of genomic DNA and RNA-seq following enrichment for targeted gene sequences by sequence capture methods offers an unbiased means to increase the read depth for transcripts of interest, and therefore a method to investi- gate the regulatory role of many disease-associated genetic variants. INTRODUCTION Genome-wide association (GWA) studies using single nucleo- tide polymorphism (SNP) maps have revolutionized the mapping of common genetic loci determining susceptibility to a wide range of common, multifactorial disorders (1), in particular autoimmune diseases (2). The next steps to follow up on these findings are the identification of particular ���These authors contributed equally to this work. To whom correspondence should be addressed. Tel: ��44 1223762107 Fax: ��44 1223762102 Email: vincent.plagnol@cimr.cam.ac.uk # The Author 2009. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/ licenses/by-nc/2.5/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. Human Molecular Genetics, 2010, Vol. 19, No. 1 122���134 doi:10.1093/hmg/ddp473 Advance Access published on October 13, 2009
Page 2
hidden
candidate variants and haplotypes, and the investigation of the molecular effects of these genetic variants. Because current evidence suggests that only a small fraction of the causal loci consists of variants (non-synonymous SNPs, copy-number variants or indels) directly affecting the protein amino-acid sequence, we expect a large fraction of the loci to have a regu- latory role on gene expression via effects on transcription, message stability and splicing. To investigate the potential effects of candidate causal variants and haplotypes on gene regulation researchers have been correlating SNPs with inher- ited gene expression, known as expression quantitative trait loci (eQTLs). The combination of genome-wide genotyping with quantification of mRNA transcripts using microarray technology in sufficiently large cohorts has already demon- strated the widespread presence of eQTLs in the human genome (3���7). Most of these studies (3���5), however, used lymphoblastoid cell lines immortalized using Epstein Barr Virus and relied on observing differences between individuals despite the large inter-individual variability of gene expression measurements that is not explained by cis genetic variation, in addition to the limited accuracy of hybridization-based gene expression assays. This high variability generated by environ- mental factors and additional non-measured genetic or epige- netic variability significantly reduces the statistical power for eQTL discovery. Therefore, measurement of expression levels across multiple individuals may be so noisy that reliable correlations between SNP alleles and gene expression levels cannot always be demonstrated when the difference of expression between haplotypes is small (less than 1.3 fold). Moreover, cell lines may not be representative of in vivo biology and may introduce even greater variability (8���10), and, therefore, gene expression analyses using purified primary cell populations are urgently required (6���7,11). An alternative experimental design well suited to address these limitations is allele-specific expression (ASE) analysis. This approach quantifies (un)equal transcription (or splicing) from the two alleles or haplotypes using RNA samples from individuals who are heterozygous at the eQTL SNP of interest. The elegant ASE approach has the major advantage of assessing expression within an individual rather than across subjects thereby avoiding major sources of error and variation. In parallel, recent advances in high-throughput resequencing technologies have enabled highly quantitative sequencing-based analysis of human transcriptomes [RNA-seq (10,12,13)]. Because these techniques separately resequence both haplo- types, they have the potential to be used for the quantification of allelic imbalance, provided that a heterozygous SNP which can be used as a marker for each haplotype exists in the tran- script of interest. The potential of this method for ASE analy- sis has been demonstrated in pooled cDNA samples and human cell lines (14). Here, we extend this approach to eight independently sequenced human poly(A)-selected tran- scriptomes obtained from primary cells from healthy donors using high-throughput paired-end (PE) resequencing. In the context of the recently identified shared pathways between multiple autoimmune disorders (2,15) that motivated this study, many of the most relevant genes in regions identified by GWA studies are immune genes that are highly expressed in CD4�� T cells. This observation suggested the use of primary CD4�� T cells in the current ASE study, thus illustrating the potential of this approach to identify regulatory effects in purified primary cell subsets. RESULTS Data description We used Illumina Genome Analyzer II (GAII) high- throughput resequencing of cDNA libraries obtained from poly(A)-purified mRNA from four individuals analysed under T-cell activation (stimulated) or unstimulated conditions (see Materials and Methods), resulting in a total of eight samples. We obtained 45 bp reads, the majority of them are paired end (i.e. containing reads from both the 30 and 50 end of a 250 bp fragment, see Table 1), that were mapped to a transcriptome reference sequence set specifically constructed for PE RNA-seq (see Materials and Methods). This reference set includes a spliced transcript for each annotated gene (Ensembl CCDS), as well as additional sequences for introns and non-standard splice junctions. A full version of the refer- ence gDNA genome with annotated gene regions masked was added to enable: capture of transcribed, but not annotated, chromosome regions detect gDNA contamination of the mRNA preparation and importantly to allow assessment of repetitive sequence. Only reads mapping with high confidence to a unique location in our reference sequence set (defined as quality reads, see Materials and Methods) were included in this study. Owing to the complex nature of the RNA-seq reference genome, taking advantage of PE sequence reads relies on the ability of the mapping algorithm to map ���chimeric��� frag- ments: for example, the first read of a pair may map to a non- standard exon���exon junction sequence and the second read to the main spliced transcript. An algorithm implementing this feature was provided by the novoalign (www.novocraft.com) software package, which we used to align resequencing reads to our reference set. Another useful feature provided by novoalign is the ability to set a lower penalty for alignment when a ���chimeric��� paired read maps to two sequences that are part of the same gene. Transcript coverage In the absence of experimental biases, the ability to detect an allelic imbalance using ASE depends on two parameters: the strength of the allelic imbalance and the read depth at the reporter heterozygous SNP. We analytically computed the read depth required to demonstrate allelic imbalance for different allelic ratios. Power calculations (Fig. 1) show that for a read depth of 50 and a 67:33 allelic imbalance, which corresponds to an average of one cycle difference in a qPCR experiment between individuals homozygous at both alleles (a two fold difference), the probability to observe a P-value more significant than 0.001 is 19% (Fig. 1). Therefore, to remove SNPs providing almost no power to detect allelic imbalance, we only tested for ASE SNPs with read depth 50. While this approach is not limited to previously known SNPs, we first tested 589 673 dbSNPs for ASE (obtained from Ensembl release 52) and located in annotated spliced Human Molecular Genetics, 2010, Vol. 19, No. 1 123

Authors on Mendeley

Readership Statistics

130 Readers on Mendeley
by Discipline
 
 
 
by Academic Status
 
34% Ph.D. Student
 
20% Post Doc
 
10% Researcher (at an Academic Institution)
by Country
 
38% United States
 
15% United Kingdom
 
6% China

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in