Systematic inference of copy-number genotypes from personal genome sequencing data reveals extensive olfactory receptor gene content diversity

56Citations
Citations of this article
140Readers
Mendeley users who have this article in their library.

Abstract

Copy-number variations (CNVs) are widespread in the human genome, but comprehensive assignments of integer locus copy-numbers (i.e., copy-number genotypes) that, for example, enable discrimination of homozygous from heterozygous CNVs, have remained challenging. Here we present CopySeq, a novel computational approach with an underlying statistical framework that analyzes the depth-of-coverage of high-throughput DNA sequencing reads, and can incorporate paired-end and breakpoint junction analysis based CNV-analysis approaches, to infer locus copy-number genotypes. We benchmarked CopySeq by genotyping 500 chromosome 1 CNV regions in 150 personal genomes sequenced at low-coverage. The assessed copy-number genotypes were highly concordant with our performed qPCR experiments (Pearson correlation coefficient 0.94), and with the published results of two microarray platforms (95-99% concordance). We further demonstrated the utility of CopySeq for analyzing gene regions enriched for segmental duplications by comprehensively inferring copy-number genotypes in the CNV-enriched >800 olfactory receptor (OR) human gene and pseudogene loci. CopySeq revealed that OR loci display an extensive range of locus copy-numbers across individuals, with zero to two copies in some OR loci, and two to nine copies in others. Among genetic variants affecting OR loci we identified deleterious variants including CNVs and SNPs affecting ~15% and~20% of the human OR gene repertoire, respectively, implying that genetic variants with a possible impact on smell perception are widespread. Finally, we found that for several OR loci the reference genome appears to represent a minor-frequency variant, implying a necessary revision of the OR repertoire for future functional studies. CopySeq can ascertain genomic structural variation in specific gene families as well as at a genome-wide scale, where it may enable the quantitative evaluation of CNVs in genome-wide association studies involving high-throughput sequencing. © 2010 Waszak et al.

Figures

  • Figure 1. Schematic illustration of CopySeq. A. ‘Locus selection’, i.e., definition and selection of loci of interest for copy-number genotyping. B. ‘Mappability assessment’, i.e., construction of k-mer mappability locus maps. Sequence sub-stretches not uniquely mappable by k-mers are identified in each locus (represented by red blocks) and masked (i.e., excluded from further analysis). C. ‘Read-mapping’, by default carried out with MAQ [24] (other read-mappers, such as BWA [23] can optionally be applied). D. ‘Copy-number genotyping’: The locusspecific read-depth is determined, and the locus-specific ‘read-depth ratio’ computed and corrected both for the locus-specific k-mer mappability as well as for G+C-content bias (see Materials and Methods). A Gaussian classifier infers locus copy numbers by comparing locus-specific read-depth ratios with read-depth ratio distributions which are expected for different copy-number genotypes (distributions for the copy-number genotypes 0, 1, 2, 3, 4, and 5 are indicated with different colors). E. Copy-number genotypes are reported. doi:10.1371/journal.pcbi.1000988.g001
  • Figure 2. Copy-number genotyping results in a chromosome 1 CNV set. A. Copy-number genotyping concordance between CopySeq- and microarray-based [14] copy-number genotypes inferred for 99 CNVs on chromosome 1 in 118 individuals, using different CNV size cutoffs. Plotted circles represent the total number of high-confidence genotypes, with the largest circle corresponding to .10,000 copy-number genotypes and the smallest to 348 copy-number genotypes. As expected, the genotyping concordance increases with higher CNV size cutoffs. B–C. Copy-number genotyping results for chromosome 1 example CNVs across 150 individuals, i.e., a bi-allelic deletion (chr1:150,822,330–150,853,218; see B) as well as a bi-allelic duplication (chr1:164,451,105–164,460,994; see C). Copy-number genotypes inferred by CopySeq are indicated with different colors: ‘0’, red; ‘1’, orange; ‘2’, grey; ‘3’, blue; ‘4’, purple. Individuals have been arranged according to population: squares, CEU; triangles, CHB+JPT; circles, YRI. The scaled read-depth ratio (indicated on the y-axis) has been calculated by multiplying the read-depth ratio by two. doi:10.1371/journal.pcbi.1000988.g002
  • Figure 3. Copy-number genotype inference in olfactory receptor (OR) loci across 150 individuals. A. Distribution of locus-specific readdepth measurements in 808 OR loci. Altogether 121,200 data points are depicted (808 loci times 150 samples). Points relate the GC-adjusted readdepth to the expected read-depth, which is estimated based on the k-mer mappability of a locus and the genomic sequencing coverage of a sample. CopySeq copy-number genotypes are indicated by colors (bottom to top): ‘0’, red; ‘1’, orange; ‘2’, grey; ‘3’, blue; ‘4’, purple; ‘5’, green; ‘6’, brown; ‘7’, yellow; ‘8’, light blue; ‘9’, black). B–D. Dissecting a complex CNV region with CopySeq. The displayed region (chr11:4,921,968–4,930,581) harbors a multi-allelic CNV involving both a deletion and a duplication. The deletion results in an OR51A2—OR51A4 fusion-gene [4]. Read-depths are shown on the left and the inferred locus-structure on the right. CopySeq was carried out in conjunction with breakpoint-junction analysis [26], generating the following copy-number genotypes. NA19138: ‘2’ for OR51A4, ‘2’ for OR51A2, ‘0’ for the fusion-gene (B); NA12716: ‘0’, ‘0’, ‘2’ (C); NA19172: ‘2’, ‘4’, ‘0’ (D). Orange and blue boxes indicate open-reading frames (ORFs), and orange/blue lines denote the respective loci (with 39 and 59- regions). Both ORFs are on the reverse strand of the reference genome. The gene fusion occurred near the ORFs’ 59-end within a sequence stretch where both share extensive homology (thus, no reads map to this stretch uniquely). E. Copy-number genotype map of OR loci in 150 individuals. Each bar represents the frequency of a copy-number genotype (y-axis) at a particular OR locus (x-axis). Colors indicate copy-number genotype frequencies (color scheme is on the right). doi:10.1371/journal.pcbi.1000988.g003
  • Figure 4. Distribution of inter-individual copy-number differences in autosomal OR loci. A. Commonly variable loci account for the majority of inter-individual OR copy number differences. OR loci were ranked by the frequency at which they displayed a copy-number genotype other than ‘2’ (indicating a CNV), followed by iterative exclusion of the rarest CNVs (i.e., first the loci that most rarely vary in copy-number were excluded, then the more common ones). Pair-wise copy-number differences between all samples were calculated, and average copy-number differences across all pair-wise comparisons determined. The y-axis indicates the inter-individual copy-number difference as a percentage of the maximum average copy-number difference, and the x-axis indicates the percentage of all copy-number variable (polymorphic) OR loci for each OR frequency rank step. For example, ,15% of the OR loci account for ,80% of the inter-individual OR copy-number differences between any two samples. B. Distribution of inter-individual OR copy number differences computed separately for each pair of samples. Pair-wise copy-number differences were computed as quantitative differences between copy-number genotype values summed up over all OR loci between pairs of samples (x-axis). (In this regard, for example, the difference for a given locus is 2, if in one sample a copy-number genotype of ‘0’ and in the other a copynumber genotype of ‘2’ is inferred.). Blue solid line: OR genes; red solid line: OR pseudogenes; red dotted line: OR pseudogenes, excluding the CNVenriched OR7E family. doi:10.1371/journal.pcbi.1000988.g004
  • Figure 5. Heritability of CNVs in a parent-offspring trio of European ancestry. A. Chromosomal origin of the largest human OR genomic cluster and pedigree of the European family. B–D. CNV inheritance, indicated in terms of scaled read-depth ratios and inferred copy-number genotypes among 96 bi-allelic OR loci located in the largest human OR cluster (11@55.6; see nomenclature in http://genome.weizmann.ac.il/horde/; chr11:54,842,512–56,344,668). The x-axis represents genomic coordinates, and individual OR positions are marked by ticks. The copy-number genotypes identified in NA12891 (B), NA12892 (C), and NA12878 (D), were inferred based on low-coverage genomic data (Table S1) and are consistent with Mendelian segregation. Bi-allelic CNVs were classified according to copy-number genotypes identified in the European (CEU) individuals. Copy-number genotypes are color-coded: ‘1’, orange; ‘2’, grey; ‘3’, blue. doi:10.1371/journal.pcbi.1000988.g005
  • Figure 6. Concordance of copy-number genotypes inferred in OR loci with microarray-based calls and qPCR experiments. A. Comparison of .5,000 copy-number genotypes inferred in OR loci, using CopySeq, with microarray-based [14] copy-number genotypes. The comparison is based on 46 OR loci, assessed in 118 individuals. Circle size indicates the number of comparisons falling into a certain bin (the largest circle, representing .3,000 copy-number genotypes, corresponds to concordant copy-number genotype calls of the homozygous reference allele, i.e., copy-number = ‘2’). Blue lines denote the function y = x and have been included to facilitate evaluation of the data. B. Validation of 50 copy-number genotypes in 5 OR loci610 samples by qPCR. Experimentally determined qPCR values are expressed in terms of adjusted Ct values, which were estimated as described in the Materials and Methods section. doi:10.1371/journal.pcbi.1000988.g006
  • Figure 7. Analysis of ‘young’ and ‘ancient’ ORs. The figure displays the distribution of sequence identities with the most similar (‘nearest’) paralog for non-variable, bi-allelic, and multi-allelic OR loci. Each point represents the sequence identity of an OR to its nearest paralog (y-axis), and the type of locus (non-variable, NV; bi-allelic, BI; multi-allelic, MU). Green points: OR locus lacks a one-to-one ortholog in the chimpanzee genome; blue points: OR locus has a one-to-one ortholog in the chimpanzee genome (as assessed by comparing human and chimpanzee ORFs at the DNA level using BLAST, and classifying as one-to-one orthologs sequences displaying mutually highest sequence identity). Blue and green rhomboids represent the corresponding distribution average; red rhomboids represent averages for NV, BI, and MU. Rhomboid error bars represent 95% confidence intervals of the average. doi:10.1371/journal.pcbi.1000988.g007
  • Figure 8. Analysis of the population distribution of bi-allelic OR loci reveals shared and population-specific CNVs. Venn diagram of 265 bi-allelic OR loci, which were distributed according to their recorded presence in the three populations analyzed (CEU, CHB+JPT, and YRI). Numbers in parentheses indicate OR loci in which a single copy-number genotype other than ‘2’ (indicating a CNV) was observed across 150 individuals; these loci may display rare, rather than population-specific CNVs. doi:10.1371/journal.pcbi.1000988.g008

References Powered by Scopus

Fast and accurate short read alignment with Burrows-Wheeler transform

34835Citations
N/AReaders
Get full text

A second generation human haplotype map of over 3.1 million SNPs

3720Citations
N/AReaders
Get full text

Accurate whole human genome sequencing using reversible terminator chemistry

2788Citations
N/AReaders
Get full text

Cited by Powered by Scopus

Genome structural variation discovery and genotyping

1112Citations
N/AReaders
Get full text

Genome sequencing of pediatric medulloblastoma links catastrophic DNA rearrangements with TP53 mutations

714Citations
N/AReaders
Get full text

Natural variation in genome architecture among 205 Drosophila melanogaster Genetic Reference Panel lines

436Citations
N/AReaders
Get full text

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Cite

CITATION STYLE

APA

Waszak, S. M., Hasin, Y., Zichner, T., Olender, T., Keydar, I., Khen, M., … Korbel, J. O. (2010). Systematic inference of copy-number genotypes from personal genome sequencing data reveals extensive olfactory receptor gene content diversity. PLoS Computational Biology, 6(11). https://doi.org/10.1371/journal.pcbi.1000988

Readers over time

‘10‘11‘12‘13‘14‘15‘16‘17‘18‘19‘20‘21‘22‘23‘2408162432

Readers' Seniority

Tooltip

PhD / Post grad / Masters / Doc 57

49%

Researcher 42

36%

Professor / Associate Prof. 15

13%

Lecturer / Post doc 3

3%

Readers' Discipline

Tooltip

Agricultural and Biological Sciences 80

71%

Biochemistry, Genetics and Molecular Bi... 20

18%

Medicine and Dentistry 8

7%

Computer Science 5

4%

Save time finding and organizing research with Mendeley

Sign up for free
0