Sign up & Download
Sign in

A second generation human haplotype map of over 3.1 million SNPs.

by Kelly A Frazer, Dennis G Ballinger, David R Cox, David A Hinds, Laura L Stuve, Richard A Gibbs, John W Belmont, Andrew Boudreau, Paul Hardenbol, Suzanne M Leal, Shiran Pasternak, David A Wheeler, Thomas D Willis, Fuli Yu, Huanming Yang, Changqing Zeng, Yang Gao, Haoran Hu, Weitao Hu, Chaohua Li, Wei Lin, Siqi Liu, Hao Pan, Xiaoli Tang, Jian Wang, Wei Wang, Jun Yu, Bo Zhang, Qingrun Zhang, Hongbin Zhao, Hui Zhao, Jun Zhou, Stacey B Gabriel, Rachel Barry, Brendan Blumenstiel, Amy Camargo, Matthew Defelice, Maura Faggart, Mary Goyette, Supriya Gupta, Jamie Moore, Huy Nguyen, Robert C Onofrio, Melissa Parkin, Jessica Roy, Erich Stahl, Ellen Winchester, Liuda Ziaugra, David Altshuler, Yan Shen, Zhijian Yao, Wei Huang, Xun Chu, Yungang He, Li Jin, Yangfan Liu, Yayun Shen, Weiwei Sun, Haifeng Wang, Yi Wang, Ying Wang, Xiaoyan Xiong, Liang Xu, Mary M Y Waye, Stephen K W Tsui, Hong Xue, J Tze-Fei Wong, Luana M Galver, Jian-Bing Fan, Kevin Gunderson, Sarah S Murray, Arnold R Oliphant, Mark S Chee, Alexandre Montpetit, Fanny Chagnon, Vincent Ferretti, Martin Leboeuf, Jean-François Olivier, Michael S Phillips, Stéphanie Roumy, Clémentine Sallée, Andrei Verner, Thomas J Hudson, Pui-Yan Kwok, Dongmei Cai, Daniel C Koboldt, Raymond D Miller, Ludmila Pawlikowska, Patricia Taillon-Miller, Ming Xiao, Lap-Chee Tsui, William Mak, You Qiang Song, Paul K H Tam, Yusuke Nakamura, Takahisa Kawaguchi, Takuya Kitamoto, Takashi Morizono, Atsushi Nagashima, Yozo Ohnishi, Akihiro Sekine, Toshihiro Tanaka, Tatsuhiko Tsunoda, Panos Deloukas, Christine P Bird, Marcos Delgado, Emmanouil T Dermitzakis, Rhian Gwilliam, Sarah Hunt, Jonathan Morrison, Don Powell, Barbara E Stranger, Pamela Whittaker, David R Bentley, Mark J Daly, Paul I W De Bakker, Jeff Barrett, Yves R Chretien, Julian Maller, Steve McCarroll, Nick Patterson, Itsik Pe'er, Alkes Price, Shaun Purcell, Daniel J Richter, Pardis Sabeti, Richa Saxena, Stephen F Schaffner, Pak C Sham, Patrick Varilly, David Altshuler, Lincoln D Stein, Lalitha Krishnan, Albert Vernon Smith, Marcela K Tello-Ruiz, Gudmundur A Thorisson, Aravinda Chakravarti, Peter E Chen, David J Cutler, Carl S Kashuk, Shin Lin, Gonçalo R Abecasis, Weihua Guan, Yun Li, Heather M Munro, Zhaohui Steve Qin, Daryl J Thomas, Gilean McVean, Adam Auton, Leonardo Bottolo, Niall Cardin, Susana Eyheramendy, Colin Freeman, Jonathan Marchini, Simon Myers, Chris Spencer, Matthew Stephens, Peter Donnelly, Lon R Cardon, Geraldine Clarke, David M Evans, Andrew P Morris, Bruce S Weir, Tatsuhiko Tsunoda, James C Mullikin, Stephen T Sherry, Michael Feolo, Andrew Skol, Houcan Zhang, Changqing Zeng, Hui Zhao, Ichiro Matsuda, Yoshimitsu Fukushima, Darryl R Macer, Eiko Suda, Charles N Rotimi, Clement A Adebamowo, Ike Ajayi, Toyin Aniagwu, Patricia A Marshall, Chibuzor Nkwodimmah, Charmaine D M Royal, Mark F Leppert, Missy Dixon, Andy Peiffer, Renzong Qiu, Alastair Kent, Kazuto Kato, Norio Niikawa, Isaac F Adewole, Bartha M Knoppers, Morris W Foster, Ellen Wright Clayton, Jessica Watkin, Richard A Gibbs, John W Belmont, Donna Muzny, Lynne Nazareth, Erica Sodergren, George M Weinstock, David A Wheeler, Imtaz Yakub, Stacey B Gabriel, Robert C Onofrio, Daniel J Richter, Liuda Ziaugra, Bruce W Birren, Mark J Daly, David Altshuler, Richard K Wilson, Lucinda L Fulton, Jane Rogers, John Burton, Nigel P Carter, Christopher M Clee, Mark Griffiths, Matthew C Jones, Kirsten McLay, Robert W Plumb, Mark T Ross, Sarah K Sims, David L Willey, Zhu Chen, Hua Han, Le Kang, Martin Godbout, John C Wallenburg, Paul L'Archevêque, Guy Bellemare, Koji Saeki, Hongguang Wang, Daochang An, Hongbo Fu, Qing Li, Zhen Wang, Renwu Wang, Arthur L Holden, Lisa D Brooks, Jean E McEwen, Mark S Guyer, Vivian Ota Wang, Jane L Peterson, Michael Shi, Jack Spiegel, Lawrence M Sung, Lynn F Zacharia, Francis S Collins, Karen Kennedy, Ruth Jamieson, John Stewart show all authors
Nature ()

Abstract

We describe the Phase II HapMap, which characterizes over 3.1 million human single nucleotide polymorphisms (SNPs) genotyped in 270 individuals from four geographically diverse populations and includes 25-35% of common SNP variation in the populations surveyed. The map is estimated to capture untyped common variation with an average maximum r2 of between 0.9 and 0.96 depending on population. We demonstrate that the current generation of commercial genome-wide genotyping products captures common Phase II SNPs with an average maximum r2 of up to 0.8 in African and up to 0.95 in non-African populations, and that potential gains in power in association studies can be obtained through imputation. These data also reveal novel aspects of the structure of linkage disequilibrium. We show that 10-30% of pairs of individuals within a population share at least one region of extended genetic identity arising from recent ancestry and that up to 1% of all common variants are untaggable, primarily because they lie within recombination hotspots. We show that recombination rates vary systematically around genes and between genes of different function. Finally, we demonstrate increased differentiation at non-synonymous, compared to synonymous, SNPs, resulting from systematic differences in the strength or efficacy of natural selection between populations.

Cite this document (BETA)

Available from Luke Smith and Jeffrey Barrett's profiles on Mendeley.
Page 1
hidden

A second generation human haploty...

ARTICLES A second generation human haplotype map of over 3.1 million SNPs The International HapMap Consortium* We describe the Phase II HapMap, which characterizes over 3.1 million human single nucleotide polymorphisms (SNPs) genotyped in 270 individuals from four geographically diverse populations and includes 25���35% of common SNP variation in the populations surveyed. The map is estimated to capture untyped common variation with an average maximum r2 of between 0.9 and 0.96 depending on population. We demonstrate that the current generation of commercial genome-wide genotyping products captures common Phase II SNPs with an average maximum r2 of up to 0.8 in African and up to 0.95 in non-African populations, and that potential gains in power in association studies can be obtained through imputation. These data also reveal novel aspects of the structure of linkage disequilibrium. We show that 10���30% of pairs of individuals within a population share at least one region of extended genetic identity arising from recent ancestry and that up to 1% of all common variants are untaggable, primarily because they lie within recombination hotspots. We show that recombination rates vary systematically around genes and between genes of different function. Finally, we demonstrate increased differentiation at non-synonymous, compared to synonymous, SNPs, resulting from systematic differences in the strength or efficacy of natural selection between populations. Advances made possible by the Phase I haplotype map The International HapMap Project was launched in 2002 with the aim of providing a public resource to accelerate medical genetic research. The objective was to genotype at least one common SNP every 5 kilobases (kb) across the euchromatic portion of the genome in 270 individuals from four geographically diverse populations1,2: 30 mother���father���adult child trios from the Yoruba in Ibadan, Nigeria (abbreviated YRI) 30 trios of northern and western European ances- try living in Utah from the Centre d���Etude du Polymorphisme Humain (CEPH) collection (CEU) 45 unrelated Han Chinese indi- viduals in Beijing, China (CHB) and 45 unrelated Japanese indivi- duals in Tokyo, Japan (JPT). The YRI samples and the CEU samples each form an analysis panel the CHB and JPT samples together form an analysis panel. Approximately 1.3 million SNPs were genotyped in Phase I of the project, and a description of this resource was pub- lished in 2005 (ref. 3). The initial HapMap Project data had a central role in the develop- ment of methods for the design and analysis of genome-wide asso- ciation studies. These advances, alongside the release of commercial platforms for performing economically viable genome-wide geno- typing, have led to a new phase in human medical genetics. Already, large-scale studies have identified novel loci involved in multiple complex diseases4,5. In addition, the HapMap data have led to novel insights into the distribution and causes of recombination hot- spots3,6, the prevalence of structural variation7,8 and the identity of genes that have experienced recent adaptive evolution3,9. Because the HapMap cell lines are publicly available, many groups have been able to integrate their own experimental data with the genome-wide SNP data to gain new insight into copy-number variation10, the relation- ship between classical human leukocyte antigen (HLA) types and SNP variation11, and heritable influences on gene expression12���14. The ability to combine genome-wide data on such diverse aspects of genetic variation with molecular phenotypes collected in the same samples provides a powerful framework to study the connection of DNA sequence to function. In Phase II of the HapMap Project, a further 2.1 million SNPs were successfully genotyped on the same individuals. The resulting HapMap has an SNP density of approximately one per kilobase and is estimated to contain approximately 25���35% of all the 9���10 million common SNPs (minor allele frequency (MAF) $ 0.05) in the assembled human genome (that is, excluding gaps in the reference sequence alignment see Supplementary Text 1), although this num- ber shows extensive local variation. This paper describes the Phase II resource, its implications for genome-wide association studies and additional insights into the fine-scale structure of linkage disequilib- rium, recombination and natural selection. Construction of the Phase II HapMap Most of the additional genotype data for the Phase II HapMap were obtained using the Perlegen amplicon-based platform15. Briefly, this platform uses custom oligonucleotide arrays to type SNPs in DNA segmentally amplified via long-range polymerase chain reaction (PCR). Genotyping was attempted at 4,373,926 distinct SNPs, which corresponds, with exceptions (see Methods), to nearly all SNPs in dbSNP release 122 for which an assay could be designed. Additional submissions were included from the Affymetrix GeneChip Mapping Array 500K set, the Illumina HumanHap100 and HumanHap300 SNP assays, a set of ,11,000 non-synonymous SNPs genotyped by Affymetrix (ParAllele) and a set of ,4,500 SNPs within the extended major histocompatibility complex (MHC)11. Genotype submissions were subjected to the same quality control (QC) filters as described previously (see Methods) and mapped to NCBI build 35 (University of California at Santa Cruz (UCSC) hg17) of the human genome. The re-mapping of SNPs from Phase I of the project identified 21,177 SNPs that had an ambiguous position or some other feature indi- cative of low reliability these are not included in the filtered Phase II data release. All genotype data are available from the HapMap Data Coordination Center (http://www.hapmap.org) and dbSNP (http:// www.ncbi.nlm.nih.gov/SNP) analyses described in this paper refer to release 21a. Three data sets are available: ���redundant unfiltered��� *Lists of participants and affiliations appear at the end of the paper. Vol 449|18 October 2007|doi:10.1038/nature06258 851 Nature ��2007 Publishing Group
Page 2
hidden
contains all genotype submissions, ���redundant filtered��� contains all submissions that pass QC, and ���non-redundant filtered��� contains a single QC1 submission for each SNP in each analysis panel. The QC filters remove SNPs showing gross errors. However, it is also important to understand the magnitude and structure of more subtle genotyping errors among SNPs that pass QC. We therefore carried out a series of analyses to assess the influence of the long-range PCR amplicon structure on genotyping error, the concordance rates between genotype calls from different genotyping platforms and between those platforms and re-sequencing assays, as well as the rates of false monomorphism and mis-mapping of SNPs (see Supplemen- tary Text 2, Supplementary Figs 1���3 and Supplementary Tables 1���4). We estimate that the average per genotype accuracy is at least 99.5%. However, there are higher rates of missing data and genotype discre- pancies at non-reference alleles, with some clustering of errors result- ing from the amplicon design and a few incorrectly mapped SNPs. Table 1 shows the numbers of SNPs attempted and converted to QC1 SNPs in each analysis panel (Supplementary Table 5 shows a breakdown by each major submission). Haplotypes and missing data were estimated for each analysis panel separately using both trio information and statistical methods based on the coalescent model (see Methods). To enable cross-population comparisons, a con- sensus data set was created consisting of 3,107,620 SNPs that were QC1 in all analysis panels and polymorphic in at least one analysis panel. The equivalent figure from Phase I was 931,340 SNPs. Unless stated otherwise, all analyses have been carried out on the consensus data set. An additional set of haplotypes was created for those SNPs in the consensus where a putative ancestral state could be assigned by comparison of the human alleles to the orthologous position in the chimpanzee and rhesus macaque genomes. The variation in SNP density within the Phase II HapMap is shown in Fig. 1. On average there are 1.14 genotyped polymorphic SNPs per kilobase (average spacing is 875 base pairs (bp)) and 98.6% of the assembled genome is within 5 kb of the nearest polymorphic SNP. Still, there is heterogeneity in genotyped SNP density at both broad (Fig. 1a) and fine (Fig. 1b) scales. Furthermore, there are systematic changes in genotyped SNP density around genomic features includ- ing genes (Fig. 1c). The Phase II HapMap differs from the Phase I HapMap not only in SNP spacing, but also in minor allele frequency distribution and patterns of linkage disequilibrium (Supplementary Fig. 4). Because the criteria for choosing additional SNPs did not include considera- tion of SNP spacing or preferential selection for high MAF, the SNPs added in Phase II are, on average, more clustered and have lower MAF than the Phase I SNPs. Because MAF predictably influences the distribution of linkage disequilibrium statistics, the average r2 at a given physical distance is typically lower in Phase II than in Phase I conversely, the jD9j statistic is typically higher (data not shown). One notable consequence is that the Phase II HapMap includes a better representation of rare variation than the Phase I HapMap. The increased resolution provided by Phase II of the project is illustrated in Fig. 2. Broadly, an additional SNP added to a region shows oneof threepatterns.First,itmaybeverysimilarin distribution to SNPs present in Phase I. Second, it may provide detailed resolution of haplotype structure (for example, a group of chromosomes with identical local haplotypes in Phase I can be shown in Phase II to carry Table 1 | Summary of Phase II HapMap data (release 21) Phase SNP categories Analysis panel YRI CEU CHB1JPT I Assays submitted 1,304,199 1,344,616 1,306,125 Passed QC 1,177,312 (90%) 1,217,902 (91%) 1,187,800 (91%) Did not pass QC 126,887 (10%) 126,714 (9%) 118,325 (9%) .20% missing 82,463 (65%) 95,684 (76%) 78,323 (66%) .1 duplicate inconsistent 6,049 (5%) 5,126 (4%) 9,242 (8%) .1 mendelian error 18,916 (15%) 11,310 (9%) N/A ,0.001 Hardy���Weinberg P -value 10,265 (8%) 8,922 (7%) 13,722 (12%) Other failures 19,345 (15%) 13,858 (11%) 20,674 (17%) II Assays submitted 5,044,989 5,044,996 5,043,775 Passed QC 3,150,433 (62%) 3,204,709 (64%) 3,244,897 (64%) Did not pass QC 1,894,556 (38%) 1,840,287 (36%) 1,798,878 (36%) .20% missing 1,419,000 (75%) 1,398,166 (76%) 1,403,543 (78%) .1 duplicate inconsistent 0 (0%) 0 (0%) 6,617 (0%) .1 mendelian error 172,339 (9%) 127,923 (7%) N/A ,0.001 Hardy���Weinberg P -value 96,231 (5%) 82,268 (4%) 108,880 (6%) Other failures 334,511 (18%) 337,906 (18%) 340,370 (19%) Overall Assays submitted 6,349,188 6,389,612 6,349,900 Passed QC 4,327,745 (68%) 4,422,611 (69%) 4,432,697 (70%) Did not pass QC 2,021,443 (32%) 1,967,001 (31%) 1,917,203 (30%) .20% missing 1,501,463 (74%) 1,493,850 (76%) 1,481,866 (77%) .1 duplicate inconsistent 6,049 (0%) 5,126 (0%) 15,859 (1%) .1 mendelian error 191,255 (9%) 139,233 (7%) N/A ,0.001 Hardy���Weinberg P -value 106,496 (5%) 91,190 (5%) 122,602 (6%) Other failures 353,856 (18%) 351,764 (18%) 361,044 (19%) Non-redundant (unique) SNPs 3,796,934 3,868,157 3,890,416 Monomorphic 861,299 (23%) 1,246,183 (32%) 1,410,152 (36%) Polymorphic 2,935,635 (77%) 2,621,974 (68%) 2,480,264 (64%) SNP categories All analysis panels Unique QC-passed SNPs 4,000,107 Passed in one analysis panel 88,140 (2%) Passed in two analysis panels 268,534 (7%) Passed in three analysis panels (QC13) 3,643,433 (91%) QC13 and monomorphic across three analysis panels 535,813 QC13 and polymorphic in at least one analysis panel 3,107,620 QC13 and polymorphic in all three analysis panels 2,006,352 QC13 and MAF $ 0.05 in at least one of three analysis panels 2,819,322 ARTICLES NATURE|Vol 449|18 October 2007 852 Nature ��2007 Publishing Group

Authors on Mendeley

Readership Statistics

475 Readers on Mendeley
by Discipline
 
 
 
by Academic Status
 
35% Ph.D. Student
 
15% Post Doc
 
10% Researcher (at an Academic Institution)
by Country
 
29% United States
 
13% United Kingdom
 
7% Germany

Tags

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in