A second generation human haplotype map of over 3.1 million SNPs.
- PubMed: 17943122
Abstract
We describe the Phase II HapMap, which characterizes over 3.1 million human single nucleotide polymorphisms (SNPs) genotyped in 270 individuals from four geographically diverse populations and includes 25-35% of common SNP variation in the populations surveyed. The map is estimated to capture untyped common variation with an average maximum r2 of between 0.9 and 0.96 depending on population. We demonstrate that the current generation of commercial genome-wide genotyping products captures common Phase II SNPs with an average maximum r2 of up to 0.8 in African and up to 0.95 in non-African populations, and that potential gains in power in association studies can be obtained through imputation. These data also reveal novel aspects of the structure of linkage disequilibrium. We show that 10-30% of pairs of individuals within a population share at least one region of extended genetic identity arising from recent ancestry and that up to 1% of all common variants are untaggable, primarily because they lie within recombination hotspots. We show that recombination rates vary systematically around genes and between genes of different function. Finally, we demonstrate increased differentiation at non-synonymous, compared to synonymous, SNPs, resulting from systematic differences in the strength or efficacy of natural selection between populations.
Author-supplied keywords
A second generation human haplotype map of over 3.1 million SNPs.
A second generation human haplotype
map of over 3.1 million SNPs
The International HapMap Consortium*
We describe the Phase II HapMap, which characterizes over 3.1 million human single nucleotide polymorphisms (SNPs)
genotyped in 270 individuals from four geographically diverse populations and includes 25–35% of common SNP variation in
the populations surveyed. The map is estimated to capture untyped common variation with an average maximum r
2
of
between 0.9 and 0.96 depending on population. We demonstrate that the current generation of commercial genome-wide
genotyping products captures common Phase II SNPs with an average maximum r
2
of up to 0.8 in African and up to 0.95 in
non-African populations, and that potential gains in power in association studies can be obtained through imputation. These
data also reveal novel aspects of the structure of linkage disequilibrium. We show that 10–30% of pairs of individuals within
a population share at least one region of extended genetic identity arising from recent ancestry and that up to 1% of all
common variants are untaggable, primarily because they lie within recombination hotspots. We show that recombination
rates vary systematically around genes and between genes of different function. Finally, we demonstrate increased
differentiation at non-synonymous, compared to synonymous, SNPs, resulting from systematic differences in the strength or
efficacy of natural selection between populations.
Advances made possible by the Phase I haplotype map
The International HapMap Project was launched in 2002 with the
aim of providing a public resource to accelerate medical genetic
research. The objective was to genotype at least one common SNP
every 5 kilobases (kb) across the euchromatic portion of the genome
in 270 individuals from four geographically diverse populations
1,2
:30
mother–father–adult child trios from the Yoruba in Ibadan, Nigeria
(abbreviated YRI); 30 trios of northern and western European ances-
try living in Utah from the Centre d’Etude du Polymorphisme
Humain (CEPH) collection (CEU); 45 unrelated Han Chinese indi-
viduals in Beijing, China (CHB); and 45 unrelated Japanese indivi-
duals in Tokyo, Japan (JPT). The YRI samples and the CEU samples
each form an analysis panel; the CHB and JPT samples together form
an analysis panel. Approximately 1.3 million SNPs were genotyped in
Phase I of the project, and a description of this resource was pub-
lished in 2005 (ref. 3).
The initial HapMap Project data had a central role in the develop-
ment of methods for the design and analysis of genome-wide asso-
ciation studies. These advances, alongside the release of commercial
platforms for performing economically viable genome-wide geno-
typing, have led to a new phase in human medical genetics. Already,
large-scale studies have identified novel loci involved in multiple
complex diseases
4,5
. In addition, the HapMap data have led to novel
insights into the distribution and causes of recombination hot-
spots
3,6
, the prevalence of structural variation
7,8
and the identity of
genes that have experienced recent adaptive evolution
3,9
. Because the
HapMap cell lines are publicly available, many groups have been able
to integrate their own experimental data with the genome-wide SNP
data to gain new insight into copy-number variation
10
, the relation-
ship between classical human leukocyte antigen (HLA) types and
SNP variation
11
, and heritable influences on gene expression
12–14
.
The ability to combine genome-wide data on such diverse aspects
of genetic variation with molecular phenotypes collected in the same
samples provides a powerful framework to study the connection of
DNA sequence to function.
In Phase II of the HapMap Project, a further 2.1 million SNPs
were successfully genotyped on the same individuals. The resulting
HapMap has an SNP density of approximately one per kilobase
and is estimated to contain approximately 25–35% of all the 9–10
million common SNPs (minor allele frequency (MAF) $ 0.05) in the
assembled human genome (that is, excluding gaps in the reference
sequence alignment; see Supplementary Text 1), although this num-
ber shows extensive local variation. This paper describes the Phase II
resource, its implications for genome-wide association studies and
additional insights into the fine-scale structure of linkage disequilib-
rium, recombination and natural selection.
Construction of the Phase II HapMap
Most of the additional genotype data for the Phase II HapMap were
obtained using the Perlegen amplicon-based platform
15
. Briefly, this
platform uses custom oligonucleotide arrays to type SNPs in DNA
segmentally amplified via long-range polymerase chain reaction
(PCR). Genotyping was attempted at 4,373,926 distinct SNPs, which
corresponds, with exceptions (see Methods), to nearly all SNPs in
dbSNP release 122 for which an assay could be designed. Additional
submissions were included from the Affymetrix GeneChip Mapping
Array 500K set, the Illumina HumanHap100 and HumanHap300
SNP assays, a set of ,11,000 non-synonymous SNPs genotyped by
Affymetrix (ParAllele) and a set of ,4,500 SNPs within the extended
major histocompatibility complex (MHC)
11
. Genotype submissions
were subjected to the same quality control (QC) filters as described
previously (see Methods) and mapped to NCBI build 35 (University
of California at Santa Cruz (UCSC) hg17) of the human genome. The
re-mapping of SNPs from Phase I of the project identified 21,177
SNPs that had an ambiguous position or some other feature indi-
cative of low reliability; these are not included in the filtered Phase II
data release. All genotype data are available from the HapMap Data
Coordination Center (http://www.hapmap.org) and dbSNP (http://
www.ncbi.nlm.nih.gov/SNP); analyses described in this paper refer
to release 21a. Three data sets are available: ‘redundant unfiltered’
*Lists of participants and affiliations appear at the end of the paper.
Vol 449 | 18 October 2007 | doi:10.1038/nature06258
851
Nature ©2007 Publishing Group
submissions that pass QC, and ‘non-redundant filtered’ contains a
single QC1 submission for each SNP in each analysis panel.
The QC filters remove SNPs showing gross errors. However, it is
also important to understand the magnitude and structure of more
subtle genotyping errors among SNPs that pass QC. We therefore
carried out a series of analyses to assess the influence of the long-range
PCR amplicon structure on genotyping error, the concordance rates
between genotype calls from different genotyping platforms and
between those platforms and re-sequencing assays, as well as the rates
of false monomorphism and mis-mapping of SNPs (see Supplemen-
tary Text 2, Supplementary Figs 1–3 and Supplementary Tables 1–4).
We estimate that the average per genotype accuracy is at least 99.5%.
However, there are higher rates of missing data and genotype discre-
pancies at non-reference alleles, with some clustering of errors result-
ing from the amplicon design and a few incorrectly mapped SNPs.
Table 1 shows the numbers of SNPs attempted and converted to
QC1 SNPs in each analysis panel (Supplementary Table 5 shows a
breakdown by each major submission). Haplotypes and missing data
were estimated for each analysis panel separately using both trio
information and statistical methods based on the coalescent model
(see Methods). To enable cross-population comparisons, a con-
sensus data set was created consisting of 3,107,620 SNPs that were
QC1 in all analysis panels and polymorphic in at least one analysis
panel. The equivalent figure from Phase I was 931,340 SNPs. Unless
stated otherwise, all analyses have been carried out on the consensus
data set. An additional set of haplotypes was created for those SNPs in
the consensus where a putative ancestral state could be assigned by
comparison of the human alleles to the orthologous position in the
chimpanzee and rhesus macaque genomes.
The variation in SNP density within the Phase II HapMap is shown
in Fig. 1. On average there are 1.14 genotyped polymorphic SNPs per
kilobase (average spacing is 875 base pairs (bp)) and 98.6% of the
assembled genome is within 5 kb of the nearest polymorphic SNP.
Still, there is heterogeneity in genotyped SNP density at both broad
(Fig. 1a) and fine (Fig. 1b) scales. Furthermore, there are systematic
changes in genotyped SNP density around genomic features includ-
ing genes (Fig. 1c).
The Phase II HapMap differs from the Phase I HapMap not only
in SNP spacing, but also in minor allele frequency distribution and
patterns of linkage disequilibrium (Supplementary Fig. 4). Because
the criteria for choosing additional SNPs did not include considera-
tion of SNP spacing or preferential selection for high MAF, the SNPs
added in Phase II are, on average, more clustered and have lower
MAF than the Phase I SNPs. Because MAF predictably influences the
distribution of linkage disequilibrium statistics, the average r
2
at a
given physical distance is typically lower in Phase II than in Phase I;
conversely, the jD9j statistic is typically higher (data not shown). One
notable consequence is that the Phase II HapMap includes a better
representation of rare variation than the Phase I HapMap.
The increased resolution provided by Phase II of the project is
illustrated in Fig. 2. Broadly, an additional SNP added to a region
shows one of three patterns. First, it may be very similar in distribution
to SNPs present in Phase I. Second, it may provide detailed resolution
of haplotype structure (for example, a group of chromosomes with
identical local haplotypes in Phase I can be shown in Phase II to carry
Table 1 | Summary of Phase II HapMap data (release 21)
Phase SNP categories Analysis panel
YRI CEU CHB1JPT
I Assays submitted 1,304,199 1,344,616 1,306,125
Passed QC 1,177,312 (90%) 1,217,902 (91%) 1,187,800 (91%)
Did not pass QC 126,887 (10%) 126,714 (9%) 118,325 (9%)
.20% missing 82,463 (65%) 95,684 (76%) 78,323 (66%)
.1 duplicate inconsistent 6,049 (5%) 5,126 (4%) 9,242 (8%)
.1 mendelian error 18,916 (15%) 11,310 (9%) N/A
,0.001 Hardy–Weinberg P -value 10,265 (8%) 8,922 (7%) 13,722 (12%)
Other failures 19,345 (15%) 13,858 (11%) 20,674 (17%)
II Assays submitted 5,044,989 5,044,996 5,043,775
Passed QC 3,150,433 (62%) 3,204,709 (64%) 3,244,897 (64%)
Did not pass QC 1,894,556 (38%) 1,840,287 (36%) 1,798,878 (36%)
.20% missing 1,419,000 (75%) 1,398,166 (76%) 1,403,543 (78%)
.1 duplicate inconsistent 0 (0%) 0 (0%) 6,617 (0%)
.1 mendelian error 172,339 (9%) 127,923 (7%) N/A
,0.001 Hardy–Weinberg P -value 96,231 (5%) 82,268 (4%) 108,880 (6%)
Other failures 334,511 (18%) 337,906 (18%) 340,370 (19%)
Overall Assays submitted 6,349,188 6,389,612 6,349,900
Passed QC 4,327,745 (68%) 4,422,611 (69%) 4,432,697 (70%)
Did not pass QC 2,021,443 (32%) 1,967,001 (31%) 1,917,203 (30%)
.20% missing 1,501,463 (74%) 1,493,850 (76%) 1,481,866 (77%)
.1 duplicate inconsistent 6,049 (0%) 5,126 (0%) 15,859 (1%)
.1 mendelian error 191,255 (9%) 139,233 (7%) N/A
,0.001 Hardy–Weinberg P -value 106,496 (5%) 91,190 (5%) 122,602 (6%)
Other failures 353,856 (18%) 351,764 (18%) 361,044 (19%)
Non-redundant (unique) SNPs 3,796,934 3,868,157 3,890,416
Monomorphic 861,299 (23%) 1,246,183 (32%) 1,410,152 (36%)
Polymorphic 2,935,635 (77%) 2,621,974 (68%) 2,480,264 (64%)
SNP categories All analysis panels
Unique QC-passed SNPs 4,000,107
Passed in one analysis panel 88,140 (2%)
Passed in two analysis panels 268,534 (7%)
Passed in three analysis panels (QC13) 3,643,433 (91%)
QC13 and monomorphic across
three analysis panels
535,813
QC13 and polymorphic in at least one analysis panel 3,107,620
QC13 and polymorphic in all three analysis panels 2,006,352
QC13 and MAF $ 0.05 in at least
one of three analysis panels
2,819,322
ARTICLES NATURE | Vol 449 | 18 October 2007
852
Nature ©2007 Publishing Group
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime




