Sign up & Download
Sign in

A haplotype map of the human genome.

by The International, Hapmap Consortium
Nature (2005)

Abstract

Inherited genetic variation has a critical but as yet largely uncharacterized role in human disease. Here we report a public database of common variation in the human genome: more than one million single nucleotide polymorphisms (SNPs) for which accurate and complete genotypes have been obtained in 269 DNA samples from four populations, including ten 500-kilobase regions in which essentially all information about common DNA variation has been extracted. These data document the generality of recombination hotspots, a block-like structure of linkage disequilibrium and low haplotype diversity, leading to substantial correlations of SNPs with many of their neighbours. We show how the HapMap resource can guide the design and analysis of genetic association studies, shed light on structural variation and recombination, and identify loci that may have been subject to natural selection during human evolution.

Cite this document (BETA)

Available from Nature
Page 1
hidden

A haplotype map of the human genome.

© 2005 Nature Publishing Group

A haplotype map of the human genome
The International HapMap Consortium*
Inherited genetic variation has a critical but as yet largely uncharacterized role in human disease. Here we report a
public database of common variation in the human genome: more than one million single nucleotide polymorphisms
(SNPs) for which accurate and complete genotypes have been obtained in 269 DNA samples from four populations,
including ten 500-kilobase regions in which essentially all information about common DNA variation has been extracted.
These data document the generality of recombination hotspots, a block-like structure of linkage disequilibrium and low
haplotype diversity, leading to substantial correlations of SNPs with many of their neighbours. We show how the
HapMap resource can guide the design and analysis of genetic association studies, shed light on structural variation and
recombination, and identify loci that may have been subject to natural selection during human evolution.
Despite the ever-accelerating pace of biomedical research, the root
causes of common human diseases remain largely unknown, pre-
ventative measures are generally inadequate, and available treatments
are seldom curative. Family history is one of the strongest risk factors
for nearly all diseases—including cardiovascular disease, cancer,
diabetes, autoimmunity, psychiatric illnesses and many others—
providing the tantalizing but elusive clue that inherited genetic
variation has an important role in the pathogenesis of disease.
Identifying the causal genes and variants would represent an impor-
tant step in the path towards improved prevention, diagnosis and
treatment of disease.
More than a thousand genes for rare, highly heritable ‘mendelian’
disorders have been identified, in which variation in a single gene is
both necessary and sufficient to cause disease. Common disorders, in
contrast, have proven much more challenging to study, as they
are thought to be due to the combined effect of many different
susceptibility DNA variants interacting with environmental factors.
Studies of common diseases have fallen into two broad categories:
family-based linkage studies across the entire genome, and popu-
lation-based association studies of individual candidate genes.
Although there have been notable successes, progress has been slow
due to the inherent limitations of the methods; linkage analysis has
low power except when a single locus explains a substantial fraction
of disease, and association studies of one or a few candidate genes
examine only a small fraction of the ‘universe’ of sequence variation
in each patient.
A comprehensive search for genetic influences on disease would
involve examining all genetic differences in a large number of affected
individuals and controls. It may eventually become possible to
accomplish this by complete genome resequencing. In the meantime,
it is increasingly practical to systematically test common genetic
variants for their role in disease; such variants explain much of the
genetic diversity in our species, a consequence of the historically
small size and shared ancestry of the human population.
Recent experience bears out the hypothesis that common variants
have an important role in disease, with a partial list of validated
examples including HLA (autoimmunity and infection)1, APOE4
(Alzheimer’s disease, lipids)2, Factor VLeiden (deep vein thrombosis)3,
PPARG (encoding PPARg; type 2 diabetes)4,5, KCNJ11 (type 2
diabetes)6, PTPN22 (rheumatoid arthritis and type 1 diabetes)7,8,
insulin (type 1 diabetes)9, CTLA4 (autoimmune thyroid disease, type
1 diabetes)10, NOD2 (inflammatory bowel disease)11,12, complement
factor H (age-related macular degeneration)13–15 and RET (Hirsch-
sprung disease)16,17, among many others.
Systematic studies of common genetic variants are facilitated by
the fact that individuals who carry a particular SNP allele at one site
often predictably carry specific alleles at other nearby variant sites.
This correlation is known as linkage disequilibrium (LD); a particu-
lar combination of alleles along a chromosome is termed a haplotype.
LD exists because of the shared ancestry of contemporary chromo-
somes. When a new causal variant arises through mutation—whether
a single nucleotide change, insertion/deletion, or structural altera-
tion—it is initially tethered to a unique chromosome on which it
occurred, marked by a distinct combination of genetic variants.
Recombination and mutation subsequently act to erode this associ-
ation, but do so slowly (each occurring at an average rate of about
1028 per base pair (bp) per generation) as compared to the number
of generations (typically 104 to 105) since the mutational event.
The correlations between causal mutations and the haplotypes on
which they arose have long served as a tool for human genetic
research: first finding association to a haplotype, and then sub-
sequently identifying the causal mutation(s) that it carries. This was
pioneered in studies of the HLA region, extended to identify causal
genes for mendelian diseases (for example, cystic fibrosis18 and
diastrophic dysplasia19), and most recently for complex disorders
such as age-related macular degeneration13–15.
Early information documented the existence of LD in the human
genome20,21; however, these studies were limited (for technical
reasons) to a small number of regions with incomplete data, and
general patterns were challenging to discern. With the sequencing of
the human genome and development of high-throughput genomic
methods, it became clear that the human genome generally
displays more LD22 than under simple population genetic models23,
and that LD is more varied across regions, and more segmentally
structured24–30, than had previously been supposed. These obser-
vations indicated that LD-based methods would generally have
great value (because nearby SNPs were typically correlated with
many of their neighbours), and also that LD relationships would
ARTICLES
*Lists of participants and affiliations appear at the end of the paper.
Vol 437|27 October 2005|doi:10.1038/nature04226
1299
Page 2
hidden
© 2005 Nature Publishing Group

need to be empirically determined across the genome by studying
polymorphisms at high density in population samples.
The International HapMap Project was launched in October 2002
to create a public, genome-wide database of common human
sequence variation, providing information needed as a guide to
genetic studies of clinical phenotypes31. The project had become
practical by the confluence of the following: (1) the availability of
the human genome sequence; (2) databases of common SNPs
(subsequently enriched by this project) from which genotyping
assays could be designed; (3) insights into human LD; (4) develop-
ment of inexpensive, accurate technologies for high-throughput SNP
genotyping; (5) web-based tools for storing and sharing data; and
(6) frameworks to address associated ethical and cultural issues32.
The project follows the data release principles of an international
community resource project (http://www.wellcome.ac.uk/
doc_WTD003208.html), sharing information rapidly and without
restriction on its use.
The HapMap data were generated with the primary aim of guiding
the design and analysis of medical genetic studies. In addition, the
advent of genome-wide variation resources such as the HapMap
opens a new era in population genetics, offering an unprecedented
opportunity to investigate the evolutionary forces that have shaped
variation in natural populations.
The Phase I HapMap
Phase I of the HapMap Project set as a goal genotyping at least one
common SNP every 5 kilobases (kb) across the genome in each of 269
DNA samples. For the sake of practicality, and motivated by the allele
frequency distribution of variants in the human genome, a minor
allele frequency (MAF) of 0.05 or greater was targeted for study. (For
simplicity, in this paper we will use the term ‘common’ to mean a
SNP with MAF $ 0.05.) The project has a Phase II, which is
attempting genotyping of an additional 4.6 million SNPs in each of
the HapMap samples.
To compare the genome-wide resource to a more complete
database of common variation—one in which all common SNPs
and many rarer ones have been discovered and tested—a representa-
tive collection of ten regions, each 500 kb in length, was selected from
the ENCODE (Encyclopedia of DNA Elements) Project33. Each
500-kb region was sequenced in 48 individuals, and all SNPs in
these regions (discovered or in dbSNP) were genotyped in the
complete set of 269 DNA samples.
The specific samples examined are: (1) 90 individuals (30 parent–
offspring trios) from the Yoruba in Ibadan, Nigeria (abbreviation
YRI); (2) 90 individuals (30 trios) in Utah, USA, from the Centre
d’Etude du Polymorphisme Humain collection (abbreviation CEU);
(3) 45 Han Chinese in Beijing, China (abbreviation CHB); (4) 44
Japanese in Tokyo, Japan (abbreviation JPT).
Because none of the samples was collected to be representative of a
larger population such as ‘Yoruba’, ‘Northern and Western European’,
‘Han Chinese’, or ‘Japanese’ (let alone of all populations from ‘Africa’,
‘Europe’, or ‘Asia’), we recommend using a specific local identifier
(for example, ‘Yoruba in Ibadan, Nigeria’) to describe the samples
initially. Because the CHB and JPT allele frequencies are generally
very similar, some analyses below combine these data sets. When
doing so, we refer to three ‘analysis panels’ (YRI, CEU, CHBþJPT) to
avoid confusing this analytical approach with the concept of a
‘population’.
Important details about the design of the HapMap Project are
presented in the Methods, including: (1) organization of the project;
(2) selection of DNA samples for study; (3) increasing the number
and annotation of SNPs in the public SNP map (dbSNP) from
2.6 million to 9.2 million (Fig. 1); (4) targeted sequencing of the ten
ENCODE regions, including evaluations of false-positive and false-
negative rates; (5) genotyping for the genome-wide map; (6) intense
efforts that monitored and established the high quality of the data;
and (7) data coordination and distribution through the project Data
Coordination Center (DCC) (http://www.hapmap.org).
Description of the data. The Phase I HapMap contains 1,007,329
SNPs that passed a set of quality control (QC) filters (see Methods) in
each of the three analysis panels, and are polymorphic across the 269
samples. SNP genotyping was distributed across centres by chromo-
somal region, with several technologies employed (Table 1). Each
centre followed the same standard rules for SNP selection, quality
control and data release; all SNPs were genotyped in the full set of 269
samples. Some centres genotyped more SNPs than required by the
rules.
Extensive, blinded quality assessment (QA) exercises documented
that these data are highly accurate (99.7%) and complete (99.3%, see
Table 1 | Genotyping centres
Centre Chromosomes Technology
RIKEN 5, 11, 14, 15, 16, 17, 19 Third Wave Invader
Wellcome Trust Sanger Institute 1, 6, 10, 13, 20 Illumina BeadArray
McGill University and Ge´nome Que´bec Innovation Centre 2, 4p Illumina BeadArray
Chinese HapMap Consortium* 3, 8p, 21 Sequenom MassExtend, Illumina BeadArray
Illumina 8q, 9, 18q, 22, X Illumina BeadArray
Broad Institute of Harvard and MIT 4q, 7q, 18p, Y, mtDNA Sequenom MassExtend, Illumina BeadArray
Baylor College of Medicine with ParAllele BioScience 12 ParAllele MIP
University of California, San Francisco, with Washington University in St Louis 7p PerkinElmer AcycloPrime-FP
Perlegen Sciences 5Mb (ENCODE) on 2, 4, 7,
8, 9, 12, 18 in CEU
High-density oligonucleotide array
*The Chinese HapMap Consortium consists of the Beijing Genomics Institute, the Chinese National Human Genome Center at Beijing, the University of Hong Kong, the Hong Kong University
of Science and Technology, the Chinese University of Hong Kong, and the Chinese National Human Genome Center at Shanghai.
Figure 1 | Number of SNPs in dbSNP over time. The cumulative number of
non-redundant SNPs (each mapped to a single location in the genome) is
shown as a solid line, as well as the number of SNPs validated by genotyping
(dotted line) and double-hit status (dashed line). Years are divided into
quarters (Q1–Q4).
ARTICLES NATURE|Vol 437|27 October 2005
1300

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

119 Readers on Mendeley
by Discipline
 
 
 
by Academic Status
 
27% Ph.D. Student
 
25% Post Doc
 
8% Associate Professor
by Country
 
34% United States
 
10% United Kingdom
 
7% Germany