Inference of human population his...
LETTER doi:10.1038/nature10231 Inference of human population history from individual whole-genome sequences Heng Li1,2 & Richard Durbin1 The history of human population size is important for understand- ing human evolution. Various studies1–5 have found evidence for a founder event (bottleneck) in East Asian and European popula- tions, associated with the human dispersal out-of-Africa event around 60 thousand years (kyr) ago. However, these studies have hadto assumesimplified demographic modelswith few parameters, and they do not provide a precise date for the start and stop times of the bottleneck. Here, with fewer assumptions on population size changes, we present a more detailed history of human population sizes between approximately ten thousand and a million years ago, usingthepairwisesequentiallyMarkoviancoalescentmodelapplied to the complete diploid genome sequences of a Chinese male (YH)6, a Korean male (SJK)7, three European individuals (J. C. Venter8, NA12891 and NA12878 (ref. 9)) and two Yoruba males (NA18507 (ref. 10) and NA19239). We infer that European and Chinese popu- lations had very similar population-size histories before 10–20 kyr ago. Both populations experienced a severe bottleneck 10–60 kyr ago, whereas African populations experienced a milder bottleneck from which they recovered earlier. All three populations have an elevated effective population size between 60 and 250kyr ago, pos- sibly due to population substructure11. We also infer that the dif- ferentiationofgeneticallymodernhumansmayhavestartedasearly as 100–120 kyr ago12, but considerable genetic exchanges may still have occurred until 20–40 kyr ago. The distribution of the time since the most recent common ancestor (TMRCA) between two alleles in an individual provides information about the history of change in population size over time. Existing methods for reconstructing the detailed TMRCA distribution have analysed large samples of individuals at non-recombining loci like mitochondrial DNA13. However, the statistical resolution of inferences from any one locus is poor, and power fades rapidly upon moving back in time because there are few independent lineages probing deep time depths (in humans, no information is available from mitochondrial DNA beyond about 200 kyr ago, when all humans share a common maternal ancestor11). In contrast, a diploid genome sequence contains hundreds of thousands of independent loci, each with its own TMRCA between the two alleles carried by an individual. In principle, it should be possible to reconstruct the TMRCA distribution across the auto- somes and the X chromosome by studying how the local density of heterozygous sites changes across the genome, reflecting segments of constant TMRCA separated by historical recombination events. To explore whether we could use this idea to learn about the detailed TMRCA distribution from a diploid whole-genome sequence, we pro- posed the pairwise sequentially Markovian coalescent (PSMC) model, whichis a specialization to the case of two chromosomes of the sequen- tially Markovian coalescent model14 (Fig. 1a). The free parameters of this model include the scaled mutation rate, the recombination rate and piecewise constant ancestral population sizes (see Methods). We scaled results to real time, assuming 25 years per generation and a neutral mutation rate of 2.5 3 1028 per generation15. The con- sequences of uncertainty in the two scaling parameters will be dis- cussed later in the text. To validate our model, we simulated one-hundred 30-megabase (Mb) sequences with a sharp out-of-Africa bottleneck followed by a population expansion, and inferred population-size history with PSMC (Fig. 2a). PSMC was able to recover the parameters used in the simulation and the variance of the estimate was small between 20 kyr ago and 3 Myr ago. More recently than 20 kyr ago or more anciently than 3 Myr ago, few recombination events are left in the present sequence, which reduces the power of PSMC. Therefore, the estimated effective population size (Ne) in these time intervals was not as accurate and had large variance. To test the robustness of the model, we introduced variable mutation rates and recombination hotspots in the simulation (Supplementary Information). The inference was still close to the true history (Fig. 2b) and a uniform rate of single nucleo- tide polymorphism (SNP) ascertainment errors did not change our qualitative results either (Supplementary Fig. 2). The simulations did, however, reveal a limitation of PSMC in recovering sudden changes in effectivepopulationsize.Forexample,theinstantaneousreductionfrom 12,000 to 1,200 at 100 kyr ago in the simulation was spread over several preceding tens of thousands ofyears in the PSMC reconstruction. 1 The Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA, UK. 2 Broad Institute of Harvard and MIT, Cambridge, Massachusetts 02142, USA. 0 20 40 60 80 100 120 140 160 180 200 Coordinate (kb) 0 50 100 150 200 TMRCA (×1,000 generations) Discretized TMRCA (hidden states) Diploid sequence (observation) Ancestral recombinations (changes of hidden states) ... emissions ... Heterozygote Homozygote Past ... emissions ... Inferred segmental TMRCA (a HMM path) a b Figure 1 | Illustration of the PSMC model and its application to simulated data. a, The PSMC infers the local time to the most recent common ancestor (TMRCA) on the basis of the local density of heterozygotes, using a hidden Markov model in which the observation is a diploid sequence, the hidden states are discretized TMRCA and the transitions represent ancestral recombination events. b, We used the ms software to simulate the TMRCA relating the two alleles of an individual across a 200-kb region (the thick red line), and inferred the local TMRCA at each locus using the PSMC (the heat map). The inference usually includes the correct time, with the greatest errors at transition points. 2 8 J U L Y 2 0 1 1 | V O L 4 7 5 | N A T U R E | 4 9 3 Macmillan Publishers Limited. All rights reserved ©2011
We applied the PSMC model to real data from recently published genome sequences (see Table 1, which defines the acronyms for sam- ples used elsewhere in the text and figures). Figure 3a shows that all populations are very similar in their estimated Ne history between 150 and 1,500 kyr ago. The Yoruba (YRI) genome differentiates from non- African populations around 100–120 kyr ago (at 110 kyr ago, NeYRI 5 15,313 6 559 and NeCHN 5 12,829 6 485). This evidence of early population differentiation is potentially consistent with the archaeological evidence of anatomically modern humans found in the Near East around 100 kyr ago12. European and East Asian popula- tions are nearly identical in estimated Ne before 11 kyr ago. From a peak of 13,500 at 150 kyr ago, the Ne dropped by a factor of ten to 1,200 between 40 and 20 kyr ago, before a sharp increase, the precise mag- nitude of which we do not have the power to measure. We also observed a less marked bottleneck in YRI from a peak of 16,100 around 100–150 kyr ago to 5,700 at 50 kyr ago, recovering earlier16 than the out-of-Africa populations, with an increase back to 8,700 by 20 kyr ago, coinciding with the Last Glacial Maximum. All populationsshowed increased Ne between 60 and 200 kyr ago, about the time of origin of anatomically modern humans17. An alternative to an increase in actual population size during this time would be that there was population structureinvolvingseparationandadmixture11,16 (SupplementaryFig5). We also saw an increase in estimated Ne before 1 million years (Myr) ago inall populations, with a sharp increase before 3 Myr ago.Although it is tempting to read into this the transition from the previously esti- mated larger Ne at the time of the split from the chimpanzee18, our method may also be subject to artefacts in this region, due to regions of balancing selection or to clustered false heterozygotes related to segmental duplications (Supplementary Fig. 3). Analysis of a European female X chromosome (EUR3.X) yielded a history similar to that from autosomes scaled by 0.75, as expected for the X chromosome (Fig. 3b). We did not observe a more severe Table 1 | Properties of the input sequences Label Description Coverage Number of called bases (bp) Number of heterozygotes (bp) Heterozygosity (31,000) YRI1.A (ref. 10) NA18507 autosomes 340 2.14 3 109 2.17 3 106 1.013 YRI2.A (ref. 9) NA19239 autosomes 329 2.11 3 109 2.21 3 106 1.051 EUR1.A (ref. 8) Venter autosomes 39 2.13 3 109 1.23 3 106 0.578 EUR2.A (ref. 9) NA12891 autosomes 338 2.11 3 109 1.67 3 106 0.791 KOR.A (ref. 7) SJK autosomes 320 2.13 3 109 1.47 3 106 0.690 CHN.A (ref. 6) YH autosomes 330 2.19 3 109 1.52 3 106 0.694 YRI3.X (ref. 9) NA19240 X chromosome 338 1.06 3 108 7.16 3 104 0.673 EUR3.X (ref. 9) NA12878 X chromosome 335 1.10 3 108 4.80 3 104 0.436 KOR–CHN.X SJK–YH combined X chromosome - 1.02 3 108 3.97 3 104 0.390 YRI1–EUR1.X NA18507–Venter combined X chromosome - 0.83 3 108 5.56 3 104 0.670 YRI1–KOR.X NA18507–KOR combined X chromosome - 1.00 3 108 6.69 3 104 0.669 YRI1–CHN.X NA18507–YH combined X chromosome - 1.06 3 108 6.95 3 104 0.657 Coverage equals theaveragenumber of reads coveringHapMap3 loci.A base is saidto be called if it passesall filters described (seeMethods). Therelatively lower coveragefor EUR1.A leadsto higher sampling bias at heterozygotes, which leads to underestimated heterozygosity, but this can be corrected by adjusting the neutral mutation rate in scaling (Supplementary Information, section 1.2). 0 1 2 3 4 5 104 105 106 107 Effective population size (×10 4 ) True history Bootstrapping estimate Direct estimate 0 1 2 3 4 5 104 105 106 107 Effective population size (×10 4 ) Years (g = 25, μ = 2.5×10–8) True history Direct estimate Variable mutation rate With recombination hotspots a b Figure 2 | PSMC estimate on simulated data. a, PSMC estimate on data simulated by msHOT. The blue curve is the population-size history used in simulation the red curve is the PSMC estimate on the originally simulated sequence the 100 thin green curves are the PSMC estimates on 100 sequences randomly resampled from the original sequence. b, PSMC estimate on data with a variable mutation rate or with hotspots. g, generation time m, mutation rate. b 0 1 2 3 4 5 104 105 106 107 Effective population size (x10 4 ) YRI1.A YRI2.A EUR.1.A[0.29] EUR2.A KOR.A[0.10] CHN.A[0.05] 0 1 2 3 4 5 104 105 106 107 Effective population size (x10 4 ) Years (g = 25, μ = 2.5×10–8) EUR2.A EUR3.X(α = 2)/0.75 YRI1–EUR1.X(α = 2)/0.75 YRI1–KOR.X(α = 2)/0.75 YRI1–CHN.X(α = 2)/0.75 Simulated African–Asian a Figure 3 | PSMC estimate on real data. a, Population sizes inferred from autosomes of six individuals. 5%, 10% and 29% of heterozygotes are assumed to be missing in CHN.A, KOR.A and EUR1.A, respectively. b, Population sizes inferred from male-combined X chromosomes and the simulated African– Asian combined sequences from the best-fit model in ref. 21. Sizes inferred from X-chromosome data are scaled by 4/3. The neutral mutation rate on X, which is used in time-scaling, is estimated with the ratio of male-to-female mutation rate, a, equal to 2 (see Methods). RESEARCH LETTER 4 9 4 | N A T U R E | V O L 4 7 5 | 2 8 J U L Y 2 0 1 1 Macmillan Publishers Limited. All rights reserved ©2011