Detecting the number of clusters of individuals using the software STRUCTURE: a simulation study.
- PubMed: 15969739
Abstract
The identification of genetically homogeneous groups of individuals is a long standing issue in population genetics. A recent Bayesian algorithm implemented in the software STRUCTURE allows the identification of such groups. However, the ability of this algorithm to detect the true number of clusters (K) in a sample of individuals when patterns of dispersal among populations are not homogeneous has not been tested. The goal of this study is to carry out such tests, using various dispersal scenarios from data generated with an individual-based model. We found that in most cases the estimated 'log probability of data' does not provide a correct estimation of the number of clusters, K. However, using an ad hoc statistic DeltaK based on the rate of change in the log probability of data between successive K values, we found that STRUCTURE accurately detects the uppermost hierarchical level of structure for the scenarios we tested. As might be expected, the results are sensitive to the type of genetic marker used (AFLP vs. microsatellite), the number of loci scored, the number of populations sampled, and the number of individuals typed in each sample.
Author-supplied keywords
Detecting the number of clusters ...
Detecting the number of clusters of individuals using the software STRUCTURE : a simulation study G. EVANNO, S. REGNAUT and J. GOUDET Department of Ecology and Evolution, Biology building, University of Lausanne, CH 1015 Lausanne, Switzerland Abstract The identification of genetically homogeneous groups of individuals is a long standing issue in population genetics. A recent Bayesian algorithm implemented in the software STRUCTURE allows the identification of such groups. However, the ability of this algorithm to detect the true number of clusters ( K ) in a sample of individuals when patterns of dispersal among populations are not homogeneous has not been tested. The goal of this study is to carry out such tests, using various dispersal scenarios from data generated with an individual- based model. We found that in most cases the estimated ���log probability of data��� does not provide a correct estimation of the number of clusters, K . However, using an ad hoc statistic ���K ��� ��� ��� based on the rate of change in the log probability of data between successive K values, we found that STRUCTURE accurately detects the uppermost hierarchical level of structure for the scenarios we tested. As might be expected, the results are sensitive to the type of genetic marker used (AFLP vs. microsatellite), the number of loci scored, the number of popula- tions sampled, and the number of individuals typed in each sample. Keywords : AFLP, hierarchical structure, microsatellite, simulations, structure software Received 5 October 2004 revision accepted 17 February 2005 Introduction Population genetics deals with the variations of allele frequencies between and within populations. The most widely used measures of population structure are Wright���s F statistics (Wright 1931). To calculate these indices, one needs first to define groups of individuals and then to use their genotypes to compute variance in allele frequencies. Thus, a fundamental prerequisite of any inference on the genetic structure of populations is the definition of popu- lations themselves. Population determination is usually based upon geographical origin of samples or phenotypes. However, the genetic structure of populations is not always reflected in the geographical proximity of individuals. Popu- lations that are not discretely distributed can nevertheless be genetically structured, due to unidentified barriers to gene flow. In addition, groups of individuals with different geographical locations, behavioural patterns or phenotypes are not necessarily genetically differentiated (for instance, migratory bats from the same breeding roost could be sampled thousands of kilometres apart in winter, see, e.g. Petit et al . 2001). Among the methods not assuming predefined structure, tree-based methods use genetic distance between indi- viduals and tree construction algorithms such as upgma or neighbour joining to group them in clusters (e.g. Saitou & Nei 1987). Similarly, multivariate analyses such as multi- dimensional scaling can help in identifying clusters of individuals. However, these graphical methods are only loosely connected to statistical procedures allowing the identification of homogeneous clusters of individuals. An alternative model-based method developed recently by Pritchard et al . (2000) and implemented in the software structure aims at delineating clusters of individuals on the basis of their genotypes at multiple loci using a Bayesian approach. The model accounts for the presence of Hardy��� Weinberg or linkage disequilibrium by introducing popu- lation structure and attempts to find population groupings that (as far as possible) are not in disequilibrium (Pritchard et al . 2000). The estimated log probability of data Pr( X | K ) (equation 12 in Pritchard et al . 2000) for each value of K is given, allowing the estimation of the more likely number of clusters. A quantification of how likely each individual Correspondence: J��r��me Goudet, Fax: + 41 21 692 42 65 E-mail: Jerome.goudet@unil.ch
Readership Statistics
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime



