Sign up & Download
Sign in

Detecting the number of clusters of individuals using the software STRUCTURE: a simulation study.

by G Evanno, S Regnaut, J Goudet
Molecular Ecology ()

Abstract

The identification of genetically homogeneous groups of individuals is a long standing issue in population genetics. A recent Bayesian algorithm implemented in the software STRUCTURE allows the identification of such groups. However, the ability of this algorithm to detect the true number of clusters (K) in a sample of individuals when patterns of dispersal among populations are not homogeneous has not been tested. The goal of this study is to carry out such tests, using various dispersal scenarios from data generated with an individual-based model. We found that in most cases the estimated 'log probability of data' does not provide a correct estimation of the number of clusters, K. However, using an ad hoc statistic DeltaK based on the rate of change in the log probability of data between successive K values, we found that STRUCTURE accurately detects the uppermost hierarchical level of structure for the scenarios we tested. As might be expected, the results are sensitive to the type of genetic marker used (AFLP vs. microsatellite), the number of loci scored, the number of populations sampled, and the number of individuals typed in each sample.

Cite this document (BETA)

Available from www.ncbi.nlm.nih.gov
Page 1
hidden

Detecting the number of clusters ...

Molecular Ecology (2005) 14 , 2611���2620 doi: 10.1111/j.1365-294X.2005.02553.x �� 2005 Blackwell Publishing Ltd Blackwell Publishing, Ltd.
Detecting the number of clusters of individuals using the software STRUCTURE : a simulation study G. EVANNO, S. REGNAUT and J. GOUDET Department of Ecology and Evolution, Biology building, University of Lausanne, CH 1015 Lausanne, Switzerland Abstract The identification of genetically homogeneous groups of individuals is a long standing issue in population genetics. A recent Bayesian algorithm implemented in the software STRUCTURE allows the identification of such groups. However, the ability of this algorithm to detect the true number of clusters ( K ) in a sample of individuals when patterns of dispersal among populations are not homogeneous has not been tested. The goal of this study is to carry out such tests, using various dispersal scenarios from data generated with an individual- based model. We found that in most cases the estimated ���log probability of data��� does not provide a correct estimation of the number of clusters, K . However, using an ad hoc statistic ���K ��� ��� ��� based on the rate of change in the log probability of data between successive K values, we found that STRUCTURE accurately detects the uppermost hierarchical level of structure for the scenarios we tested. As might be expected, the results are sensitive to the type of genetic marker used (AFLP vs. microsatellite), the number of loci scored, the number of popula- tions sampled, and the number of individuals typed in each sample. Keywords : AFLP, hierarchical structure, microsatellite, simulations, structure software Received 5 October 2004 revision accepted 17 February 2005 Introduction Population genetics deals with the variations of allele frequencies between and within populations. The most widely used measures of population structure are Wright���s F statistics (Wright 1931). To calculate these indices, one needs first to define groups of individuals and then to use their genotypes to compute variance in allele frequencies. Thus, a fundamental prerequisite of any inference on the genetic structure of populations is the definition of popu- lations themselves. Population determination is usually based upon geographical origin of samples or phenotypes. However, the genetic structure of populations is not always reflected in the geographical proximity of individuals. Popu- lations that are not discretely distributed can nevertheless be genetically structured, due to unidentified barriers to gene flow. In addition, groups of individuals with different geographical locations, behavioural patterns or phenotypes are not necessarily genetically differentiated (for instance, migratory bats from the same breeding roost could be sampled thousands of kilometres apart in winter, see, e.g. Petit et al . 2001). Among the methods not assuming predefined structure, tree-based methods use genetic distance between indi- viduals and tree construction algorithms such as upgma or neighbour joining to group them in clusters (e.g. Saitou & Nei 1987). Similarly, multivariate analyses such as multi- dimensional scaling can help in identifying clusters of individuals. However, these graphical methods are only loosely connected to statistical procedures allowing the identification of homogeneous clusters of individuals. An alternative model-based method developed recently by Pritchard et al . (2000) and implemented in the software structure aims at delineating clusters of individuals on the basis of their genotypes at multiple loci using a Bayesian approach. The model accounts for the presence of Hardy��� Weinberg or linkage disequilibrium by introducing popu- lation structure and attempts to find population groupings that (as far as possible) are not in disequilibrium (Pritchard et al . 2000). The estimated log probability of data Pr( X | K ) (equation 12 in Pritchard et al . 2000) for each value of K is given, allowing the estimation of the more likely number of clusters. A quantification of how likely each individual Correspondence: J��r��me Goudet, Fax: + 41 21 692 42 65 E-mail: Jerome.goudet@unil.ch
Page 2
hidden
2612 G. EVANNO, S. REGNAUT and J. GOUDET �� 2005 Blackwell Publishing Ltd, Molecular Ecology , 14, 2611���2620 is to belong to each group is also given, information that can be then used to assign individuals to populations. While the authors warn that Pr( X | K ) is really only an indi- cation of the number of clusters and an ad hoc guide (p. 949 in Pritchard et al . 2000 p. 3 in Pritchard & Wen 2003), the program has been widely used to this end. More generally, it has been used for detection of genetic structure in sample populations for medical purposes (Pritchard & Donnelly 2001 Satten et al . 2001), assignment studies (Rosenberg et al . 2001), population admixture and hybridization ana- lysis (Beaumont et al . 2001 Goossens et al . 2002 Randi & Lucchini 2002), migration and dispersal analysis (Arnaud et al . 2003 Cegelski et al . 2003 Berry et al . 2004) and also to detect, with or without success, cryptic genetic structure of natural populations (Rosenberg et al . 2002 Caizergues et al . 2003). Among the Bayesian clustering methods, structure is the most widely used. While other methods have been developed (Banks & Eichert 2000 Dawson & Belkhir 2001 Corander et al . 2003) and still other methods for the assignment of individuals to populations exist (but imply the a priori knowledge of source populations: Paetkau et al . 1995 Rannala & Mountain 1997 Cornuet et al . 1999), we will focus here exclusively on the software structure . Tests and comparative studies using empirical data sets have been performed to assess structure ���s ability in assign- ing individuals to their known cluster of origin (Pritchard & Donnelly 2001 Rosenberg et al . 2001 Manel et al . 2002 Turakulov & Easteal 2003). Most of these studies have proven the software to be efficient in assigning individuals to their populations of origin (albeit most are based on simu- lations with limited number of populations and absence of dispersal between them). However, little is known on the crucial ability of structure to detect the real number of clusters ( K ) which composes a data set. Pritchard et al . (2000) showed that structure easily detects two to four highly differentiated populations but studies in molecular ecology usually include many more populations and very often these populations are not evenly distributed in space. Many studies have described migration patterns departing from Wright���s island model and including several hier- archical levels and/or isolation by distance. For instance, Chapuisat et al . (1997), Giles et al . (1998), Bouzat & Johnson (2004) or Trouv�� et al . (2005) have documented situations with a hierarchical pattern of population structure, as groups are themselves clusters of differentiated populations. Another pattern frequently described is a contact zone between otherwise isolated populations. This situation implies a relative genetic isolation between the two groups of popu- lations and sometimes also a pattern of isolation by distance within each group. Such a migration scheme was found for instance by Lugon-Moulin et al . (1999) who describe two longitudinal geographical patterns of isolated shrew populations separated by a zone through which dispersal is strongly reduced. Many of these studies have been conducted using microsatellite markers to assess polymorphism. These DNA markers are widely used because they are both co- dominant and highly polymorphic (Jarne & Lagoda 1996). However, their development is relatively expensive, time consuming and can be difficult. An alternative family of markers also commonly used in populations studies are the amplified fragment length polymorphism (AFLPs) (Vos et al . 1995). AFLPs generate hundreds of polymorphic bands and are easier to develop than microsatellites, but they have the potential inconvenience of being dominant (a DNA band is either present or absent). These two types of markers have different properties. For instance, Gaudeul et al . (2004) reported very different levels of population structuring inferred from AFLPs and microsatellite markers. Both AFLP and microsatellites can be used for assignment studies but their respective ability to delineate clusters of individuals has not been compared so far. The goal of this study is to test the ability of the algorithm underlying the software structure to detect the number of clusters in situations including more than two populations. While the program is increasingly used, it is unknown whether it can efficiently detect the real number of clusters in hierarchical systems where migration between popula- tions is uneven. We present an evaluation of the perform- ances of the method under three models of population structure: the island model, a contact zone, and a hierarchical island model. For each model, we simulated AFLP and microsatellite genotypic data sets that were subsequently run in structure , and then we analysed the output. We find that ��� K , an ad hoc quantity related to the second order rate of change of the log probability of data with respect to the number of clusters, is a good predictor of the real number of clusters. structure identifies groups of individuals corresponding to the uppermost hierarchical level, and performs well with both dominant and codominant markers. Materials and methods Simulation of the three migration models We used the software easypop (Balloux 2001) to generate genotypic data from three different models of population structure: an island model, a hierarchical island model and a contact-zone model (Fig. 1). For all simulations and model of population structure, mutation process followed the K allele model (equal probability of mutations to any allelic state) at a rate of �� = 10 ��� 3 . The modelled organisms are diploid, hermaphroditic and randomly mating (excluding selfing). Each simulation was run for 10 000 generations to obtain populations at drift, migration and mutation equilibrium. For each model, we generated 10 replicates where each individual genotype was made of 100 micro- satellite loci, each with 10 possible allelic states.

Readership Statistics

834 Readers on Mendeley
by Discipline
 
 
 
by Academic Status
 
32% Ph.D. Student
 
16% Post Doc
 
14% Student (Master)
by Country
 
20% United States
 
8% Brazil
 
7% United Kingdom

Tags

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in