There are several ways in things the basic model that we have described here might be modified to produce better performance in particular cases.

For example, in models and methods and applications to data we assumed relatively noninformative priors for q. However, in some situations, there might be quite a bit of information about likely values of q, and the estimation procedure could be improved by using that information.

For example, in estimating admixture proportions for African Americans, it would be possible to improve the estimation procedure by making use of existing information about the extent of European admixture (e.

A second way in which the basic model can behcet s disease modified involves changing the way in which the allele azacitidine P are estimated. Throughout this article, we have assumed that the allele frequencies in different populations are uncorrelated with one images. This is a convenient approximation for populations that are not extremely closely related and, as we more rare seen, can produce accurate clustering.

However, loosely speaking, the model of uncorrelated allele frequencies says that we do not normally expect to see populations with very similar allele frequencies. This property has the result that the clustering algorithm may tend to merge subpopulations that share similar frequencies. An alternative, which we have implemented in our software package, is to permit allele frequencies to be correlated across populations (appendix, Model with correlated allele frequencies).

In a series of additional simulations, we have found that this allows us to perform accurate assignments of individuals in very closely related populations, though possibly at hiv infection cost of making us likely to overestimate K.

Our basic model might also be modified to allow for linkage among marker loci. Normally, we would not expect to see linkage disequilibrium within subpopulations, except between markers that are extremely close together.

This means that in situations where there is little admixture, our assumption of independence among loci will be quite accurate. However, we might expect to see strong correlations among linked loci when there is recent admixture. This occurs because an individual who is admixed will inherit large chromosomal segments from one population or another.

Thus, when the map order of marker loci is known, it should be possible to improve the accuracy of the estimation for such individuals by modeling the inheritance of these segments. In constellation article we have devoted considerable attention to the problem of inferring K.

This is an important practical problem from the standpoint of model choice. We need to have some way of deciding which clustering model is most appropriate for interpreting the data. However, we stress that care should be taken in the interpretation of the inferred value of K. Second, it has been observed that in Bayesian model-based clustering, the posterior distribution of K tends to be quite dependent on the priors and modeling assumptions, even though estimates of the other parameters (e.

There are also biological reasons to be careful interpreting K. The population model that we have adopted here is obviously an idealization. We anticipate that it will be flexible enough to permit appropriate clustering for a wide range of population structures. As another example, imagine a species that lives on a continuous plane, but has low dispersal rates, so that allele frequencies vary continuously across the plane.

If we sample at K distinct locations, we might infer the presence of K clusters, but the inferred number K is not biologically interesting, as it was determined purely by the sampling scheme. All that can usefully be said in such a situation is that the migration rates between the sampling locations are not high enough to make the population act as a single unstructured population.

In summary, we find that the method described here amy is very supporting supportive produce highly accurate clustering and sensible choices of K, both for simulated data and for real data from humans and from the Taita thrush.

In the latter example, we find it particularly encouraging that using a relatively small number of loci (seven) we can obtain a very strong signal of population structure and assign individuals appropriately. We thank Peter Galbusera and Lynn Jorde for allowing us to use their data, Augie Kong for a helpful discussion, Daniel Falush for suggesting comparison with neighbor-joining trees, Steve Brooks and Trevor Sweeting for helpful discussions on inferring K, and Eric Anderson for his extensive comments on an earlier version of the manuscript.

This work was supported by National Institutes of Health grant GM19634 and by a Hitchings-Elion fellowship from Burroughs-Wellcome Fund. The work was initiated while the authors were resident at the Isaac Newton Institute for Mathematical Sciences, Cambridge, UK.

This is often surprisingly straightforward using standard methods devised for this purpose, such as the Metropolis-Hastings algorithm (e. This can be formalized and shown to be true provided the Markov chain satisfies certain technical conditions (ergodicity) that hold for the Markov chains considered in this article.

In general it is very difficult to know how large m and c should be. The values required to obtain reliable results depend heavily on the amount of correlation between successive states of the Markov chain. Substantial differences among the results obtained for the different runs indicate that m and c are too small.

It is then necessary either to increase m and c or (if this makes the method computationally infeasible) to construct a Markov chain with better mixing properties. We now provide further details regarding the approach to choosing K (see Inference for the number of populations).

However, our own implementation of versions of this approach has turned out to be computationally infeasible, due to the very high-dimensional parameter space of our problem. An alternative interpretation of this method is that model selection is based on penalizing the mean of the Bayesian deviance by a quarter of its variance. Note that Equation A8 makes an implicit assumption that an equal fraction of the sample is drawn from each population.



