Genetic diversity in India and the inference of Eurasian population expansion

Background Genetic studies of populations from the Indian subcontinent are of great interest because of India's large population size, complex demographic history, and unique social structure. Despite recent large-scale efforts in discovering human genetic variation, India's vast reservoir of genetic diversity remains largely unexplored. Results To analyze an unbiased sample of genetic diversity in India and to investigate human migration history in Eurasia, we resequenced one 100-kb ENCODE region in 92 samples collected from three castes and one tribal group from the state of Andhra Pradesh in south India. Analyses of the four Indian populations, along with eight HapMap populations (692 samples), showed that 30% of all SNPs in the south Indian populations are not seen in HapMap populations. Several Indian populations, such as the Yadava, Mala/Madiga, and Irula, have nucleotide diversity levels as high as those of HapMap African populations. Using unbiased allele-frequency spectra, we investigated the expansion of human populations into Eurasia. The divergence time estimates among the major population groups suggest that Eurasian populations in this study diverged from Africans during the same time frame (approximately 90 to 110 thousand years ago). The divergence among different Eurasian populations occurred more than 40,000 years after their divergence with Africans. Conclusions Our results show that Indian populations harbor large amounts of genetic variation that have not been surveyed adequately by public SNP discovery efforts. Our data also support a delayed expansion hypothesis in which an ancestral Eurasian founding population remained isolated long after the out-of-Africa diaspora, before expanding throughout Eurasia.


Comparison between sequence and SNP microarray data
To quantify the difference between resequencing data and SNP microarray data, we compared the distribution of derived-alleles obtained by resequencing with SNP genotypes obtained by microarray genotyping (Supplemental Figure 4). For the ENCODE sequence data, number of polymorphic sites and the allele frequency distribution were calculated using the HapMap YRI (60), HapMap CEU (60), combined randomly selected HapMap CHB/JPT (60) and South Indians (60; Brahmin, Mala, Madiga, Irula). To obtained comparative data from microarrays, a contiguous set of SNPs on chromosome 12, equal in number to that found by sequencing, was selected randomly from the Affymetrix 250K NspI microarray genotypes [3] for each population (1000 replicates).
For both the sequence and the microarray data, the number of polymorphic sites is higher in Africans than non-Africans. Consistent with ascertainment strategies, however, low-frequency polymorphisms (< 0.2) are significantly under-represented and high-frequency polymorphisms are over-represent in the microarray data for all groups (Supplemental Figure 4). These results demonstrate the necessity of full sequence data sets to accurately assess genetic variation in any major population.

Comparison of three-population and four-population out-of-Africa models for the ∂a∂i analysis
We compared three general three-population models, each with a different set of parameters. The maximum-likelihood values of each model for each of the three-population dataset are shown in the Supplemental Figure 5. The likelihood ratio tests demonstrate that models allowing exponential growth in the two Eurasian continental groups are significantly better than the models with constant population size in both Africa-East Asia-Europe (p=0.004) and Africa-India-Europe (p=0.021) models. Adding migration rate estimates (ooa_mig, 11 parameters) among populations does not significantly improve the model fitting (p>0.7) compare to the model without migration (ooa_simple, 7 parameters). We then compared three general four-population models, each with a different set of parameters. The maximum-likelihood values for each four-population models are shown in the Supplemental Figure 6. As with the threepopulation models, models allowing exponential growth in the Eurasian continental groups are significantly more likely than models with constant population size (p<0.01). Among the two models allowing exponential growth, adding migration rate estimates (ooa_fourpop_growth_mig, 13 parameters) among groups does not significantly improve the model fitting (p>0.85) compare to the model without migration (ooa_fourpop_growth, 9 parameters). Therefore, in the final analysis we estimated the parameters using the three-population ooa_growth model and the fourpopulation ooa_fourpop_growth model in the interest of minimizing the number of parameters estimated and improving the speed of computation.

∂a∂i analysis at the population level
Because of the limited sample size in individual populations, we performed two-population split-with-migration analysis at the population level (Supplemental Figure 9). The results from the two-population model showed that the pattern observed in the analyses of continental groups remained largely the same (Supplemental Table 4). The CIs around the estimates are generally larger, indicating the loss of power due to the smaller sample sizes of the populations compared to the continental groups. In general, Indian populations have the shortest divergence times from the HapMap European populations, especially HapMap TSI. With the exception of CHD, there is little migration between Indian populations and HapMap non-Indian populations. It is noteworthy that Eurasian populations in general have a shorter divergence time with HapMap LWK (from East Africa) than HapMap YRI (from West Africa). This result might reflect significant population variation within Africa before the out-of-Africa migration. HapMap GIH diverged from south Indian populations between 1.2 kya (Irula) and 15.3 kya (Mala/Madiga), and there is no substantial estimated migration after the divergence (Supplemental Table 4). We were unable to confidently estimate the population relationship among South Indian populations, probably both due to the lack of power in our dataset, and the closely shared history and high level of migration among these populations.

Supplemental Tables
Supplemental Table 1