Human genomic regions with exceptionally high or low levels of population differentiation identified from 911 whole-genome sequences

Background Population differentiation has proved to be effective for identifying loci under geographically-localized positive selection, and has the potential to identify loci subject to balancing selection. We have previously investigated the pattern of genetic differentiation among human populations at 36.8 million genomic variants to identify sites in the genome showing high frequency differences. Here, we extend this dataset to include additional variants, survey sites with low levels of differentiation, and evaluate the extent to which highly differentiated sites are likely to result from selective or other processes. Results We demonstrate that while sites of low differentiation represent sampling effects rather than balancing selection, sites showing extremely high population differentiation are enriched for positive selection events and that one half may be the result of classic selective sweeps. Among these, we rediscover known examples, where we actually identify the established functional SNP, and discover novel examples including the genes ABCA12, CALD1 and ZNF804, which we speculate may be linked to adaptations in skin, calcium metabolism and defense, respectively. Conclusions We have identified known and many novel candidate regions for geographically restricted positive selection, and suggest several directions for further research.

and the frequencies of classes of functionally-related Single Nucleotide Polymorphisms (SNPs) [6], and specific examples of variants showing both frequency differences between populations and relevant functional effects have been reported. Examples of this second class are the A allele at rs1426654 within SLC24A5 contributing to light skin color in Europeans [7], and the T allele at rs4988235 near the lactase gene (LCT) leading to the ability to digest lactose as an adult in Western Asia and Europe [8]. In contrast, rs1129740 at DQA1 in the HLA locus shows similar frequencies of the two alleles (coding for Cys and Tyr respectively) in many populations, linked to both increased susceptibility and protection against conditions such as hepatitis, leprosy and AIDS, possibly as a result of balancing selection [5]. Of course, apparent allele frequency patterns of these kinds are not necessarily the result of adaptation: they can also arise as a consequence of genetic drift, background selection or genotyping error [9][10][11], and the relative contributions of these different processes remain to be determined.
Genetic investigations of the selective forces that have shaped the human genome have mostly been based on limited genetic information: either genotyping of known variable sites using SNP arrays, or studies focused on regions of prior interest. The ability to sequence whole human genomes on a population scale has allowed analyses to be performed using more complete datasets less biased by prior expectations [12,13]. Within the 1000 Genomes Project, we have previously identified, validated and reported SNPs with unusually high levels of population differentiation between continents, and also between populations within continents (HighD sites) [13]. These analyses revealed that there were no variants in the genome that were fixed for one allele in one continent/population and for the other allele in another continent/population; it also re-discovered the known examples mentioned above, and identified many additional highly differentiated alleles. We have also explored the functional annotations of these HighD sites, showing that they are enriched for categories such as non-synonymous variants, DNase hypersensitive sites and site-specific transcription factor binding sites, supporting the idea that they are functionally important [14].
In the current study, we have extended these analyses in three ways. First, we have included additional types of variant from the same set of samples (indel calls and large 5 deletions from the integrated callset [13] and additional indel calls from the exome dataset); second, we have explored the properties of the most extremely low-differentiated variants (LowD sites); and third, we have further analyzed the most highly differentiated variants to investigate the extent to which they are likely to result from selection of different kinds, or other processes. In this way, we have identified many novel candidates for geographically restricted positive selection, and suggest several directions for further research.

Identification of regions of unusually high and low differentiation among human populations
We systematically investigated the pattern of genomic differentiation among 911 individuals from the 1000 Genomes Project. We considered 36.8 million genomic variants (35,587,323 SNPs; 1,244,127 small insertion/deletions, INDELs; 13,110 large deletions, structural variants or SVs) from the 1000 Genomes Phase I integrated dataset [13] and 7,210 additional previously-unreported high quality exomic INDELs from the same samples (see Materials and Methods). We restricted our analysis to the 911 individuals belonging to the three major continental groups; samples with small sample size (IBS) and known extensive admixture (CLM, MXL, PUR) were excluded from this analysis (Supplementary Table 1).
In order to find genomic sites with high differentiation between populations, we calculated derived allele frequency differences (ΔDAF) at all variants for pairs of continents and pairs of populations within continents. This measure, ΔDAF, is highly correlated with F ST (Spearman r 0.93 -0.95, all p-values <10e-16 in a set of 5,000 random sites from chromosome 20; Supplementary Figure 1). We performed extensive simulations to demonstrate that, like F ST , ΔDAF has >98% power to detect partial and complete selective sweeps under the range of conditions examined, and performs better than XP-EHH for partial sweeps (Supplementary Figure 2). For each pair of populations/continents, genomic ΔDAFs were ranked and a rank p-value was assigned. This analysis showed that sites with high differentiation were often strongly clustered in the genome (Figure 1), and in some cases a single differentiation event dominated the distribution ( Figure 1B). Thus a list of the most highly differentiated sites 6 based on a simple threshold would include redundant variants resulting from the same biological event, and miss other events. Assuming that one variant in each cluster drove the differentiation, and the other variants in the cluster reflect the limited recombination during differentiation, we devised the following strategy to capture the key variants. We subdivided the 1% most extremely differentiated variants into non-overlapping windows, exploring different window sizes, picked the most highly differentiated variant in each window, and retained it if it exceeded a specified ΔDAF value, again exploring a range of values (Supplementary Figure 3). As a result, we chose a window size of 5,000 variants and ΔDAF thresholds of >0.7 and >0.25 between and within continents, respectively. These choices provided a balance between stringency (enriching for the most highly differentiated variants) and number of sites (retaining sufficient for the group analyses described below).
We applied a similar strategy to the exomic INDELs, but because of the smaller number of variants we only applied the ΔDAF threshold criteria without further filtering by window size. After removing variants with low genotype quality, we identified 1,137 unique variants, roughly corresponding to one HighD site every 2.  Table 2).
We also identified regions of low differentiation among continents, and among populations within continents. In addition to the factors taken into account for HighD sites, we considered two additional features. Rare sites inevitably show low differentiation, but this is a consequence of their rarity rather than of possible balancing selection, and such sites were thus excluded. We also observed a class of sites with allele frequency 0.5 in all populations, which are likely to represent miscalled paralogous variants. For these reasons, we restricted our analysis to approximately two million sites (1,835,365 SNPs;172,191 INDELs;285 SVs) with 0.40≤DAF≤0.60 (in the entire sample of 911), excluding sites with 0.45<DAF<0.55. As a measure of low differentiation, we calculated the coefficient of variation of derived allele frequencies among populations (cvDAF). We ranked cvDAF genome-wide and again assigned an empirical p-value, with the lowest p-value corresponding to the smallest cvDAF. As with HighD sites, LowD sites also tended to cluster in the genome (Supplementary Figure 3 were replicated in independent samples, the low levels of population differentiation measured by the cvDAF value were specific to the Phase I samples and are not a more general feature of the populations that the samples are derived from. We have therefore not analyzed the LowD sites further, and concentrate below on the HighD sites.

Do HighD sites result from selection or drift?
We expect the HighD sites to result from a combination of drift and positive selection, and wished to evaluate the relative contributions of these two evolutionary forces. We therefore estimated the number of HighD sites expected under neutrality in betweencontinent comparisons using sequence data simulated under three established demographic models for these populations [16,17]. We anticipated that the number of HighD sites under neutrality would be smaller than the observed number, and the difference would provide an estimate of the number resulting from non-neutral processes. In contrast, the opposite We therefore turned to alternative sources of insights. Published studies have listed genes showing evidence of positive selection in genome-wide scans; although these lists differ between studies, a core set of genes detected by multiple scans has been compiled [18][19][20].
Genes harboring HighD sites were significantly enriched in this core set (34% observed vs 8% on average from 100 control gene sets; one sided t-test p-value <2.2e-16; Figure 3).
Moreover, HighD genes with the highest ΔDAF values were more strongly enriched (37% for the 4th quartile), and restricting the analysis to protein-coding genes increased the overlap to 48% (average from 100 control sets of randomly selected genes is 13%; one sided t-test p-value <2.2e-16). The range of ΔDAF values of the HighD sites in overlapping genes was nevertheless 0.72-0.98 in between-continent comparisons and 0.25-0.63 in within-continent comparisons, suggesting that genes exhibiting the full range of ΔDAF values considered here may, according to other criteria, be positively selected, with support for selection in more than a third. We next looked in our own data for specific evidence of positive selection in the regions around the HighD sites (both genic and non-genic) identified through pairwise continental comparisons. We assumed that the selected allele was the derived one and that it had been selected in the population where its frequency was higher. We calculated haplotype-based statistics (iHS [21] and XP-EHH [22]) which are sensitive to recent positive selection, to 9 assess the evidence for selection at both HighD sites and a set of randomly selected genomic variable sites matched for allele frequency and distance from genes (Supplementary Figure 9). HighD sites were found to have higher iHS and XP-EHH values (i.e. to be under selection) specifically in the population with the highest DAF ( Figure 4). A similar analysis using alternative tests for selection (a combined p-value based on Tajima's D, Fay and Wu's H and a composite likelihood ratio test [23]) also provided support for positive selection in the population with the highest DAF, as well as for weaker selection in some additional populations (Supplementary Figure 10).
Since in a few cases specific functional targets of positive selection have been identified, we further investigated the relationship between these known targets and the variant identified by the HighD analysis. In all three examples from comparisons between continents, the precise functional target was identified correctly. These were rs1426654 in SLC24A5 and rs16891982 in SLC45A2, both associated with light skin pigmentation in individuals with European ancestry [7,24] and rs2814778 in DARC [25] (Figure 5a) encoding the Duffy blood group antigen O allele and associated with malaria resistance in Africans.
Similarly, in comparisons between populations within continents, rs4988235 in MCM6 and in the promoter region of LCT, responsible for lactase persistence [8] and rs12913832 in HERC2 associated with blue eye color [26,27] were identified correctly. We therefore conclude from these analyses that a substantial proportion of the HighD sites arose by positive selection rather than by other mechanisms, and that where it could be tested, we pick out the functional variant.

Do HighD selected sites represent cases of hard or soft sweeps?
With this knowledge, we could investigate how often positive selection in humans, ascertained in this way, acted via a classic selective sweep compared with acting on standing variation. We took two approaches to this question, first further exploring some of the relevant properties of hard sweeps by simulations, and second comparing with accepted examples of hard sweeps. We expected that a hard sweep would lead to the selected variant often lying on a single haplotype (plus its descendants resulting from mutation and recombination), while under neutrality or a soft sweep it would be more likely to lie on multiple diverse haplotype. We therefore extracted from the simulations presented above 10 (to explore the power of the ΔDAF statistic to detect selection) the 2kb haplotype carrying the derived allele at each selected HighD SNP, or a frequency-matched control. To provide an estimate of how different the haplotypes surrounding each target variant are from each other, we calculated the average pairwise Levenshtein distance [28] Figure 15). Thus, we estimated that the true fraction of sites with features similar to reference sites is 30% (i.e. 66% minus 36%) and this is significantly more than in

HighD sites are enriched in genes and other functional elements
Having established the likelihood of positive selection acting on a proportion of HighD sites, probably via both hard and soft sweeps, we turned to an investigation of the probable functional targets of this selection. We first considered HighD site association with annotated genes coding for either proteins or RNAs. There was a strong enrichment for genic sites compared to frequency-matched controls (odds ratio = 6.9, Fisher exact test pvalue <1e-10). We did not find enrichment for HighD sites with reported eQTLs [29], but have previously reported their enrichment in nonsynonymous SNPs and UTRs among coding sequences, and DNase hypersensitive sites and binding sites of some classes of transcription factors among non-coding regulatory regions [14]. These findings suggest that changes in both amino acid sequences and regulatory elements may have been selected.
In order to determine whether or not specific classes of genes were enriched for HighD sites, two approaches were used. Functional annotation clustering was carried out using the Database for Annotation, Visualization, and Integrated Discovery (DAVID v.6.7, [30]). Using the highest classification stringency with a Bonferroni correction for multiple comparisons, no significant enrichment of any Gene Ontology (GO) term associated with any biological process, cellular compartment, or molecular function was found. However, using medium classification stringency, a significant enrichment was observed for transcription factor binding sites at the continental and all population levels. In addition, we analyzed the same sites by Ingenuity Pathway Analysis. Here also, no significant associations were observed after applying a Bonferroni correction for multiple comparisons. Thus HighD sites, and by implication the local adaptations that underlie a proportion of them, involve diverse biological pathways, none of which predominates at our level of sensitivity.

Novel insights into individual genomic candidates for population-or continent-specific positive selection
At the continental level, 58% (65% when restricting to protein coding) of the HighD variants 12 are in genes previously reported to be under positive selection. Within continents, 24% (30% when restricting to protein coding) of the HighD genes belong to lists of genes putatively under positive selection. We have thus discovered a large number of HighD sites that appear to represent novel candidates for selection. Confirmation of this possibility would require functional investigations, which are beyond the scope of the current study.
However, we note that while this manuscript was under review, one of our candidates, rs1871534 in ZIP4, was the subject of such a detailed study, and strong functional support for positive selection acting on altered zinc transport was obtained [31]. Encouraged by this, we now present a number of other candidates that illustrate a range of features.
Among the within-continent comparisons, the C→T intronic polymorphism rs77943343 is which, however, appears to reflect an independent mutation whose derived allele has frequency 0.024 in AFR and 0.002 in EUR but is absent in ASN. rs77943343 lies in a candidate enhancer region and in a site for histone H3 acetylation in skeletal muscle myoblasts, 497 bp downstream of an inactive promoter and 11.5 kb upstream of the active promoter (11.6 kb from the gene's first codon) and thus might be relevant to the regulation of this gene (Supplementary Figure 16). CALD1 is implicated in Ca ++ -dependent smooth muscle contraction. Knockout of CALD1 paralogs in zebrafish results in altered intestinal peristalsis [32] and in humans CALD1 has been associated with gastric cancer [33] and endometriosis [34]. Network analysis of haplotypes surrounding rs77943343 in the ASN populations shows the increased frequency of a core haplotype specific to the JPT and six rare single-step derivatives also carrying the derived allele, all also present in the JPT and five specific to this population (Supplementary Figure 16). We speculate that the derived allele could have improved the efficiency of the use of Ca ++ in conditions of low dietary calcium intake, such as the Japanese diet.
We observed two other examples of HighD sites located in genes related to calcium 13 metabolism, both with high DAF in the LWK: rs7818866 in VDAC3 and rs6578434 in STIM1 (DAF=0.67 and 0.63, respectively, compared with DAF <5% in other populations). VDAC3 is a voltage-dependent anion channel type 3, essential for sperm mobility, while STIM1 generates the Ca ++ ions in oocytes during fertilization and is essential for this process.  (Figure 7). ABCA12 is a member of the ABC family of proteins that transport molecules across cell membranes. Its expression is regulated by ultraviolet light [35] [35] and loss-of-function mutations in ABCA12 cause the severe skin disorder autosomal recessive congenital ichthyosis (OMIM: 607800; [36,37]). Abca12(-/-) mice closely reproduce the human disease phenotype, and provide evidence that ABCA12 plays important roles in lung surfactant and skin barrier functions [38]. Knockdown of the zebrafish ortholog similarly resulted in a phenotype including altered keratinocyte morphology [39]. Other members of the ABC family have also been identified in previous scans for positive selection [40]. The HighD SNP in ABCA12 lies near a cluster of DNase I hypersensitive sites and a candidate enhancer specific to epithelial cells and epidermal keratinocytes (Figure 7). We suggest that this gene has been positively selected for altered expression as part of the adaptation of the skin in populations outside Africa, perhaps as a result of altered exposure to UV radiation.  [41]. An intronic polymorphism 14 (rs1344706) in this gene has been associated with schizophrenia [42] and a third variant, rs4667001 in the fourth exon, changes both an amino acid and mRNA levels [43]. The ACA insertion is in strong LD with rs4667001 (r^2 =0.96) but less so with rs1344706 (r^2= 0.45).
Because of the threonine insertion, the protein has an additional site for post-translational modifications such as glycosylation and phosphorylation. Phosphorylation of other proteins (e.g. the deubiquitinating enzyme OTUB1) has been demonstrated to regulate susceptibility to pathogens of the Yersinia family [44], of which some members probably evolved in China [45], and thus we speculate that the insertion may have been selected in relation to pathogen resistance.

Discussion
Positive selection acts through several mechanisms, and the predominant ones in human populations remain to be clarified. Population differentiation is one of the most straightforward ways to identify a subset of variants experiencing this form of selection, and indeed some of the resulting phenotypes have been recognized and studied by physical anthropologists for over a century, while the genes and specific variants that underlie them are now being identified by geneticists [5]. Whole-genome sequences from population samples provide a powerful starting resource for this, and are now available from the 1000 Genomes Project [12,13]. In the current study, we have examined a near-complete catalog of the accessible sites in the genome that represent extreme differentiation between and within the African, European and East Asian population groups participating in the project. This study is more comprehensive than previous surveys of highly-differentiated variants, which in large-scale studies have been limited to examining the SNPs included on genotyping arrays [11,46] or SNPs in protein-coding genes discovered by resequencing three populations [12]. Our strategy of choosing the highest ΔDAF value from a fixed window expands the range of signals we can discover, but will nevertheless exclude weaker but still potentially interesting selection signals from the windows with a strong signal, and future studies could use more sophisticated approaches to identifying peak signals. A level of population differentiation higher than that at the DARC gene has sometimes been used as a criterion, and here we observe novel ΔDAF values greater than DARC at three positions: rs6014096 in DOCK5, rs1596930 in EXOC6B and rs12903208 in PRTG (Figure 5b) , all in introns. In addition, genes such as CALD1, ABCA12 and ZNF804A provide intriguing examples for follow-up in model organisms, the last also illustrating the importance of considering variants other than SNPs.
Our data throw light on two topics of current debate about selection in humans. First, some but not all studies [18] have found more evidence for recent positive selection outside Africa than inside. It has been difficult to interpret the results of tests that incorporate haplotype structure, because recombination differs between populations, with lower levels of linkage disequilibrium and different PRDM9 alleles and recombination hotspots in Africa [47]. Similarly, SNP ascertainment in African populations has been less thorough than in European populations, and so analyses based on known SNPs have been biased against discovering highly differentiated sites in Africans. The identification of HighD sites from full sequence data, however, is unaffected by recombination or ascertainment, and the lower 16 number of HighD sites in Africa (25 vs 110 each in EUR and ASN) supports the hypothesis of less positive selection of this type, despite the larger effective population size which should make selection more efficient.
Second, our identification of a large sample of likely positive selection events allows us to consider the relative importance of classic sweeps compared with selection on standing variation in humans. To do this, we need to know the proportion of HighD sites that result from positive selection rather than drift, and the proportion that result from classic sweeps.
If we took the proportion of HighD sites that overlap genes with published evidence of positive selection as a minimum estimate of the proportion of selective events (58% for continental comparisons), and the proportion of inter-continental HighD sites with low Levenshtein distances (see Results) as a minimum estimate of the proportion of classic sweeps (30%), we would estimate about one-half. Both of these proportions are likely to be under-estimates -DARC would be excluded from the classic sweep numbers, for exampleand thus this estimate is highly uncertain, but in contrast to some other analyses [48], emphasizes an important role for classic sweeps.

Conclusions
This study has established a comprehensive catalog of most of the variants, including SNPs, INDELS and SVs, that are highly differentiated between the major populations of sub-Saharan Africa, Europe and East Asia. Remarkably, this simple approach, when applied to whole-genome sequences from large population samples, usually seems to lead directly to the functional variant responsible for the differentiated phenotype. Several of the most highly differentiated variants and their associated phenotypes were known long before this work, testifying to the high visibility of the phenotypes, but the remaining catalog should be a rich source of starting points for investigations of phenotypes which should be equally important and fascinating.

Methods
We used genotype calls from two sources: the 1000 Genomes Phase I integrated callset [13] and 7,210 additional previously-unreported high quality exonic INDELs from the same samples. In each case, we restricted our analyses to 911 individuals from ten populations with ancestry from Africa, Europe or East Asia (Supplementary Table 1). Allele frequencies were calculated using vcffixup from vcflib and used to identify HighD sites. We determined distance from the nearest annotated gene using information from Ensembl v72 and subsequently generated sets of sites used in various analyses as matched random genomic sites by matching allele frequency and distance from gene start site (Supplementary Figure   10), taking into account the population to which the HighD site was assigned. Occurrence of HighD and matched sites among eQTLs was estimated using the GeneVar database [29].
Weir and Cockerham's F ST [49] was calculated using vcftools [50]. F ST values were used to evaluate correlation with ΔDAF values for the same variants in chosen regions.
Using COSI, we simulated the evolution of 600 kb regions under a published demographic model [16] for three continental populations, namely AFR, EUR and ASN. We generated 3,000 replicates of neutral evolution (i.e. without any selected variant) as well as 3 x 100 replicates of 30 selective sweep scenarios as described [20], The selective coefficient was set to drive the selected allele up to frequency of 0.2, 0.4, 0.6, 0.8 or 1. For each replicate, we ran iHS [21], XP-EHH [22], F ST [49] and ΔDAF [13] using a pipeline described previously [51]. For cross-population statistics, we considered all the six pairwise comparisons. We then inferred the sensitivity (true positive rate) of those four methods by (i) calculating the 95th percentile of the score distributions obtained across the 3000 neutral replicates, hence inferring the threshold corresponding to a 5% false discovery rate, and (ii) counting the proportion of replicates under a given sweep scenario with a score for the selected variant above this threshold. For each population, we performed the sensitivity analysis for five different sweep scenarios by grouping the replicates where the 18 final allele frequency of the selected variant was the same, as well an "overall" sweep scenario grouping all the replicates with selection in a given population.

Genotype concordance of HighD and LowD sites between Phase I data and publicly available
Complete Genomics data was calculated as the proportion of concordant calls among 112 overlapping individuals averaged across loci. Allele frequency concordance was calculated between Phase I EUR, AFR, ASN populations and three non-overlapping sets of individuals from the same continents from the HapMap3 study [15] [15]). Because overlap of sites on the chip with HighD and LowD sites was poor (100 HighD and 29 LowD sites) we also included comparisons at sites in high LD (r^2 >0.8) with HighD and LowD sites (255 and 202 sites, respectively). LD was calculated using Vcftools [50]. Concordance was assessed using a Spearman correlation coefficient.
In order to estimate the expectation for number of HighD sites under neutrality, we simulated 100 replicates of 500 Mb DNA sequence data (subdivided into 10 chromosomes of the same size) from 911 diploid genomes from three populations according to two published demographic models [16,17] for the three continental populations AFR, ASN and EUR using MaCS [52]; we included variable recombination rates by incorporating information from random regions from HapMap Phase3 recombination maps [15]. Because all models gave led to similar conclusions we only report results from one [17]. We estimated the number of HighD sites in 500 Mb of simulated data as for the Phase 1 data, and then scaled this number to the size of the accessible genome (2,526,390,487 base pairs We summarized from the literature a list of 3,467 genes that have been previously identified in in genomic scans for positive selection [18][19][20] and compiled the occurrence of non-redundant genes hosting HighD sites (HighD-genes; n=542) in it. To obtain estimates of random expectation we calculated the occurrence of 100 sets of control genes in the list of positively selected genes. Control genes were randomly selected to match the number of HighD-genes or fractions of this set based on ΔDAF quartiles (n=247 and n=146 for 3 rd and 19 4th quartiles, respectively). A one-sided t-test was used to assess the significance of the differences in overlap.
We estimated genome-wide statistics informative about selection from sequence data of the Phase I data set. These statistics were based on 10 kb windows and include three allele frequency spectrum-based tests, Tajima's D, Fay and Wu's H, and Nielsen et al.'s composite likelihood ratio (CLR), calculated and combined to give a single p-value as described previously [23]. We also estimated the haplotype-based statistics iHS [21] and XP-EHH [22] using Selscan [53] at sites for which recombination maps and ancestral allele information were available. In order to make the continent and population level analyses comparable, in each group we restricted the analysis to a set of 30 randomly-chosen individuals. The iHS and XP-EHH scores obtained for each SNP in each population/continent were divided into allele frequency classes and, within each class, normalized following standard procedures [21,40]. Similarly, we calculated iHS and XP-EHH for a set of genomic sizes matched to HighD sites for allele frequency in the combined sample and distance from gene start site. A two-sample Kolmogorov-Smirnov test was used to evaluate differences between the HighD and matched site iHS distributions.
The Database for Annotation, Visualization, and Integrated Discovery (DAVID v.6.7) was used for functional characterization of the highly differentiated variants that lay within genes. The Functional Annotation Clustering option was used by adding Panther and Reactome to Pathways, Panther to Protein Domains and Reactome and UCSC TFBS to Protein Interactions to the default settings. In another approach, Ingenuity Pathway Analysis (IPA, Ingenuity Systems, Redwood City, CA) was carried out for the same set of variants.
Ensembl gene identifiers were uploaded and core analyses was carried out for the differentiated variants at the continental (CON) and population levels (AFR, ASN and EUR). The analyses generated networks based on their connectivity in the Ingenuity Knowledge Base (IKB), which includes experimental data from human, mouse and rat models. The core analyses selected only those interactions which have been experimentally observed and included pathways based on 1) Diseases and disorders, 2) Molecular and cellular functions and 3) Physiological system development and function. The significance of the association between the dataset and the pathways was tabulated by estimating a ratio 20 between the number of genes from the dataset that met the expression value cut off that map to the pathway, and the total number of molecules present in the pathway. A conservative Bonferroni p-value threshold was used to account for multiple testing.
Median-Joining networks for haplotype visualization and analysis were generated using Network 4.6.1.1 [54]. Since this version of the software can display only 100 chromosomes per circle, we selected each time 300 random chromosomes. The CALD1 and ABCA12 haplotypes were based on all the SNPs in D'=1 within 10 kb of the HighD site in the population with the highest DAF; for the networks in Figure 7 and Supplementary Figure 17, haplotypes were derived from polymorphisms within 2kb surrounding the HighD site.     to 100 control sets of randomly-selected genes is reported. "with highDeltaDAF" refers to HighD sites whose ΔDAF is ranked in the fourth quartile. values of the two statistics relative to randomly selected genomic variable sites matched for allele frequency and distance from gene are also shown (indicated as "matched" or "m_ ", gray lines). p-values refer to two-sample Kolmogorov-Smirnov tests between iHS or XP-EHH distributions in HighD and matched sites. In (b), for every population XP-EHH was calculated using as reference the two others (two shades of pink).