Human genomic regions with exceptionally high levels of population differentiation identified from 911 whole-genome sequences
© Colonna et al.; licensee BioMed Central Ltd. 2014
Received: 3 May 2014
Accepted: 30 June 2014
Published: 30 June 2014
Population differentiation has proved to be effective for identifying loci under geographically localized positive selection, and has the potential to identify loci subject to balancing selection. We have previously investigated the pattern of genetic differentiation among human populations at 36.8 million genomic variants to identify sites in the genome showing high frequency differences. Here, we extend this dataset to include additional variants, survey sites with low levels of differentiation, and evaluate the extent to which highly differentiated sites are likely to result from selective or other processes.
We demonstrate that while sites with low differentiation represent sampling effects rather than balancing selection, sites showing extremely high population differentiation are enriched for positive selection events and that one half may be the result of classic selective sweeps. Among these, we rediscover known examples, where we actually identify the established functional SNP, and discover novel examples including the genes ABCA12, CALD1 and ZNF804, which we speculate may be linked to adaptations in skin, calcium metabolism and defense, respectively.
We identify known and many novel candidate regions for geographically restricted positive selection, and suggest several directions for further research.
The environment, acting through selective processes, is one of the main forces that shapes the genomes of species on an evolutionary timescale . A population carries a certain level of genetic variation, known as ‘standing variation’; most of these variants are neutral and have no appreciable effect on fitness, but some do influence fitness. When the environment changes, natural selection acts on these non-neutral variants (and indirectly on linked variants), and over subsequent generations their frequencies may change in response. Genetic influences on fitness are typically small, and many variants usually affect the same trait, so adaptation is generally the result of small changes in frequency of multiple alleles: a ‘soft sweep’ . Occasionally, a single new mutation has a major advantageous effect and can dominate the selective response, leading to a ‘hard sweep’ . In addition, sexual selection provides an additional force on the genome , with consequences for patterns of genetic variation in the population that are difficult to distinguish from those resulting from natural selection, although hard sweeps based on the perceived attractiveness of a new observable variant may be more frequent. In other circumstances, natural selection may favor more than one allele at a particular site in the genome - for example, if the heterozygote has the highest fitness, or the rarer allele is generally at an advantage - and this can lead to similar allele frequencies in different populations, sometimes maintained for long periods . Thus, allele frequency comparisons between different populations can potentially be informative about several forms of selection.
Humans have a worldwide distribution, but this is, evolutionarily, recent. Human populations outside Africa originated from migration out of Africa 50 to 70 thousand years ago, with low levels of admixture with archaic hominins, that reached all inhabitable continents by about 15 thousand years ago . Our ancestors have thus been exposed to a diverse range of environments over this time period, experiencing different temperatures, altitudes, ultra-violet radiation levels, foods and pathogens; and with the development of farming and livestock domestication in some places within the last 10 thousand years, additional changes in lifestyles. The common variants in our genomes are mostly older than the out-of-Africa expansion, and thus their frequencies are potentially informative about adaptations to these environmental changes. Indeed, both general correlations between climatic variables and the frequencies of classes of functionally related single nucleotide polymorphisms (SNPs) , and specific examples of variants showing both frequency differences between populations and relevant functional effects have been reported. Examples of this second class are the A allele at rs1426654 within SLC24A5 contributing to light skin color in Europeans , and the T allele at rs4988235 near the lactase gene (LCT) leading to the ability to digest lactose as an adult in Western Asia and Europe . In contrast, rs1129740 at DQA1 in the HLA locus shows similar frequencies of the two alleles (coding for cysteine and tyrosine, respectively) in many populations, linked to both increased susceptibility and protection against conditions such as hepatitis, leprosy and AIDS, possibly as a result of balancing selection . Of course, apparent allele frequency patterns of these kinds are not necessarily the result of adaptation: they can also arise as a consequence of genetic drift, background selection or genotyping error [9–11], and the relative contributions of these different processes remain to be determined.
Genetic investigations of the selective forces that have shaped the human genome have mostly been based on limited genetic information: either genotyping of known variable sites using SNP arrays, or studies focused on regions of prior interest. The ability to sequence whole human genomes on a population scale has allowed analyses to be performed using more complete datasets less biased by prior expectations [12, 13]. Within the 1000 Genomes Project, we have previously identified, validated and reported SNPs with unusually high levels of population differentiation between continents, and also between populations within continents (HighD sites) . These analyses revealed that no variants in the genome were fixed for one allele in one continent/population and for the other allele in another continent/population; they also re-discovered the known examples mentioned above, and identified many additional highly differentiated alleles. We have also explored the functional annotations of these HighD sites, showing that they are enriched for categories such as non-synonymous variants, DNase hypersensitive sites and site-specific transcription factor binding sites, supporting the idea that they are functionally important .
In the current study, we have extended these analyses in three ways. First, we have included additional types of variant from the same set of samples (indel calls and large deletions from the integrated callset  and additional indel calls from the exome dataset); second, we have explored the properties of the most extremely low-differentiated variants (LowD sites); and third, we have further analyzed the most highly differentiated variants to investigate the extent to which they are likely to result from selection of different kinds, or other processes. In this way, we have identified many novel candidates for geographically restricted positive selection, and suggest several directions for further research.
Identification of regions of unusually high and low differentiation among human populations
We systematically investigated the pattern of genomic differentiation among 911 individuals from the 1000 Genomes Project. We considered 36.8 million genomic variants (35,587,323 SNPs; 1,244,127 small insertion/deletions (INDELs); 13,110 large deletions or structural variants (SVs) from the 1000 Genomes phase I integrated dataset  and 7,210 additional previously unreported high quality exomic INDELs from the same samples (Materials and methods). We restricted our analysis to the 911 individuals belonging to the three major continental groups; populations with small sample size (IBS) and known extensive admixture (CLM, MXL, PUR) were excluded from this analysis (Table S1 in Additional file 1).
We also identified regions of low differentiation among continents, and among populations within continents. In addition to the factors taken into account for HighD sites, we considered two additional features. Rare sites inevitably show low differentiation, but this is a consequence of their rarity rather than of possible balancing selection, and such sites were therefore excluded. We also observed a class of sites with allele frequency 0.5 in all populations, which are likely to represent miscalled paralogous variants. For these reasons, we restricted our analysis to approximately two million sites (1,835,365 SNPs, 172,191 INDELs, 285 SVs) with 0.40 ≤ DAF ≤ 0.60 (in the entire sample of 911), excluding sites with 0.45 < DAF < 0.55. As a measure of low differentiation, we calculated the coefficient of variation of derived allele frequencies among populations (cvDAF). We ranked cvDAF genome-wide and again assigned an empirical P-value, with the lowest P-value corresponding to the smallest cvDAF. As with HighD sites, LowD sites also tended to cluster in the genome (Figure S4 in Additional file 2) and thus we explored a range of combinations of cvDAF and window size (Figure S5 in Additional file 2) to retain variants with the lowest cvDAF, conditioning on cvDAF <0.01 in non-overlapping windows of 100 variants. We identified 236 variants that met these criteria (213 SNPs and 23 INDELs; Table S3 in Additional file 1).
Most HighD variants are SNPs (85%) and the INDEL/SNP ratio in HighD sites with good genotype quality (genotype imputation quality from MaCH/Thunder >0.8) [16, 17] is similar to that in matched controls (0.10 in HighD, 0.12 in controls; Fisher exact test P-value = 0.25). We validated the HighD and LowD sites with several approaches, in addition to the Sequenom genotyping reported previously . First, we observed high genotype concordance at 808 sites that were independently genotyped in 112 individuals in both phase I and complete genomics  datasets (0.99 for one lowD INDEL; medians of 0.98 for 629 HighD SNPs, 0.94 for 8 HighD INDELs, and 0.99 for 170 lowD SNPs). Second, we compared allele frequencies between the phase I populations and 255 independent individuals from the same populations genotyped on a different platform . While we observed a high correlation of allele frequencies and ΔDAF values for HighD sites (r = 0.82 to 0.97, P-value <2.0 × 10-10), the correlation of the coefficient of variation for LowD sites was poor (r = 0.13, P-value = 0.278), despite a good correlation in allele frequencies (r = 0.94 to 0.98 and r = 0.77 to 0.96 in HighD and lowD sites, respectively; Figure S6 in Additional file 2). Thus, while both sets of genotypes are reliable and the high levels of population differentiation measured by the ΔDAF value were replicated in independent samples, the low levels of population differentiation measured by the cvDAF value were specific to the phase I samples and are not a more general feature of the populations that the samples are derived from. We have therefore not analyzed the LowD sites further, and concentrate below on the HighD sites.
Do HighD sites result from selection or drift?
Do HighD selected sites represent cases of hard or soft sweeps?
HighD sites are enriched in genes and other functional elements
Having established the likelihood of positive selection acting on a proportion of HighD sites, probably via both hard and soft sweeps, we turned to an investigation of the probable functional targets of this selection. We first considered HighD site association with annotated genes coding for either proteins or RNAs. There was a strong enrichment for genic sites compared to frequency-matched controls (odds ratio = 6.9, Fisher exact test P-value <1 × 10-10). We did not find enrichment for HighD sites with reported expression quantitative trait loci , but have previously reported their enrichment in nonsynonymous SNPs and UTRs among coding sequences, and DNase hypersensitive sites and binding sites of some classes of transcription factors among non-coding regulatory regions . These findings suggest that changes in both amino acid sequences and regulatory elements may have been selected.
In order to determine whether or not specific classes of genes were enriched for HighD sites, two approaches were used. Functional annotation clustering was carried out using the Database for Annotation, Visualization, and Integrated Discovery (DAVID v.6.7) . Using the highest classification stringency with a Bonferroni correction for multiple comparisons, no significant enrichment of any Gene Ontology term associated with any biological process, cellular compartment, or molecular function was found. However, using medium classification stringency, a significant enrichment was observed for transcription factor binding sites at the continental and all population levels. In addition, we analyzed the same sites by Ingenuity Pathway Analysis. Here, also, no significant associations were observed after applying a Bonferroni correction for multiple comparisons. Thus, HighD sites, and by implication the local adaptations that underlie a proportion of them, involve diverse biological pathways, none of which predominates at our level of sensitivity.
Novel insights into individual genomic candidates for population- or continent-specific positive selection
At the continental level, 58% (65% when restricting to protein coding) of the HighD variants are in genes previously reported to be under positive selection. Within continents, 24% (30% when restricting to protein coding) of the HighD genes belong to lists of genes putatively under positive selection. We have thus discovered a large number of HighD sites that appear to represent novel candidates for selection. Confirmation of this possibility would require functional investigations, which are beyond the scope of the current study. However, we note that while this manuscript was under review, one of our candidates, rs1871534 in ZIP4, was the subject of such a detailed study, and strong functional support for positive selection acting on altered zinc transport was obtained . Encouraged by this, we now present a number of other candidates that illustrate a range of features.
Among the within-continent comparisons, the C > T intronic polymorphism rs77943343 is located in the caldesmon 1 (CALD1) gene, which encodes a calmodulin- and actin-binding protein regulating smooth muscle and non-muscle contraction. Almost half of the JPT sample carries the derived allele at rs77943343 (DAF = 0.49), which is less frequent in the other Asian (CHB = 0.15, CHS = 0.18) and continental populations in general (AFR = 0.18, EUR = 0.05). This variant lies immediately adjacent to another polymorphism, rs77994671, which, however, appears to reflect an independent mutation whose derived allele has frequency 0.024 in AFR and 0.002 in EUR but is absent in ASN. rs77943343 lies in a candidate enhancer region and in a site for histone H3 acetylation in skeletal muscle myoblasts, 497 bp downstream of an inactive promoter and 11.5 kb upstream of the active promoter (11.6 kb from the gene’s first codon) and thus might be relevant to the regulation of this gene (Figure S16 in Additional file 2). CALD1 is implicated in Ca++-dependent smooth muscle contraction. Knockout of CALD1 paralogs in zebrafish results in altered intestinal peristalsis and in humans CALD1 has been associated with gastric cancer  and endometriosis . Network analysis of haplotypes surrounding rs77943343 in the ASN populations shows the increased frequency of a core haplotype specific to the JPT population and six rare single-step derivatives also carrying the derived allele, all also present in the JPT population, and five specific to this population (Figure S16 in Additional file 2). We speculate that the derived allele could have improved the efficiency of the use of Ca++ in conditions of low dietary calcium intake, such as the Japanese diet [37, 38].
We observed two other examples of HighD sites located in genes related to calcium metabolism, both with high DAF in the LWK: rs7818866 in VDAC3 and rs6578434 in STIM1 (DAF = 0.67 and 0.63, respectively, compared with DAF <5% in other populations). VDAC3 is a voltage-dependent anion channel type 3, essential for sperm mobility, while STIM1 generates the Ca++ ions in oocytes during fertilization and is essential for this process. These three examples together highlight the potential importance of calcium metabolism in human adaptation.
Second, the derived allele at an exomic INDEL (a novel variant; chromosome 2, position 185802211; Table S2 in Additional file 1) is an insertion of 3 bp (ACA), which adds a threonine residue at amino acid position 697, within the first exon of the zinc finger protein 804A gene (ZNF804A). This variant is present at high frequency in the ASN population (DAF = 0.83 versus 0.58 and 0.07 in EUR and AFR, respectively). ZNF804A acts as a transcription factor and regulates the transcription of genes related to schizophrenia . An intronic polymorphism (rs1344706) in this gene has been associated with schizophrenia  and a third variant, rs4667001 in the fourth exon, changes both an amino acid and mRNA levels . The ACA insertion is in strong linkage disequilibrium (LD) with rs4667001 (r2 = 0.96) but less so with rs1344706 (r2 = 0.45). Because of the threonine insertion, the protein has an additional site for post-translational modifications such as glycosylation and phosphorylation. Phosphorylation of other proteins (for example, the deubiquitinating enzyme OTUB1) has been demonstrated to regulate susceptibility to pathogens of the Yersinia family , of which some members probably evolved in China , and thus we speculate that the insertion may have been selected in relation to pathogen resistance.
Positive selection acts through several mechanisms, and the predominant ones in human populations remain to be clarified. Population differentiation is one of the most straightforward ways to identify a subset of variants experiencing this form of selection, and indeed some of the resulting phenotypes have been recognized and studied by physical anthropologists for over a century, while the genes and specific variants that underlie them are now being identified by geneticists . Whole-genome sequences from population samples provide a powerful starting resource for this, and are now available from the 1000 Genomes Project [12, 13]. In the current study, we have examined a near-complete catalog of the accessible sites in the genome that represent extreme differentiation between and within the African, European and East Asian population groups participating in the project. In doing this, we have rediscovered known examples of classic selective sweeps, and also encountered a large number of novel sites, and have thus been able to both characterize the general features of this mode of selection and obtain new insights into some specific examples.While our validation studies confirm that the vast majority of the HighD sites we have listed are indeed highly differentiated between continents or populations, rather than being artifacts arising from genotyping errors, it is more difficult to assess the proportions that result from random genetic drift compared to positive selection. A major underlying limitation is the paucity of positively selected variants supported by direct functional evidence. HighD sites are undoubtedly enriched for positive selection (Figures 3 and 4), and overlap with genes previously reported as positively selected. This 34% overlap, divided into 58% for comparisons between continents and 24% for comparisons between populations, provides one estimate of the proportion that are selected, albeit one that ignores the novel events. The lower proportion in population comparisons could reflect, among other factors, the higher noise contributed by drift and the greater difficulty of detecting selection by other approaches at less complete sweeps.
This study is more comprehensive than previous surveys of highly differentiated variants, which in large-scale studies have been limited to examining the SNPs included on genotyping arrays [11, 50] or SNPs in protein-coding genes discovered by resequencing three populations . Our strategy of choosing the highest ΔDAF value from a fixed window expands the range of signals we can discover, but will nevertheless exclude weaker but still potentially interesting selection signals from the windows with a strong signal, and future studies could use more sophisticated approaches to identifying peak signals. A level of population differentiation higher than that at the DARC gene has sometimes been used as a criterion, and here we observe novel ΔDAF values greater than DARC at three positions: rs6014096 in DOCK5, rs1596930 in EXOC6B and rs12903208 in PRTG (Figure 5b), all in introns. In addition, HighD sites in genes such as CALD1, ABCA12 and ZNF804A provide intriguing examples for follow-up in model organisms, the last also illustrating the importance of considering variants other than SNPs.
Our data throw light on two topics of current debate about selection in humans. First, some but not all studies  have found more evidence for recent positive selection outside Africa than inside. It has been difficult to interpret the results of tests that incorporate haplotype structure, because recombination differs between populations, with lower levels of LD and different PRDM9 alleles and recombination hotspots in Africa . Similarly, SNP ascertainment in African populations has been less thorough than in European populations, and so analyses based on known SNPs have been biased against discovering highly differentiated sites in Africans. The identification of HighD sites from full sequence data, however, is unaffected by recombination or ascertainment, and the lower number of HighD sites in Africa (25 versus 110 each in EUR and ASN) supports the hypothesis of less positive selection of this type, despite the larger effective population size, which should make selection more efficient.
Second, our identification of a large sample of likely positive selection events allows us to consider the relative importance of classic sweeps compared with selection on standing variation in humans. To do this, we need to know the proportion of HighD sites that result from positive selection rather than drift, and the proportion that result from classic sweeps. If we took the proportion of HighD sites that overlap genes with published evidence of positive selection as a minimum estimate of the proportion of selective events (58% for continental comparisons), and the proportion of inter-continental HighD sites with low Levenshtein distances (see Results) as a minimum estimate of the proportion of classic sweeps (30%), we would estimate about one-half. Both of these proportions are likely to be under-estimates - DARC would be excluded from the classic sweep numbers, for example - and thus this estimate is highly uncertain, but in contrast to some other analyses , emphasizes an important role for classic sweeps.
This study has established a comprehensive catalog of most of the variants, including SNPs, INDELS and SVs, that are highly differentiated between the major populations of sub-Saharan Africa, Europe and East Asia. Remarkably, this simple approach, when applied to whole-genome sequences from large population samples, usually seems to lead directly to the functional variant responsible for the differentiated phenotype. Several of the most highly differentiated variants and their associated phenotypes were known long before this work, testifying to the high visibility of the phenotypes, but the remaining catalog should be a rich source of starting points for investigations of phenotypes that should be equally important and fascinating.
Materials and methods
We used genotype calls from two sources: the 1000 Genomes Phase I integrated callset  and 7,210 additional, previously unreported high quality exonic INDELs from the same samples. In each case, we restricted our analyses to 911 individuals from 10 populations with ancestry from Africa, Europe or East Asia (Table S1 in Additional file 1). Allele frequencies were calculated using vcffixup from Vcflib  and used to identify HighD sites. We determined distance from the nearest annotated gene using information from Ensembl v72 and subsequently generated sets of sites used in various analyses as matched random genomic sites by matching allele frequency and distance from gene start site (Figure S10 in Additional file 2), taking into account the population to which the HighD site was assigned. Occurrence of HighD and matched sites among expression quantitative trait loci was estimated using the GeneVar database .
Using COSI , we simulated the evolution of 600 kb regions under a published demographic model  for three continental populations, namely AFR, EUR and ASN. We generated 3,000 replicates of neutral evolution (that is, without any selected variant) as well as 3 × 100 replicates of 30 selective sweep scenarios as described , within each of the three populations. Each of these selective scenarios has a selective sweep acting on a new variant located at the center of the simulated region, that is, at position 300,000 bp. The selective sweep lasts from 200 to 1,200 generations and always ends 401 generations ago. The selective coefficient was set to drive the selected allele up to a frequency of 0.2, 0.4, 0.6, 0.8 or 1. For each replicate, we ran iHS , XP-EHH , FST and ΔDAF  using a pipeline described previously . For cross-population statistics, we considered all the six pairwise comparisons. We then inferred the sensitivity (true positive rate) of those four methods by (i) calculating the 95th percentile of the score distributions obtained across the 3,000 neutral replicates, hence inferring the threshold corresponding to a 5% false discovery rate, and (ii) counting the proportion of replicates under a given sweep scenario with a score for the selected variant above this threshold. For each population, we performed the sensitivity analysis for five different sweep scenarios by grouping the replicates where the final allele frequency of the selected variant was the same, as well as an ‘overall’ sweep scenario grouping all the replicates with selection in a given population.
Genotype concordance of HighD and LowD sites between phase I data and publicly available complete genomics data was calculated as the proportion of concordant calls among 112 overlapping individuals averaged across loci. Allele frequency concordance was calculated between phase I EUR, AFR, and ASN populations and three non-overlapping sets of individuals from the same continents from the HapMap3 study  (total of 255 individuals: 24 EUR (18 CEU, 6 TSI); 166 ASN (101 CHD, 31 JPT, 34 CHB); 65 AFR (22 LWK, 43 YRI)) genotyped on a different platform (Illumina BeadArray genotypes at 1,456,587 sites ). Because overlap of sites on the chip with HighD and LowD sites was poor (100 HighD and 29 LowD sites), we also included comparisons at sites in high LD (r2 > 0.8) with HighD and LowD sites (255 and 202 sites, respectively). LD was calculated using Vcftools . Concordance was assessed using a Spearman correlation coefficient.
In order to estimate the expectation for number of HighD sites under neutrality, we simulated 100 replicates of 500 Mb DNA sequence data (subdivided into 10 chromosomes of the same size) from 911 diploid genomes from three populations according to two published demographic models [20, 21] for the three continental populations AFR, ASN and EUR using MaCS ; we included variable recombination rates by incorporating information from random regions from HapMap phase 3 recombination maps . Because all models led to similar conclusions, we only report results from . We estimated the number of HighD sites in 500 Mb of simulated data as for the phase 1 data, and then scaled this number to the size of the accessible genome (2,526,390,487 base pairs ).
We summarized from the literature a list of 3,467 genes that have been previously identified in genomic scans for positive selection [22–24] and compiled the occurrence of non-redundant genes hosting HighD sites (HighD-genes; n = 542) in it. To obtain estimates of random expectation we calculated the occurrence of 100 sets of control genes in the list of positively selected genes. Control genes were randomly selected to match the number of HighD-genes or fractions of this set based on ΔDAF quartiles (n = 247 and n = 146 for third and fourth quartiles, respectively). A one-sided t-test was used to assess the significance of the differences in overlap.
We estimated genome-wide statistics informative about selection from sequence data of the phase I data set. These statistics were based on 10 kb windows and include three allele frequency spectrum-based tests, Tajima’s D, Fay and Wu’s H, and Nielsen et al.’s composite likelihood ratio, calculated and combined to give a single P-value as described previously . We also estimated the haplotype-based statistics iHS and XP-EHH  using Selscan  at sites for which recombination maps and ancestral allele information were available. In order to make the continent and population level analyses comparable, in each group we restricted the analysis to a set of 30 randomly chosen individuals. The iHS and XP-EHH scores obtained for each SNP in each population/continent were divided into allele frequency classes and, within each class, normalized following standard procedures [25, 44]. Similarly, we calculated iHS and XP-EHH for a set of genomic sizes matched to HighD sites for allele frequency in the combined sample and distance from gene start site. A two-sample Kolmogorov-Smirnov test was used to evaluate differences between the HighD and matched site iHS distributions.
The Database for Annotation, Visualization, and Integrated Discovery (DAVID v.6.7 [33, 60]) was used for functional characterization of the highly differentiated variants that lay within genes. The Functional Annotation Clustering option was used by adding Panther and Reactome to Pathways, Panther to Protein Domains and Reactome and UCSC TFBS to Protein Interactions to the default settings. In another approach, Ingenuity Pathway Analysis (IPA, Ingenuity Systems, Redwood City, CA, USA) was carried out for the same set of variants. Ensembl gene identifiers were uploaded and core analyses was carried out for the differentiated variants at the continental (CON) and population levels (AFR, ASN and EUR). The analyses generated networks based on their connectivity in the Ingenuity Knowledge Base (IKB), which includes experimental data from human, mouse and rat models. The core analyses selected only those interactions that have been experimentally observed and included pathways based on (i) diseases and disorders, (ii) molecular and cellular functions and (iii) physiological system development and function. The significance of the association between the dataset and the pathways was tabulated by estimating a ratio between the number of genes from the dataset that met the expression value cutoff and map to the pathway, and the total number of molecules present in the pathway. A conservative Bonferroni P-value threshold was used to account for multiple testing.
Median-joining networks for haplotype visualization and analysis were generated using Network 18.104.22.168 . Since this version of the software can display only 100 chromosomes per circle, we selected each time 300 random chromosomes. The CALD1 and ABCA12 haplotypes were based on all the SNPs in D’ = 1 within 10 kb of the HighD site in the population with the highest DAF; for the networks in Figure 7 and Figure S16 in Additional file 2, haplotypes were derived from polymorphisms within 2 kb surrounding the HighD site.
Derived allele frequency
We thank Danny Challis, Fuli Yu, Donna Muzny and Richard Gibbs for contributing the exomic indel data set and Goo Jun for helping with the Complete Genomics data set. We thank Tsun-Po Yang for assistance with Genevar. This work was supported by The Wellcome Trust (098051), an Italian National Research Council (CNR) short-term mobility fellowship from the 2013 program to VC, and an EMBO Short Term Fellowship ASTF 324–2010 to VC. A full list of participants and institutions is available in Additional file 3.
- Darwin C: The Origin of Species. 1859Google Scholar
- Pritchard JK, Pickrell JK, Coop G: The genetics of human adaptation: hard sweeps, soft sweeps, and polygenic adaptation. Curr Biol. 2010, 20: R208-215.PubMedPubMed CentralView ArticleGoogle Scholar
- Darwin C: The Descent of Man and Selection in Relation to Sex. 1871View ArticleGoogle Scholar
- Charlesworth D: Balancing selection and its effects on sequences in nearby genome regions. PLoS Genet. 2006, 2: e64-PubMedPubMed CentralView ArticleGoogle Scholar
- Jobling MA, Hollox EJ, Hurles ME, Kivisild T, Tyler-Smith C: Human Evolutionary Genetics. 2013, Garland Science: Abingdon, UK, 2Google Scholar
- Hancock AM, Witonsky DB, Alkorta-Aranburu G, Beall CM, Gebremedhin A, Sukernik R, Utermann G, Pritchard JK, Coop G, Di Rienzo A: Adaptations to climate-mediated selective pressures in humans. PLoS Genet. 2011, 7: e1001375-PubMedPubMed CentralView ArticleGoogle Scholar
- Lamason RL, Mohideen MA, Mest JR, Wong AC, Norton HL, Aros MC, Jurynec MJ, Mao X, Humphreville VR, Humbert JE, Sinha S, Moore JL, Jagadeeswaran P, Zhao W, Ning G, Makalowska I, McKeigue PM, O'donnell D, Kittles R, Parra EJ, Mangini NJ, Grunwald DJ, Shriver MD, Canfield VA, Cheng KC: SLC24A5, a putative cation exchanger, affects pigmentation in zebrafish and humans. Science. 2005, 310: 1782-1786.PubMedView ArticleGoogle Scholar
- Enattah NS, Sahi T, Savilahti E, Terwilliger JD, Peltonen L, Jarvela I: Identification of a variant associated with adult-type hypolactasia. Nat Genet. 2002, 30: 233-237.PubMedView ArticleGoogle Scholar
- Charlesworth B, Morgan MT, Charlesworth D: The effect of deleterious mutations on neutral molecular variation. Genetics. 1993, 134: 1289-1303.PubMedPubMed CentralGoogle Scholar
- McVicker G, Gordon D, Davis C, Green P: Widespread genomic signatures of natural selection in hominid evolution. PLoS Genet. 2009, 5: e1000471-PubMedPubMed CentralView ArticleGoogle Scholar
- Xue Y, Zhang X, Huang N, Daly A, Gillson CJ, Macarthur DG, Yngvadottir B, Nica AC, Woodwark C, Chen Y, Conrad DF, Ayub Q, Mehdi SQ, Li P, Tyler-Smith C: Population differentiation as an indicator of recent positive selection in humans: an empirical evaluation. Genetics. 2009, 183: 1065-1077.PubMedPubMed CentralView ArticleGoogle Scholar
- The 1000 Genomes Project Consortium: A map of human genome variation from population-scale sequencing. Nature. 2010, 467: 1061-1073.PubMed CentralView ArticleGoogle Scholar
- The 1000 Genomes Project Consortium: An integrated map of genetic variation from 1,092 human genomes. Nature. 2012, 491: 56-65.PubMed CentralView ArticleGoogle Scholar
- Khurana E, Fu Y, Colonna V, Mu XJ, Kang HM, Lappalainen T, Sboner A, Lochovsky L, Chen J, Harmanci A, Das J, Abyzov A, Balasubramanian S, Beal K, Chakravarty D, Challis D, Chen Y, Clarke D, Clarke L, Cunningham F, Evani US, Flicek P, Fragoza R, Garrison E, Gibbs R, Gümüs ZH, Herrero J, Kitabayashi N, Kong Y, Lage K, et al: Integrative annotation of variants from 1092 humans: application to cancer genomics. Science. 2013, 342: 1235587-PubMedPubMed CentralView ArticleGoogle Scholar
- Sabeti PC, Varilly P, Fry B, Lohmueller J, Hostetter E, Cotsapas C, Xie X, Byrne EH, McCarroll SA, Gaudet R, Schaffner SF, Lander ES, International HapMap C, Frazer KA, Ballinger DG, Cox DR, Hinds DA, Stuve LL, Gibbs RA, Belmont JW, Boudreau A, Hardenbol P, Leal SM, Pasternak S, Wheeler DA, Willis TD, Yu F, Yang H, Zeng C, Gao Y: Genome-wide detection and characterization of positive selection in human populations. Nature. 2007, 449: 913-918.PubMedPubMed CentralView ArticleGoogle Scholar
- Li Y, Sidore C, Kang HM, Boehnke M, Abecasis GR: Low-coverage sequencing: implications for design of complex trait association studies. Genome Res. 2011, 21: 940-951.PubMedPubMed CentralView ArticleGoogle Scholar
- Li Y, Willer CJ, Ding J, Scheet P, Abecasis GR: MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet Epidemiol. 2010, 34: 816-834.PubMedPubMed CentralView ArticleGoogle Scholar
- Complete Genomics. [http://www.completegenomics.com/]
- Altshuler DM, Gibbs RA, Peltonen L, Dermitzakis E, Schaffner SF, Yu F, Bonnen PE, de Bakker PI, Deloukas P, Gabriel SB, Peltonen L, Bonnen PE, Gibbs RA, de Bakker PI, Deloukas P, Gabriel SB, Gwilliam R, Hunt S, Inouye M, Jia X, Palotie A, Parkin M, Whittaker P, Yu F, Chang K, Hawes A, Lewis LR, Ren Y, Wheeler D, Gibbs RA, Muzny DM, Barnes C, Darvishi K, Hurles M, et al: Integrating common and rare genetic variation in diverse human populations. Nature. 2010, 467: 52-58.PubMedView ArticleGoogle Scholar
- Schaffner SF, Foo C, Gabriel S, Reich D, Daly MJ, Altshuler D: Calibrating a coalescent simulation of human genome sequence variation. Genome Res. 2005, 15: 1576-1583.PubMedPubMed CentralView ArticleGoogle Scholar
- Gravel S, Henn BM, Gutenkunst RN, Indap AR, Marth GT, Clark AG, Yu F, Gibbs RA, Bustamante CD: Demographic history and rare allele sharing among human populations. Proc Natl Acad Sci U S A. 2011, 108: 11983-11988.PubMedPubMed CentralView ArticleGoogle Scholar
- Akey JM: Constructing genomic maps of positive selection in humans: where do we go from here?. Genome Res. 2009, 19: 711-722.PubMedPubMed CentralView ArticleGoogle Scholar
- Grossman SR, Andersen KG, Shlyakhter I, Tabrizi S, Winnicki S, Yen A, Park DJ, Griesemer D, Karlsson EK, Wong SH, Cabili M, Adegbola RA, Bamezai RN, Hill AV, Vannberg FO, Rinn JL, Lander ES, Schaffner SF, Sabeti PC, 1000 Genomes Project: Identifying recent adaptations in large-scale genomic data. Cell. 2013, 152: 703-713.PubMedPubMed CentralView ArticleGoogle Scholar
- Grossman SR, Shlyakhter I, Karlsson EK, Byrne EH, Morales S, Frieden G, Hostetter E, Angelino E, Garber M, Zuk O, Lander ES, Schaffner SF, Sabeti PC: A composite of multiple signals distinguishes causal variants in regions of positive selection. Science. 2010, 327: 883-886.PubMedView ArticleGoogle Scholar
- Voight BF, Kudaravalli S, Wen X, Pritchard JK: A map of recent positive selection in the human genome. PLoS Biol. 2006, 4: e72-PubMedPubMed CentralView ArticleGoogle Scholar
- Ayub Q, Yngvadottir B, Chen Y, Xue Y, Hu M, Vernes SC, Fisher SE, Tyler-Smith C: FOXP2 targets show evidence of positive selection in European populations. Am J Hum Genet. 2013, 92: 696-706.PubMedPubMed CentralView ArticleGoogle Scholar
- Soejima M, Koda Y: Population differences of two coding SNPs in pigmentation-related genes SLC24A5 and SLC45A2. Int J Legal Med. 2007, 121: 36-39.PubMedView ArticleGoogle Scholar
- Reich D, Nalls MA, Kao WH, Akylbekova EL, Tandon A, Patterson N, Mullikin J, Hsueh WC, Cheng CY, Coresh J, Boerwinkle E, Li M, Waliszewska A, Neubauer J, Li R, Leak TS, Ekunwe L, Files JC, Hardy CL, Zmuda JM, Taylor HA, Ziv E, Harris TB, Wilson JG: Reduced neutrophil count in people of African descent is due to a regulatory variant in the Duffy antigen receptor for chemokines gene. PLoS Genet. 2009, 5: e1000360-PubMedPubMed CentralView ArticleGoogle Scholar
- Eiberg H, Troelsen J, Nielsen M, Mikkelsen A, Mengel-From J, Kjaer KW, Hansen L: Blue eye color in humans may be caused by a perfectly associated founder mutation in a regulatory element located within the HERC2 gene inhibiting OCA2 expression. Hum Genet. 2008, 123: 177-187.PubMedView ArticleGoogle Scholar
- Sturm RA, Duffy DL, Zhao ZZ, Leite FP, Stark MS, Hayward NK, Martin NG, Montgomery GW: A single SNP in an evolutionary conserved region within intron 86 of the HERC2 gene determines human blue-brown eye color. Am J Hum Genet. 2008, 82: 424-431.PubMedPubMed CentralView ArticleGoogle Scholar
- Levenshtein VI: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady. 1966, 10:Google Scholar
- Yang TP, Beazley C, Montgomery SB, Dimas AS, Gutierrez-Arcelus M, Stranger BE, Deloukas P, Dermitzakis ET: Genevar: a database and Java application for the analysis and visualization of SNP-gene associations in eQTL studies. Bioinformatics. 2010, 26: 2474-2476.PubMedPubMed CentralView ArticleGoogle Scholar
- Huang DW, Sherman BT, Lempicki RA: Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc. 2009, 4: 44-57.View ArticleGoogle Scholar
- Engelken J, Carnero-Montoro E, Pybus M, Andrews GK, Lalueza-Fox C, Comas D, Sekler I, de la Rasilla M, Rosas A, Stoneking M, Valverde MA, Vicente R, Bosch E: Extreme population differences in the human zinc transporter ZIP4 (SLC39A4) are explained by positive selection in Sub-Saharan Africa. PLoS Genet. 2014, 10: e1004128-PubMedPubMed CentralView ArticleGoogle Scholar
- Abrams J, Davuluri G, Seiler C, Pack M: Smooth muscle caldesmon modulates peristalsis in the wild type and non-innervated zebrafish intestine. Neurogastroenterol Motil. 2012, 24: 288-299.PubMedPubMed CentralView ArticleGoogle Scholar
- Hou Q, Tan HT, Lim KH, Lim TK, Khoo A, Tan IB, Yeoh KG, Chung MC: Identification and functional validation of caldesmon as a potential gastric cancer metastasis-associated protein. J Proteome Res. 2013, 12: 980-990.PubMedView ArticleGoogle Scholar
- Ueno K, Nakamura K, Nishiwaki T, Saito T, Okuda Y, Yamamoto M: Intakes of calcium and other nutrients related to bone health in Japanese female college students: a study using the duplicate portion sampling method. Tohoku J Exp Med. 2005, 206: 319-326.PubMedView ArticleGoogle Scholar
- Lau EM, Woo J, Lam V, Hong A: Milk supplementation of the diet of postmenopausal Chinese women on a low calcium intake retards bone loss. J Bone Miner Res. 2001, 16: 1704-1709.PubMedView ArticleGoogle Scholar
- Markó L, Paragh G, Ugocsai P, Boettcher A, Vogt T, Schling P, Balogh A, Tarabin V, Orso E, Wikonkal N, Mandl J, Remenyik E, Schmitz G: Keratinocyte ATP binding cassette transporter expression is regulated by ultraviolet light. J Photochem Photobiol B. 2012, 116: 79-88.PubMedView ArticleGoogle Scholar
- Rajpopat S, Moss C, Mellerio J, Vahlquist A, Ganemo A, Hellstrom-Pigg M, Ilchyshyn A, Burrows N, Lestringant G, Taylor A, Kennedy C, Paige D, Harper J, Glover M, Fleckman P, Everman D, Fouani M, Kayserili H, Purvis D, Hobson E, Chu C, Mein C, Kelsell D, O'Toole E: Harlequin ichthyosis: a review of clinical and molecular findings in 45 cases. Arch Dermatol. 2011, 147: 681-686.PubMedView ArticleGoogle Scholar
- Scott CA, Rajpopat S, Di WL: Harlequin ichthyosis: ABCA12 mutations underlie defective lipid transport, reduced protease regulation and skin-barrier dysfunction. Cell Tissue Res. 2013, 351: 281-288.PubMedView ArticleGoogle Scholar
- Yanagi T, Akiyama M, Nishihara H, Sakai K, Nishie W, Tanaka S, Shimizu H: Harlequin ichthyosis model mouse reveals alveolar collapse and severe fetal skin barrier defects. Hum Mol Genet. 2008, 17: 3075-3083.PubMedView ArticleGoogle Scholar
- Li Q, Frank M, Akiyama M, Shimizu H, Ho SY, Thisse C, Thisse B, Sprecher E, Uitto J: Abca12-mediated lipid transport and Snap29-dependent trafficking of lamellar granules are crucial for epidermal morphogenesis in a zebrafish model of ichthyosis. Dis Model Mech. 2011, 4: 777-785.PubMedPubMed CentralView ArticleGoogle Scholar
- Pickrell JK, Coop G, Novembre J, Kudaravalli S, Li JZ, Absher D, Srinivasan BS, Barsh GS, Myers RM, Feldman MW, Pritchard JK: Signals of recent positive selection in a worldwide sample of human populations. Genome Res. 2009, 19: 826-837.PubMedPubMed CentralView ArticleGoogle Scholar
- Girgenti MJ, LoTurco JJ, Maher BJ: ZNF804a regulates expression of the schizophrenia-associated genes PRSS16, COMT, PDE4B, and DRD2. PLoS One. 2012, 7: e32404-PubMedPubMed CentralView ArticleGoogle Scholar
- O'Donovan MC, Craddock N, Norton N, Williams H, Peirce T, Moskvina V, Nikolov I, Hamshere M, Carroll L, Georgieva L, Dwyer S, Holmans P, Marchini JL, Spencer CC, Howie B, Leung HT, Hartmann AM, Möller HJ, Morris DW, Shi Y, Feng G, Hoffmann P, Propping P, Vasilescu C, Maier W, Rietschel M, Zammit S, Schumacher J, Quinn EM, Schulze TG, et al: Identification of loci associated with schizophrenia by genome-wide association and follow-up. Nat Genet. 2008, 40: 1053-1055.PubMedView ArticleGoogle Scholar
- Williams HJ, Norton N, Dwyer S, Moskvina V, Nikolov I, Carroll L, Georgieva L, Williams NM, Morris DW, Quinn EM, Giegling I, Ikeda M, Wood J, Lencz T, Hultman C, Lichtenstein P, Thiselton D, Maher BS, Malhotra AK, Riley B, Kendler KS, Gill M, Sullivan P, Sklar P, Purcell S, Nimgaonkar VL, Kirov G, Holmans P, Corvin A, Molecular Genetics of Schizophrenia Collaboration (MGS) International Schizophrenia Consortium (ISC), SGENE-plus, GROUP, et al: Fine mapping of ZNF804A and genome-wide significant evidence for its involvement in schizophrenia and bipolar disorder. Mol Psychiatry. 2011, 16: 429-441.PubMedPubMed CentralView ArticleGoogle Scholar
- Edelmann MJ, Kramer HB, Altun M, Kessler BM: Post-translational modification of the deubiquitinating enzyme otubain 1 modulates active RhoA levels and susceptibility to Yersinia invasion. FEBS J. 2010, 277: 2515-2530.PubMedView ArticleGoogle Scholar
- Morelli G, Song Y, Mazzoni CJ, Eppinger M, Roumagnac P, Wagner DM, Feldkamp M, Kusecek B, Vogler AJ, Li Y, Cui Y, Thomson NR, Jombart T, Leblois R, Lichtner P, Rahalison L, Petersen JM, Balloux F, Keim P, Wirth T, Ravel J, Yang R, Carniel E, Achtman M: Yersinia pestis genome sequencing identifies patterns of global phylogenetic diversity. Nat Genet. 2010, 42: 1140-1143.PubMedPubMed CentralView ArticleGoogle Scholar
- The International HapMap Consortium: A haplotype map of the human genome. Nature. 2005, 437: 1299-1320.PubMed CentralView ArticleGoogle Scholar
- Kong A, Thorleifsson G, Gudbjartsson DF, Masson G, Sigurdsson A, Jonasdottir A, Walters GB, Gylfason A, Kristinsson KT, Gudjonsson SA, Frigge ML, Helgason A, Thorsteinsdottir U, Stefansson K: Fine-scale recombination rate differences between sexes, populations and individuals. Nature. 2010, 467: 1099-1103.PubMedView ArticleGoogle Scholar
- Hernandez RD, Kelley JL, Elyashiv E, Melton SC, Auton A, McVean G, Sella G, Przeworski M: Classic selective sweeps were rare in recent human evolution. Science. 2011, 331: 920-924.PubMedPubMed CentralView ArticleGoogle Scholar
- The 1000 Genomes Project. [http://www.1000genomes.org/]
- Vcflib. [https://github.com/ekg/vcflib/]
- Weir BS, Cockerham CC: Estimating F-Statistics for the analysis of population structure. Evolution. 1984, 38: 1358-1370.View ArticleGoogle Scholar
- Pybus M, Dall'Olio GM, Luisi P, Uzkudun M, Carreno-Torres A, Pavlidis P, Laayouni H, Bertranpetit J, Engelken J: 1000 Genomes Selection Browser 1.0: a genome browser dedicated to signatures of natural selection in modern humans. Nucleic Acids Res. 2014, 42: D903-909.PubMedPubMed CentralView ArticleGoogle Scholar
- Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, Handsaker RE, Lunter G, Marth GT, Sherry ST, McVean G, Durbin R, 1000 Genomes Project Analysis Group: The variant call format and VCFtools. Bioinformatics. 2011, 27: 2156-2158.PubMedPubMed CentralView ArticleGoogle Scholar
- Chen GK, Marjoram P, Wall JD: Fast and flexible simulation of DNA sequence data. Genome Res. 2009, 19: 136-142.PubMedPubMed CentralView ArticleGoogle Scholar
- Szpiech ZA, Hernandez RD: Selscan: an efficient multi-threaded program to perform EHH-based scans for positive selection. arXiv. 2014, 14036854v2-Google Scholar
- da Huang W, Sherman BT, Lempicki RA: Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 2009, 37: 1-13.PubMedView ArticleGoogle Scholar
- Bandelt HJ, Forster P, Röhl A: Median-joining networks for inferring intraspecific phylogenies. Mol Biol Evol. 1999, 16: 37-48.PubMedView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.