- Open Access
Genetic disease risks can be misestimated across global populations
© The Author(s). 2018
- Received: 11 May 2018
- Accepted: 9 October 2018
- Published: 14 November 2018
Accurate assessment of health disparities requires unbiased knowledge of genetic risks in different populations. Unfortunately, most genome-wide association studies use genotyping arrays and European samples. Here, we integrate whole genome sequence data from global populations, results from thousands of genome-wide association studies (GWAS), and extensive computer simulations to identify how genetic disease risks can be misestimated.
In contrast to null expectations, we find that risk allele frequencies at known disease loci are significantly different for African populations compared to other continents. Strikingly, ancestral risk alleles are found at 9.51% higher frequency in Africa, and derived risk alleles are found at 5.40% lower frequency in Africa. By simulating GWAS with different study populations, we find that non-African cohorts yield disease associations that have biased allele frequencies and that African cohorts yield disease associations that are relatively free of bias. We also find empirical evidence that genotyping arrays and SNP ascertainment bias contribute to continental differences in risk allele frequencies. Because of these causes, polygenic risk scores can be grossly misestimated for individuals of African descent. Importantly, continental differences in risk allele frequencies are only moderately reduced if GWAS use whole genome sequences and hundreds of thousands of cases and controls. Finally, comparisons between uncorrected and corrected genetic risk scores reveal the benefits of considering whether risk alleles are ancestral or derived.
Our results imply that caution must be taken when extrapolating GWAS results from one population to predict disease risks in another population.
- Ascertainment bias
- Genetic risk scores
- Genetic epidemiology
- Genome-wide association studies
- Global health
- Health disparities
- Population genetics
In the past decade, over 3300 genome-wide association studies (GWAS) have successfully identified more than 58,000 genetic associations with common diseases and other traits [1, 2]. However, the vast majority of published GWAS have used samples of European ancestry [3, 4], and a looming challenge is to be able to generalize GWAS results across populations [5–11]. An additional complication is that existing GWAS use genotyping arrays, as opposed to whole genome sequencing (WGS). Each disease-associated locus has risk and protective alleles. Results from GWAS can be combined to generate polygenic risk scores to predict individual risks of disease [12–14]. These polygenic risk scores quantify hereditary disease burdens by summing the number of risk alleles in each individual’s genome and sometimes weighting SNPs by effect size . The “missing heritability” problem hampers genetic risk scores, as many causal variants remain undiscovered [16, 17]. Diseases can also have different genetic architectures in different populations . Because of these issues, genetic predictions of disease risk are not always accurate, and it is important to be able to distinguish between situations where genetic risks actually differ between populations and when genetic predictions of differences in disease risks are spurious.
Although health disparities are often due to access to healthcare and socio-economic factors [19, 20], genetic differences in disease risks arise when allele frequencies at disease-associated loci differ across populations . Populations that share recent ancestry have similar allele frequencies and hereditary disease risks, while populations that diverged in the deep past can have large allele frequency differences at disease-associated loci [21, 22]. These differences are magnified by population bottlenecks and founder effects, including elevated risks of cystic fibrosis among the Québécois  and cardiovascular disease among the descendants of the HMS Bounty mutineers . However, many common diseases are polygenic [25, 26], and allele frequency differences at individual loci tend to average out. Because of this, the overall burden of hereditary disease is expected to be similar across the globe , with the possible exception of reduced genetic load in African populations . For polygenic diseases, the null expectation is that individuals from different populations will have similar counts of risk alleles.
The genetic ancestry of study participants can cause hereditary disease risks to be misestimated. Indeed, genetic risk scores generated from different study cohorts have been shown to vary across populations . As of 2016, the ancestry of 81% of all GWAS samples was European and 14% was Asian , and this is likely to cause the set of known disease associations to be enriched for alleles that are polymorphic or intermediate frequency in Europe or Asia, but not Africa. Inequity in genetic studies parallels what is observed in social science research; most samples are from Western, educated, industrialized, rich and democratic (WEIRD) societies [29, 30]. For disease associations to be detected, loci need to be polymorphic in the study population. Because of this, disease loci with allele frequencies that are zero or one in European populations are likely to be missed (i.e., the “known unknowns” ), and some of these disease loci will have intermediate frequencies in other populations. Disease associations found in one population can over- or underestimate genetic disease risks in other populations. One partial solution to this problem is to perform multiethnic GWAS that include individuals from multiple populations .
Commonly used genotyping arrays can also cause predictions of hereditary disease risks to be misestimated. One issue is that SNPs on genotyping arrays tend to have large minor allele frequencies [33–35]. These older SNPs often have large allele frequency differences between populations [36, 37]. Systematic biases can also arise because commercially available genotyping arrays tend to use SNPs that were originally ascertained in European populations. This SNP ascertainment bias can be particularly problematic if it yields disease loci with risk allele frequencies that are high for one population and low for another population.
At present, it is unknown how much the set of known disease associations hinders precision medicine and personal genomics. To bridge this knowledge gap, we integrated whole genome sequence data from global populations with results from thousands of GWAS and ran extensive computer simulations. These analyses (1) revealed novel empirical patterns at disease-associated loci, (2) identified multiple causes of how disease risks can be misestimated in global populations, and (3) examined different solutions to this problem (including alternative GWAS study designs and building genetic risk scores that correct for major sources of bias).
African risk allele frequencies differ from other continents
Disease categories that have a larger proportion of ancestral alleles tend to have elevated risk allele frequencies in Africa (Fig. 2b). After binning GWAS loci by disease category, we find that the differences in the mean frequency of risk alleles between African and non-African populations are highly correlated with the proportion of risk alleles that are ancestral (r2 = 0.842). Accurate estimation of genetic disease risks across global populations may hinge upon knowledge of whether risk-increasing alleles are ancestral or derived.
Ancestral and derived alleles yield different patterns of genetic disease risk
The joint site frequency spectrum (SFS) enables the frequencies of individual risk alleles to be compared between African and non-African populations. Similar numbers of disease associations are found above and below the diagonal in Fig. 3b. However, conditioning on whether risk alleles are ancestral or derived reveals a striking pattern: 69.2% of ancestral risk alleles are found at higher frequency in African populations (red dots below the diagonal), and 64.5% of derived risk alleles are found at higher frequency in non-African populations (blue dots above the diagonal). The magnitudes of allele frequency differences between populations also vary for ancestral and derived risk alleles. We find that ancestral risk alleles are found at much higher frequencies in Africa, and derived risk alleles are found at moderately lower frequencies in Africa (Fig. 3c). Specifically, the mean difference in ancestral risk allele frequencies between African and pooled non-African populations is + 9.51%, and the mean difference in derived risk allele frequencies between African and pooled non-African populations is − 5.40% (p value < 2.2 × 10−16 for both comparisons, Wilcoxon signed-rank tests). The overall continental difference in risk allele frequencies of + 1.15% arises because 44% of presently known disease-associated loci have ancestral risk alleles.
Derived allele frequencies serve as proxies for SNP age , and we find that older disease-associated loci are more likely to have large differences in continental allele frequencies. For each 20% DAF bin (pooled data), we calculated the difference in risk allele frequencies between African and non-African populations. In sharp contrast to other DAF bins, published disease loci with DAF ≤ 0.2 exhibit only a small amount of bias (Fig. 3d). This pattern occurs regardless of whether risk alleles are ancestral or derived. Note that SNPs with DAF ≤ 0.2 tend to be younger than 125,000 years old, assuming an effective population size of 10,000 individuals and generation times of 25 years .
Choice of study population contributes to misestimates of genetic disease risk
Most disease associations have been discovered in study cohorts with European ancestry, and this can bias the estimation of genetic disease risks in diverse global populations. Empirical data reveal the effects of GWAS study populations; many disease-associated alleles segregate at intermediate frequencies in non-African populations but are found at extremely low or high frequencies in Africa (compare the vertical and horizontal borders of Fig. 3b). This occurs because statistical power is maximized at intermediate frequencies, and most disease-associated loci have been discovered in non-African populations. Existing GWAS have discovered relatively few disease alleles that segregate only in African populations.
Differences in allele frequencies between African and European populations for different genotyping technologies
Allele frequency difference between Africa and Europe
Ancestral risk allele (%)
Derived risk allele (%)
NHGRI-EBI GWAS Catalog (empirical)
Affymetrix Genome-Wide Human SNP Array 6.0 (simulated)*
Illumina Omni 5M microarray (simulated)*
Whole genome sequences (simulated)*
We also examined the effects of genotype-by-environment (GxE) interactions by allowing effect sizes to vary by population in our GWAS simulations. In general, results from these simulations mirror the results of other simulations; ancestral risk allele frequencies are higher in African populations than non-African populations, and derived risk allele frequencies are lower in African populations than non-African populations (Additional file 3: Figure S1). Compared to African study cohorts, European study cohorts magnify these allele frequency differences between populations. Choice of study cohort imposes a filter on effect sizes, as SNPs with very small effect sizes do not yield detectable associations (compare gray pre-GWAS effect sizes to red and blue post-GWAS effect sizes in Additional file 3: Figures S1-S3). Large effect sizes enable high-frequency ancestral alleles and low-frequency derived alleles to be detected in a GWAS. The results described above are also robust to systematic biases in effect sizes, i.e., scenarios where pre-GWAS European effect sizes tend to be larger than African effect-sizes or vice versa (Additional file 3: Figures S2 and S3).
Genotyping arrays and SNP ascertainment bias cause disease risks to be misestimated
Many commonly used genotyping arrays contain SNPs that were ascertained in a relatively small number of European individuals. This ascertainment bias results in allele frequency distributions that vary by genotyping platform. Compared to WGS data, derived allele frequencies are higher for SNPs on the Affymetrix Genome-Wide Human SNP Array 6.0 and the Illumina Omni 5M microarray. SNPs on genotyping arrays also exhibit continental biases (Fig. 3a). Specifically, we find that derived allele frequencies in African populations are markedly lower than derived allele frequencies in non-African populations (p value < 2.2 × 10−16 for both arrays, Wilcoxon signed-rank tests).
The joint SFS of non-African and African populations further reveals the effects of SNP ascertainment bias. Examining WGS data, we find that similar numbers of SNPs have elevated derived allele frequencies in non-African and African populations (Additional file 3: Figure S4a). By contrast, the Affymetrix Genome-Wide Human SNP Array 6.0 and the Illumina Omni 5M microarray are enriched SNPs with higher derived allele frequencies outside of Africa (i.e., SNPs above the diagonal in Additional file 3: Figure S4b and Additional file 3: Figure S4c). Importantly, this pattern mirrors what is seen for empirical GWAS data (Additional file 3: Figure S4d), which suggests that genotyping arrays contribute to continental differences in risk allele frequencies at known disease-associated loci.
Because many disease-associations involve imputed SNPs, we also tested whether continental differences in risk allele frequencies persist for disease-associated loci that are not on the Affymetrix Genome-Wide Human SNP 6.0 Array. For this empirical set of disease-associated loci, we find that sites with ancestral risk alleles have higher allele frequencies in Africa (+ 8.63% on average) and that SNPs with derived risk alleles have lower allele frequencies in Africa (− 4.83% on average). This suggests that biases persist even for imputed SNPs.
Continental differences in allele frequencies persist even if whole genome sequencing and large sample sizes are used
Simulations of GWAS results were used to infer the extent that misestimates of disease risks depend upon genotyping technology (Table 1). Here, simulations assume European ancestry for each study cohort and sample sizes of 3500 cases and 3500 controls. We find that different genotyping arrays yield similar results: the Affymetrix Genome-Wide Human SNP Array 6.0 and the Illumina Omni 5M microarray yield ancestral risk allele frequencies that are 10.7% and 11.0% higher in Africa and derived risk alleles that are 8.0% and 8.2% higher in Europe, respectively. Somewhat surprisingly, continental differences in allele frequencies also occur for GWAS simulations that use WGS data. Focusing on WGS GWAS simulations, ancestral risk allele frequencies are 9.7% higher in Africa, and derived risk alleles are 7.2% higher in Europe. These patterns arise because of our choice of study cohort and because sample sizes of 3500 cases and 3500 controls have relatively little power to catch rare disease alleles.
Correcting for ancestral and derived risk alleles leads to improved genetic risk scores
GRS corrections reduce some, but not all, of the population-level differences in predicted disease risks. Here, we compensate for continental differences in ancestral and derived risk allele frequencies by generating corrected GRS for African genomes. We find that African individuals have corrected GRS that are similar to other populations for metabolic (p value = 0.8080), morphological (p value = 0.0671), and neurological (p value = 0.7116, Mann-Whitney U tests) disease risks. By contrast, African individuals have corrected GRS that are different than other populations for GI or liver, cancer, miscellaneous, and cardiovascular disease risks (p value < 2.2 × 10−16 for each disease category, Mann-Whitney U tests). Corrections involve in a leftward shift in the GRS of African genomes, the magnitude of which depends on the proportion of ancestral risk alleles for each disease category (compare the size of arrows in Fig. 6). We observe three different outcomes: minimal effects, over-correction, and reduction of bias. Cardiovascular risk predictions for African genomes were largely unchanged (i.e., GRS still appear to underestimate the risks of cardiovascular disease in individuals of African descent). Two disease categories (GI or liver and miscellaneous diseases) have corrected GRS distributions that differ more between African and non-African populations than uncorrected GRS distributions. The remaining four disease categories (metabolic, morphological, cancer, and neurological diseases) have corrected GRS distributions that overlap heavily with other populations. Although the correction method used here alleviates some forms of bias, our results suggest that GRS can be further improved by considering additional parameters.
The biased set of disease associations that are presently known leads to misestimates of hereditary disease risks. Specifically, African populations tend to have higher frequencies of ancestral risk alleles and lower frequencies of derived risk alleles at existing GWAS loci. Considering the magnitude of these differences and the proportion of disease-associated alleles that are ancestral, as opposed to derived, yields risk allele frequencies that are 1.15% higher in Africa. Elevated risk allele frequencies in African populations are the opposite of what one expects to see given human demographic history. Due to population bottlenecks, non-African populations are expected to have greater amounts of genetic load . This discrepancy arises because GWAS rely on European study cohorts and data from genotyping arrays. Systematic allele frequency biases can be mistaken for directional selection, hindering tests of polygenic selection acting on GWAS traits . Continental differences in allele frequencies also have important ramifications for precision medicine and personal genomics; disease risks are likely to be misestimated if GWAS results are naively used to calculate genetic risk scores (Fig. 6). This can obscure the existing health disparities that are due to socio-cultural factors including access to medical care [46, 47]. High-risk individuals may have genetic profiles that lull them into a false sense of security, and low-risk individuals may have genetic risk profiles that lead to an undue amount of worry.
Here, we are concerned with the limitations of using disease associations discovered in one population to predict disease risks in another population, as opposed to whether GWAS findings can be successfully replicated across multiple populations. The effects of different study cohorts are asymmetric. Non-African GWAS results can be used to predict disease risks in other non-African populations, but these disease associations generalize poorly to African populations (Fig. 4). By contrast, African GWAS results can be used to predict disease risks in a relatively unbiased way across all global populations. This asymmetry arises as a by-product of demographic history and the out-of-Africa migration (Fig. 1) and because GWAS use arrays that suffer from SNP ascertainment bias (Fig. 3a). Our results suggest that there may be additional benefits to including a large number of African individuals in multiethnic GWAS. We note that difficulties can arise when transferring GWAS results from one non-African population to another non-African population. This is due to both the existence of private risk alleles and divergence times that can exceed 30,000 years. Regardless of the study cohort used to generate genetic risk scores, it is impossible to fully correct for missing risk alleles from understudied populations. Problems generalizing GWAS results cannot be solved by only using WGS and large sample sizes (Fig. 5). Furthermore, many variants discovered by WGS are rare and population-specific. That said, genetic risk scores generated from WGS data are expected to be less biased than genetic risk scores generated from array data, especially when sample sizes are large.
Although this paper focuses on risk allele frequency differences across populations, we note that many disease loci remain undetected, and this also contributes to misestimates of disease risks. These missing disease loci are particularly important when risk alleles are population-specific. This underscores the need for genetic epidemiology studies to include samples from a diverse set of populations.
Our study demonstrates the benefits of adopting an evolutionary perspective towards health and disease [48, 49]. Important empirical patterns would not have been noticed without considering ancestral vs. derived states of alleles. Continental differences in allele frequencies also depend upon SNP age. An evolutionary perspective is also valuable for understanding how genetic disease risks can be misestimated across populations. Specifically, we find that it matters whether populations have experienced a history of bottlenecks and founder effects. Knowing whether individual disease loci have experienced a history of natural section can lead to additional insights [42, 50, 51].
Recently, Martin et al. found that polygenic risk scores yield inaccurate predictions of height and schizophrenia and that GRS for type II diabetes depend upon on choice of study cohort . Using coalescent simulations, they also found that the proportion of heritability that can be explained decreases with distance to the GWAS study population. Using complementary approaches, our study resulted in novel discoveries. We find that ancestral and derived states of risk alleles play a central role in the estimation of genetic disease risks across multiple populations, something missed by prior studies that examine the generalizability of GWAS results. We also find that important asymmetries exist when extrapolating the results between African and non-African populations and that population bottlenecks play a key role (i.e., generalizability of results depends on much more than the evolutionary distance between populations). By explicitly testing the effects of different genotyping technologies and sample sizes, we were able to discover that WGS of hundreds of thousands of cases and controls still yields biased GWAS results. Martin et al. also advocate mean-centering GRS for each population , but this solution can be problematic if hereditary disease risks actually differ between populations.
Our GRS calculations illustrate how misestimation of genetic risks can obscure whether there are any real differences in disease risks across populations (Fig. 6). Two types of error are possible: (1) The underlying risk of a particular disease may actually be the same for different populations, yet GRS distributions show little overlap. (2) The underlying risk of a particular disease may actually differ for populations, yet GRS distributions show extensive overlap. Accurate GRS corrections are needed to exclude either of these two possibilities. Environmental effects and genotype-by-environment interactions also contribute to disease phenotypes . Studies of immigrants, admixed families, and adopted individuals may prove to be particularly informative with respect to genetics and health inequities [53–56]. PCA information can be used to improve GRS for admixed genomes . Corrected GRS for admixed genomes may also benefit from local ancestry painting tools like RFMix  or ELAI .
Going forward, multiple approaches can be used to extend the benefits of precision medicine and personal genomics to a wide range of global populations. One option is to assume that disease associations can be generalized across populations without any complications. However, this approach is flawed because only a biased set of disease loci is known at present. A second option is to require that genetic risk scores only use disease associations discovered in the same population (i.e., avoid generalizing results across populations). However, this is unfeasible from a logistical standpoint—as it would require repeating every GWAS in every global population. A third option is to use whole genome sequencing and large African study cohorts to generate sets of disease-associated loci that can be generalized as free of bias. On a more practical side, genetic risk scores can be generated that correct for existing biases. This requires understanding how risk allele frequencies differ between populations (as shown here) and leveraging linkage disequilibrium information to infer the effect sizes of risk alleles in non-study populations [60, 61]. Finally, we note that the gold standard for evaluating the genetic risk scores involves testing how well they predict disease phenotypes in diverse populations—something that requires individual-level phenotype data. Only by understanding population genetics and the effects of SNP ascertainment bias can accurate predictive models of genetic disease risks be built.
Population genetic data
Allele frequencies were obtained for each of the five continental populations of the 1000 Genomes Project: Africa (AFR), Americas (AMR), East Asia (EAS), Europe (EUR), and South Asia (SAS) . These frequencies were used to generate risk allele frequencies and derived allele frequencies at disease-associated loci from the NHGRI-EBI GWAS Catalog and simulated datasets. Ancestral and derived states in phase 3 1000 Genomes Project VCF files were used (these ancestral states were inferred via the EPO pipeline from Ensembl). We found that derived allele frequencies in all populations were elevated for large chunks of chromosome 8, which is indicative of misidentified ancestral states. To compensate for this, we masked SNPs found in the chr8: 89,000,000–146,364,022 region (hg19). Individuals in phase 3 of 1000 Genomes Project were genotyped using WGS. Allele frequencies of SNPs on the Affymetrix Genome-Wide Human SNP Array 6.0 and the Illumina Omni 5M microarray were found by merging data from the 1000 Genomes Project with lists of SNP IDs obtained from the Affymetrix and Illumina websites.
Identification of disease-associated variants
Using the NHGRI-EBI GWAS Catalog , Berens et al. generated a curated set of 3180 disease-associated loci . This involved filtering out SNPs that were not associated with a disease, eliminating SNPs lacking risk allele or odds ratio information, and LD-pruning. Here, we further constrained the set of disease-associated loci from  by requiring knowledge of whether risk alleles are ancestral or derived. After excluding 144 SNPs with unknown ancestral states, we were left with a focal set of 3036 disease-associated loci (Additional file 4: Table S3). We classified these 3036 disease-associated loci into 7 non-overlapping categories: gastrointestinal/liver, metabolic, morphological, cancer, neurological, miscellaneous, and cardiovascular. Wilcoxon signed-rank tests were used to compare disease allele frequencies between African and non-African populations. Disease-associated loci were binned by DAF, averaging across all 1000 Genomes Populations. Allele ages were estimated as per Eq. 4 in  (assuming N = 10,000 and a generation time of 25 years).
Computer simulations were used to test whether SNP ascertainment bias alone can produce what appears to be genetic differences in disease risks across populations. The goal here was to generate simulated datasets comparable to the set of 3036 disease-associated loci from the NHGRI-EBI GWAS Catalog. These simulations assume that the underlying risks of disease are the same across the globe. Two general types of simulations were run: simulations with ancestral risk alleles and simulations with derived risk alleles. Simulations involved randomly drawing a test SNP from a list of known genetic variants ascertained via WGS or found on commercial genotyping arrays. Conditioning on whether risk alleles are ancestral or derived, the risk allele frequency of the test SNP was found in the study population. We then used a Perl script based on the GAS/CaTS power calculator  to determine the probability of detecting a successful genetic association at the test SNP. The GAS power calculator leverages information about the number of cases and controls, p value threshold, disease model, prevalence, disease allele frequency, and genotype relevant risk (http://csg.sph.umich.edu/abecasis/cats/gas_power_calculator/). For each test SNP, we generated a uniformly distributed random number between 0 and 1. The test SNP was retained if the random number was less than the power to successfully detect a genetic association, and the test SNP was rejected if the random number was greater than the probability of detection. This process was repeated until a set of 3036 successful disease associations were detected. At each of these 3036 SNPs, we obtained simulated risk allele frequencies for five populations in the 1000 Genomes Project dataset (AFR, AMR, EAS, EUR, SAS). Our default parameters were as follows: genotyping technology = Affymetrix Genome-Wide Human SNP Array 6.0, study population = Europe (EUR), sample size = 3500 cases and 3500 controls, genetic model = additive, p value threshold = 10−5, prevalence = 0.1, and genotype relative risk = 1.211. These parameter values were chosen to be representative of the empirical data found in the NHGRI-EBI GWAS Catalog.
Our default model was modified to test which aspects of SNP ascertainment bias contribute the most to continental differences in risk allele frequencies. This involved varying the following simulation parameters: genotyping technology, sample size, mode of inheritance, and the p value threshold required for association detection. To examine the effects of different study populations, simulated risk allele frequencies were chosen from one of the five different populations (AFR, AMR, EAS, EUR, or SAS) or from an equal mixture of all five populations (MIX). The effects of different sample sizes were simulated by varying the number of cases and controls from three to six on a log10 scale at intervals of 0.1 (i.e., between 1000 and 1,000,000 cases and controls). The effects of different genotyping technologies were simulated by drawing random SNPs from either the Affymetrix Genome-Wide Human SNP Array 6.0, the Illumina Omni 5M microarray, or WGS data from the 1000 Genomes Project. Three genetic modes of inheritance were simulated: dominant, additive, and recessive. Two different p value thresholds were simulated: 1 × 10−5 and 5 × 10−8.
We also simulated the results of GWAS when effect sizes vary between populations. Simulations examined three different effect size distributions (symmetric, larger effect sizes in Europe, and larger effect sizes in Africa), two different types of risk alleles (ancestral and derived), and two different study cohorts (European and African). In each simulation run, 3036 disease-associated loci were obtained using the power calculator described above.
Simulations were repeated 1000 times per combination of parameters. Symmetric effect sizes were generated by drawing locus-specific genotype relative risks for each test SNP from a gamma distribution (shape = 1.24, scale = 0.85). These parameter values were chosen to give a distribution of effect sizes that is comparable to loci in the NHGRI-EBI GWAS Catalog. We allowed genotype relative risks for each test SNP to vary by population by adding random noise (normally distributed, mean = 0, standard deviation = 0.5). Simulated genotype relative risks < 1 were set equal to 1. Larger European effect sizes were generated by drawing locus-specific genotype relative risks from a gamma distribution that was shifted 0.5 upwards (Additional file 3: Figure S3). Larger African effect sizes were generated by drawing locus-specific genotype relative risks from a gamma distribution shifted 0.5 to the right (Additional file 3: Figure S4). A representative dataset from GWAS simulations is included in Additional file 5: Table S4.
Genetic risk scores (GRS) for 2504 individuals were built using genotypes at a curated set of 3036 disease-associated loci from the NHGRI-EBI GWAS Catalog. Note that genetic risk scores are sometimes called polygenic risk scores (PRS). For each disease locus, we counted whether an individual has 0, 1, or 2 copies of the risk allele. Because each disease category includes a heterogeneous set of diseases and phenotypes, we did not incorporate odds ratio and/or effect size information into our GRS calculations. Counts of risk alleles were then summed across all loci that belong to a particular disease category, yielding a raw GRS for each individual. Standardized GRS values were calculated for each combination of individual and disease category by finding the mean and standard deviation of raw GRS values across all 2504 individuals in our global dataset. Given our empirical results (Fig. 3c), diploid African genomes tend to have 0.1902 (2 × 9.51%) additional copies of each ancestral risk allele and 0.1082 (2 × 5.41%) fewer copies of each derived risk allele compared to non-African genomes. Because of this, our correction method considered the state of the risk alleles (ancestral or derived). Uncorrected African GRS use counts of 0, 1, or 2 risk alleles at each disease locus. Corrected African GRS use counts of − 0.1902, 0.8098, and 1.8098 “effective risk alleles” for ancestral alleles and 0.1082, 1.1082, and 2.1082 “effective risk alleles” for derived alleles. The same mapping of raw GRS to standardized GRS was used for uncorrected and corrected African GRS.
We thank A. Martin, U. Martinez-Marigorta, M. Quiver, and C. Simonti for the helpful discussions during the writing of this paper.
This work was supported by NIH/NCI grant U01CA184374 and start-up funds from Georgia Institute of Technology.
Availability of data and materials
Global allele frequencies are publicly available from the 1000 Genomes Project website: http://www.internationalgenome.org/data . Disease associations are publicly available from the NHGRI-EBI GWAS Catalog: https://www.ebi.ac.uk/gwas/ . R and Perl scripts used in GWAS simulations are available at https://github.com/LachanceLab/AscertainmentBias_GWAS .
MSK performed the statistical analyses and GWAS simulations, generated the genetic risk scores, and wrote the manuscript. KPP analyzed the population genetic data. AKT generated the initial set of genetic risk scores. AJB curated the set of disease-associated loci. JL conceived and supervised the study, interpreted the results, and wrote the manuscript. All authors have read and approved the final manuscript.
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- MacArthur J, Bowler E, Cerezo M, Gil L, Hall P, Hastings E, Junkins H, McMahon A, Milano A, Morales J, et al. The new NHGRI-EBI Catalog of published genome-wide association studies (GWAS Catalog). Nucleic Acids Res. 2017;45:D896–901.View ArticleGoogle Scholar
- Hindorff LA, Sethupathy P, Junkins HA, Ramos EM, Mehta JP, Collins FS, Manolio TA. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci U S A. 2009;106:9362–7.View ArticleGoogle Scholar
- Popejoy AB, Fullerton SM. Genomics is failing on diversity. Nature. 2016;538:161–4.View ArticleGoogle Scholar
- Manolio TA. In retrospect: a decade of shared genomic associations. Nature. 2017;546:360–1.View ArticleGoogle Scholar
- Martin AR, Gignoux CR, Walters RK, Wojcik GL, Neale BM, Gravel S, Daly MJ, Bustamante CD, Kenny EE. Human demographic history impacts genetic risk prediction across diverse populations. Am J Hum Genet. 2017;100:635–49.View ArticleGoogle Scholar
- Bustamante CD, Burchard EG, De la Vega FM. Genomics for the world. Nature. 2011;475:163–5.View ArticleGoogle Scholar
- Marigorta UM, Navarro A. High trans-ethnic replicability of GWAS results implies common causal variants. PLoS Genet. 2013;9:e1003566.View ArticleGoogle Scholar
- Palmer C, Pe’er I. Statistical correction of the Winner’s curse explains replication variability in quantitative trait genome-wide association studies. PLoS Genet. 2017;13:e1006916.View ArticleGoogle Scholar
- Shriner D. Mixed ancestry and disease risk transferability. Curr Genet Med Reports. 2015;3:151–7.View ArticleGoogle Scholar
- Coram MA, Fang H, Candille SI, Assimes TL, Tang H. Leveraging multi-ethnic evidence for risk assessment of quantitative traits in minority populations. Am J Hum Genet. 2017;101:218–26.View ArticleGoogle Scholar
- Hindorff LA, Bonham VL, Brody LC, Ginoza MEC, Hutter CM, Manolio TA, Green ED. Prioritizing diversity in human genomics research. Nat Rev Genet. 2018;19:175–85.View ArticleGoogle Scholar
- Chatterjee N, Shi J, Garcia-Closas M. Developing and evaluating polygenic risk prediction models for stratified disease prevention. Nat Rev Genet. 2016;17:392–406.View ArticleGoogle Scholar
- International Schizophrenia Consortium, Purcell SM, Wray NR, Stone JL, Visscher PM, O’Donovan MC, Sullivan PF, Sklar P. Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature. 2009;460:748–52.PubMed CentralGoogle Scholar
- Shi J, Park JH, Duan J, Berndt ST, Moy W, Yu K, Song L, Wheeler W, Hua X, Silverman D, et al. Winner’s curse correction and variable thresholding improve performance of polygenic risk modeling based on genome-wide association study summary-level data. PLoS Genet. 2016;12:e1006493.View ArticleGoogle Scholar
- Corona E, Chen R, Sikora M, Morgan AA, Patel CJ, Ramesh A, Bustamante CD, Butte AJ. Analysis of the genetic basis of disease in the context of worldwide human relationships and migration. PLoS Genet. 2013;9:e1003447.View ArticleGoogle Scholar
- Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, McCarthy MI, Ramos EM, Cardon LR, Chakravarti A, et al. Finding the missing heritability of complex diseases. Nature. 2009;461:747–53.View ArticleGoogle Scholar
- Wray NR, Yang J, Hayes BJ, Price AL, Goddard ME, Visscher PM. Pitfalls of predicting complex traits from SNPs. Nat Rev Genet. 2013;14:507–15.View ArticleGoogle Scholar
- McClellan J, King MC. Genetic heterogeneity in human disease. Cell. 2010;141:210–7.View ArticleGoogle Scholar
- Warnecke RB, Oh A, Breen N, Gehlert S, Paskett E, Tucker KL, Lurie N, Rebbeck T, Goodwin J, Flack J. Approaching health disparities from a population perspective: the National Institutes of Health Centers for Population Health and Health Disparities. Am J Public Health. 2008;98:1608–15.View ArticleGoogle Scholar
- Woolf SH, Braveman P. Where health disparities begin: the role of social and economic determinants--and why current policies may make matters worse. Health Aff (Millwood). 2011;30:1852–9.View ArticleGoogle Scholar
- 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature. 2015;526:68–74.View ArticleGoogle Scholar
- Li JZ, Absher DM, Tang H, Southwick AM, Casto AM, Ramachandran S, Cann HM, Barsh GS, Feldman M, Cavalli-Sforza LL, Myers RM. Worldwide human relationships inferred from genome-wide patterns of variation. Science. 2008;319:1100–4.View ArticleGoogle Scholar
- Laberge AM, Michaud J, Richter A, Lemyre E, Lambert M, Brais B, Mitchell GA. Population history and its impact on medical genetics in Quebec. Clin Genet. 2005;68:287–301.View ArticleGoogle Scholar
- Macgregor S, Bellis C, Lea RA, Cox H, Dyer T, Blangero J, Visscher PM, Griffiths LR. Legacy of mutiny on the bounty: founder effect and admixture on Norfolk Island. Eur J Hum Genet. 2010;18:67–72.View ArticleGoogle Scholar
- Timpson NJ, Greenwood CMT, Soranzo N, Lawson DJ, Richards JB. Genetic architecture: the shape of the genetic contribution to human traits and disease. Nat Rev Genet. 2018;19:110–24.View ArticleGoogle Scholar
- Visscher PM, Wray NR, Zhang Q, Sklar P, McCarthy MI, Brown MA, Yang J. 10 years of GWAS discovery: biology, function, and translation. Am J Hum Genet. 2017;101:5–22.View ArticleGoogle Scholar
- Lohmueller KE. The distribution of deleterious genetic variation in human populations. Curr Opin Genet Dev. 2014;29:139–46.View ArticleGoogle Scholar
- Henn BM, Botigue LR, Peischl S, Dupanloup I, Lipatov M, Maples BK, Martin AR, Musharoff S, Cann H, Snyder MP, et al. Distance from sub-Saharan Africa predicts mutational load in diverse human genomes. Proc Natl Acad Sci U S A. 2016;113:E440–9.View ArticleGoogle Scholar
- Jones D. A WEIRD view of human nature skews psychologists’ studies. Science. 2010;328:1627.View ArticleGoogle Scholar
- Henrich J, Heine SJ, Norenzayan A. Most people are not WEIRD. Nature. 2010;466:29.View ArticleGoogle Scholar
- Logan DC. Known knowns, known unknowns, unknown unknowns and the propagation of scientific enquiry. J Exp Bot. 2009;60:712–4.View ArticleGoogle Scholar
- Pulit SL, Voight BF, de Bakker PI. Multiethnic genetic association studies improve power for locus discovery. PLoS One. 2010;5:e12600.View ArticleGoogle Scholar
- Clark AG, Hubisz MJ, Bustamante CD, Williamson SH, Nielsen R. Ascertainment bias in studies of human genome-wide polymorphism. Genome Res. 2005;15:1496–502.View ArticleGoogle Scholar
- McCarthy MI, Abecasis GR, Cardon LR, Goldstein DB, Little J, Ioannidis JP, Hirschhorn JN. Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nat Rev Genet. 2008;9:356–69.View ArticleGoogle Scholar
- Nielsen R. Population genetic analysis of ascertained SNP data. Hum Genomics. 2004;1:218–24.View ArticleGoogle Scholar
- Lachance J, Tishkoff SA. SNP ascertainment bias in population genetic analyses: why it is important, and how to correct it. Bioessays. 2013;35:780–6.View ArticleGoogle Scholar
- Albrechtsen A, Nielsen FC, Nielsen R. Ascertainment biases in SNP chips affect measures of population divergence. Mol Biol Evol. 2010;27:2534–47.View ArticleGoogle Scholar
- Lachance J. Disease-associated alleles in genome-wide association studies are enriched for derived low frequency alleles relative to HapMap and neutral expectations. BMC Med Genet. 2010;3:57.Google Scholar
- Di Rienzo A, Hudson RR. An evolutionary framework for common diseases: the ancestral-susceptibility model. Trends Genet. 2005;21:596–601.View ArticleGoogle Scholar
- Ramachandran S, Deshpande O, Roseman CC, Rosenberg NA, Feldman MW, Cavalli-Sforza LL. Support from the relationship of genetic and geographic distance in human populations for a serial founder effect originating in Africa. Proc Natl Acad Sci U S A. 2005;102:15942–7.View ArticleGoogle Scholar
- Skol AD, Scott LJ, Abecasis GR, Boehnke M. Joint analysis is more efficient than replication-based analysis for two-stage genome-wide association studies. Nat Genet. 2006;38:209–13.View ArticleGoogle Scholar
- Lachance J, Berens AJ, Hansen MEB, Teng AK, Tishkoff SA, Rebbeck TR. Genetic hitchhiking and population bottlenecks contribute to prostate cancer disparities in men of African descent. Cancer Res. 2018;78:2432–43.View ArticleGoogle Scholar
- Benjamin EJ, Virani SS, Callaway CW, Chamberlain AM, Chang AR, Cheng S, Chiuve SE, Cushman M, Delling FN, Deo R. Heart disease and stroke statistics—2018 update: a report from the American Heart Association. Circulation. 2018;137:e67–e492.View ArticleGoogle Scholar
- Slatkin M, Rannala B. Estimating allele age. Annu Rev Genomics Hum Genet. 2000;1:225–49.View ArticleGoogle Scholar
- Novembre J, Barton NH. Tread lightly interpreting polygenic tests of selection. Genetics. 2018;208:1351–5.View ArticleGoogle Scholar
- Braveman P, Egerter S, Williams DR. The social determinants of health: coming of age. Annu Rev Public Health. 2011;32:381–98.View ArticleGoogle Scholar
- Manrai AK, Funke BH, Rehm HL, Olesen MS, Maron BA, Szolovits P, Margulies DM, Loscalzo J, Kohane IS. Genetic misdiagnoses and the potential for health disparities. N Engl J Med. 2016;375:655–65.View ArticleGoogle Scholar
- Stearns SC, Medzhitov R. Evolutionary medicine. Sunderland: Sinauer Associates, Inc., Publishers; 2016.Google Scholar
- Crespi BJ. The emergence of human-evolutionary medical genomics. Evol Appl. 2011;4:292–314.View ArticleGoogle Scholar
- Bigham AW, Magnaye K, Dunn DM, Weiss RB, Bamshad M. Complex signatures of natural selection at GYPA. Hum Genet. 2018;137:151–60.View ArticleGoogle Scholar
- Shriner D, Rotimi CN. Whole genome sequence-based haplotypes reveal single origin of the sickle allele during the Holocene Wet Phase. Am J Hum Genet. 2018;102:547–56.View ArticleGoogle Scholar
- Hunter DJ. Gene-environment interactions in human diseases. Nat Rev Genet. 2005;6:287–98.View ArticleGoogle Scholar
- Hemminki K, Bermejo JL, Försti A. Opinion: the balance between heritable and environmental aetiology of human disease. Nat Rev Genet. 2006;7:958.View ArticleGoogle Scholar
- Haugaard JJ, Hazan C. Adoption as a natural experiment. Dev Psychopathol. 2003;15:909–26.View ArticleGoogle Scholar
- Sankar P, Cho MK, Condit CM, Hunt LM, Koenig B, Marshall P, Lee SS, Spicer P. Genetic research and health disparities. JAMA. 2004;291:2985–9.View ArticleGoogle Scholar
- Fine MJ, Ibrahim SA, Thomas SB. The role of race and genetics in health disparities research. Am J Public Health. 2005;95:2125–8.View ArticleGoogle Scholar
- Reisberg S, Iljasenko T, Läll K, Fischer K, Vilo J. Comparing distributions of polygenic risk scores of type 2 diabetes and coronary heart disease within different populations. PLoS One. 2017;12:e0179238.View ArticleGoogle Scholar
- Maples BK, Gravel S, Kenny EE, Bustamante CD. RFMix: a discriminative modeling approach for rapid and robust local-ancestry inference. Am J Hum Genet. 2013;93:278–88.View ArticleGoogle Scholar
- Guan Y. Detecting structure of haplotypes and local ancestry. Genetics. 2014;196:625–42.View ArticleGoogle Scholar
- Vilhjalmsson BJ, Yang J, Finucane HK, Gusev A, Lindstrom S, Ripke S, Genovese G, Loh PR, Bhatia G, Do R, et al. Modeling linkage disequilibrium increases accuracy of polygenic risk scores. Am J Hum Genet. 2015;97:576–92.View ArticleGoogle Scholar
- Rosenberg NA, Huang L, Jewett EM, Szpiech ZA, Jankovic I, Boehnke M. Genome-wide association studies in diverse populations. Nat Rev Genet. 2010;11:356–66.View ArticleGoogle Scholar
- Berens AJ, Cooper TL, Lachance J. The genomic health of ancient hominins. Hum Biol. 2017;89:5–17.View ArticleGoogle Scholar
- Lachance J: AscertainmentBias_GWAS. Github Repository 2018. https://github.com/LachanceLab/AscertainmentBias_GWAS. Accessed 24 Aug 2018.