Cancer eQTLs can be determined from heterogeneous tumor gene expression data by modeling variation in tumor purity

Expression quantitative trait loci (eQTLs) identified using tumor gene expression data could affect gene expression in cancer cells, tumor-associated normal cells, or both. Here, we demonstrate a method to identify eQTLs affecting expression in cancer cells by modeling the statistical interaction between genotype and tumor purity. Only one-third of breast cancer risk variants, identified as eQTLs from a conventional analysis, could be confidently attributed to cancer cells. The remaining variants could affect cells of the tumor microenvironment, such as immune cells and fibroblasts. Deconvolution of tumor eQTLs will help determine how inherited polymorphisms influence cancer risk, development, and treatment response.


20
Expression quantitative trait loci (eQTLs) identified using tumor gene expression data could affect gene expression 21 in cancer cells, tumor-associated normal cells, or both. Here, we demonstrate a method to identify eQTLs affecting 22 expression in cancer cells by modeling the statistical interaction between genotype and tumor purity. Only one-third 23 of breast cancer risk variants, identified as eQTLs from a conventional analysis, could be confidently attributed to 24 cancer cells. The remaining variants could affect cells of the tumor microenvironment, such as immune cells and 25 fibroblasts. Deconvolution of tumor eQTLs will help determine how inherited polymorphisms influence cancer 26 risk, development, and treatment response. Expression quantitative trait loci (eQTLs) have been mapped in many tumor types, including high-profile studies in 29 glioma [1], colon[2], breast[3] and prostate cancer [4]. These studies measured genome-wide gene expression in 30 tumors and identified associations between these gene expression levels and common inherited (germline) genetic 31 variants (e.g. single nucleotide polymorphisms (SNPs)) profiled in the same patients. These results have been very 32 widely applied: For example, the majority of inherited cancer risk variants implicated by GWAS[5] are in non-coding 33 likely-regulatory[6,7] regions of the genome. Thus, to identify genes regulated by these variants, eQTLs identified 34 from tumor tissue[2,3] (and sometimes normal tissue [8]) are typically interrogated-facilitating rational functional 35 follow-up studies [9]. Indeed, inherited genetic variation is associated with the development of specific somatic 36 mutation profiles in cancers, and functional work demonstrated that this can be caused by germline mediated 37 changes in gene expression in cancer cells [10]. Additionally, cancer eQTLs have been extensively studied in the 38 context of pharmacogenomics, for example, inherited variants affect the expression levels of membrane 39 pump/transporter genes modulating chemotherapeutic response [11]. Notably, inherited variants associated with 40 chemotherapeutic response in cell lines are also enriched for eQTLs [12]. Putative drug target genes with existing 41 evidence of disease relevance from genetic association studies are also more likely to be successful in the drug 42 development pipeline; however, this is critically dependent on correctly assigning variants to the genes they 43 regulate [13]. These examples, pertaining to cancer risk, development, and treatment, include only a small subset of 44 applications of cancer eQTL profiles. 45 46 However, previous cancer eQTL studies quantified cancer gene expression by extracting RNA from tumor biopsies, 47 which are not a pure sample of cancer cells; instead, these are a heterogeneous mixture of, for example, cancer cells, 48 tumor-infiltrating immune cells, supporting tissue (stroma) and normal epithelial cells from the surrounding tissue. 49 Therefore, the expression profiles obtained reflect both cancer and non-cancer cells. Hence, eQTLs identified this 50 way could arise from cancer cells, tumor-associated normal cells, or both.

79
A conventional tumor eQTL mapping strategy will recover eQTLs from both cancer cells and tumor-80 associated normal cells in simulated data 81 To establish whether eQTLs in tumor-associated normal cells may indeed influence eQTL profiles recovered from 82 bulk tumor expression data, we first created a simulated dataset where underlying cancer/normal eQTL profiles 83 were known a priori. Simulations consisted of expression levels of 600 genes in pure "cancer" samples and in pure 84 "normal" tissue samples. These were then combined to simulate a "bulk tumor" expression dataset, consisting of 85 1,000 samples. Six classes of eQTLs were created, each represented by 100 genes; these were (1) genes with eQTLs 86 in cancer and normal, but with different effects in the two cell types (2) genes with eQTLs in cancer only (3) genes 87 with eQTLs in normal only (4) genes with no eQTL in either cell type (5) genes with the same eQTL in both cell 88 types and (6) genes with similar eQTLs in both cell types. Because the purpose of these simulations was to study the 89 performance of this model in real cancer data, the parameters, such as sample size, expression levels, effect sizes 90 and proportions of cancer/normal cells, were chosen to resemble the TCGA breast cancer cohort (see Methods).

92
We applied the current standard eQTL mapping strategy to these simulated data, where the expression levels from 93 bulk tumors were treated as representative of cancer itself (henceforth referred to as the "conventional model"; see 94 Methods). Importantly, the assumption here is that the goal is to identify eQTLs influencing gene expression in 95 cancer cells, therefore true simulated cancer eQTLs were treated as the ground-truth for all statistical measures of 96 performance reported in this and the next section. By comparing the results obtained from the model to the true 97 known cancer eQTLs created as part of the simulation, this approach achieved reasonable sensitivity and specificity 98 (79.5% and 80.3% respectively). However, there was a clear influence of the simulated eQTLs in the normal cells on 99 the recovered effects from bulk tumor expression (Pearson's correlation (r) = 0.9, P = 1.3 × 10 -38 between simulated 00 effect size of eQTLs with an effect in normal but not cancer cells and their estimated effect size from the 01 conventional model; Fig. 1(a)). Furthermore, while we expected a false discovery rate (FDR; estimated using the 02 Benjamini and Hochberg approach) of 5%, the true FDR was 11.1%, when the known simulated set of cancer 03 eQTLs was treated as the ground-truth. Most (37 of 40) of these false discoveries were falsely attributed 04 associations resulting from eQTLs in normal cells (Supplementary Table 1).

06
Cancer eQTLs can be accurately identified from bulk tumor expression data by modeling the interaction 07 of tumor purity and genotype in simulated data 08 To recover cancer eQTLs from bulk tumor expression data, we have built upon (see Methods) a previous study to 09 identify eQTLs with different effects in human neutrophils and lymphocytes using whole blood expression data [17]. 10 Like conventional eQTL mapping, our new approach involves fitting a linear regression model of gene expression 11 level against genotype: However, in addition to genotype, the estimated proportion of tumor-associated normal cells 12 (tumor purity) is included as a covariate, as well as the interaction between the estimated tumor purity and genotype 13 (henceforth referred to as the "interaction model"; see Methods). Critically, the estimate of the main effect 14 associated with this interaction term allows the eQTL to be assigned to cancer, not the interaction term itself (see  The interaction model recovered simulated cancer eQTLs with a sensitivity and specificity of 58.3% and 96.1% between cancer and normal cell types. However, the true FDR dropped to 3.3%, below the expected rate of 5%.

24
Only two "normal only" (group 3; see Methods) eQTLs were misattributed to cancer and the influence of normal 25 cells observed for the conventional model was eliminated ( Fig. 1(b); Supplementary Table 2). To further illustrate 26 the utility of the model, a normal-driven eQTL analyzed with a conventional model is shown in Fig. 1(c), along with 27 the capacity of the interaction model to extrapolate the correct effect size in cancer cells, deducing that this signal 28 was driven by samples with large quantities of tumor-associated normal cells ( Fig. 1(d)).

30
In cancer eQTL mapping, the assumption has been implicit that the eQTLs identified from tumor samples affect 31 gene expression in cancer cells. However, the pervasive genomic aberrations and dysregulation of key master 32 regulators that occurs in cancer cells[18] could obscure or eliminate associations between germline polymorphisms 33 and gene expression, either by increasing transcriptional noise or by disrupting the regulatory landscape. Thus, the 34 inherited genetic influence on gene expression could be far greater in normal cells than in cells that have undergone 35 neoplastic transformation. To assess the plausibility that eQTLs previously discovered from tumor expression data 36 could be largely driven by normal cells, we included an additional 500 genes with "normal only" eQTLs in our 37 simulated dataset. Again, assuming the objective is to identify eQTLs that affect gene expression in cancer cells, a 38 conventional model applied to bulk tumor expression data performs very poorly. Using an FDR threshold of 5% 39 we in fact observed a rate of false discovery rising to 46% of significant associations (Supplementary Table 3). Of  While no simulated dataset can capture the full complexity of in vivo biology, these analyses suggest that (i) it is 55 plausible that many, if not most eQTLs identified from tumor expression data using conventional approaches 56 actually affect gene expression in normal cells, not in cancer cells and (ii) using the parameters of the TCGA breast 57 cancer data, modeling the interaction of tumor purity and genotype performs well at correctly attributing true cancer 58 eQTLs. Below, we perform a case-study using an integrative analysis of real data from TCGA breast cancer, breast 59 cancer GWAS results, and samples from the Genotype-tissue Expression (GTEx) project. To test the utility of the interaction model on real data, we conducted cis-eQTL mapping in TCGA breast cancer 63 samples, where both germline genotype and bulk tumor RNA-Seq data were available (n = 894). We also applied a 64 conventional model to bulk tumor expression data (see Methods). We focused on breast cancer as it has the largest 65 available sample size, and is reasonably representative of tumor types with high normal cell contamination ( Fig.   66 2(a)). We estimated tumor purity using a consensus approach that combined the estimates from copy number  We evaluated 3,602,220 associations between tag SNPs and the expression levels of genes within 500 kilobases of 74 each tag SNP. The data were filtered and preprocessed based on the recent guidelines of GTEx, including steps to 75 control for population structure, unmeasured confounders, and expression heterogeneity (see Methods). We 76 identified 57,189 significant cis-eQTL associations (FDR < 0.05; Fig. 3(a)) using the conventional model. However, 77 using the interaction model, just 8,833 eQTLs could be confidently attributed to cancer cells (FDR < 0.05; Fig.   78 3(a)). Of the 8,833 associations attributed to cancer cells, 7,542 were also identified by the conventional model and see Methods). When we randomly permuted the tumor purity estimates, the number of eQTLs that could be 82 attributed to cancer cells was just 239 (Fig. 3(a)). We show a specific example in Figs. 3(b-e) to illustrate the process 83 of attributing eQTLs to the affected cell type. In this example, the association between SNP rs6458012 and the 84 expression of MDGA1 in breast tumors (P = 1.5 × 10 -29 ; Fig. 3 conventional and interaction models in real data. However, we can assess whether the interaction model eliminates 99 associations for likely immune and stromal cell-specific eQTLs. To do this, we used GTEx data to define a set of 00 eQTLs that were likely to be misattributed; i.e. they were more likely to have arisen in immune and stromal cells, 01 rather than from breast cancer cells. We defined this set as cis-eQTLs identified in lymphoblastoid cell lines (LCLs) 02 or transformed fibroblasts in GTEx (FDR < 0.05), which were not even nominally significant (P > 0.05) in GTEx 03 breast tissue. We reasoned that LCLs and fibroblasts provide a good proxy for tumor-associated immune and 04 stromal cells, while the regulatory landscape of breast cancer cells is likely to maintain a similarity to breast, the 05 tissue from which they developed. These criteria yielded a set of 47,196 eQTLs shared between GTEx and TCGA 06 that had a higher likelihood of being misattributed if identified as cancer eQTLs. Of the 57,189 significant 07 associations from the conventional model, 5,440 were among this set defined as likely arising in associated-normal 08 cells. For 8,833 associations from the interaction model, this number was reduced to 572. This is a significant 09 reduction in the proportion of these likely misattributed eQTLs ( Fig. 4(a & b), P = 8.1 × 10 -22 from Fisher's exact 10 test, odds ratio 1.51). Thus, consistent with our simulations, there is convincing evidence in real data that the use of 11 the interaction model reduces the misattribution of eQTLs from tumor-associated normal cells. Furthermore, we 12 also mapped breast cancer eQTLs using only 10% of the TCGA breast cancer samples that had the highest 13 estimated cancer cell content (all > 88.6% purity; median = 91.2%, n = 89). As expected, the eQTL effects 14 estimated from this high purity subset were (globally) much more similar to those estimated by the interaction  20 We also expect that genes whose regulation is disrupted following tumorigenesis would be more likely to be  Then, to determine whether these changes were biologically meaningful we assessed these genes for enrichment of 28 Gene Ontology (GO) biological process (see Methods). Indeed, the most strongly enriched processes included included DNA repair and cell cycle, key process influencing breast cancer susceptibility and progression. Some of 31 this dysregulation may be attributable to increased expression heterogeneity or different expression levels among 32 these genes in cancer and understanding the mechanisms by which normal regulation of these genes is disrupted 33 will represent a starting point for future mechanistic studies.

35
Validation of TCGA breast cancer findings in the METABRIC dataset 36 Next, we sought to replicate our results using an additional 997 breast tumor expression profiles and genotypes 37 generated by the METABRIC consortium [22]. Although this is the most suitable validation cohort available, there 38 are some limitations to this dataset; for example, the genotypes were generated from (less reliable) tumor tissue (see 39 Methods), and expression was estimated using a microarray platform, which is likely less sensitive than the RNA-seq  to the TCGA breast tumor expression data (arising from 16 of the 81 risk SNPs that could be mapped to one-or-60 more genes; see Methods). However, 9 of these eQTLs were not even nominally significant (P > 0.05) when 61 extrapolated to cancer cells using the interaction model, suggesting these are strong candidates for eQTLs arising 62 from normal cells. Indeed, all of these 9 associations were significant in at least one of fibroblast, breast or LCLs in 63 GTEx, in all cases with the same directionality as the eQTL effect estimated from bulk tumor expression using the 64 conventional model (Fig. 5(a & b); Supplementary Table 8).   We have demonstrated an improved eQTL mapping strategy for cancer, which uses tumor purity estimates and bulk 91 tumor gene expression data to identify eQTLs that can be confidently attributed to cancer cells. In breast cancer, 92 the result is that most bulk tumor eQTLs cannot be confidently attributed to cancer cells, once the possibility of 93 these eQTLs arising from tumor-associated normal cells is appropriately modeled.  In the future, one approach to cancer eQTL mapping will likely be to apply single-cell gene expression methods to 14 tumors-directly measuring gene expression in cancer and tumor-associated normal cells. For many cancer types 15 this should be possible, but currently, single-cell expression datasets are not on a scale required to map eQTL 16 profiles. For the foreseeable future, sample sizes available for gene expression in bulk tumors will remain orders-of-  for each cell type. One would also need to be sufficiently confident that the cell type proportions were being 30 accurately estimated, which becomes more difficult given more similar expression profiles in less distinct subtypes.  Another assumption that our model makes, is that the presence/absence of normal cells does not itself affect genes where this is not true, it may be difficult or impossible to separate the eQTL profiles of tumor-resident cancer 42 and normal cells using any method, including single-cell RNA-seq. 43 44 Additionally, our model, or any such model, cannot prove a non-association. It is incorrect to conclude that tumor 45 eQTLs that cannot be attributed to cancer cells are definitely not eQTLs in cancer cells, or are certainly eQTLs in 46 tumor-associated normal cells. The correct interpretation is that there is no statistical evidence for this eQTL in 47 cancer cells at the current sample size and given factors such as the accuracy with which the data were measured. 48 Notably, cancer eQTLs identified by the interaction model may still be eQTLs in other tumor-associated normal 49 cell types and these should not be interpreted as exclusively-cancer eQTLs.

51
We have elucidated a major shortcoming of current eQTL mapping strategies in cancer, in that eQTLs identified 52 from tumor expression data could arise from either cancer or tumor-associated normal cells. We have also proposed 53 a solution, which allows us to recover eQTL profiles for constituent cell types using expression data collected in a 54 mixture of cell types. We have applied this solution to breast cancer, where we showed that most eQTLs discovered 55 in tumors cannot be confidently attributed to cancer cells, once the possibility of these signals arising in tumor-56 associated normal cells is appropriately modeled. Overall, this work will improve the understanding of gene 57 regulation in cancer, including studying inherited cancer risk variants, disease development, and drug response. This 58 study also provides improved theoretical groundwork for deconvolution of eQTLs effects in other mixtures of cell 59 types, including normal human tissues.

61
Simulating bulk tumor expression data as a product of underlying "cancer" and "normal" expression data 62 63 We simulated cancer and normal gene expression datasets for 600 genes in 1000 samples-the approximate number  Simulated purity estimates were derived from 1,000 randomly chosen consensus purity estimates [15] in real TCGA 85 breast cancer samples. When recovering the cancer eQTLs using the interaction model, noise was added to the 86 purity estimates, to simulate the fact that in real data these estimates will be imprecise: For each sample, noise was 87 added by randomly sampling a normal distribution with mean 0 and standard deviation 0.1; the resulting values were 88 then quantile normalized to the original purity estimates, thus preserving the distribution of the data precisely 89 (Supplementary Figure 16). For Fig. 1(e), the standard deviation of the noise generating normal distribution was  were pre-processed and filtered primarily using the guidelines of GTEx: Expression data were quantile normalized.

05
The expression of each gene was then mapped to a standard normal distribution, with mean 0 and standard 06 deviation of 1. Genes not expressed in at least 75% of samples were removed. SNPs with a minor allele frequency 07 (MAF) of less than 5% were removed. Males, as well as Y chromosome SNPs and genes, were removed. We 08 estimated ancestry using the first 3 principal component of the genotype matrix. To account for expression 09 heterogeneity and unmeasured confounders, we also estimated 35 PEER factors [35]. The filtering steps yielded [EQ1] 16 Where y is the gene expression value; x is the genotype encoded as 0, 1 or 2; a is the 3 genotype principal 17 components used to estimate ancestry; b is the 35 PEER factors and ɛ is the residual error term. For each model, a 18 P-value for the eQTL was calculated by a t-test on the β1 term. The model to identify cancer eQTLs is similar to the model described above but also includes tumor purity, [EQ2]

29
The terms are as in EQ1, but with the addition of c, which represents the CPE estimate of tumor purity (0 < c < 1) 30 and c × x the interaction of tumor purity and genotype. Critically, tumor purity is encoded such that 0 represents 31 100% cancer cells and 1 represents 100% normal cells, meaning that the β1 term will have extrapolated an effect size 32 at 100% cancer cells. As above, the P-value for each eQTL was calculated by a t-test on the β1 term. A similar model proportion (term c in EQ2, but not bounded by 0 and 1); here, we use actual estimates of the cell type proportion, 37 bounded by 0 and 1-in this case the proportion of tumor-associated normal cells. The consequence of this is that 38 the main effect β1 now represents an extrapolated estimate of the eQTL effect size at 0% tumor-associated normal 39 cells, equivalent to 100% cancer cells. Thus, we recover cancer eQTLs by testing this main effect, rather than the 40 interaction term, which is actually a measure of how the magnitude of an eQTL differs between the two cell types 41 (as previously described in Westra et al.). Note 1: We also fit these models with gene copy number and methylation The SE terms refer to the standard error estimates associated with the eQTL effect (βTCGA and βGTEx) in TCGA and 56 GTEx respectively. P-values were calculated from these Z-scores using the probability density function for a normal 57 distribution.

59
Gene set analysis of differential eQTLs between TCGA and GTEx using GOseq 60 61 Gene set analysis, which was used to identify differentially enriched biological processes between GTEx and TCGA 62 eQTLs, was performed using the GOseq[38] package in R. We considered a gene differentially regulated if it had at 63 least one associated eQTL that was significantly different (calculated using EQ3, FDR < 0.05) between TCGA 64 breast cancer and GTEx breast tissue. All genes expressed in both TCGA breast cancer and GTEx normal breast 65 tissue were used as the background list. The GOseq approach allowed us to use a six-knot monotonic spline 66 function to control for the increased probability of a gene appearing in the foreground list (i.e. differentially Equilibrium (HWE) P < 1 × 10 -6 using Plink [42]. The input data were further validated and QC'ed by the server, The METABRIC data were obtained with permission from the European Genotype Archive. Raw Affymetrix 82 Genome-Wide Human SNP Array 6.0 CEL files were obtained from archive EGAD00010000164. We called 83 genotypes from these files using the Birdseed v2 algorithm under the default configuration implemented in 84 Affymetrix Genotyping Console. Notably, these data were measured from tumor tissue and are thus less reliable 85 than genotypes called from blood (as in TCGA); however, the METABRIC authors have previously used these 86 genotypes for eQTL mapping, and demonstrated that the results were reasonably consistent with those obtained 87 from genotypes generated from matched normal tissue [22]. We filtered SNPs with >.05 missing genotypes, MAF 88 <0.05 and only retained SNPs also included in the final TCGA analysis. The METABRIC "discovery" (n = 997) 89 normalized gene expression data were obtained from archive EGAD00010000210. We retained genes also included 90 in the TCGA analysis and mapped each gene to a normal distribution with mean 0 and standard deviation 1.

91
Covariates for expression heterogeneity and population structure were estimated and SNPs were mapped to genes 92 as in the TCGA analysis above. Note that the PEER algorithm did not converge on our METABRIC expression we also mapped these estimates to the same quantiles of the TCGA CPE data. Similarly to TCGA, eQTLs were 99 then mapped using the "conventional" and "interaction" models in EQ1 & EQ2.  (a) Scatterplot of the eQTL effect size recovered from a conventional analysis of bulk tumor expression data (y-axis) 12 against the known normal eQTL effect size created by simulation (x-axis) for the 100 eQTLs that were simulated to 13 have an effect in normal cells, but not cancer. Points are colored red if the conventional model identified these as 14 significant at FDR < 0.05. The eQTL effects recovered by the conventional model (y-axis) are heavily influenced by 15 the eQTL effects in tumor-associated normal cells.

16
(b) Scatterplot of the estimated cancer eQTL effect size recovered by the interaction model (y-axis) plotted against the 17 known normal eQTL effect size created by simulation (x-axis) for the same 100 eQTLs as Fig. 1(a)  tumor, obtained by the conventional model, is shown in green. Whiskers represent 95% confidence intervals. The 28 interaction model has not misattributed this eQTL to cancer cells.

29
(e) The change in the sensitivity, specificity, and FDR achieved by the interaction model as the level of noise with which 30 the proportion of cancer cells is measured changes. The "Pearson correlation" on the x-axis is the correlation between 31 the known simulated proportions and those "measured" as more noise is added (see Methods). The dashed red line is 32 at 0.05, the rate at which the FDR was controlled for these tests using the Benjamini and Hochberg method. The 33 FDR is well controlled by the interaction model, even when the correlation between the real and measured (noise 34 added) proportions approaches 0.5. Note: if the cancer cell proportions are completely randomized, the true FDR is 35 22% (at the 5% threshold). Again, when calculating these true FDRs, the known simulated set of cancer eQTLs were 36 treated as the ground truth.   represent 95% confidence intervals. The association from the conventional model applied to TCGA breast cancer 59 bulk tumors is shown in green (corresponding to Fig. 3(b)). Shown in black are the effect sizes and confidence 60 intervals for the association of rs6458012 and MDGA1 when TCGA breast samples are divided into five equally 61 sized bins, based on each sample's estimated proportion of cancer cells. The estimated effect size decreases as the 62 proportion of cancer cells decreases. The extrapolated effect size in cancer cells, estimated by the interaction 63 model, is shown in red; this association is not statistically significant, illustrated by the 95% confidence interval 64 overlapping the grey dashed line, which represents an effect size of 0. This suggests the association recovered by 65 the conventional model did not arise in cancer cells. 66 d) A strip chart showing the association of rs6458012 and the expression MDGA1 in fibroblasts from GTEx. These 67 are associated (P = 6.3 × 10 -13 ) with the same directionality as identified in TCGA breast tumors (Fig. 3(b)). 68 e) A strip chart showing the association of rs6458012 and the expression MDGA1 in LCLs from GTEx. These are 69 associated (P = 1.4 × 10 -10 ) with the same directionality as identified in TCGA breast tumors (Fig. 3(b)).  model, or the interaction (cancer) model, which were also significant (FDR < 0.05) in GTEx LCL or fibroblast data, 75 but not GTEx breast data (P > 0.05).

76
(b) The proportions of eQTLs identified by the conventional (bulk tumor) or interaction (cancer) model that are likely 77 to be derived from normal cells, based on Fig. 4(a).

78
(c) Bar graph of the lowest 8 P-values of 3,679 GO biological processes tested; P-values are for the enrichment of 79 genes whose eQTL profile changes between TCGA breast cancer (identified by the interaction model) and normal 80 breast tissue in GTEx. 81 82  The code to reproduce the results in this paper are on GitHub: https://github.com/paulgeeleher/cancerEqtls and 06 Open Science Framework: https://osf.io/z7uyp/ 07 Note that permission must be obtained from TCGA and METABRIC to obtain access to germline genotype 08 information. Competing interests 15 16 The authors declare that they have no competing interests.  Author contributions 26 PG and RSH conceived the study. PG developed the statistical methods, performed the analysis and drafted the 27 paper. CS provided support/insight in statistical methods, deconvolution approaches and analysis. JF performed 28 exploratory initial analysis. AN and ANB provided analytical support. FW and ZZ performed genotype imputations.

29
RG and ZZ provided computational resources and support. RSH supervised the study. All authors approved/edited 30 the final manuscript.