GxEsum
Following Ni et al. [9], RNM can be written as:
$$ \mathbf{y}=\mathbf{b}+\mathbf{g}+\boldsymbol{\uptau} =\mathbf{b}+{\mathbf{g}}_{\mathbf{0}}+{\mathbf{g}}_{\mathbf{1}}\times \mathbf{E}+{\boldsymbol{\uptau}}_{\mathbf{0}}+{\boldsymbol{\uptau}}_{\mathbf{1}}\times \mathbf{E} $$
where y is the N vector of phenotypic observations; b is a vector of fixed effects; g is the N individual genetic effects, which can be decomposed into the first and second order of genetic random regression coefficients, g0 and g1; τ is the residual effects, decomposed into the first and second order of residual random regression coefficients, τ0 and τ1; and E is the N vector of environmental variable. Note that E can be also any covariate variable (e.g., smoking, alcohol intake frequency).
Assuming that the phenotypes (y) are pre-adjusted for the main genetic effects (g0), environmental or covariate variable (E), and other fixed effects (b), the model can be rewritten as:
$$ \mathbf{y}={\mathbf{g}}_{\mathbf{1}}\times \mathbf{E}+{\boldsymbol{\uptau}}_{\mathbf{0}}+{\boldsymbol{\uptau}}_{\mathbf{1}}\times \mathbf{E}=\mathbf{X}{\boldsymbol{\upbeta}}_{\mathbf{1}}\times \mathbf{E}+{\boldsymbol{\uptau}}_{\mathbf{0}}+{\boldsymbol{\uptau}}_{\mathbf{1}}\times \mathbf{E} $$
where X is an N x M standardized genotype matrix for M SNPs, and β1 is an M vector of SNP interaction effects modulated by the environment (i.e., GxE SNP effects). It is noted that τ0 is the residual effects that are consistent across the environment whereas τ1 captures the heterogeneous residual effects across the environment (i.e., RxE).
Following Bulik-Sullivan et al. [10], assuming \( \mathbbm{E}\left[{\mathbf{g}}_{\mathbf{1}}\right]=\mathbbm{E}\left[{\boldsymbol{\upbeta}}_{\mathbf{1}}\right]=0 \), the expected chi-square statistics of variant j for the GxE is:
$$ \mathbbm{E}\left[{\chi}_j^2\right]=N\bullet \mathrm{Var}\Big(\hat{\beta_{1j}\Big)} $$
(1)
Using the law of total variance, \( \mathrm{Var}\Big[\hat{\beta_{1j}\Big]} \) can be obtained as:
$$ \mathrm{Var}\Big(\hat{\beta_{1j}\Big)}=\mathbbm{E}\left[\mathrm{Var}\left(\hat{\beta_{1j}}|\mathbf{EX}\right)\right]+\mathrm{Var}\left[\mathbbm{E}\left(\hat{\beta_{1j}}|\mathbf{EX}\right)\right] $$
$$ =\mathbbm{E}\left[\mathrm{Var}\left(\hat{\beta_{1j}}|\mathbf{EX}\right)\right] $$
where EX is an N x M matrix with each column having the Hadamard product between E and Xj (standardized genotypes at the jth SNP), and the conditional expectation of \( \hat{\beta_{1j}} \) is \( \mathbbm{E}\left(\hat{\beta_{1j}}|\mathbf{EX}\right)=0 \).
Noting that the least-square estimate of \( \hat{\beta_{1j}} \) can be obtained as \( \hat{\beta_{1j}}=\left(\mathbf{E}{\mathbf{X}}_j\right)^{\prime}\mathbf{y}/N \), \( \mathrm{Var}\left(\hat{\beta_{1j}}|\mathbf{EX}\right) \) can be rearranged as:
$$ \mathrm{VAR}\left(\hat{\beta_{1j}}|\mathbf{EX}\right)=\mathrm{Var}\left[\left(\mathbf{E}{\mathbf{X}}_j\right)^{\prime}\mathbf{y}/N|\mathbf{EX}\right] $$
$$ =\frac{1}{N^2}\mathrm{Var}\left[{\left(\mathbf{E}{\mathbf{X}}_j\right)}^{\prime}\mathbf{y}|\mathbf{EX}\right] $$
$$ =\frac{1}{N^2}{\left(\mathbf{E}{\mathbf{X}}_j\right)}^{\prime}\mathrm{Var}\left(\mathbf{y}|\mathbf{EX}\right)\left(\mathbf{E}{\mathbf{X}}_j\right) $$
$$ =\frac{1}{N^2}\left(\mathbf{E}{\mathbf{X}}_j\right)^{\prime}\mathrm{Var}\left[\left(\mathbf{E}\mathbf{X}\right){\boldsymbol{\upbeta}}_{\mathbf{1}}+{\boldsymbol{\uptau}}_{\mathbf{0}}+\mathbf{E}{\boldsymbol{\uptau}}_{\mathbf{1}}|\mathbf{EX}\right]\left(\mathbf{E}{\mathbf{X}}_j\right) $$
$$ =\frac{1}{N^2}{\left(\mathbf{E}{\mathbf{X}}_j\right)}^{\prime}\mathrm{Var}\left[\left(\mathbf{E}\mathbf{X}\right){\boldsymbol{\upbeta}}_{\mathbf{1}}|\mathbf{EX}\right]\left(\mathbf{E}{\mathbf{X}}_j\right) $$
$$ +\frac{1}{N^2}{\left(\mathbf{E}{\mathbf{X}}_j\right)}^{\prime}\left(\mathbf{E}{\mathbf{X}}_j\right)\mathrm{Var}\left({\boldsymbol{\uptau}}_{\mathbf{0}}\right)+\frac{1}{N^2}\left(\mathbf{E}{\mathbf{X}}_j\right)^{\prime}\left(\mathbf{E}\right)\left(\mathbf{E}\right)^{\prime}\left(\mathbf{E}{\mathbf{X}}_j\right)\mathrm{Var}\left({\boldsymbol{\uptau}}_{\mathbf{1}}\right) $$
$$ =\frac{1}{N^2}\left(\mathbf{E}{\mathbf{X}}_j\right)^{\prime}\left(\mathbf{E}\mathbf{X}\right){\left(\mathbf{E}\mathbf{X}\right)}^{\prime}\left(\mathbf{E}{\mathbf{X}}_j\right)\mathrm{Var}\left({\boldsymbol{\upbeta}}_{\mathbf{1}}|\mathbf{EX}\right) $$
$$ +\frac{1}{N^2}\left[N\mathrm{Var}\left({\boldsymbol{\uptau}}_{\mathbf{0}}\right)+\left(\mathbf{E}{\mathbf{X}}_j\right)^{\prime}\left(\mathbf{E}\right)\left(\mathbf{E}\right)^{\prime}\left(\mathbf{E}{\mathbf{X}}_j\right)\mathrm{Var}\left({\boldsymbol{\uptau}}_{\mathbf{1}}\right)\right] $$
$$ =\frac{1}{N^2}\frac{h_{{\boldsymbol{g}}_{\mathbf{1}}}^{\mathbf{2}}}{M}{\left(\mathbf{E}{\mathbf{X}}_j\right)}^{\prime}\left(\mathbf{E}\mathbf{X}\right){\left(\mathbf{E}\mathbf{X}\right)}^{\prime}\left(\mathbf{E}{\mathbf{X}}_j\right) $$
$$ +\frac{1}{N^2}\left[N\left(1-{h}_{{\boldsymbol{g}}_{\mathbf{1}}}^{\mathbf{2}}-{h}_{{\boldsymbol{\tau}}_{\mathbf{1}}}^{\mathbf{2}}\right)+{h}_{{\boldsymbol{\tau}}_{\mathbf{1}}}^{\mathbf{2}}\left(\mathbf{E}{\mathbf{X}}_j\right)^{\prime}\left(\mathbf{E}\right)\left(\mathbf{E}\right)^{\prime}\left(\mathbf{E}{\mathbf{X}}_j\right)\right] $$
where \( {h}_{{\boldsymbol{g}}_{\mathbf{1}}}^{\mathbf{2}} \) and \( {h}_{{\boldsymbol{\tau}}_{\mathbf{1}}}^{\mathbf{2}} \) are the proportion of phenotypic variance explained by GxE and RxE, respectively.
Therefore, \( \mathbbm{E}\left[{\chi}_j^2\right] \) in Eq. (1) can be written as:
$$ \mathbbm{E}\left[{\chi}_j^2\right]=N\bullet \mathrm{Var}\left(\hat{\beta_{1j}}|\mathbf{EX}\right) $$
$$ =\frac{1}{N}\left[\frac{h_{{\boldsymbol{g}}_{\mathbf{1}}}^{\mathbf{2}}}{M}{\left(\mathbf{E}{\mathbf{X}}_j\right)}^{\prime}\left(\mathbf{E}\mathbf{X}\right){\left(\mathbf{E}\mathbf{X}\right)}^{\prime}\left(\mathbf{E}{\mathbf{X}}_j\right)+N\left(1-{h}_{{\boldsymbol{g}}_{\mathbf{1}}}^{\mathbf{2}}-{h}_{{\boldsymbol{\tau}}_{\mathbf{1}}}^{\mathbf{2}}\right)+{h}_{{\boldsymbol{\tau}}_{\mathbf{1}}}^{\mathbf{2}}\left(\mathbf{E}{\mathbf{X}}_j\right)^{\prime}\left(\mathbf{E}\right)\left(\mathbf{E}\right)^{\prime}\left(\mathbf{E}{\mathbf{X}}_j\right)\right] $$
(2)
According to Bulik-Sullivan et al. [10], the products of the standardized genotypes at variant j and other variants can be expressed as a function of LD scores, i.e.:
$$ \frac{1}{N^2}{\left({\mathbf{X}}_{\mathbf{j}}\right)}^{\prime}\left(\mathbf{X}\right){\left(\mathbf{X}\right)}^{\prime}\left({\mathbf{X}}_{\mathbf{j}}\right)=\frac{1}{N^2}\left({\mathrm{N}}^2+\left[\mathrm{N}\ast \left(1-\tilde{r}_{j}^2\right)+\tilde{r}_{j}^2\ast {\mathrm{N}}^2\right]\ast \left(\mathrm{M}-1\right)\right) $$
$$ =\mathbf{1}+\left[\frac{1-\tilde{r}_{j}^2}{\mathrm{N}}+\tilde{r}_{j}^2\right]\ast \left(\mathrm{M}-1\right) $$
$$ ={\ell}_j+\left[\frac{\left(\mathrm{M}-1\right)\left(1-\tilde{r}_{j}^2\right)}{\mathrm{N}}\right] $$
$$ \approx {\ell}_j+\frac{M-{\ell}_j}{N} $$
where \( \tilde{r}_{j}^2 \) is defined as the expected sample correlation between genotypes at the jth variant and the other (M-1) variants, and ℓj = 1+ \( \tilde{r}_{j}^2\left(M-1\right) \) is the LD scores of the jth SNP.
According to the central moment theory of standard normal distribution of three independent random variables (X1, X2 and E), each with an N vector, useful equations are:
$$ \mathbbm{E}\left[\left({\mathbf{X}}_{\mathbf{1}}\right)^{\prime}\left({\mathbf{X}}_{\mathbf{1}}\right)\right]=N $$
$$ \mathbbm{E}\left[\left({\mathbf{X}}_{\mathbf{1}}\right)^{\prime}\left({\mathbf{X}}_{\mathbf{2}}\right)\left({\mathbf{X}}_{\mathbf{2}}\right)^{\prime}\left({\mathbf{X}}_{\mathbf{1}}\right)\right]=N $$
$$ \mathbbm{E}\left[\left({\mathbf{X}}_{\mathbf{1}}\right)^{\prime}\left({\mathbf{X}}_{\mathbf{1}}\right)\left({\mathbf{X}}_{\mathbf{1}}\right)^{\prime}\left({\mathbf{X}}_{\mathbf{1}}\right)\right]={N}^2 $$
$$ \mathbbm{E}\left[\left(\mathbf{E}{\mathbf{X}}_{\mathbf{1}}\right)^{\prime}\left(\mathbf{E}\right)\left(\mathbf{E}\right)^{\prime}\left(\mathbf{E}{\mathbf{X}}_{\mathbf{1}}\right)\right]=\mathbbm{E}\left[\left(\mathbf{E}{\mathbf{X}}_{\mathbf{1}}\right)^{\prime}\left(\mathbf{E}{\mathbf{X}}_{\mathbf{2}}\right)\left(\mathbf{E}{\mathbf{X}}_{\mathbf{2}}\right)^{\prime}\left(\mathbf{E}{\mathbf{X}}_{\mathbf{1}}\right)\right]=3N $$
and
$$ \mathbbm{E}\left[\left(\mathbf{E}{\mathbf{X}}_{\mathbf{1}}\right)^{\prime}\left(\mathbf{E}{\mathbf{X}}_{\mathbf{1}}\right)\left(\mathbf{E}{\mathbf{X}}_{\mathbf{1}}\right)^{\prime}\left(\mathbf{E}{\mathbf{X}}_{\mathbf{1}}\right)\right]={N}^2. $$
Therefore, assuming that E and Xj have a negligible correlation for a polygenic trait (i.e., a tiny proportion of the phenotypic variance of E can be explained by a single SNP, Xj), the term (EXj)′(EX)(EX)′(EXj) can be expressed as a function of LD scores as:
$$ \frac{1}{N^2}{\left(\mathbf{E}{\mathbf{X}}_j\right)}^{\prime}\left(\mathbf{E}\mathbf{X}\right){\left(\mathbf{E}\mathbf{X}\right)}^{\prime}\left(\mathbf{E}{\mathbf{X}}_j\right)=\frac{1}{N^2}\left({N}^2+\left[3N\left(1-\tilde{r}_{j}^2\right)+\tilde{r}_{j}^2\ast {N}^2\right]\ast \left(M-1\right)\right) $$
$$ =1+\left[\frac{3\left(1-\tilde{r}_{j}^2\right)}{N}+\tilde{r}_{j}^2\right]\ast \left(M-1\right) $$
$$ ={\ell}_j+\frac{3\left(M-{\ell}_j\right)}{N} $$
Thus, a part in Eq. (2) can be rearranged as:
$$ \frac{1}{N}\left[\frac{h_{{\boldsymbol{g}}_{\mathbf{1}}}^{\mathbf{2}}}{M}{\left(\mathbf{E}{\mathbf{X}}_j\right)}^{\prime}\left(\mathbf{E}\mathbf{X}\right){\left(\mathbf{E}\mathbf{X}\right)}^{\prime}\left(\mathbf{E}{\mathbf{X}}_j\right)+N\left(1-{h}_{{\boldsymbol{g}}_{\mathbf{1}}}^{\mathbf{2}}\right)\right] $$
$$ =\frac{1}{N}\left[\frac{N^2{h}_{{\boldsymbol{g}}_{\mathbf{1}}}^{\mathbf{2}}}{M}\left({\ell}_j+\frac{3\left(M-{\ell}_j\right)}{N}\right)+N\left(1-{h}_{{\boldsymbol{g}}_{\mathbf{1}}}^{\mathbf{2}}\right)\right] $$
$$ =\frac{N{h}_{{\boldsymbol{g}}_{\mathbf{1}}}^{\mathbf{2}}}{M}\left({\ell}_j+\frac{3\left(M-{\ell}_j\right)}{N}\right)+1-{h}_{{\boldsymbol{g}}_{\mathbf{1}}}^{\mathbf{2}} $$
$$ =\frac{\left(N-3\right){h}_{{\boldsymbol{g}}_{\mathbf{1}}}^{\mathbf{2}}}{M}{\ell}_j+1+2{h}_{{\boldsymbol{g}}_{\mathbf{1}}}^{\mathbf{2}} $$
$$ =\frac{N\left(1-3/N\right){h}_{{\boldsymbol{g}}_{\mathbf{1}}}^{\mathbf{2}}}{M}{\ell}_j+1+2{h}_{{\boldsymbol{g}}_{\mathbf{1}}}^{\mathbf{2}} $$
$$ =\frac{N{h}_{{\boldsymbol{g}}_{\mathbf{1}}}^{\mathbf{2}}}{M}{\ell}_j+1+2{h}_{{\boldsymbol{g}}_{\mathbf{1}}}^{\mathbf{2}} $$
(3)
The term, 1 − 3/N, in Eq. (3) can be approximated as 1 in the analysis using biobank scale data, which contains over 105 samples.
The remaining part in Eq. (2) can be rearranged as:
$$ \frac{1}{N}\left[N\left(-{h}_{{\boldsymbol{\tau}}_{\mathbf{1}}}^{\mathbf{2}}\right)+{h}_{{\boldsymbol{\tau}}_{\mathbf{1}}}^{\mathbf{2}}\left(\mathbf{E}{\mathbf{X}}_j\right)^{\prime}\left(\mathbf{E}\right)\left(\mathbf{E}\right)^{\prime}\left(\mathbf{E}{\mathbf{X}}_j\right)\right]=2{h}_{{\boldsymbol{\tau}}_{\mathbf{1}}}^{\mathbf{2}} $$
where \( \mathbbm{E}\left[\left(\mathbf{E}{\mathbf{X}}_{\mathbf{1}}\right)^{\prime}\left(\mathbf{E}\right)\left(\mathbf{E}\right)^{\prime}\left(\mathbf{E}{\mathbf{X}}_{\mathbf{1}}\right)\right]=3N \) according to the central moment theory of standard normal distribution (see above), assuming that E and each column of X have a negligible correlation, which satisfies if E is an environmental variable or a polygenic trait.
Therefore,
$$ \mathbbm{E}\left[{\chi}_j^2\right]=N\bullet \mathrm{Var}\left(\hat{\beta_{1j}}|\mathbf{EX}\right)=\frac{N{h}_{{\boldsymbol{g}}_{\mathbf{1}}}^{\mathbf{2}}}{M}{\ell}_j+1+2{h}_{{\boldsymbol{g}}_{\mathbf{1}}}^{\mathbf{2}}+2{h}_{{\boldsymbol{\tau}}_{\mathbf{1}}}^{\mathbf{2}} $$
(4)
where \( 1+2{h}_{{\boldsymbol{g}}_{\mathbf{1}}}^{\mathbf{2}}+2{h}_{{\boldsymbol{\tau}}_{\mathbf{1}}}^{\mathbf{2}} \) can be obtained as the intercept of the outcome by fitting to the proposed model (GxEsum). It is noted Eq. (4) is valid when Xj is not strictly normally distributed, i.e., the centered and standardized genotypes of jth SNP, which is already shown in Bulik-Sullivan et al. [10]. When the environmental variable (E) is non-normal, the general form of the fourth central moment term can be expressed as \( \mathbbm{E}\left[\left(\mathbf{E}{\mathbf{X}}_{\mathbf{1}}\right)^{\prime}\left(\mathbf{E}\right)\left(\mathbf{E}\right)^{\prime}\left(\mathbf{E}{\mathbf{X}}_{\mathbf{1}}\right)\right]= \)kurtosis∙N where kurtosis is the kurtosis of E. And only the intercept part of the Eq. (4) is slightly modified as:
$$ \mathbbm{E}\left[{\chi}_j^2\right]=N\bullet \mathrm{Var}\left(\hat{\beta_{1j}}|\mathbf{EX}\right)=\frac{N{h}_{{\boldsymbol{g}}_{\mathbf{1}}}^{\mathbf{2}}}{M}{\ell}_j+1+\left({\mathrm{k}}_{\mathrm{urtosis}}-1\right){h}_{{\boldsymbol{g}}_{\mathbf{1}}}^{\mathbf{2}}+\left({\mathrm{k}}_{\mathrm{urtosis}}-1\right){h}_{{\boldsymbol{\tau}}_{\mathbf{1}}}^{\mathbf{2}} $$
(5)
Equations (4) and (5) are verified using simulations (see Additional file 1: Note S3 and Table S2).
To validate the proposed model in general, we used comprehensive phenotypic simulations that were based on real genotype data (see Additional file 1: Note S4).
Real data
UK Biobank data were used, which contains 0.5 million individuals aged between 40 and 69 years. The data consists of health-related information for each participant who was recruited in 2006–2010, and their imputed genomic data (~ 92 million SNPs) has been distributed through European Genome-phenome Archive. A stringent quality control process for individuals was set as follows: (1) who were reported as non-white British, (2) who were having mismatched gender between the reported and the inferred by the genotypic data, (3) who were having missing rate over 0.05, and (4) who were having putative sex chromosome aneuploidy. In addition, only HapMap3 SNPs were used which were passed from the stringent quality controls for SNPs. The filter for SNPs is set as follows: (1) which were having INFO score less than 0.6, (2) which were having a MAF less than 1%, (3) which were having Hardy-Weinberg Equilibrium (HWE) P-value less than 1E−4, and (4) one of which from the duplicated SNPs. From those passing the tough procedures, we additionally excluded one of pair of samples who were having a genomic relationship higher than 0.05. After quality control, 288,837 individuals and 1,133,273 SNPs remained. We estimated LD scores using the genotypic data of UK Biobank after these quality control processes.
Among the trait phenotypes available in the UK biobank, we arbitrarily selected BMI (a quantitative trait) and hypertension and type 2 diabetes (binary disease traits) and tested if the genetic effects of the complex traits were significantly modulated by an environmental variable, i.e., NEU, ALC, PA, or age (for testing BMI); BMI, WHR, or BFP (for hypertension); and BMI, diastolic BP, or systolic BP (for type 2 diabetes). The number of cases for hypertension and type 2 diabetes was 134,499 (population prevalence is 0.51) and 11,694 (population prevalence is 0.04), respectively. The phenotypes of the main trait were adjusted for potential confounders such as age, gender, year of birth, assessment center, Townsend Deprivation Index, genetic batch, household income, educational qualification [38], the first 10 principal components, and the environmental variable. For any phenotypic missing value for each variable, we used the mean of the phenotypes of the variable, i.e., phenotypic imputation with the mean. A better phenotypic imputation method [27] can be used, which is likely to improve the significance of GxE. Further details of the variables used in this study are in Additional file 1: Note S4.
In GWAS, we used a linear model for quantitative traits as well as for binary responses. The use of a linear model applied to binary responses is because it has been reported that a logistic regression may generate biased estimates in some instances [39], and our simulations (Additional file 1: Note S2) were based on a probit model (i.e., a linear transformation of the inverse standard normal distribution) that can be well approximated by a linear model [40].