Modelling local gene networks increases power to detect trans-acting genetic effects on gene expression

Expression quantitative trait loci (eQTL) mapping is a widely used tool to study the genetics of gene expression. Confounding factors and the burden of multiple testing limit the ability to map distal trans eQTLs, which is important to understand downstream genetic effects on genes and pathways. We propose a two-stage linear mixed model that first learns local directed gene-regulatory networks to then condition on the expression levels of selected genes. We show that this covariate selection approach controls for confounding factors and regulatory context, thereby increasing eQTL detection power and improving the consistency between studies. GNet-LMM is available at: https://github.com/PMBio/GNetLMM. Electronic supplementary material The online version of this article (doi:10.1186/s13059-016-0895-2) contains supplementary material, which is available to authorized users.

We first studied a small gene-gene network to show the benefits of our model. The network consists of four genes A, B, C and D, with the following topology: Gene A has a cis-anchor explaining between 10% and 20% of the variance and each regulating gene explains between 10% and 20% of the variance. The remaining variance goes into i.i.d. noise. We are interested in finding trans associations and for simplicity here concentrate on the association between SNP A and gene C.
Next, we keep all but one parameter fixed, and vary

Confounding
Next, we studied a small gene-gene network in which conditioning on the wrong gene leads to false positives. The networks consists of three genes A, B and C, and two confounding factors H A , H B : Gene A and Gene B have the same cis anchor explaining in total between 10% and 20% of the variance andand the confounding effects explain together 50% to 70% of the gene √ï s variability.. The remaining variability goes into i.i.d noise. We are interested in testing if GNet-LMM calls V-structures that lead to false positives between SNP A and Gene C. We keep all parameter fixed and • use the same cis SNPs for gene A and gene B to test for synthetic associations • use different cis SNPs for gene A and gene B to examine the effects of confounding in general Each setting is repeated 10000 times.

Parallel
In the following, we study a small gene-gene network in which a SNP has a direct and an indirect (cis-mediated) effect on the target gene. The network consists of three genes A,B and C having the following structure: Gene A has a cis anchor explaining between 10% and 20% of the variance and each regulating gene explains between 10% and 20% of the variance. The remaining variance goes into i.i.d. noise. We are interested if GNet-LMM improves power compared to a standard LMM depending depending if the association is cis mediated or not. For this, we introduce an additional parameter α, where (1 − α) 2 % of the variance explained by SNP A goes to Gene A and α 2 goes directly to gene C. We let α vary in {0.2, 0.4, 0.6, 0.8, 1.0} and repeat each setting 1000 times.

Trans
Finally, we study a small gene-gene network in which a SNP effect is mediated by a cis and by a trans effect on gene. The network consists of four genes A,B, C and D having the following structure: Gene A has a cis anchor and Gene D has a trans anchor with the same SNP. In total, the SNP explains between 10% and 20% of the variance and each regulating gene explains between 10% and 20% of the variance of gene C. The remaining variance goes into i.i.d. noise. We are interested if GNet-LMM improves power compared to a standard LMM depending if the association is cis mediated or not. For this, we introduce a new parameter α, where (1 − α) 2 % of the variance explained by SNP A is mediated by Gene A and α 2 is mediated by Gene D. We let α vary in {0.2, 0.4, 0.6, 0.8, 1.0} and repeat each setting 1000 times.

Power Simulations
Each gene is a linear function of all in-coming edges, the cis-SNP and noise: Y i,cis = X i,cis w i,cis (8) The matrix A g is the adjacency matrix where A g [i, j] = 1 iff gene j causes gene i. The simulated networks contained 100 genes. The gene weights are drawn from a mixture of two normal distribution, where the mixture coefficient is 0.5 and the means are ±1 and the standard error from each component is 0.1. By using this prior, we ensure that all in-coming edges have roughly the same impact.
Sparse Network We draw a directed edge between gene i and gene j with probability 5%. In addition, we require that i < j to ensure that there are no loops in the network and allow for no more than 5 in-coming edges.
Star Network Select the first 9 genes as hubs. Each hub regulates between 20% and 50% of the genes. No hub is regulated by another hub.

Confounding Factors
Y i,cis = X i,cis w i,cis (10) We first simulate genes according the gene network. Afterwords, we add confounding to the genes.This should resemble technical confounding that is not propagated by the gene-gene network. We again use a mixture of two normal distribution for the weights of the confounding factors.
We vary the following parameters: • the variance of the SNP in σ 2 SN P ∈ {0.0, 0.05, 0.10, 0.15, 0.20} • the variance of the network σ 2 network ∈ {0.0, 0.3, 0.5, 0.7, 0.8, 0.9} • the ratio between the confounding factors and the gene network α ∈ {0.0, 0.2, 0.4, 0.6, 0.8, 1.0} • the expected number of confounders per gene in {0, 0.5, 1, 2, 3} • the number of confounders {0, 1, 2, 3, 4, 5}  Figure S1: Power comparison across different settings for the basic simulation. Compared are a standard linear mixed model (LMM), GNet-LMM with cis and trans anchors as well as an LMM when conditioning on true exogenous genes (ideal-LMM) and an LMM when conditioning on co-regulated genes downstream of the focal gene C. (a) the topology of the simulated network (see main text, Methods). Considered are alternative simulation parameters, varying (b) the proportion of variance explained by SNP A on the mediating (cis) gene A, (c) the effect of the edge effect of Gene A on Gene C, (d) the variance explained by Gene B on Gene C. Across the different scenarios, the performance of GNet-LMM is close to the theoretical limit when conditioning on true exogenous genes (ideal-LMM).

Latent Factor Analyser
↵ Let Y be the gene expression matrix. It has N rows and T columns. Each column Y :,t corresponds to one gene, each row Y n to one sample. The genes can be represented as a linear function of the transcription factors, where is a matrix of dimension N ⇥ K and corresponds to the unobserved expression values of the transcription factors. X is a matrix of size T ⇥ K and corresponds to the weights of the transcription factors. The indicator matrix I trans is of size T ⇥ K and maps the genes to the transcription factors, having a 1 if the transcription factor regulates the gene and 0 otherwise. Each gene has its own noise-level Z :,t ⇠ N 0, ⌧ Yt 1 I N ⇥N , and Gaussian priors are put over the transcription factor weights X t ⇠ N 0, ⌧ Xt 1 I K⇥K . The transcription factors are partly explained by genetics and partly by environmental contributions: where G 2 R N ⇥F are the SNPs, W 2 R K⇥F are the weights having a Gaussian prior W k ⇠ N (0, ⌧ 1 W k I F ⇥F ) and E is the environmental contribution, also drawn from a Gaussian distribution E :,k ⇠ N 0, ⌧ k 1 I N ⇥N . The indicator matrix I cis 2 {0, 1} K⇥F can be used to specify which SNPs control the transcription factors. For convenience, we use a Gamma distribution (⌧ |a, b) as prior for all precision parameters.
The joint distribution over the model parameters and the data is given by Unfortunately, the evidence is not tractable and we have to either use approximation methods or MCMC sampling for computing the evidence or the posterior over the parameters.

Latent Factor Analyser
↵ Let Y be the gene expression matrix. It has N rows and T columns. Each column Y :,t corresponds to one gene, each row Y n to one sample. The genes can be represented as a linear function of the transcription factors, where is a matrix of dimension N ⇥ K and corresponds to the unobserved expression values of the transcription factors. X is a matrix of size T ⇥ K and corresponds to the weights of the transcription factors. The indicator matrix I trans is of size T ⇥ K and maps the genes to the transcription factors, having a 1 if the transcription factor regulates the gene and 0 otherwise. Each gene has its own noise-level Z :,t ⇠ N 0, ⌧ Yt 1 I N ⇥N , and Gaussian priors are put over the transcription factor weights X t ⇠ N 0, ⌧ Xt 1 I K⇥K . The transcription factors are partly explained by genetics and partly by environmental contributions: where G 2 R N ⇥F are the SNPs, W 2 R K⇥F are the weights having a Gaussian prior W k ⇠ N (0, ⌧ 1 W k I F ⇥F ) and E is the environmental contribution, also drawn from a Gaussian distribution E :,k ⇠ N 0, ⌧ k 1 I N ⇥N . The indicator matrix I cis 2 {0, 1} K⇥F can be used to specify which SNPs control the transcription factors. For convenience, we use a Gamma distribution (⌧ |a, b) as prior for all precision parameters.
The joint distribution over the model parameters and the data is given by Unfortunately, the evidence is not tractable and we have to either use approximation methods or MCMC sampling for computing the evidence or the posterior over the parameters. = G I cis where G 2 R N ⇥F are the SNPs, W 2 R K⇥F a N (0, ⌧ 1 W k I F ⇥F ) and E is the environmental distribution E :,k ⇠ N 0, ⌧ k 1 I N ⇥N . The in to specify which SNPs control the transcriptio distribution (⌧ |a, b) as prior for all precision The joint distribution over the model param Unfortunately, the evidence is not tractable and we have to either use a for computing the evidence or the posterior ov

Latent Factor Analyser
↵ Let Y be the gene expression matrix. It has N rows and T colu corresponds to one gene, each row Y n to one sample. The genes c linear function of the transcription factors, where is a matrix of dimension N ⇥ K and corresponds to the values of the transcription factors. X is a matrix of size T ⇥ K a weights of the transcription factors. The indicator matrix I trans is o the genes to the transcription factors, having a 1 if the transcripti gene and 0 otherwise. Each gene has its own noise-level Z :,t ⇠ N Gaussian priors are put over the transcription factor weights X t ⇠ The transcription factors are partly explained by genetics and p contributions: where G 2 R N ⇥F are the SNPs, W 2 R K⇥F are the weights having N (0, ⌧ 1 W k I F ⇥F ) and E is the environmental contribution, also dr distribution E :,k ⇠ N 0, ⌧ k 1 I N ⇥N . The indicator matrix I cis 2 to specify which SNPs control the transcription factors. For conveni distribution (⌧ |a, b) as prior for all precision parameters.
The joint distribution over the model parameters and the data is is not tractable and we have to either use approximation method for computing the evidence or the posterior over the parameters.   Figure S4: Graphical Model of the data generation process for simulations that include confounding effects. In a first step, the latent gene expression levels are simulated as function of the in-coming (latent) genes, cis-SNPs and noise. In a second step, technical noise is added to a subset of the genes leading to confounding that is not propagated along the genetic network.  and GNetLMM [trans]. In contrast, for tests without cis eQTLs, GNETLmm[cis] reverts to a standard LMM whereas GNETLmm[trans] still achieves a greater enrichment than other methods. Overall, there appears to be a stronger enrichment of pathway-consistent associations among tests that do have cis anchors, which is consistent with the idea that genuine trans associations tend to be cis mediated. 1.0e-06 1.0e-08 1.0e-10 1.0e-12 1.0e-14 Figure S13: Number of retrieved cis associations on the Cardiogenics dataset. Shown is the number of cis-hits found for varying cutoff values. Across all thresholds, factor model based approach find more hits than standard approaches. As expected, GNet-LMM performs equivalently to the standard LMM.   Figure S17: Validation of genes that are associated with the LYZ region on the Cardiogenics dataset using an independent Monocyte eQTL study. Shown is the receiver operating characteristics (ROC) for eQTLs discovered in the Cardiogenics dataset using different methods, when using eQTL genes discovered in an independent Monocyte eQTL dataset (Fairfax et al., 2014) as ground truth set (standard linear mixed model; P < 0.01, Bonferroni adjusted across tests). The ROC curve is truncated at an FPR of 5%. The replication rate was best for GNet-LMM, whereas other methods resulted in a lower replication rate than obtained using a standard linear model.

11
(a) ↵ cis = 10 4 (b) ↵ cis = 10 5 (c) ↵ cis = 10 7 (d) ↵ cis = 10 8 Supplementary Figure 11 cisThresh Performance of GNet-LMM for varying cis-thresholds on the mouse dataset. Shown are the paired p-values (log-transformed) between cis-LMM, GNet-LMM using the default p-value threshold (↵ cis = 10 6 ) and other ↵ cis threshold choices. The changing method is plotted on the x-axis, GNet-LMM on the y-axis in the top row, and cis-LMM on the y-axis in the bottom row. We chose ↵ cis = 10 6 since this roughly corresponds to the Bonferroni corrected p-value of 0.05. If a more stringent threshold is chosen (c,d), less anchors are determined leading to a small power loss. However, the di↵erences are negligible when comparing to a standard cis-LMM. Shown are the paired p-values (log-transformed) between the GNet-LMM using the standard threshold (↵ cis = 10 6 ) and other choices (a d). We chose ↵ cis = 10 6 since this roughly corresponds to the Bonferroni corrected p-value of 0.05. If a more stringent threshold is chosen (c,d), less anchors are determined leading to a small power loss. However, the di↵erences are negligible when comparing to a standard cis-LMM (e h). The changing method is plotted on the x-axis, GNet-LMM on the y-axis (top row), cis-LMM on the y-axis (bottom row).

11
Supplementary Figure 15 Performance of GNet-LMM for varying cis-thresholds on the mouse dataset.
Shown are the paired p-values (log-transformed) between the GNet-LMM using the standard threshold (↵ cis = 10 6 ) and other choices (a d). We chose ↵ cis = 10 6 since this roughly corresponds to the Bonferroni corrected p-value of 0.05. If a more stringent threshold is chosen (c,d), less anchors are determined leading to a small power loss. However, the di↵erences are negligible when comparing to a standard cis-LMM (e h). The changing method is plotted on the x-axis, GNet-LMM on the y-axis (top row), cis-LMM on the y-axis (bottom row).

11
(a) ↵ ind = 0.01 Supplementary Figure 12 indThresh Performance of GNet-LMM for varying ind -thresholds on mouse dataset. Shown are the paired p-values (log-transformed) between cis-LMM, GNet-LMM u the default p-value threshold (↵ ind = 0.1) and other ↵ ind threshold choices. The changing metho plotted on the x-axis, GNet-LMM on the y-axis in the top row, and cis-LMM on the y-axis in the bot row. If a more stringent threshold is chosen (d), less anchors are determined leading to a small po loss. However, the di↵erences are negligible when comparing to a standard cis-LMM. Suppl Shown (↵ cis = Bonfer determ standa row), c the default p-value threshold (↵ cis = 10 6 ) and other ↵ cis threshold choices. The changing method is plotted on the x-axis, GNet-LMM on the y-axis in the top row, and cis-LMM on the y-axis in the bottom row. We chose ↵ cis = 10 6 since this roughly corresponds to the Bonferroni corrected p-value of 0.05. If a more stringent threshold is chosen (c,d), less anchors are determined leading to a small power loss. However, the di↵erences are negligible when comparing to a standard cis-LMM. If a more stringent threshold is chosen (c,d), less anchors are determined leading to a small power loss.
However, the di↵erences are negligible when comparing to a standard cis-LMM. Supplementary Figure 11 cisThresh Performance of GNet-LMM for varying cis-thresholds on mouse dataset. Shown are the paired p-values (log-transformed) between cis-LMM, GNet-LMM u the default p-value threshold (↵ cis = 10 6 ) and other ↵ cis threshold choices. The changing metho plotted on the x-axis, GNet-LMM on the y-axis in the top row, and cis-LMM on the y-axis in the bot row. We chose ↵ cis = 10 6 since this roughly corresponds to the Bonferroni corrected p-value of 0 If a more stringent threshold is chosen (c,d), less anchors are determined leading to a small power l However, the di↵erences are negligible when comparing to a standard cis-LMM. Shown are the paired p-values (log-transformed) between the GNet-LMM using the standard threshold (↵ cis = 10 6 ) and other choices (a d). We chose ↵ cis = 10 6 since this roughly corresponds to the Bonferroni corrected p-value of 0.05. If a more stringent threshold is chosen (c,d), less anchors are determined leading to a small power loss. However, the di↵erences are negligible when comparing to a standard cis-LMM (e h). The changing method is plotted on the x-axis, GNet-LMM on the y-axis (top row), cis-LMM on the y-axis (bottom row).

11
Supplementary Figure 15 Performance of GNet-LMM for varying cis-thresholds on the mouse dataset.
Shown are the paired p-values (log-transformed) between the GNet-LMM using the standard threshold (↵ cis = 10 6 ) and other choices (a d). We chose ↵ cis = 10 6 since this roughly corresponds to the Bonferroni corrected p-value of 0.05. If a more stringent threshold is chosen (c,d), less anchors are determined leading to a small power loss. However, the di↵erences are negligible when comparing to a standard cis-LMM (e h). The changing method is plotted on the x-axis, GNet-LMM on the y-axis (top row), cis-LMM on the y-axis (bottom row).

11
(a) ↵ ind = 0.01 Supplementary Figure 12 indThresh Performance of GNet-LMM for varying ind -thresholds on mouse dataset. Shown are the paired p-values (log-transformed) between cis-LMM, GNet-LMM u the default p-value threshold (↵ ind = 0.1) and other ↵ ind threshold choices. The changing metho plotted on the x-axis, GNet-LMM on the y-axis in the top row, and cis-LMM on the y-axis in the bot row. If a more stringent threshold is chosen (d), less anchors are determined leading to a small po loss. However, the di↵erences are negligible when comparing to a standard cis-LMM.  Figure S22: Distribution of the number of genes in the conditioning set for the Cardiogenic dataset. Shown is the distribution of the number of genes per unique conditioning set on the Cardiogenics dataset. Most conditioning sets contain only a handful of genes, making the inference amenable for low-rank tricks.