Tejaas: reverse regression increases power for detecting trans-eQTLs

Trans-acting expression quantitative trait loci (trans-eQTLs) account for ≥70% expression heritability and could therefore facilitate uncovering mechanisms underlying the origination of complex diseases. Identifying trans-eQTLs is challenging because of small effect sizes, tissue specificity, and a severe multiple-testing burden. Tejaas predicts trans-eQTLs by performing L2-regularized “reverse” multiple regression of each SNP on all genes, aggregating evidence from many small trans-effects while being unaffected by the strong expression correlations. Combined with a novel unsupervised k-nearest neighbor method to remove confounders, Tejaas predicts 18851 unique trans-eQTLs across 49 tissues from GTEx. They are enriched in open chromatin, enhancers, and other regulatory regions. Many overlap with disease-associated SNPs, pointing to tissue-specific transcriptional regulation mechanisms. Supplementary Information The online version contains supplementary material available at (10.1186/s13059-021-02361-8).


Contents
List of Tables S1 Summary of number of trans-eQTLs before and after cross-mappability lter . . 35 S2 GWAS enrichment of trans-eQTLs for di erent tissues and disease categories . . 41

Forward regression
A trans-eQTL is generally expected to in uence the expression levels of tens to hundreds of genes and we take advantage of this signature to increase the sensitivity to detect them. Brynedal et al. proposed a method called cross-phenotype meta-analysis (CPMA) to analyze the p-value distribution of pairwise associations of a candidate SNP with all measured genes [1]. We follow the same idea but use a di erent statistic for forward regression (FR) to nd the trans-eQTLs.

. Notation
We use bold capital cases for matrices. X is the × matrix of genotypes for SNPs and samples. Y is the × matrix of gene expression levels for genes and samples. For the FR analysis, both X and Y are centered and normalized. We use bold small cases for vectors. The rows of X and Y are denoted as x and y respectively for every SNP ∈ {1, . . . , } and for every gene ∈ {1, . . . , }. The columns of X and Y are denoted as x and y respectively for every sample ∈ {1, . . . , }. Both x and y are vectors of size , while x and y are of sizes and respectively.

. FR-score
For each SNP x , we calculated the p-values of association with y for all the ∈ {1, . . . , } genes independently. Under the null hypothesis that the SNP is not a trans-eQTL, these p-values will be independent and identically distributed (iid) with a uniform probability density function, If SNP is a trans-eQTL, then more genes than expected by chance will have low p-values for association with the SNP, leading to a higher density of p-values near zero. We de ned the FRscore (q fwd ) as a statistic that estimates the di erence between the observed p-value distribution from the data and the uniform distribution expected by chance. We sort the p-values in increasing order and the th smallest value is called the th order statistic, and is denoted as ( ) . If is uniformly distributed ( ∈ (0, 1)), then ( ) will be a Beta-distributed random variable, ( ) ∼ Beta ( , + 1 − ) . ( For any random variable ∼ Beta ( , ), where denotes the digamma function. Using the identities in Eq. 3 and noting that = and = + 1 − , we obtained the expectation of ln ( ) as, If the candidate SNP is a trans-eQTL and there is an enrichment of p-values near zero, then the observed values of ln ( ) will be lower than that expected by chance when is small. Therefore, the cumulative sum of E ln ( ) − ln ( ) over will increase monotonically, pass through a maximum and then decrease to an asymptotic value of zero. Hence, we de ned the FR-score as, Intuitively, if the candidate SNP is not a trans-eQTL, =1 E ln ( ) − ln ( ) will uctuate around zero. Hence, q fwd will remain close to 0 for these SNPs. However, a trans-eQTL will be associated with many genes and the FR-score will be high (q fwd 0) because there will be many genes with a lower p-value than expected by chance. Of course, there will be other genes which are not associated with the trans-eQTL and the p-values for association with these genes will not contribute to the q fwd . Therefore, it would be su cient to calculate the q fwd from the rst genes with lowest p-values, instead of all genes: .

Null model and p-values for FR-score
We need to de ne a null distribution to evaluate the signi cance of any observed q fwd . Let q null fwd be an iid sample from the null distribution. Since it is analytically intractable to de ne a probability density of q null fwd , we de ne an empirical null distribution. The empirical cumulative distribution function (ECDF) of q null fwd can be used to ascertain signi cance of any observed q fwd to obtain a p-value, denoted as q fwd , One possible way to obtain a null model is by calculating q null fwd for a large number of null SNPs assuming they are not trans-eQTLs. Following our hypothesis in Eq. (1), the p-values for association of each of these null SNPs with the genes can be sampled each time from Unif (0, 1) distribution. However, the iid assumption of p-values breaks down in real data because the gene expressions are strongly correlated. Therefore, the p-values for association between the SNP and the genes are strongly correlated and cannot be sampled from the Unif (0, 1) distribution.
To account for the correlation, we randomized the sample labels of X by permuting the columns of the real genotype matrix -thereby removing any association with the gene expression Y but retaining the correlation between the gene expression levels. We then calculated q null fwd for all SNPs using the 'permuted' genotype and the gene expression. We used this empirical null distribution to calculate q fwd from Eq. 7 .
The estimate of q fwd gets increasingly noisy with higher values of q fwd (low p-values) because of lack of sampling points in that region. In that regime it is better to rely on a parametric t of the exponentially falling part of the cumulative distribution. We sorted the q null fwd in an increasing order. We set the limit beyond which we will use the exponential extrapolation at top = min{100, 0.1 }, where is the total number of SNPs being used for the calibration of the null model. This ensures that the extrapolation is applied at most to the top 10% of points, and, if more data are available, only to the top 100. Let us denote the q null fwd at this cut-o point as . For any observed q fwd ≥ , we use a maximum-likelihood estimate based on the top data points and is given by, where null jpa, is the th value in the ordered sequence of q null fwd . Note that for q fwd = this equation yields = top / , the same as the empirical value.

. Comparison with CPMA
Brynedal et al.used a similar idea for nding trans-eQTLs by testing each SNP for enrichment of association − ln( ) values across all genes with a null hypothesis that − ln( ) should be exponentially distributed, with a decay parameter = 1. Under the joint alternative hypothesis, a subset of association statistics are non null and ≠ 1. They compared the evidence for these hypotheses as a likelihood ratio test for CPMA, with the statistic CPMA de ned as, whereˆ is the observed exponential parameter from the data. They accounted for the extensive correlation among the gene expression levels by testing the signi cance empirically using a simulated gene expression matrix with the same covariance as the real gene expression. To de ne high-con dence trans-eQTLs, they combined empirical CPMA statistic from multiple data sets using sample size weighted meta-analysis. FR measures the signi cance of the enrichment in the p-value distribution near zero, and CPMA measures the signi cance of the enrichment in the − ln( ) distribution. Theoretically, both should give similar ranking of SNPs for being trans-eQTLs. Since there are no currently available software for CPMA, we implemented FR-score in Tejaas and compared the performance with RR-score.

Reverse regression
In this section, we discuss the reverse regression (RR) model and introduce the RR-score, denoted as q rev for nding the trans-eQTLs. We also describe a null model and explain how to obtain signi cance p-values for the q rev of a candidate SNP, denoted as q rev . We also discuss the numerical implementation of RR-score in Tejaas and several other options used in Tejaas.

. Notation
We use the same notations from Sec. 1.1. The genotype vector x for every th SNP is centered but not normalized to allow for null model calculation. Y is centered and normalized.

. Model description
We model x with a univariate normal distribution whose mean depends linearly on the gene expression through a column vector of regression coe cients (∈ R ), where the variance of the th SNP is given as 2 . For the rest of the manuscript, we drop the subscript for ease of reading although we note that all calculations are done for the th SNP. The log likelihood of this regression task is where is the minor allele count of the th SNP in th sample. The number of samples will usually be on the order of a hundred to a few thousand, much smaller than the number of explanatory variables ≈ 20 000. Therefore, simple maximization of the likelihood would lead to a dramatically overtrained which would perfectly predict x on the training data but which would achieve very bad performance on unseen data. The solution is to de ne a prior on . We assume that every th element ( ∈ {1, . . . , }) of is sampled from a normal distribution with variance 2 , ∼ N | 0, 2 (13) and maximize the log posterior probability, .

Bayesian model comparison
Let H 1 be the trans-eQTL model which allows ≠ 0 and H 0 be the null model for which = 0. According to Bayes' theorem, The probability for the model H 1 is a monotonically increasing function of the likelihood ratio, where the second line is obtained by using the model de ned in Eq. 11 and the prior for de ned in Eq. 13. We then de ned := YY T + 2 / 2 I and used the technique of quadratic complementation to obtain, Using Eqs. 15 and 17, we obtained the probability of the trans-eQTL model, Thus, it is possible to obtain the probability of each SNP being a trans-eQTL but the calculation is computationally expensive and requires the prior probabilities (H 0 ) and (H 1 ). These prior probabilities can be set to a constant.

. RR-score: Definition
Motivated by the probability obtained in Eq. 18, we de ned our test statistic RR-score, denoted q rev , as follows: Suppose we have obtained a score q rev for the th SNP and we would like to know how signi cant this score is. In order to rank it with respect to all the other SNPs in the genome, we need to calculate a null distribution of q null rev and from it the p-value of our observed score q rev . Note that the statistic q rev is not proportional to the probability given by Eq. 18 and hence cannot be directly used for ranking the SNPs. However, in Sec. 2.6, we derive a p-value to ascertain the signi cance of the q rev of each SNP.

. Numerical calculation
The RR-score de ned in Eq. (19) involves computing the inverse of a × matrix W. To reduce the complexity, we rst perform a singular value decomposition of Y T , with two orthogonal matrices, U ∈ R × and V ∈ R × , and a diagonal matrix S ∈ R × that whose all elements are zero except for the = min{ , } singular values on its diagonal. Substituting the SVD into W gives This allows us to expand the matrix W using the singular values and the eigenvectors u , Therefore, we can write the RR-score as

.6 Null model
To obtain a p-value for the q rev score of SNP with column x in the genotype matrix X, we need a null distribution of q rev scores. The p-value is then the probability mass of the null distribution with values higher than the actually obtained q rev score. We can obtain such null distribution by permuting the genotype entries in x while keeping the rows of the gene expression matrix Y unpermuted. The distribution of the resulting q null rev scores can di er between SNPs depending on their minor allele frequency (MAF) and the variance of the genotype ( 2 ).
In principle we could derive the distribution of q null rev empirically by permuting the elements of x a large number of times. However, to obtain p-values of 5 × 10 −8 or below, corresponding to genome-wide signi cance, we would need to draw at least 2 × 10 7 permuted samples and compute their q rev scores. This would be too time consuming. Instead, we derive in Appendix 1 analytical expressions for the expectation value := q null rev and variance 2 := Var q null rev of q null rev = x T Wx under the permutation null model for any symmetric matrix W and any centered vector x. In Fig. S1a and b, we show that our analytical calculation of and match those obtained from the empirical permutation of x. With this empirical permutation, we could verify a p-value up to 1 × 10 −12 is accurately approximated by our normal approximation. In our subsequent analyses of GTEx, we found that 2781 association tests (out of 49 × 8048655 association tests) were beyond the empirically veri ed range.
Normal approximation. Under the null model assumption that the SNP is not a trans-eQTL, the u T x are distributed as the projections of a unit vector with random direction onto Cartesian coordinates given by the eigenvectors u . The u T x will be normally distributed and u T x 2 will have a chi-square distribution. In the limit of 1, the sum over 2 / 2 + 2 / 2 u T x 2 will have a normal distribution according to the central limit theorem. Therefore, we approximate q null rev by N , 2 . In Fig. S1c, we show the QQ plot which numerically validates the approximation. Finally, the p-value of q rev for a candidate SNP is where Φ( ) denotes the cumulative normal distribution for a random variable .
Constraint on covariate correction. Generally, the gene expression matrix Y should have rank (Y) = = min{ , }. This means that the SVD in Eq. (21) should have singular values .
To obtain q rev , the sum in Eq. (24) runs over components of u T x 2 . If known covariates are now corrected from Y using linear regression (see CCLM below, Sec. 3.1), then of these singular values become zero for the residual expression Y . This is equivalent to subtracting components from the sum in Eq. (24). When the subtracted part is not normally distributed, as would happen if is too small for the central limit theorem to be valid, then q null rev is not normally distributed and the p-values are not accurate. If is large, then the subtracted part and consequently, q null rev , remains normally distributed. In practical situations, only a few known covariates are used to correct the gene expression and in this limit of small , the null model is not Gaussian.
In Fig. S2, we show that q null rev is normally distributed for a given SNP and a gene expression matrix with = 581 (from adipose subcutaneous tissue of GTEx). Setting the rst seven singular (c) QQ plot of the empirical null distribution of q rev for four representative SNPs with minor allele frequencies of . , . , . and . . On the x-axis, we plot the theoretical quantiles of N ( , ) and on the y-axis, we plot the observed quantiles of (q revq ) / q . With a sampling size of e empirical permutations, the maximum observed quantiles correspond to p-values of . e-, 6. e-, . e-and . -respectively for the four SNPs.
For all plots in this figure, we used the gene expression of adipose subcutaneous tissue from GTEx. For all q rev , we used = . .
values to zero (equivalent to correcting seven covariates) breaks the normal distribution of q null rev . However, if 200 singular values are set to zero, then the assumption of normal distribution of q null rev becomes valid again. This illustrates the problem of using CCLM with few confounders in combination with Tejaas.

. Comparison of RR-score with sequence kernel association test (SKAT)
The aggregation of weak signals from many covariates in the RR-score of Tejaas is reminiscent of methods developed for nding rare genetic variants associated with a trait that aggregate the e ects of many rare variants in an entire locus or gene, such as the burden test [2] or the sequence kernel association test (SKAT) [3,4]. Their test statistics take a similar form to that of Tejaas. All three can be written as (20)), R = 11 T for the burden test and R = I for SKATor, in the weighted version with diagonal weight matrix Π = diag( 1 , . . . , ), R = Π 11 T Π and R = Π 2 , respectively (where 1 T (1, . . . , 1)). For → 0, RR-score of Tejaas tends towards the unweighted SKAT statistic. However, burden and SKAT are classical hypothesis tests, testing whether to reject the null hypothesis = 0. In contrast, Tejaas uses Bayesian model comparison, comparing the null model = 0 with the alternative trans-eQTL model ≠ 0 (Eq. (15)) while integrating out the unknown e ect strengths (Eq. (17)). Since the trans-eQTL model includes the expected scale of e ect sizes via its prior ( ) = N | 0, 2 , the test statistic of Tejaas depends on this scale. As can be seen in Eq. (23), the multiplication with R leads to a saturation of the e ects of principle components of Y with singular values larger than / .
Furthermore, whereas the burden test and SKAT commonly assume the response variable under the null hypothesis to be identically and independently normally distributed, this assumption would be unsuitable for centered minor allele frequencies . Instead, Tejaas assumes that patient indices are randomly permuted under the null model (Sec. 2.6).
Whereas the SKAT statistic was devised because it was deemed useful to distinguish positives from negatives while being easy to compute, we used the Bayesian approach to derive a test statistic that should be reasonably good in distinguishing positives from negatives. Indeed, despite the similarity with the SKAT statistic, the equivalent TEJAAS kernel contains a correction R for the correlation between covariates, which does not appear in SKAT and related methods. This correction has its origin in the non-null model, for which regression coe cients deviating substantially from 0 are considered.
Once derived, we strove to make the TEJAAS statistic fast to compute by means of analytically calculating the rst and second moments (Appendix 1). Its time complexity is dominated by the inversion of the matrix 2 / 2 I + Y T Y, which takes ( min{ , }), while p-values are computed in ( 2 ) (Appendix 1). In comparison, SKAT scores are computed in ( ), while the computation of the p-values require the inversion of an × matrix, taking time ( 3 ).

.8 Choice of the regularizer
The standard deviation of the normal prior is not learned from the data, but is chosen empirically. It is important to choose such that it avoids over tting and also restricting the regression coe cients too much.. In the limit of large , the normal approximation to q rev will break down. To see this, we rst observe that from Eq. (24) Therefore, as 2 / 2 becomes smaller than most squared singular values { 2 : 1 ≤ ≤ }, q rev will approach 1. The distribution of q rev under the permutation null model will therefore become extremely skewed with an abrupt drop towards 1. Such a distribution will not be approximated well by a Gaussian, but luckily this situation corresponds to the irrelevant regime of extreme over tting. This can be seen by noting that the prediction of x at the maximum likelihood solution which means that the regression results in a perfect prediction on the training samples, in other words, it results in complete over tting.
At the other extreme, when gets so small that 2 / 2 max{ 2 : 1 ≤ ≤ }, we can approximate where is the × sample covariance matrix. Since our gene expression matrix Y is normalized, the diagonal elements of should be equal to 1, whereas the o -diagonal elements will be more or less randomly distributed around 0. Intuitively, we expect that in this regime q rev will be will be too restrictive for the model and lead to false signals even with chance correlations of a single gene with a genotype.
Non-Gaussian parameter. From Eq. (25), the q null rev should have a normal distribution and the standardized q null rev should have a standard normal distribution, The kurtosis of a standard normal distribution is 3. We use this property to de ne a non-Gaussian parameter ( ) to measure the deviation of the distribution from a standard normal distribution, where ( ; ) = q null rev − / . Given a gene expression matrix, we simulated genotypes for 5000 SNPs and calculated ( ) for di erent values of . We recommend choosing opt such that opt ≥ arg min ( ). Ideally, we also want a high value for the standard deviation of to ensure a broad distribution of q null rev .

. Target genes
There are two separate tasks in the analysis of trans-eQTLs: (i) identifying the trans-eQTL SNPs, and (ii) identifying their target genes. However, the second task is subordinate in the sense that if trans eQTLs cannot be predicted correctly there is no use in predicting target genes. Conversely, if the trans-eQTLs can be predicted correctly then some complementary method can be used for predicting their target genes. Reverse Regression o ers evidence for the presence of a trans-eQTL without identifying which genes are targeted. Hence, we do not get the set of target genes from Reverse Regression. In fact, multiple regression with 2 penalty (ridge regression) is not optimal for variable selection.
Therefore, in Tejaas, we included a single SNP-gene pairwise regression to nd a set of candidate genes as targets for each predicted trans-eQTL. Under the hood, our software performs the following tasks: • perform reverse regression to rank all trans-eQTL SNPs and select the top trans-eQTLs below a cuto (default 0.001), can be speci ed by --psnpthres.
• perform single SNP-gene linear regression for all preselected trans-eQTL SNPs and estimate the Benjamini-Hochberg (BH) false discovery rate for the association of all genes with the candidate trans-eQTL SNP.
The output of Tejaas contains both the trans-eQTL SNPs ranked by p-values and all SNP-gene pairs below a p-value cuto ( --pgenethres, default 0.01) with their corresponding BH adjusted p-values at 50% FDR (--fdrgenethres).
As discussed below (Sec. 3), reverse regression works better with KNN correction while pairwise regression (for target gene discovery) works better with conventional covariate-corrected gene expression. To get the best of both worlds, our software accepts two separate input options for gene expression les: (1) --gx to provide the raw gene expression le for KNN correction and trans-eQTL discovery, and (2) --gxcorr to provide the covariate-corrected gene expression le for single SNP-gene pair regression to discover target genes. .

Cis-masking
In Tejaas, we are interested to nd long-range SNP-gene interactions. However, the strong e ect size of the cis-eQTLs can sometimes lead to high q rev . To avoid wrongly identifying cis-eQTLs as trans-eQTLs, we introduced an option in our software to exclude all genes in the vicinity of each SNP. We call this procedure 'cis-masking'. It can be invoked with the ag --cismask. The width of the vicinity can be speci ed using the option --window.
To implement cis-masking, we calculated the SVD of the expression matrix (see Eq. (21)) after removing (masking out) the genes that occur within ±1Mb (or any distance speci ed by --window) from the SNP position. Doing this for every SNP would be computationally expensive as it would involve ∼ 10 7 matrix inversions (once for each SVD). But since many SNPs generally have identical genes in the vicinity, we group them and calculate the SVD once for each group, which brings down the number of SVD calculations to ∼ 10 4 .

. Computational requirements
To determine run times and memory requirements of Tejaas, we used the adipose subcutaneous tissue expression from GTEx, which contains a total of 15673 genes for 581 samples. We used the GTEx genotype for chromosome 1. All tests were run on a node with 2× Intel Xeon CPU E5-2640 v3 @ 2.60GHz with 8 physical cores each and 128Gb of RAM. We subsampled the expression matrix as well as the number of SNPs used to run Tejaas to evaluate di erent possible scenarios.
The most expensive step for Tejaas is the SVD decomposition of the gene expression matrix. The number of times the SVD decomposition is performed depends directly on the number of SNPs selected. While using --cis-masking (Sec. 2.10), di erent SNPs will have di erent cis-genes that are required to be masked out and hence, a new SVD decomposition is required. Run times increase with the number of SNPs (Fig. S3a, left panel), since the number of cismasks needed (top x-axis) is proportional to the number of SNPs (bottom x-axis). Memory usage increases with the number of SNPs and samples, since the dosage matrix is being held in memory.
We implemented Tejaas with an option to run Tejaas on a multicore server using MPI parallelization. Fig. S3b shows how run times decrease with the number of cores used. Total memory usage, as expected, increases with the number of cores. In the designed parallelization scheme, each core will compute the SVD decomposition for a given cismask. With an increase in the number of cores, the expression matrix needs to be copied to each of them for processing and hence the memory usage scales proportionally.

Confounder correction
Gene expression measurements are notorious for being dominated by strong confounding e ects that can arise from technical details of RNA recovery, conservation, and sequencing or from environmental and biological factors such as age, gender, diseases, nutrition and lifestyle, drug regimes etc. The subtle e ects of trans-eQTLs are at risk of being drowned out by the strong systematic noise from confounding e ects.
In this section, we rst discuss the standard confounder correction method for eQTL analyses. Then, we introduce our K-nearest neighbor correction that is specially suitable for use with Tejaas.

. Residuals from multiple linear regression
Given a matrix C ∈ R × of covariates and samples, a multiple linear regression is performed for the th gene expression y , where ∈ R is a vector of e ect sizes of the covariates for gene . The estimate of the e ect sizes from the multiple linear regression is then used to calculate the expression residuals y = y − C which are linearly uncorrelated with the covariates. The residual y is used as the corrected gene expression for downstream analyses. In this manuscript, we denote this correction as 'CCLM'. This is a simple and e ective method to correct gene expression confounders and is routinely used to discover biologically signi cant eQTLs. CCLM can be used for known covariates such as age, gender, etc. as well as other hidden covariates inferred using PEER [5] or other similar methods. However, if rank (Y) = , then it can be shown that for the residual matrix, rank (Y ) ≤ − . Therefore, as noted in Sec. 2.6, this may compromise the normality of the distribution of q null rev . Hence, the CCLM correction is problematic to use with Tejaas.

. K-nearest neighbor (KNN) correction
For the KNN correction, we assume that confounding e ects dominate the gene expression. We expect that the expression levels of the genes of nearest neighbors (NN ) of each sample will be a ected by the same dominant confounder variables. In other words, if the samples are close to one another in the expression space, we expect them to be close to one another in the confounder space. Hence, we can correct at least a good part of the confounding e ects by centering the expression y and genotype x of sample using the samples whose gene expressions are the most similar to that of sample , The nearest neighbors NN are calculated using the denoised Euclidean distances between gene expression vectors. To reduce noise in the distance calculation, we remove all but the leading principle components of the singular value decomposition of the expression matrix Y. After subtracting the mean gene expressions and genotype vectors of the nearest neighbors, we center the resulting corrected gene expression and genotype vectors again. Since KNN does not reduce the rank of the gene expression matrix, it works well with Tejaas. As shown in the QQ plots of Fig. S4, the empirical null distribution indeed follows the normal distribution. The choice of should be such that it captures the locally varying e ects of the confounders. As shown later in Fig. S10, too small a value of would lead to excessive statistical noise, while increasing too much will remove too little of the confounding e ects as these start to average out within the neighbors of each data point.
In Fig. S5, we show how KNN correction removes the strong dissimilarity between samples in the data. Before KNN correction (Fig. S5a), the samples are strongly dissimilar. We could observe several sample groups which are neighbors in the expression space, probably due to external confounders. Although the samples were clustered in the expression space, we observed distinctive patches in the genotype space. This happens because several confounders (such as population substructure) a ect both genotype and gene expression, leading to unwanted correlation between the rst principal component of the genotype and the expression levels. We could correct for these confounders by centering both x and y in the KNN method (Fig. S5b).
The KNN correction has several bene ts in comparison to CCLM and other linear corrections. First, it can correct out non-linear confounding e ects, so, in contrast to the CCLM correction, it should also work when the confounding e ects are not well approximated by linear, additive e ects. Second, since it is non-parametric (except for ), over tting does not typically occur [6,7]. Third and most importantly, it does not require the confounders to be known. Therefore, it can be used for removing hidden confounders, which is often the case with real data.

Trans-eQTL simulation
This section describes details of a simulation study that evaluates the power of our method to nd trans e ects in gene expression data. We followed the strategy of Hore et al. for the simulation, as described in the Supplementary section of their article [8]. We discuss their simulation details below, partly verbatim. Any di erence with their strategy is explicitly noted. We also compare the

. Data simulation
Simulated data consisted of genotype and gene expression for = 450 individuals to closely resemble the sample size of the Genotype Tissue Expression (GTEx) project [9][10][11]. We simulated the gene expression data for = 12 639 genes, containing non-genetic signals (background correlation estimated from the GTEx tissues and confounding factors) and genetic signals (cis and trans e ects). Following the strategy of Hore et al. [8], we assumed that each gene contained only one SNP; this simpli ed case is equivalent to assuming that there is at most one cis-eQTL for each gene.

. . Genotypes
We used the real genotype data from the GTEx individuals, while Hore et al.used simulated genotypes. After pre-ltering of the GTEx genotype, we randomly sampled = 12 639 SNPs (corresponding to = 12 639 genes) from the genotype data of 450 individuals. From the SNPs sampled, we randomly selected cis = 800 SNPs to be cis-eQTLs. From these cis-eQTLs, we selected a subset trans = 30 SNPs to be trans-eQTLs. These trans-eQTLs were originally cis-eQTLs associated with a nearby gene, which in turn regulated the expression of other distant genes. All SNPs were sampled independently from the genotype. Let X ∈ R × be the sampled matrix of genotypes.

. . Gene expression
Let Y ∈ R × be the matrix of simulated gene expression data. The data was simulated in stages. First, we generated the non-genetic component of all genes using background correlation (noise) and confounding factors. Second, we added the cis e ects and nally, we added the trans component on the target genes of the trans-eQTLs.
Noise. Hore et al.used heteroscedastic background noise. However, to replicate the underlying correlation of the gene expressions in the GTEx samples, we created a Gaussian noise with a covariance matrix obtained from the gene expressions in the artery aorta tissue of GTEx. Let Y as be the observed gene expression in GTEx samples. We decomposed the covariance matrix of Y as such that Cov We de ned the noise as where Z is a × matrix, with every row sampled from a normal distribution with zero mean and unit variance. Note that Cov (Y noise ) = Cov (Y as ) and therefore Y noise retains the background correlation structure of the adipose subcutaneous tissue.
Confounding factors. We generated = 10 confounding factors, of which pop = 3 were considered to be due to population substructure. Let W be the × matrix of confounding factors. The three population substructure confounders (W pop ) were set to be the rst, second and third principal components of the corresponding genotype matrix. The remaining ind = − pop independent confounders (W ind ) were sampled from a standard normal distribution, Each confounding factor was assumed to be a ecting only a few target genes. Hence, the e ect size for the th confounding factor on the th gene was sampled with a sparsity , to give us the coe cient matrix of size × . We used = 1.0 unless otherwise mentioned. Finally, we obtained the e ect of confounding factors on the gene expression using an additive model, Cis e ects. In this simulation framework, every gene spatially overlaps with a corresponding SNP in the genotype, i.e., we have (= ) SNP-gene pairs. If the SNP in the th position has a cis e ect ( ∈ cis ), then the expression of the gene is modi ed with a strength , where the direction of the e ect size is random, i.e., sgn ( ) is assigned a value of −1 or +1 randomly. The strength of the cis e ect, i.e., | | depends on whether the SNP also has a trans e ect or not and is given by if SNP also acts as a trans-eQTL (∈ trans ) Gamma (4, 0.1) , otherwise.
For all the remaining SNP-gene pairs ( ∉ cis ) with no cis e ects, Y cis = 0.
Combining noise, confounding factors and cis e ects. We obtained a temporary set of expression levels for all genes by additively combining the noise, the e ect of confounding factors and the e ect of cis-eQTLs, Note that the Y cis has non-zero values only for the targets of cis = 800 cis-eQTLs, and has zero values for all remaining genes.
Trans e ects. For simulating the trans e ects, Hore et al.assumed that the trans-eQTLs regulated a nearby gene (via cis e ect) and this cis target gene was a transcription factor (TF) and regulated multiple target genes downstream (excluding other TF genes; could be any other gene including other cis target genes). This ensured that the trans-eQTLs were indirectly associated with the target genes with practically low e ect sizes. Let trans be the number of target genes regulated by the TF and the target genes of every TF were selected at random. Let be the set of TFs which regulate the expression of the th gene and the trans e ect on this gene was simulated as, ∼ Gamma trans , 0.02 .
was the relative e ect of TF on the gene . If a gene was not a target for any of the trans = 30 TFs, then Y trans = 0.
Combining all of the data. Finally, the contributions from the trans-eQTLs were incorporated to create a nal set of simulated expression levels, and was used for subsequent analyses.
. . Choice of simulation parameters As described above, there are several parameters which can be tuned in the simulation to generate a wide range of confounding e ects, cis e ects and trans e ects. Three parameters determine the strength of a simulated trans-e ect: 1. E ect of the cis-eQTL on the TF ( for ∈ cis ).
2. E ect of the TF on the target genes. This is determined by trans and the mean e ect of the TFs on the target genes is given by = 0.02 trans 3. The number of target genes ( trans ).
Hore et al. [8] selected these parameters with reference to the KLF14 trans signal [12] so that the signal strengths were similar to the real data. For our simulation, we chose the same values for these parameters, i.e., = 0.6 for ∈ cis , trans = 20 and trans = 150. We also explored several other choices of parameters by reducing the strength of trans e ects, particularly by tuning trans ( = 5, 10 and 20) and trans (= 50 and 100) as shown in Fig. 2 of the main manuscript. In order to get an idea of how realistic the simulations are, we looked at the e ect sizes of the trans-eQTLs and the estimated expression heritabilities contributed by the trans-eQTLs.
E ect sizes. The expression of the th gene is given by Hence the e ect of an individual SNP on gene in an additive model, will be given by = and all other terms will be included in the error term . Simple linear regression (methods like MatrixEQTL) calculates an estimate for the coe cient. The standard error for the estimated coe cient will be given by where the residual can be obtained as 2 = Var Y − 2 Var X . For comparison across multiple SNPs and studies the e ect size is scaled by the standard error to obtain the score, Since we planted the trans-eQTLs in simulations, we know the "true" values of the coe cients, i.e., = , and can calculate the scores for all SNP-gene pairs where > 0. We show the distribution of the | | scores in Fig. S6a and compare it with the | | scores of the most signi cant eQTLs predicted by GTEx [13]. It should be noted that the predicted cis-eQTLs and trans-eQTLs from GTEx do not represent the background distribution because there could be many other cis or trans-eQTLs which might have not been predicted. However, the distributions give an idea of the e ect sizes that are required to be considered signi cant by simple regression. The distribution of cis-eQTL | | scores overlap with that of predicted cis-eQTLs in GTEx, indicating that the cis-eQTL e ect sizes are realistic and are expected to be predicted by standard eQTL mapping. The e ect size of the simulated trans-eQTLs are much weaker compared to the 156 signi cant trans-eQTL and target gene pairs discovered by the GTEx consortium. Consistent with our expectations, (i) the trans-eQTLs have weaker e ect sizes compared to cis-eQTLs and (ii) they cannot be discovered by standard eQTL mapping with the given sample size. However, it remains unknown how much weaker the trans-eQTLs really are. Heritability. The expression heritability (narrow-sense) and their cis and trans components for the th gene can be calculated as where the denominator Var Y depends non-trivially on the correlated noise and other confounding e ects: where we have used We can assume that the last term Cov Y cis , Y trans = 0 because if gene is targeted by both a cis-eQTL (X ) and a trans-eQTL (X , ∈ ), then the trans-eQTL must be distally separated from the cis-eQTL and therefore, probably not correlated so that Cov X , X = 0. It follows that the last term in the numerator of Eq. (53) can also be assumed to be zero. To obtain the cis and trans heritability, we empirically calculated Var Y from the simulated expression data and calculated the numerators in Eq. (52) and (53) from the input e ect sizes and sampled genotypes for the simulation. The mean cis and trans heritability over all genes can be obtained as, where cis are the genes targeted by the cis-eQTLs and trans are the genes targeted by the trans-eQTLs. The proportion of trans-heritability is given by The mean cis heritability across all simulations was found to be ℎ 2 cis = 3.8 × 10 −4 . The distribution of trans over the 20 simulation replicates for each set of simulation parameters used in Fig. 2 (main text) is shown in Fig. S6b. In real data, trans is expected to be ≥ 0.7 from recent estimates [14] and the proportion of trans-heritability in our simulations is lower than these recent estimates. Since the trans e ect sizes used in the simulations are close to our expectation in real data, the number of TFs and / or target genes used in our simulation are possibly lower than what would be expected in real data. With fewer target genes it is harder to predict the trans-eQTLs using FR and RR of Tejaas, while simple regression methods like MatrixEQTL does not depend on the number of target genes.
Judging by the e ect sizes and the heritability estimates, the parameters in our simulations represent a conservative estimate of our expectations in reality.

. ROC pAUC analysis
We ranked the trans-eQTL predictions from each tool by descending score and computed the cumulative number of true and false positives (TP, FP) up to each score. We obtained the true positive rate (TPR) as the proportion of positives (with respect to total ground truth positives) correctly identi ed at that threshold; and the false positive rate (FPR) as the proportion of false positives (with respect to total ground truth negatives) identi ed at that threshold. This is often summarized as, where FN and TN denotes the false negative and the true negative respectively. The Receiver Operating Characteristic (ROC) curve shows the TPR against FPR at various thresholds (Fig. S7).
Traditionally, ranking performance is measured by the full area under the ROC curve (AUC). However, AUC summarizes the entire ROC curve, including regions that are not relevant to predicting trans-eQTLs (e.g., regions with low levels of speci city). In order to alleviate this de ciency while bene ting from some of the advantageous properties of the AUC, we used a partial area under the ROC curve (pAUC) which summarizes a portion of the curve over the pre-speci ed range of interest, FPR ≤ 0.1 (Fig. S7). Specifying the range of interest as FPR ≤ 0.1 is arbitrary. To give a concrete example, we have 12639 SNPs of which 30 are trans-eQTLs. FPR = 0.1 corresponds to a threshold at which top SNPs are selected such that there are 1261 false positives (10% of total 12609 true negatives) among those selected SNPs. For the three methods compared here, suppose the number of predicted trans-eQTLs are Tejaas , FR and MEQTL for FPR=0.1. The higher the number of true positives at that speci city ( = 1 -FPR = 0.9), the better is the method. However, the threshold p-value or FDR required for choosing Tejaas , FR and MEQTL are di erent.

. Methods compared
We compared Tejaas RR-score with two methods: (1) MatrixEQTL [15] as a representative for current standard eQTL pipelines, and (2) Tejaas FR-score, as an alternative for CPMA [1] which uses the same underlying property of trans-eQTLs that they target multiple genes simultaneously.
MatrixEQTL. We used the R package for MatrixEQTL with default options. The covariate correction was done separately but used the same linear model correction implemented in Matrix-EQTL. Since we were not interested in the cis-eQTLs, we used a cis-window of 0Kb.

FR (∼CPMA).
Since CPMA is not implemented as a software, we implemented the FR-score (q fwd ) within Tejaas as an alternative. Internally, the method runs over the data twice: (1) First, create an empirical null model from the data and (2) then calculate q rev as well as an empirical p-value using the previously generated null model. This is the same procedure as CPMA, as described in their manuscript [1].

. Supplementary simulation results
Selecting the regularizer. We plotted the non-Gaussian parameter ( ) and the standard deviation of at di erent values of in Fig. S8a. As noted above in Sec. 2.8, the best choice of should be greater than arg min ( ). Ideally, we also want a high value for the standard deviation of to ensure a broad distribution of q null rev . Therefore, we chose = 0.2. To validate our choice, we looked at the ranking of the trans-eQTLs at di erent values of in Fig. S8b. Each ROC curve was obtained by averaging over 20 simulations using default simulation parameters (see above) and KNN correction with 30 neighbors. In the inset, we show the partial areas under the ROC curves (pAUC) where the false positive rate (FPR) ≤ 0.1. We found that Tejaas works best at this choice of . The rest of the simulations were performed using = 0.2.
Note on inverse normal transform. Current GTEx pipeline uses a two-step approach for normalizing the read counts of the genes: (1) read counts are normalized using TMM (Trimmed Mean of M-values [16]) and ltered for thresholds, (2) expression values for each gene are then inverse normal transformed across samples. The gene expression generated in our simulations are not read counts, but equivalent to the TMM values. Hence we skipped the rst step. We did perform the second step of inverse normal transformation before the confounder corrections.
As expected, we found that CCLM correction (Sec. 3.1) bene ts signi cantly by using the inverse normal transformation (not shown). For the KNN correction (Sec. 3.2), however, we found that the inverse normal transformation reduces the accuracy of Tejaas (Fig. S9). It might be because the  neighbor information gets skewed with this transformation. Hence, we performed KNN correction without the inverse normal transformation both for the simulations and the GTEx application.
Number of neighbors for KNN correction. Another important aspect of Tejaas is to choose the number of neighbors for the unsupervised KNN correction. To ascertain the robustness of the KNN correction, we performed simulations with di erent number of confounding factors, = 10, 20 and 30. We then compared the ranking of Tejaas using di erent number of nearest neighbors, as shown in Fig. S10. The rankings were compared using the pAUC where the false positive rate (FPR) ≤ 0.1. We found that the choice of KNN neighbors does not depend on the number of confounding factors, and we obtained the best accuracy with 30 nearest neighbors.

Ranking of trans-eQTLs.
Complementary to the partial area under ROC curves (Fig. 2 of main text), we also show the proportion of true positives (on average) included in the top ranked trans-eQTL SNPs in Fig. S12.
Target gene discovery. We investigated whether the increased accuracy of Tejaas to detect trans-eQTL SNPs also increases the accuracy of predicting trans target genes. As discussed in Sec. 2.9, Tejaas calculates single SNP-gene pairwise regression to nd the set of candidate target genes. However, unlike MatrixEQTL, the trans-eQTL SNPs are preselected using reverse regression. The simulations provide us the ground truth to compare the performance of the di erent methods in predicting correct SNP-gene pairs. The SNP-gene pairs reported by Tejaas, FR and MatrixEQTL were ranked by their p-values. Each true positive is a correctly predicted pair of trans-eQTL SNP and target gene, each false positive is a wrongly predicted pair. From the ROC curve, we calculated the pAUC up to a FPR of 0.1. The mean pAUC averaged over 20 simulations for each method and di erent simulation settings are shown in Fig. S13. For Tejaas and FR, we can limit the number of SNP-gene pairs to test by selecting trans-eQTL SNPs with RR or FR ≤ cuto . We tested three increasingly signi cant p-value cuto s, namely 1 × 10 −2 , 1 × 10 −3 and 1 × 10 −4 , and evaluated the sensitivity and speci city to discover the simulated true SNP-gene pairs. MatrixEQTL only tests single SNP-gene pair associations and therefore, we cannot perform the pre-ltering step.

Methods
Tejaas achieves higher accuracy for nding true SNP-gene pairs compared to FR and Matrix-EQTL over a wide range of simulation settings. All methods use the same association test for predicting SNP-gene pairs and the only di erence between the three methods is the preselection of trans-eQTL SNPs. Hence, the improvement of Tejaas over MatrixEQTL must be due to accurate preselection of trans-eQTL SNPs (see Fig. 2 of main text). MatrixEQTL needs to test a large number of comparisons ( × ), and cannot bene t from the preselection step. FR performs poorly due to its poor sensitivity to predict trans-eQTLs in the preselection step (see Fig. 2 of main text). This analysis quantitatively shows that an increased accuracy to detect trans-eQTL SNPs can improve the prediction of trans-eQTL target genes, even when using simple regression methods.

GTEx analysis
To illustrate Tejaas in a real data set, we analyzed trans-eQTLs across 49 human tissues using data from the Genotype Tissue Expression version 8 (GTEx v8) project [9][10][11]. In this section, we explain the preprocessing steps used for the GTEx analysis and report supporting results related to our analysis.

RNA-Seq expression data
We downloaded the phased RNA-seq read count expression matrix from the dbGaP portal ( lename: phASER_GTEx_v8_matrix.txt.gz). The phASER processed expression uses read-backed mapping for each RNA-seq sample, using the genotype from the same individual as reference to map and phase RNA-seq reads. For each gene, two read count values are reported, one for each expressed allele. We added up both count values and calculated TPMs (Transcripts Per Million). For quality control, we retained genes with expression values > 0.1 and more than 6 mapped reads in at least 20% of the samples.

. Covariate correction
Tejaas uses the TPMs (or TMMs) for KNN correction to remove confounders and subsequent trans-eQTL discovery. However, as discussed in Sec. 2.9, the explicit covariate-corrected gene expression is required for nding target genes of the trans-eQTLs. We downloaded the covariate les from the GTEx portal [17] ( lename: GTEx_Analysis_v8_eQTL_covariates.tar.gz). The covariates include the rst 5 principal components of the genotype (see Supplementary Material of [13] for details), donor sex, WGS sequencing platform (HiSeq 2000 or HiSeq X) and WGS library construction protocol (PCR-based or PCR-free). Additionally, from phenotype les available in dbGaP, we included donor age and post mortem interval in minutes ('TRISCHD') as covariates.
We inverse normal transformed the TPMs (or TMMs) and used CCLM (Sec. 3.1) to remove the contribution of all the covariates and used the residuals for target gene discovery.  One possible way to check the e cacy of the unsupervised KNN correction is to check whether it can be explained by the confounding e ects of known covariates and if so, to what extent. To test this, we checked the correlation of the 'KNN confounder' with known technical and biological covariates of GTEx samples and subjects. For each tissue , the KNN confounder (C knn ∈ R × ) is simply the term we subtracted from the expression values of all the genes in Eq. (32),
Categorical covariates with ≤ 4 categories were binarized or converted to integers otherwise. Values with 'Not Reported' or 'Unknown' status were considered as missing values and were not considered in the analysis. For each covariate, we selected only those with at least 50 observations in any given tissue. We then performed a simple linear regression (SLR) to explain C knn with C gtex as predictors using Python's sklearn.linear_model.LinearRegression, From the tted models, we obtained the coe cient of determination 2 for every covariate in tissue . In other words, the 2 corresponds to the proportion of the variance of C knn that can be explained by the th known covariate in tissue . The resulting 2 values are reported in Fig. S14 (bottom panel). Indeed, the variance of KNN correction are explained to di erent extents by several known covariates. For example, the rst principal component of the genotype (PC1) and the sample's reported race both explain a signi cant proportion of the variance of C knn . The next four principal components, PC2 to PC5 do not contribute to explaining the variance of C knn . Interestingly, sample covariates related to library size and quality as well as various rates for mapped reads explain di erent proportions of the variance of C knn in di erent tissues, indicating technical confounders in many GTEx tissues. Next, we wanted to check the total proportion of the variance of C knn explained by known covariates, which cannot be done using the SLR set up described above. For each tissue , we selected those covariates with 2 ≥ 0.05 in the SLR set up. We then performed a multiple linear regression (MLR), where is the number of covariates retained in tissue . The percentage of total explained variance of C knn for each tissue is shown in the top panel of Fig. S14. On average across all tissues, the known covariates that we used explained 39.9% of the variance of KNN correction.

. Tejaas regularizer selection for GTEx
As in simulations (Sec. 4.4), we selected the regularizer for the GTEx tissues using the non-Gaussian parameter (Sec. 2.8). From Fig. S15, we found that we could use = 0.1 for most of the tissues in GTEx v8. However, there were four tissues for which ( ) at = 0.1 deviated strongly from the minimum. Hence for these four tissues, namely heart atrial appendage, pancreas, spleen and whole blood, we chose = 0.006. We used a binomial test to calculate the p-values for the enrichment . If is the number of trans-eQTLs in the tissue, then the probability of nding of them with a feature is, and ( > ) gives us the p-value for the tissue-GWAS pair.

. Effect of cis-masking
We applied Tejaas on 38 GTEx tissues (except brain tissues and bladder) with and without cismasking (Sec. 2.10) to check the e ect of cis-masking option of Tejaas for discovering trans-eQTLs in the GTEx data. We found that ∼ 97% of the trans-eQTLs discovered with cis-masking (shown with green in Fig. S17a) are also discovered without cis-masking (shown with red in Fig. S17a). However, if no cis-masking is used, Tejaas discovers ∼ 11% more trans-eQTLs compared to that with cis-masking. This increase in signi cant trans-eQTLs might be due to strong cis-eQTLs, as envisioned in Sec. 2.10. We can understand the e ect of cis-masking by looking at how many of our trans-eQTLs are also reported as cis-eQTLs by the GTEx Consortium [13]. We de ned the proportion of cis-eQTLs 0.0 0. 5   among the trans-eQTLs compared to the proportion expected by chance as cis-eQTL enrichment ( cis-eQTL ). In Fig. S17b, we compare the log 2 ( cis-eQTL ) for trans-eQTLs discovered with and without cis-masking in all the 38 tissues. The mean log 2 enrichment of cis-eQTLs across all tissues (shown in inset of Fig. S17b) is 0.10 with cis-masking and increases to 0.77 without cis-masking. This increase comes from the extra ∼ 11% SNPs discovered without cis-masking, indicating that cis-masking is crucial to avoid strong cis-eQTLs being falsely reported as trans-eQTLs by Tejaas.

Effect of cross-mappable genes
False trans-eQTL signals could arise from multi-mapped reads within genes with high sequence similarity, where reads from a given gene are mapped to one or more other genes elsewhere in the genome and showing an association resembling a long range regulatory e ect. This problem was explored by Saha and Battle [18] and can be mitigated using cross-mappability scores. We obtained pre-computed cross-mappability scores from [18], with settings of k-mer length 75 for exons and 36 for UTRs. In Tejaas we extended the cis-masking approach to exclude all genes that cross-map (cross-mappability score > 1) with any of the cis-genes listed in a given cis-masking group and we call this the cross-mappability lter. We applied our trans-eQTL discovery pipeline on all non-brain tissues of GTEx with the cross-  Table S | Summary of number of trans-eQTLs before and after cross-mappability filter. We report the two sets of tissues, Set contains non-brain tissues for which we used = 0.1 in our analysis, and Set contains tissues for which we used = 0.006. We report the total number of lead trans-eQTLs after LD pruning for the different analyses. Numbers in bracket are the number of unique trans-eQTLs in that given set. mappability lter. The cross-mappability lter masks out a large number of genes, often more than thousands. For example, the average number of cis-genes masked in whole blood tissue is ∼ 9, while the average number of cross-mapped genes masked for whole blood tissue becomes ∼ 3508. About 34% of trans-eQTLs were ltered out for tissues with = 0.1 (Table S1 and Fig. S18a), indicating that they might be false signals from cross-mapped genes. However, we unexpectedly observed an increase in the number of predicted trans-eQTLs in the set of four tissues (whole blood, heart atrial appendage, spleen and pancreas) where we used = 0.006. This is because removing the large number of cross-mapped genes breaks down the normality assumption of the null model for these tissues, which are very sensitive to the choice of , as can be seen from their non-Gaussian parameter (Fig. S15). For the other tissues, the choice of is relatively stable and removing the cross-mapped genes does not a ect the choice. For our assumptions to hold, we need to optimize for every SNP in these tissues, to consider every set of masked genes when using the cross-mappability lter. This optimization is time consuming. However, with an approximate choice of = 0.008 obtained by optimizing ( ) after removing a random set of 3000 genes from the expression data, the results are comparable to those found in other tissues (Table S1 and Fig. S18b). We compared the enrichment in functional regions and cis-eQTLs before and after crossmappability lter. Enrichment of trans-eQTLs in DHS regions and cis-eQTLs does not change signi cantly ( > 0.1) with and without the cross-mappability lter (Fig. S19). Despite this, we do observe a slight decrease in the enrichment of cis-eQTLs for many tissues, consistent with the false positive trans-signal that the cross-mappability lter is meant to remove.

. Effect of GTEx populations in predicted trans-eQTLs
To assess the e ect of population di erentiation in GTEx, we calculated xation index values ( values) as in [19] on GTEx genotype data. We compared the distribution of values for all GTEx SNPs with the set of trans-eQTLs predicted with Tejaas (Fig. S20, left panel). The average value for the predicted trans-eQTLs in each tissue is shown on the right panel of Fig. S20. High values are indicative of allele frequency di erences between subpopulations present in the data. Population substructure could give spurious indications of eQTL associations [20]. To rule out confounding e ects of population di erentiation from our enrichment analysis, we sampled null sets with the same distribution as that of the predicted trans-eQTLs. We re-calculated DHS functional enrichments (Fig. S21) and GWAS enrichemnts (data not shown). However, we did not observe any signi cant deviation from the previously calculated enrichments. .

Regulatory element enrichment in brain tissues
Brain tissues in GTEx have generally lower number of RNA-seq samples compared to other tissues. This is one of the reasons why they display lower number of predicted trans-eQTLs with Tejaas.  Other unknown confounders in the expression measurements could also a ect Tejaas predictions, as it has been reported that brain tissues have markedly distinct expression signatures in GTEx that separates them from the rest of the tissues. This could be due to the post mortem nature of the samples. We report DHS, raQTL and functional region enrichments for trans-eQTLs predicted in brain tissues in Fig. S22. Due to the low number of predictions, the reported enrichments are not reliable.

Performance of Tejaas on null data
On null data, Tejaas does not show any spurious association (Fig. S23). We used the same gene expression as used for trans-eQTL discovery to check if the correlation of the gene expression can lead to false discovery. We removed any possible trans-eQTL signal by permuting the sample labels of the genotype. We found no signi cant association in any GTEx tissue. .

Replication in eQTLGen
We downloaded the list of signi cant trans-eQTLs discovered in the eQTLGen study from https: //www.eqtlgen.org/trans-eqtls.html (version from 2018-09-04). It contains 59 786 SNP-gene pairs and a total of 3 853 unique trans-eQTLs in whole blood. In Appendix 3, we report the overlap between trans-eQTL from eQTLGen with the lead trans-eQTLs discovered with Tejaas in all GTEx tissues. Enrichments were calculated as in Sec. 5.7. Replication across all GTEx tissues is low, with whole blood showing the highest number of replicated variants. Although we don't expect to replicate other studies that used single SNP-gene pairs association methods, it is reassuring to see that Tejaas can replicate trans-eQTLs discovered in the same tissue type. Nonetheless, we must highlight that the eQTLGen meta-analysis study includes GTEx whole blood expression.

GWAS analysis
We investigated the overlap of the novel trans-eQTLs discovered by Tejaas with GWAS variants of complex traits to nd transcriptional regulatory mechanisms through which SNPs a ect complex  diseases. In this section, we discuss the details of our analyses and report supplementary results.

Data source for GWAS summary statistics
We used two di erent libraries of GWAS-associated SNPs: 1. GWAS Catalog. Data was obtained from the EBI website [22]. We used version 'e98_r2020-03-08' which contains 4493 studies with a total of 179 364 associations [23,24].
2. Complex trait GWAS. A set of 87 GWAS compiled by by Barbeira et al. [25]. Summary statistics from these GWAS were harmonized and imputed to GTEx v8 variants with MAF > 0.01 using only European samples. Of these, 86 traits had at least one genome-wide signi cant SNP ( < 5 × 10 −8 ) among the GTEx v8 variants, which were tested for trans-eQTLs. These 86 traits were broadly classi ed into 11 disease categories.

Calculation of GWAS enrichment
We used two di erent strategies for calculating the enrichment of trans-eQTL SNPs in GWAS for the two di erent datasets. Complex trait GWAS. In this case, we wanted to compare the GWAS enrichment of trans-eQTLs with that of GTEx cis-eQTLs with varying cuto of GWAS p-values. Hence, we plotted the fraction of cis-eQTLs, the fraction of trans-eQTLs and the fraction of total tested SNPs that overlap with SNPs associated with a complex disease (or disease category), as shown in the representative example of Fig. S24a. We used di erent cuto p-values for selecting the GWAS associated SNPs. Each disease category is a collection of several disease phenotypes. To obtain category-wise p-values for each SNP, we assigned the minimum GWAS p-value from the phenotypes that constituted the category. Finally, we calculated the GWAS enrichment of eQTLs (cis and trans respectively), GWAS = fraction of eQTLs overlapping with GWAS SNPs fraction of tested SNPs overlapping with GWAS SNPs (73) This is shown in the representative example of Fig. S24b. We used all reported cis-eQTLs and all predicted trans-eQTLs without pruning for LD regions to calculate the overlaps and enrichments. P-value calculation. We used a binomial test to calculate the p-values for the enrichment of each tissue-GWAS pair. If is the number of trans-eQTLs in the tissue, then the probability of nding of them in a GWAS is given by, and ( > ) gives us the p-value for the tissue-GWAS pair.

Supplementary results
Complementary to the Fig. 5b in our main text, we report all pairs of tissues and disease categories, which showed signi cant enrichment ( < 0.05) for trans-eQTLs to overlap with GWAS-associated SNPs, selected at a nominal cuto of < 1 × 10 −6 (Table S2). In Fig. S25, we report the GWAS enrichment for every tissue-study pair for the 87 disease phenotypes, at a nominal GWAS cuto of ≤ 1 × 10 −6 . Traits without any signi cant GWASassociated SNPs are not shown. Similar to disease categories in Fig. 5 in the main text, the GWAS enrichments of trans-eQTLs are speci c to certain tissue-disease pairs. Some of these pairs suggest biological relationship, such as the enrichment of thyroid trans-eQTLs in hypothyroidism or the enrichment of adrenal gland trans-eQTLs in hypertension. There are many other such interesting physiological links, which merits a thorough analysis in the future. rewrite q rev as, The rst sum over can be expressed in terms of simpler sums, where we have de ned the following moments of W, Inserting this into Eq. (80) yields

Variance of RR-score
We will determine the variance with the equation We will now derive expressions for the sums of elements of W in equation (90)