 Method
 Open access
 Published:
SURGE: uncovering contextspecific geneticregulation of gene expression from singlecell RNA sequencing using latentfactor models
Genome Biology volume 25, Article number: 28 (2024)
Abstract
Genetic regulation of gene expression is a complex process, with genetic effects known to vary across cellular contexts such as cell types and environmental conditions. We developed SURGE, a method for unsupervised discovery of contextspecific expression quantitative trait loci (eQTLs) from singlecell transcriptomic data. This allows discovery of the contexts or cell types modulating genetic regulation without prior knowledge. Applied to peripheral blood singlecell eQTL data, SURGE contexts capture continuous representations of distinct cell types and groupings of biologically related cell types. We demonstrate the diseaserelevance of SURGE contextspecific eQTLs using colocalization analysis and stratified LDscore regression.
Background
A complete, mechanistic understanding of the genetic basis of complex traits could provide insights into the biological basis of human health and disease. A powerful approach to filling in the missing links between genetics and complex traits is to use molecular measurements, such as gene expression levels, as an intermediate phenotype. Genetic variants significantly associated with gene expression are known as expression quantitative trait loci (eQTLs) [1,2,3,4,5]. Although eQTL studies have now been performed in large cohorts and numerous tissues [5, 6], characterizing the impact of regulatory genetic variants is far from complete. This complexity arises in part because the effects of genetic variation on gene expression vary considerably between different cellular contexts, such as cell types, developmental stage, or condition (Fig. 1A) [7,8,9,10,11,12,13,14].
Indeed, eQTLs from adult bulk tissue samples fail to explain the majority of known disease loci [11, 15,16,17]. It is therefore critical to identify eQTLs from diverse contexts in order to more fully characterize the molecular mechanisms underlying disease associated loci. Recent work has shown singlecell RNA sequencing (scRNAseq) provides unique data to uncover cell type and contextspecific eQTLs; such higherresolution data will naturally better reflect diverse cell types and cellular states, many of which would not be detectable from bulk RNAseq [9, 10, 12,13,14, 18, 19].
However, the relevant contexts, such as cell type or state, that actually modulate genetic effects may not be known a priori, for example, genetic regulatory effects that are only present in a rare cell type, during intermediate stages of cellular differentiation [8, 9] or in response to environmental stimuli [7] that may not already be known to be disease relevant. Furthermore, an individual cell may be defined by multiple, overlapping contexts, such as both cell type and a perturbation response affecting partially overlapping sets of cells [9, 20, 21]. Contexts, such as differentiation progress or time, may manifest as continuous effects rather than discrete clusters. We developed SURGE (Singlecell Unsupervised Regulation of Gene Expression), a novel probabilistic model that uses matrix factorization to learn a continuous representation of the cellular contexts that modulate genetic effects. This includes the extent of relevance of each context to each cell or sample, and the corresponding eQTL effect sizes specific to each learned context, allowing for discovery of contextspecific eQTLs without prespecifying subsets of cells or samples.
First, we evaluate the statistical power of SURGE to identify latent contexts that modulate genetic effects on gene expression using simulated data. Next in a proofofconcept experiment, we apply SURGE to bulk gene expression measurements from ten GTEx version 8 tissues [5] to uncover the relevant contexts underlying eQTL regulatory patterns in bulk RNAseq data. We then use SURGE to identify contextspecific eQTLs in a singlecell data set consisting of 1.2 million peripheral blood mononuclear cells (PBMC) spanning 224 genotyped individuals [18]. Finally, we demonstrate the diseaserelevance of SURGE contextspecific eQTLs using colocalization analysis and stratified LDscore regression (SLDSC) [22, 23].
Results
A standard approach to identify contextspecific eQTLs is to quantify the effect of the interaction between genotype and a prespecified cellular context on gene expression levels using a linear model (interactioneQTLs) [8, 9, 20, 24]. However, this approach requires prespecifying which contexts, such as known cell types, to test for interaction, therefore inhibiting eQTL discovery in previously unstudied cellular contexts or uncharacterized cell types. Relatedly, others have tested the interaction between genotype and reduceddimension expression features (such as gene expression principal components or MOFA factors [25]) on gene expression [13, 21], but this imposes the limiting assumption that all contexts that modulate the genetic regulation of gene expression can be explained by reduceddimension features of gene expression and limits eQTL discovery to contexts with large effects on broad gene expression patterns.
To address these issues, we developed SURGE, which uses a matrix factorization approach to uncover contextspecific eQTLs without requiring prespecification of the contexts of interest. SURGE achieves this goal by leveraging information across genomewide variantgene pairs to jointly learn both a continuous representation of the latent cellular contexts defining each measurement (henceforth referred to as SURGE latent contexts) and the interaction eQTL effect sizes corresponding to each SURGE latent context (Fig. 1B; see the “Methods” section). Importantly, SURGE allows for any individual measurement (such as a single cell) to be defined by multiple, overlapping contexts. From an alternative but equivalent lens, SURGE discovers the latent contexts whose linear interaction with genotype explains the most variation in gene expression levels. From this perspective, SURGE enables unsupervised discovery of the principal axes of genetic regulation of gene expression defining an eQTL data set. To more accurately infer SURGE latent contexts and increase power to detect contextspecific eQTLs, SURGE jointly controls for the effects of known covariates as well as sample repeat structure induced by assaying multiple measurements (such as many cells) from the same individual on gene expression when inferring latent contexts (see the “Methods” section). Finally, SURGE automatically selects the number of relevant latent contexts by placing Automatic Relevance Determination prior distributions [26] on the inferred latent contexts (see the “Methods” section). The user simply would initialize the number of latent contexts to be large and greater than the likely number of underlying latent contexts present in the eQTL data set, and SURGE will prune unnecessary contexts during optimization (see the “Methods” section).
SURGE can be applied any eQTL data set using publicly available software (https://github.com/BennyStrobes/surge) and can be scaled to tens of thousands of measurements (Additional file 1: Fig. S1). After inference, SURGE latent contexts can be associated with available measurements and annotations to help interpret their biological meaning. In addition, latent contexts can be used identify SURGE interaction eQTLs or variants whose effect on genes expression significantly changes as a function of the SURGE latent context (see the “Methods” section).
Recently, there have been two methods proposed to identify contexts related to genetic regulation of gene expression from eQTL data sets [27,28,29]. SURGE is unique from these methods in that it identifies contexts whose linear interaction with genotype explain the most variation in singlecell gene expression levels. Vochteloo et al. [27] identifies contexts using an iterative algorithm such that the interaction between the context and genotype maximize expression variation in only the previous iteration’s genomewide significant contextinteraction eQTLs. Contexts identified by this approach may not directly correspond to contexts whose interaction with genotype maximally explains variation in gene expression. Furthermore, this approach does not model samplerepeat structure and was not developed for application on singlecell eQTL data. Gewirtz et al. [28, 29] propose a method to identify shared latent topics present in both expression and genotype data. Topics identified by this approach will not directly correspond to contexts whose interaction with genotype maximally explains variation in gene expression. Therefore, the goals of each method are distinct, and SURGE uniquely identifies contexts where interaction between genotype and context drive variation in gene expression.
We utilize a simulation framework to statistically quantify SURGE’s ability to accurately infer the latent contexts that alter genetic regulation of gene expression (see the “Methods” section). As expected, reconstruction of the simulated contexts depends on the sample size of the eQTL data set as well as the true effect size and number of contextspecific eQTLs present in the simulated eQTL data set (Fig. 1C, Additional file 1: Fig. S2). However, given a realistic eQTL data set containing 100 modest effect contextspecific eQTLs (simulated realistic interaction variance 0.25 [8]; see the “Methods” section) and sample size (n = 250), SURGE accurately infers the simulated latent contexts (Fig. 1C, Additional file 1: Fig. S2) as well as the number of simulated latent contexts (Fig. 1D, Additional file 1: Fig. S3).
As a proof of concept in real sequencing data, we apply SURGE to model RNA sequencing samples from 10 GTEx version 8 tissues (4169 individualtissue pairs; Adrenal Gland, ColonSigmoid, Esophagus Mucosa, MuscleSkeletal, Pituitary, Skin [not sun exposed suprapubic], Skin [sun exposed lower leg], Small Intestine terminal ileum, Stomach, and Thyroid), selected to be largely diverse with a small number of related tissues. SURGE identifies 15 latent contexts, resulting in hundreds of genes with at least one SURGE interaction eQTL (eFDR < = 0.05, see the “Methods” section) (Additional file 1: Figs. S4, S5, S6, and S7, Tables S1, S2). In this dataset, each RNA sample was extracted from a specific tissue, and while tissue identity information is not provided to SURGE, all 15 of the SURGE latent contexts are associated with differences in tissue type between the samples (Additional file 1: Table S3) with more than 30% of the variation in 7 of the top 8 SURGE latent contexts (latent contexts ordered by PVE, see the “Methods” section) explained by tissue identity (Fig. 2A, Additional file 1: Table S3). SURGE latent context 1, for example, isolates RNA samples from MuscleSkeletal tissue (p < = 2.2 e − 16, Wilcoxon rank sum test); RNA samples derived from MuscleSkeletal tissue have an average latent context 1 value of − 1.82 (sdev 0.342), while RNA samples from other tissues have an average latent context 1 value of − 0.011 (sdev 0.456). Furthermore, SURGE latent context 4 and 7 cluster samples according to their known ancestry (Additional file 1: Table S4); samples from African Ancestry donors are strongly loaded on both latent context 4 and 7 (Fig. 2B, Additional file 1: Fig. S8).
Next, we intersect the learned SURGE latent contexts with previously computed computational estimates of each RNA sample’s cell type composition according to xCell (xCell infers cell type enrichment scores that reflect cell type composition based on external cell typespecific gene expression data) [24, 30]. We find that the SURGE latent contexts are not simply identifying differences in tissue identity between the samples but learning differences in cell type composition of samples both across tissues and within a single tissue (Fig. 2C, Fig. 2D, Additional file 1: Fig. S9S11). SURGE latent context 2, for example, is highly correlated with epithelial cell enrichment score across samples from all ten tissues (Fig. 2C; Spearman’s rho 0.724, p < = 2.2 e − 16). Moreover, many of the SURGE latent contexts capture complex multicell type composition continuums, not simply the change in proportions of a single cell type (Fig. 2D, Additional file 1: Fig. S9, Table S5). SURGE identifies latent contexts underlying cell type composition continuums even when applied to RNA samples from only a single tissue (see the “Methods” section, Additional file 1: Fig. S12), demonstrating the importance of cell type composition differences across samples extracted from the same tissue. Importantly, we observe greater power to detect contextspecific eQTLs with SURGE latent contexts than with the previously studied approach [30] of testing genetic interactions with cell type enrichment score estimates from xCell (see the “Methods” section, Additional file 1: Fig. S13). In summary, SURGE identifies tissue type, cell type, and ancestry as the primary axes of genetic regulation of gene expression within GTEx eQTL data.
Next, we apply SURGE to a recently generated singlecell eQTL data set consisting of 1.2 million PBMCs from 224 genotyped individuals [18]. One hundred fortyone of these individuals have systemic lulus erythematosus (SLE), while the remainder are healthy. To mitigate the sparsity characteristic of 10X sequencing data, we aggregate cell level expression data across highly correlated cells to generate 22,774 pseudocells (see the “Methods” section, Additional file 1: Fig. S14) [21, 31], aggregating on average 22 cells per pseudocell. Here, SURGE identifies 6 latent contexts, resulting in hundreds of genes with at least one SURGE interaction eQTL (eFDR < 0.1, see the “Methods” section) (Additional file 1: Figs. S15, S16, S17, Tables S6, S7).
SURGE latent contexts 1, 2, and 4 capture continuous representations of distinct blood cell types while integrating biologically related cell types along a gradient within a single latent context (Fig. 3A, Additional file 1: Figs. S18S20, Tables S8, S9). SURGE latent context 2, for example, is strongly loaded on natural killer (NK) cells (p < = 2.2 e − 16, Wilcoxon rank sum test), while still identifying fineresolution differences distinguishing bright NK cells from dim NK cells (Additional file 1: Fig. S18; p < = 2.2 e − 16, Wilcoxon rank sum test). Additionally, SURGE latent context 1 identifies subtle differences isolating monocytes derived from healthy individuals from monocytes derived from SLE individuals (Additional file 1: Fig. S21; p < = 2e − 16, Wilcoxon rank sum test). Interestingly, SURGE latent contexts 3, 5, and 6 are only modestly explained by known cell types (Additional file 1: Table S8, Table S9) and are not explained by broad expression trends related to cell types defining the top gene expression principal components (Additional file 1: Figs. S22S23). SURGE latent contexts 5 and 6 instead show strong correlation with expression of genes involved in specific biological processes (see the “Methods” section, Additional file 1: Tables S10, S11). For example, SURGE latent context 4 is correlated with genes that are extremely enriched in the Hallmark interferongamma response (odds ratio: 28.52, p < 4.2e − 10) [32]. The interferon gamma response is a wellstudied immunerelated pathway shown to be involved in regulating SLE [18, 33, 34].
Finally, we evaluate the relationship between SURGE interaction eQTLs and diseaseassociated loci across diverse traits with genomewide association studies (GWAS) available. Using coloc [22], we identify hundreds of colocalizations between SURGE interaction eQTLs and GWAS loci (Fig. 3B, C, Additional file 1: Fig. S24). For example, a SURGE context 4 interaction eQTL for BTN3A2 colocalized with a GWAS signal for SLE (Fig. 3B). We identify significantly more trait colocalizations with SURGE interaction eQTLs relative to using standard eQTLs (Fig. 3C; p < 0.0026, paired Wilcoxon rank sum test across 14 independent traits). In addition, we compared the number of trait colocalizations identified using SURGE interaction eQTLs with expression PC interaction eQTLs or significant interactions between gene expression principal components (PCs) and genotype on gene expression (see the “Methods” section). Despite both methods identifying comparable numbers of interaction eQTLs (Additional file 1: Fig. S17), SURGE interaction eQTLs colocalized (PPH4 > 0.95) with 2.4 more loci per trait on average across 15 independent traits (Fig. 3C; p < 0.017, paired Wilcoxon rank sum test). SURGE identified 66 trait colocalizations (PPH4 > 0.95) that could not be replicated (PPH4 > 0.1) by the expression PC interaction eQTL analysis, whereas the expression PC interaction eQTL only identified 31 trait colocalizations not replicated by SURGE interaction eQTLs. Notably, 50 of the 66 trait colocalizations unique to SURGE were discovered in the SURGE latent contexts not well explained by expression PCs (latent contexts 3, 5, and 6; Additional file 1: Figs. S22, S25) demonstrating the importance of not relying exclusively on expression PCs for interaction eQTL calling.
Next, we assess how eQTL enrichment in complex trait and disease heritability varied along the SURGE latent contexts using SLDSC [23, 35]. Briefly, we used SURGE to estimate eQTL effect sizes at multiple positions along each SURGE latent context and then use SLDSC to quantify the heritability enrichment of eQTLs identified at each position (see the “Methods” section). We note that this approach is not limited to SURGE interaction eQTLs and could be applied to any analysis that infers interaction eQTLs. We observed that eQTL enrichment in complex trait and disease heritability significantly varies along the SURGE latent contexts for many diseases and complex traits (Fig. 3D, E, Additional file 1: Fig. S26). For example, predicted eQTL effects in cells negatively loaded on SURGE latent context 4 (corresponding to a B cell continuum) are approximately four times more enriched in celiac disease heritability than static eQTLs (Fig. 3E). This result coincides with the previously reported role of B cells in celiac disease [36, 37]. Ultimately, this analysis highlights the importance of assessing eQTLs in diseaserelevant contexts as well as SURGE’s capacity for identifying diseaserelevant contexts.
Discussion
Here, we presented SURGE, a novel probabilistic model that identifies contextspecific eQTLs from singlecell data without prespecifying context, such as cell types or subsets of samples. SURGE leverages information from variantgene pairs across the entire genome to learn a continuous representation of the cellular contexts defining each measurement and the corresponding eQTL effect sizes specific to each learned context. Importantly, SURGE allows for unsupervised discovery of the principal axes of genetic regulation of gene expression within an eQTL data set, identifying latent contexts associated with cell type, tissue type, and ancestry when applied to GTEx tissue samples and highly resolved blood cell types and gene programs when applied to bloodderived single cells. We demonstrated that eQTL enrichment in complex trait and disease heritability significantly varied along the SURGE latent contexts and, ultimately, SURGE identified many traitrelevant loci that could not be detected through traditional eQTL approaches.
Although SURGE allows for identification of contextspecific eQTLs without prespecifying contexts, it has several limitations. First, SURGE is limited to discovery of interaction eQTLs with small effect sizes or those from contexts whose effects explain a large fraction of gene expression variation aggregated across many genes in the genome (Fig. 1C) and will be underpowered to detect interaction eQTLs from contexts that interact with the genetic regulation gene expression in a small number of genes. Second and like many unsupervised algorithms, it can be challenging to interpret the biological meaning of some SURGE latent contexts, particularly those that capture signals independent of observed measurement annotations such as cell type or sample covariates. Third, SURGE latent context inference and SURGE interaction eQTL calling is limited to the most significant variant for each eGene, hindering discovery of multiple independent context specific eQTLs regulating a single gene. Fourth, SURGE makes the simplifying assumption that the residual variance in gene expression is distributed normally. Recent work has demonstrated the benefits of nonnormal distributions in singlecell interaction eQTL calling [13]. We leave the extension of SURGE with alternative residual distributions to future work. Fifth, SURGE cannot be reasonably scaled to more than 50,000 expression samples (Additional file 1: Fig. S1). This does not present a concern for existing eQTL data sets; however, SURGE optimization could be extended in the future, if needed, to scale to arbitrarily large eQTL data sets using parallel programming or stochastic optimization [25].
In conclusion, SURGE provides a statistically principled approach to uncover the dominant axes of genetic regulation of gene expression in an eQTL data set. It will become increasingly useful as large singlecell eQTL data sets containing cells spanning increasingly diverse cellular contexts are generated.
Methods
SURGE model overview
The SURGE model is defined according to the following probability distributions:
Here, \(n\) indexes RNA samples, \(t\) indexes representative variantgene pairs being tested for eQTL analysis, and \(i\) indexes individuals. We use the notation \(n\in i\) to represent the instance where RNA sample \(n\) is drawn from the individual\(i\). \({y}_{nt}\) is the observed standardized gene expression (mean 0 and variance 1 for each test\(t\)) level of the gene corresponding to test \(t\) in sample\(n\). We assume the gene expression data has been properly normalized prior to standardization. \({G}_{nt}\) is the observed, standardized (described in more detail below) genotype of the variant corresponding to test \(t\) in sample\(n\). \({X}_{nl}\) is the observed value of covariate \(l\) for sample \(n\). SURGE infers the values of:

\({F}_{t}\): the eQTL effect size of test \(t\) that is shared across samples

\({V}_{kt}\): the eQTL effect size of test \(t\) for latent context \(k\)

\({U}_{nk}\): the latent context value of sample \(n\) on factor \(k\)

\({\mu }_{t}\): the intercept of each test

\({W}_{lt}\): The effect size of covariate \(l\) on the gene corresponding to test \(t\)

\({\alpha }_{it}\): the random effect intercept for each individual for each test

\({\gamma }_{k}^{2}\): The variance of the values in latent context \(k\)

\({\psi }_{t}^{2}\): The variance of intercept corresponding to each individual in test \(t\)

\({\sigma }_{t}^{2}\): The residual variance in gene expression levels in test \(t\)
\({a}_{0}\), and \({\beta }_{0}\) are model hyperparameters set to provide noninformative priors while stabilizing optimization. In practice, we set \({\alpha }_{0}\) to 1e^{−3} and \({\beta }_{0}\) to 1e^{−3}.
To standardize the genotype of the variant corresponding to test \(t\), we center the genotype vector to have mean 0 across samples and then we scale the genotype vector for test \(t\) (\({G}_{*t}\)) by the standard deviation of \({Y}_{*t}/{G}_{*t}\). This scaling encourages the lowdimensional factorization (\(UV\)) to explain variance equally across tests instead of preferentially explaining variance in tests with small variance in \({Y}_{*t}/{G}_{*t}\).
It is worth highlighting that a meanzero gaussian prior is placed on \({U}_{nk}\) in order to produce interpretable assignments of samples to factors. The level of regularization of that prior is learned separately for each latent context (\({\gamma }_{k}^{2}\)), allowing SURGE to zeroout (\({\gamma }_{k}^{2}\) approaches 0) irrelevant contexts and automatically learn the effective number of latent contexts. This approach has been used by others for inference of the number of effective components in more traditional matrix factorizations [25, 38] and is similar to Automatic Relevance Determination [26].
SURGE optimization
We approximate the posterior distribution of all latent variables [\(Z=\)(\({F}_{t}\), \({V}_{kt}\), \({U}_{nk}\), \({\mu }_{t}\), \({W}_{lt}\), \({\alpha }_{it}\), \({\gamma }_{k}^{2}\), \({\psi }_{t}^{2}\), \({\sigma }_{t}^{2}\))] using meanfield variational inference [39]. The goal of variational inference is to minimize the KLdivergence between \(q(Z)\) and \(p(ZY,G,X)\), which can be written as \(KL(q(Z)p(ZY,G,X)\). Here, \(q(Z)\) is a simple, tractable distribution that is used to approximate \(p\left(ZY,G,X\right)\). We used the “meanfield approximation” for \(q(Z)\) such that all latent variables are independent of one another. More specifically:
where \(N\left(x\right\mu , {\sigma }^{2})\) is a univariate normal distribution parameterized by mean \(\mu\) and variance \({\sigma }^{2}\), and \(G(X\alpha , \beta )\) is a univariate gamma distribution parameterized by \(\alpha\) and \(\beta\).
It can be shown that minimizing the KLdivergence \(KL(q(Z)p(ZY,G,X)\) is equivalent to maximizing the evidence lower bound (ELBO):
Therefore, we will frame SURGE optimization from the perspective of maximizing the ELBO with respect to the parameters defining \(q(Z)\) or the variational parameters. Noteworthy is \(p(G,Y, X, Z)\) is explicitly defined in the “SURGE model overview” section and can be easily computed. The approach we take to maximize the ELBO is through coordinate ascent [39], iteratively updating the variational distribution each latent variable, while holding the variational distributions of all other latent variables fixed. Accordingly, the ELBO is guaranteed to monotonically increase after each variational update. In the case of the SURGE model, each update is available in closed form (see the supplement materials).
Optimization of variational parameters is performed as follows: we randomly initialize all variational parameters (see the “Random initialization for SURGE optimization” section) and then iteratively loop through all latent variables in \(Z\) and update the variational parameters corresponding to that latent variable until we reach convergence.
To assess convergence, we assess the change in ELBO from one iteration to the next. We consider the model converged when the change in ELBO is less than 1e − 2.
Random initialization for SURGE optimization
It is important to note that meanfield variational inference is not guaranteed to converge to the global optima of the ELBO. To mitigate the effects of local optima, we recommend optimizing multiple models with different random initializations and using the parameters learned from the model that achieves the largest ELBO.
Proportion variance explained of SURGE latent contexts
Following the approach taken by [40], we define the “proportion variance explained” (PVE) of the \({k}^{th}\) latent context as:
As stated in [40], this approach is a measure of the amount of signal in data set that is identified by the \({k}^{th}\) latent context. However, the name “proportion variance explained” should be considered loosely as the factors are not orthogonal.
We also consider the “Proportion of SURGEmediated variance explained” (PSMVE) for the \({k}^{th}\) latent context as:
PSMVE quantifies the fraction of total variance in gene expression that is mediated through contextspecific eQTL effects attributed to each latent context.
Removing irrelevant latent contexts
Upon model convergence, we remove latent contexts with PVE \(\le 1{e}^{5}\).
Simulation experiments
To assess SURGE’s ability to accurately capture contexts underlying contextspecific eQTLs, we performed the following simulation experiment.
We randomly generated genotype and expression matrices across 1000 variantgene pairs and \(N\) RNA samples. For each simulated variantgene pair, we simulated the genotype vector (\(G\)) across the \(N\) samples according to the following probability distributions:
Then, we simulated the expression vector (\(y\)) across the \(N\) samples using that variantgene pair’s simulated genotype vector according to the following probability distributions:
Using this simulation, we can evaluate SURGE’s ability to recapture the simulated latent contexts (U) (Fig. 1C, Additional file 1: Fig. S2) as a function of the simulation hyperparameters:

The number of latent contexts (K)

The sample size (\(N\))

The strength of the interaction terms (\(\gamma\))

The fraction of tests that are contextspecific eQTLs for a particular context (\(p\))
We can also access SURGE’s ability to accurately estimates the number of relevant contexts (K) (Fig. 1D, Additional file 1: Fig. S3) by only retaining contexts with PVE \(>1{e}^{5}.\)
Selection of representative variantgene pairs used for optimization
SURGE optimization (i.e., learning the SURGE latent contexts) requires an input expression matrix and genotype matrix. As specified above, both matrices should be of dimension \(N\) X \(T\), where \(N\) is the number of RNA samples and \(T\) is the number of genomewide representative variant gene pairs. We desire each variantgene pair to be independent of one another because we want SURGE to infer eQTL patterns that are persistent across the genome, not specific to a single gene or variant.
To encourage the expression and genotype data to consist of independent variantgene pairs, we limit there to be a single variantgene pair selected for each gene and limit there to be a single variantgene pair selected for each variant.
Furthermore, it has been shown that contextspecific eQTLs are more likely to be standard eQTLs than not. We therefor limit the representative variantgene pairs used for SURGE optimization to those that are standard eQTLs within the data set (more details presented below). For computational efficiency, we recommend using a maximum of 2000 genomewide representative variantgene pairs for SURGE optimization.
SURGE interactioneQTLs
SURGE optimization on a subset of genomewide representative variantgene pairs will result in approximations to the posterior distributions of the SURGE latent contexts (U) as well as eQTL effect sizes for each of the SURGE latent contexts for only the representative variant gene pairs (V). It is of interest, however, to call interaction eQTLs with respect to each of the SURGE latent contexts for all variant genepairs, not just the subset of representative variantgene pairs used for SURGE optimization.
Therefore, to identify SURGE interactioneQTL for an arbitrary variantgene pair, we treat the expected value of the inferred posterior distribution on the SURGE latent contexts ( \(\widehat{U}\): dim NXK) as observed and optimize the following linear mixed model for each variantgene pair. The linear mixed model is as follows:
Here:

\({y}_{n}\) is the observed expression level of the gene corresponding to the variantgene pair in sample \(n\)

\({g}_{n}\) is the observed genotype of the variant corresponding to the variantgene pair in sample \(n\)

\({X}_{nl}\) is the observed value of covariate \(l\) in sample \(n\)

\(\mu\) is the intercept

\({\alpha }_{i}\) is the random effect intercept for individual \(i\). We use the notation \(n\in i\) to represent the case where sample \(n\) is drawn from individual \(i\)

\({W}_{l}\) is the fixed effect for covariate \(l\)

\({\beta }_{g}\) is the fixed effect for genotype

\({\beta }_{k}\) is the fixed effect of the \({k}^{th}\) latent context

\({\beta }_{gxk}\) is the fixed effect of the interaction between the \({k}^{th}\) latent context and genotype
We use the R package “lme4” to quantify the significance of all K interaction terms: \({\beta }_{gx1}, \dots , {\beta }_{gxk}, \dots , {\beta }_{gxK}\). Intuitively, if the \({k}^{th}\) interaction term (\({\beta }_{gxk}\)) is significant, it implies that the eQTL effect size of this variantpairs significantly changes along latent context \(k\).
Calibration of SURGE interaction eQTLs using permutation analysis
Pvalues resulting from SURGE interactioneQTL analysis are potentially inflated due to SURGE interaction eQTLs being identified from the same data used to learn the SURGE latent contexts. This statistical phenomenon is known as “doubledipping,” and there exist wellstudied approaches to ensure statistical calibration in the presence of “doubledipping” [41,42,43,44]. We used the empirical false discovery rate (eFDR) [45] to generate wellcalibrated SURGE interaction eQTLs. eFDR establishes significance by comparing the observed SURGE interaction eQTL pvalues with an empirical null distribution of pvalues.
In more detail, we use a conservative, permutation analysis to generate an empirical null distribution of genelevel pvalues that can be used to calibrate the observed genelevel pvalues. The permutation analysis consisted of (1) permuting genotype of each individual, (2) reoptimizing SURGE latent contexts (U) using the permuted genotype data, and (3) calling SURGE interactioneQTLs with the permuted genotype data and the SURGE latent contexts learned using the permuted genotype data.
To permute genotype, we generated a single permutation of genotype that was used across all analyzed variants to ensure we did not break the correlation structure across variantgene pairs. In addition, we only permuted genotype across individuals, not RNA samples, to ensure we preserved sample repeat structure. This means that multiple RNA samples from the same individual will always have the same genotype values in the permutation run. Lastly, similar to previous permutation experiments on linearinteraction effects [8, 46], we only permuted the genotype variable in the interaction term while leaving the fixed effect of genotype unpermuted (for both SURGE optimization and SURGE interactioneQTL calling).
Given both the observed and permuted SURGE interaction eQTL pvalues, we generated genelevel pvalues using Bonferroni correction for both observed and permuted interaction eQTLs. We then evaluated genomewide significance of the observed genelevel pvalues using empirical FDR (eFDR) [45] calibrated with the permuted genelevel pvalues. This approach was performed independently for each SURGE latent context. We will refer to this approach as the “percontext eFDR.”
In the real (unpermuted) data, we only called SURGE interactioneQTLs for SURGE latent contexts with PVE > 1e^{−5}. Unsurprisingly, permuted SURGE latent contexts consistently explained less PVE than unpermuted SURGE latent contexts (Additional file 1: Fig. S4, Fig. S15); there existed zero permuted SURGE latent contexts explaining PVE > 1e^{−5} across all experiments. Therefore, if Z SURGE latent contexts have PVE > 1e^{−5}, we selected the top Z permuted SURGE latent contexts to be used in the permuted SURGE interaction eQTL analysis.
We consider an alternative multiple testing procedure where the eFDR is computed jointly across SURGE interaction eQTLs in all SURGE latent contexts instead of computing eFDR in each context independently (percontext eFDR). Specifically, we evaluated genomewide significance of the observed (gene, SURGE context) pair p values using eFDR calibrated using the permuted (gene, SURGE permuted context) pair p values. We will refer to this approach as the “allcontext eFDR.”
However, in concordance with previous multiple testingcorrection procedures for multicontext eQTL calling [5, 24], we recommend using the percontext eFDR.
Expression PC interactioneQTLs
We benchmarked the SURGE interaction eQTLs against expression PC interaction eQTLs. Intuitively, expression PC interaction eQTLs identify statistically significant interactions between a gene expression principal component and the genotype of a particular snp on the expression levels of a particular gene. We set the number of expression PCs equal to the number of SURGE latent contexts.
To identify expression PC interactioneQTLs for an arbitrary variantgene pair, we first use PCA to calculate the expression principal components (E: dim NXK). Then, we optimize the following linear mixed model for each variantgene pair. The linear mixed model is as follows:
Here:

\({y}_{n}\) is the observed expression level of the gene corresponding to the variantgene pair in sample \(n\)

\({g}_{n}\) is the observed genotype of the variant corresponding to the variantgene pair in sample \(n\)

\({X}_{nl}\) is the observed value of covariate \(l\) in sample \(n\)

\(\mu\) is the intercept

\({\alpha }_{i}\) is the random effect intercept for individual \(i\). We use the notation \(n\in i\) to represent the case where sample \(n\) is drawn from individual \(i\)

\({W}_{l}\) is the fixed effect for covariate \(l\)

\({\beta }_{g}\) is the fixed effect for genotype

\({\beta }_{k}\) is the fixed effect of the \({k}^{th}\) expression PC

\({\beta }_{gxk}\) is the fixed effect of the interaction between the \({k}^{th}\) expression PC and genotype
We use the R package “lme4” to quantify the significance of all K interaction terms: \({\beta }_{gx1}, \dots , {\beta }_{gxk}, \dots , {\beta }_{gxK}\). Intuitively, if the \({k}^{th}\) interaction term (\({\beta }_{gxk}\)) is significant, it implies that the eQTL effect size of this variantpairs significantly changes along expression PC \(k\).
Expression PC interaction eQTLs are closely related to CellRegMap interaction eQTLs [21], except CellRegMap uses MOFA factors [25] instead of gene expression PCs, and CellRegMap treats the interaction eQTL effect sizes as random effects instead of fixed effects. It was shown in Appendix Figure S5 of the CellRegMap manuscript [21] that MOFA factors were highly correlated with gene expression PCs, and CellRegMap interaction eQTL pvalues were highly concordant regardless of whether gene expression PCs or MOFA factors were used. Ultimately, we used our own implementation of expression PC interaction eQTL calling simply to minimize modeling assumption differences with SURGE and allow a focused comparison between the use of SURGE latent contexts and expression PCs in interaction eQTL calling.
Application of SURGE to GTEx samples from 10 tissues: expression quantification
To normalize expression from samples from 10 GTEx tissues (Adrenal gland, Colonsigmoid, EsophagusMucosa, MuscleSkeletal, Pituitary, Skinnotsunexposed, Skinsunexposed, smallintestineterminalileum, Stomach, Thyroid), we concatenated logTPM expression measurements across all samples used in the GTEx v8 eQTL analysis for one of those tissues [5]. We also limited to genes that were tested for eQTLs in the GTEx v8 analysis [5] in all 10 tissues. Next, we quantile normalized this matrix to ensure each sample had an equivalent distribution across genes and then standardized each gene (mean 0 and standard deviation 1). We excluded RNA samples that were outliers (Zscore > = 4) according to Mahalanobis distance computed on 80 expression PCs.
Application of SURGE to GTEx samples from 10 tissues: standard eQTL calling
We first tested for standard eQTLs, or association between genotype and the concatenated (across tissues) expression vector described in the “Application of SURGE to GTEx samples from 10 tissues: expression quantification” section. For this analysis, we limited to genes that passed filters described in the “Application of SURGE to GTEx samples from 10 tissues: expression quantification” section. We then limited to variants with MAF > = 0.05 that were less than 50 KB from the transcription start site of a gene. We controlled for the effects of 80 expression PCs and 4 genotype PCs (as recommended by [5] given the sample size). We assessed genomewide significance according to a genelevel Bonferroni correction, followed by a genomewide Benjamini–Hochberg correction.
Application of SURGE to GTEx samples from 10 tissues: SURGE optimization
To select a subset of variantgene pairs to be used for SURGE model optimization, we first limited to variantgene pairs that were standard eQTLs (FDR < = 0.05; see the “Application of SURGE to GTEx samples from 10 tissues: standard eQTL calling” section). This was done to ensure a higher fraction of the variantgene pairs used for SURGE optimization were contextspecific eQTLs as it is known standard eQTLs are more likely to be contextspecific eQTLs than variantgene pairs that are not standard eQTLs. Furthermore, we limited to the most significant variant per gene among the 2000 most significant genes and removed a variantgene pair if the variant was already in the training set for its association with a more significant gene. This yielded 1996 genomewide representative variantgene pairs used for SURGE optimization. We then ran SURGE under default parameter settings over these representative variantgene pairs. We included 80 expression PCs and 4 genotype PCs as covariates in SURGE optimization. The converged SURGE model resulted in 15 latent contexts with PVE \(>1{e}^{5}\) and hundreds of genomewide significant SURGE interaction eQTLs (eFDR < = 0.05) (Additional file 1: Fig. S5, Table S1).
We did not control for a random effects term related to sample repeat structure. While this analysis contained multiple RNAseq samples from the same individual, there were only 5.1 samples per individual on average. This was too few repeated measurements to accurately estimate an additional variance parameter. As a secondary analysis, however, we ran SURGE on this data using a random effect intercept. There are high levels of correlation between the identified SURGE latent contexts between the two models (Additional file 1: Fig. S7).
Application of SURGE to GTEx samples from a single tissue
To run SURGE on GTEx samples from a single GTEx tissue, we took a very similar approach to that described in the “Application of SURGE to GTEx samples from 10 tissues: expression quantification,” “Application of SURGE to GTEx samples from 10 tissues: standard eQTL calling,” and “Application of SURGE to GTEx samples from 10 tissues: SURGE optimization” sections. The only difference is that we now limit to samples from the tissue of interest. Furthermore, we now only control for 60 expression PCs and 2 genotype PCs during standard eQTL calling and SURGE optimization. The converged model resulted in 1 latent contexts with PVE \(>1{e}^{5}\) and 1287 genomewide significant SURGE interaction eQTLs (eFDR < = 0.05).
Application of SURGE to PBMC singlecell eQTL data: pseudocell expression quantification
We imported raw, unnormalized UMI counts from [18]. We used SCRAN [47] to generate lognormalized counts for each cell. We removed genes that were expressed in fewer than 0.5% of cells. We then limited to the top 6000 highly variable genes via the Scanpy function “highly_variable_genes” [48]. We then removed the effects of sequencing batch using Combat [49] as implemented in Scanpy. We then scaled each gene to have mean 0 and variance 1, with a maximum absolute value of 10 to mitigate outlier effects as implemented by “scanpy.pp.scale.”
Next, we sought to generate pseudocells that represented groupings of highly correlated cells within an individual. We first removed individuals from this analysis with fewer than 2500 cells. Next, we performed Leiden clustering as implemented by Scanpy [50] independently in each individual using all default parameters, except we used a finegrained cluster resolution of 10. Here, each Leiden cluster corresponds to a pseudocell. We took the average expression across all cells assigned to the pseudocell to estimate the expression profile of the pseudocell. Finally, we standardized each gene (across pseudocells) to have mean 0 and standard deviation 1, again capping the absolute value of standardized scores to be 10 to mitigate outlier effects. We excluded RNA pseudocells that were outliers (Zscore > = 4) according to Mahalanobis distance computed on 30 expression PCs.
Application of SURGE to PBMC singlecell eQTL data: standard eQTL calling
We first tested for standard eQTLs or association between genotype and the expression vector across pseudocells described in the “Application of SURGE to Yelab generated singlecell eQTL data: pseudocell expression quantification” section. For this analysis, we limited to genes that passed filters described in the “Application of SURGE to Yelab generated singlecell eQTL data: pseudocell expression quantification” section. We then limited to variants with MAF > = 0.05 that were less than 200 KB from the transcription start site of a gene. We controlled for the effects of 30 expression PCs and 2 genotype PCs. We controlled for samplerepeat structure stemming from multiple pseudocells originating from the same individual using a random effects intercept for each individual. We assessed genomewide significance according to a genelevel Bonferroni correction, followed by a genomewide Benjamini–Hochberg correction.
Application of SURGE to PBMC singlecell eQTL data: SURGE optimization
To select a subset of variantgene pairs to be used for SURGE model optimization, we first limited to variantgene pairs that were standard eQTLs (FDR < = 0.05; see the “Application of SURGE to Yelab generated singlecell eQTL data: standard eQTL calling” section). This was done to ensure a higher fraction of the variantgene pairs used for SURGE optimization were contextspecific eQTLs as it is known standard eQTLs are more likely to be contextspecific eQTLs than variantgene pairs that are not standard eQTLs. Furthermore, we limited to the most significant variant per gene among the 2000 most significant genes and removed a variantgene pair if the variant was already in the training set for its association with a more significant gene. We than ran SURGE under default parameter settings over these representative variantgene pairs. We included 30 expression PCs and 2 genotype PCs as covariates in SURGE as well as a random effect intercept term for each individual. The converged SURGE model resulted in 6 latent contexts with PVE \(>1{e}^{5}\) and hundreds of genomewide significant SURGE interaction eQTLs (eFDR < = 0.1) (Additional file 1: Fig. S16, Table S6).
Gene set enrichment analysis
We tested enrichment of genes whose expression levels was highly correlated with SURGE latent contexts (identified when SURGE was applied to singlecell PBMC data) within known gene sets. Specifically for each SURGE latent context, we identified the 50 genes whose expression levels across pseudocells were most strongly correlated (absolute value of correlation coefficient) with the SURGE latent context. We then tested gene set enrichment of these 50 genes relative to all genes that passed filters described in the “Application of SURGE to PBMC singlecell eQTL data: pseudocell expression quantification” section. We tested enrichment of these strongly correlated genes in both the Hallmark gene set and the MSigDB Biological Process gene set [32] (Additional file 1: Table S10, Table S11).
Application of stratified LD score regression (SLDSC)
Recall, SURGE interaction eQTLs for a specific variantgene pair can be identified by evaluating the following likelihood (see the “Methods” section “Surge interaction eQTLs” for more details):
Upon maximizing this likelihood (assume \(\widehat{{\beta }_{g}}\) and \(\widehat{{\beta }_{gxk}}\) are the estimated values of \({\beta }_{g}\) and \({\beta }_{gxk}\) that maximize the likelihood), we can estimate the expected eQTL effect size for the variantgene pair for a particular value of a latent context of U using the following function:
Here, \({\beta }^{*}\) is the expected eQTL effect size for the particular variantgene pair when the \({k}^{th}\) latent context value of \(U\) is equal to \({U}_{k}^{*}\). Ultimately, this enables us to compute the expected eQTL effect size for all variantgene pairs when the \({k}^{th}\) latent context value of U is equal to \({U}_{k}^{*}\).
We use the above expectation to assess how eQTL enrichment in complex trait and disease heritability varied along the SURGE latent contexts. Specifically, for each SURGE latent context, we generated 200 equally spaced positions along the range of SURGE latent context values. For each of those 200 positions, we computed the expected eQTL effect sizes (using the above expectation of \({\beta }^{*}\)) for all variantgene pairs. We then used the squared expected eQTL effect size across variantgene pairs as annotation in SLDSC [23, 35] along with all BaselineLD v2.2 annotations excluding four QTL related annotations (“GTEx_eQTL_MaxCPP,” “BLUEPRINT_H3K27acQTL_MaxCPP,” “BLUEPRINT_H3K4me1QTL_MaxCPP,” “BLUEPRINT_DNA_methylation_MaxCPP”). If a given variant mapped to multiple genes, we used the sum of squared expected eQTL effect sizes across genes as the annotation similar to [16]. This analysis was done for each of the 200 equally spaced positions for each of the 5 SURGE latent contexts identified when SURGE was run on the singlecell PBMC eQTL data.
Availability of data and materials
SURGE software is available on GitHub at https://github.com/BennyStrobes/surge under an MIT license. The GTEx v8 data [5] can be downloaded from the dbGaP website under phs000424.v8.p2 and on the GTEx portal (http://gtexportal.org/). PBMC singlecell eQTL [18] expression data are available in the Human Cell Atlas Data Coordination Platform and at GEO accession number GSE174188. PBMC singlecell eQTL [18] genotype data are available at dbGaP accession number phs002812.v1.p1. Additionally, the SURGE source code used in this study is also accessible on Zenodo (https://doi.org/https://doi.org/10.5281/zenodo.10383060) [51].
References
Nica AC, Dermitzakis ET. Expression quantitative trait loci: present and future. Philos Trans R Soc Lond B Biol Sci. 2013;368:20120362.
Lappalainen T, The Geuvadis Consortium, Sammeth M, Friedländer MR, ‘tHoen PAC, Monlong J, et al. Transcriptome and genome sequencing uncovers functional variation in humans. Nature. 2013;501:506–11.
Battle A, Mostafavi S, Zhu X, Potash JB, Weissman MM, McCormick C, et al. Characterizing the genetic basis of transcriptome diversity through RNAsequencing of 922 individuals. Genome Res. 2014;24:14–24.
Kerimov N, Hayhurst JD, Peikova K, Manning JR, Walter P, Kolberg L, et al. A compendium of uniformly processed human gene expression and splicing quantitative trait loci. Nat Genet. 2021;53:1290–9.
The GTEx Consortium. The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science. 2020;369:1318–30.
Võsa U, Claringbould A, Westra HJ, Bonder MJ, Deelen P, Zeng B, et al. Largescale cis and transeQTL analyses identify thousands of genetic loci and polygenic scores that regulate blood gene expression. Nat Genet. 2021;53:1300–10.
Knowles DA, Burrows CK, Blischak JD, Patterson KM, Serie DJ, Norton N, et al. Determining the genetic basis of anthracyclinecardiotoxicity by molecular response QTL mapping in induced cardiomyocytes. Elife. 2018;7: e33480.
Strober BJ, Elorbany R, Rhodes K, Krishnan N, Tayeb K, Battle A, et al. Dynamic genetic regulation of gene expression during cellular differentiation. Science. 2019;364:1287–90.
Cuomo ASE, Seaton DD, McCarthy DJ, Martinez I, Bonder MJ, GarciaBernardo J, et al. Singlecell RNAsequencing of differentiating iPS cells reveals dynamic genetic effects on gene expression. Nat Commun. 2020;11:810. https://doi.org/10.1038/s4146702014457z.
Jerber J, Seaton DD, Cuomo ASE, Kumasaka N, Haldane J, Steer J, et al. Populationscale singlecell RNAseq profiling across dopaminergic neuron differentiation. Nat Genet. 2021;53:304–12.
Umans BD, Battle A, Gilad Y. Where are the diseaseassociated eQTLs? Trends Genet. 2021;37:109–24.
Elorbany R, Popp JM, Rhodes K, Strober BJ, Barr K, Qi G, et al. Singlecell sequencing reveals lineagespecific dynamic genetic regulation of gene expression during human cardiomyocyte differentiation. PLoS Genet. 2022;18:e1009666.
Nathan A, Asgari S, Ishigaki K, Valencia C, Amariuta T, Luo Y, et al. Singlecell eQTL models reveal dynamic T cell state dependence of disease loci. Nature. 2022;606:120–8.
Yazar S, AlquiciraHernandez J, Wing K, Senabouth A, Gordon MG, Andersen S, et al. Singlecell eQTL mapping identifies cell typespecific genetic control of autoimmune disease. Science. 2022;376:eabf3041.
Chun S, Casparino A, Patsopoulos NA, CroteauChonka DC, Raby BA, De Jager PL, et al. Limited statistical evidence for shared genetic effects of eQTLs and autoimmunediseaseassociated loci in three major immunecell types. Nat Genet. 2017;49:600–5.
Yao DW, O’Connor LJ, Price AL, Gusev A. Quantifying genetic effects on disease mediated by assayed gene expression levels. Nat Genet. 2020;52:626–33.
Mostafavi H, Spence JP, Naqvi S, Pritchard JK. Limited overlap of eQTLs and GWAS hits due to systematic differences in discovery. bioRxiv. 2022;2022.05.07.491045. https://doi.org/10.1101/2022.05.07.491045
Perez RK, Gordon MG, Subramaniam M, Kim MC, Hartoularos GC, Targ S, et al. Singlecell RNAseq reveals cell typespecific molecular and genetic associations to lupus. Science. 2022;376:eabf1970.
van der Wijst MGP, Brugge H, de Vries DH, Deelen P, Swertz MA, LifeLines Cohort Study, et al. Singlecell RNA sequencing identifies celltypespecific ciseQTLs and coexpression QTLs. Nat Genet. 2018;50:493–7.
Findley AS, Monziani A, Richards AL, Rhodes K, Ward MC, Kalita CA, et al. Functional dynamic genetic effects on gene regulation are specific to particular cell types and environmental conditions. Elife. 2021;10:e67077. https://doi.org/10.7554/eLife.67077.
Cuomo ASE, Heinen T, Vagiaki D, Horta D, Marioni JC, Stegle O. Cell RegMap: a statistical framework for mapping contextspecific regulatory variants using scRNAseq. Mol Syst Biol. 2022;18:e10663.
Giambartolomei C, Vukcevic D, Schadt EE, Franke L, Hingorani AD, Wallace C, et al. Bayesian test for colocalisation between pairs of genetic association studies using summary statistics. PLos Genet. 2014;10:e1004383.
Finucane HK, BulikSullivan B, Gusev A, Trynka G, Reshef Y, Loh PR, et al. Partitioning heritability by functional annotation using genomewide association summary statistics. Nat Genet. 2015;47:1228–35.
KimHellmuth S, Aguet F, Oliva M, MuñozAguirre M, Kasela S, Wucher V, et al. Cell typespecific genetic regulation of gene expression across human tissues. Science. 2020;369:eaaz8528. https://doi.org/10.1126/science.aaz8528.
Argelaguet R, Arnol D, Bredikhin D, Deloro Y, Velten B, Marioni JC, et al. MOFA+: a statistical framework for comprehensive integration of multimodal singlecell data. Genome Biol. 2020;21:111.
Wipf D, Nagarajan S. A new view of automatic relevance determination. Available: https://papers.nips.cc/paper/2007/file/9c01802ddb981e6bcfbec0f0516b8e35Paper.pdf.Cited 22 Nov 2022
Vochteloo M, Deelen P, Vink B, Tsai EA, Runz H, AndreuSánchez S, et al. Unbiased identification of unknown cellular and environmental factors that mediate eQTLs using principal interaction component analysis. bioRxiv. 2022. https://doi.org/10.1101/2022.07.28.501849
Gewirtz AD, Townes FW, Engelhardt BE. Telescoping bimodal latent Dirichlet allocation to identify expression QTLs across tissues. Life Sci Alliance. 2022;5:e202101297. https://doi.org/10.26508/lsa.202101297.
Gewirtz ADH, Townes FW, Engelhardt BE. Expression QTLs in singlecell sequencing data. bioRxiv. 2022. https://doi.org/10.1101/2022.08.14.503915
Aran D, Hu Z, Butte AJ. xCell: digitally portraying the tissue cellular heterogeneity landscape. Genome Biol. 2017;18:220.
Baran Y, Bercovich A, SebePedros A, Lubling Y, Giladi A, Chomsky E, et al. MetaCell: analysis of singlecell RNAseq data using Knn graph partitions. Genome Biol. 2019;20:206.
Liberzon A, Birger C, Thorvaldsdóttir H, Ghandi M, Mesirov JP, Tamayo P. The Molecular Signatures Database (MSigDB) hallmark gene set collection. Cell Syst. 2015;1:417–25.
Theofilopoulos AN, Koundouris S, Kono DH, Lawson BR. The role of IFNgamma in systemic lupus erythematosus: a challenge to the Th1/Th2 paradigm in autoimmunity. Arthritis Res. 2001;3:136–41.
Schroder K, Hertzog PJ, Ravasi T, Hume DA. Interferongamma: an overview of signals, mechanisms and functions. J Leukoc Biol. 2004;75:163–89.
Gazal S, Finucane HK, Furlotte NA, Loh PR, Palamara PF, Liu X, et al. Linkage disequilibriumdependent architecture of human complex traits shows action of negative selection. Nat Genet. 2017;49:1421–7.
du Pré MF, Sollid LM. Tcell and Bcell immunity in celiac disease. Best Pract Res Clin Gastroenterol. 2015;29:413–23.
Jagadeesh KA, Dey KK, Montoro DT, Mohan R, Gazal S, Engreitz JM, et al. Identifying diseasecritical cell types and cellular processes by integrating singlecell RNAsequencing and human genetics. Nat Genet. 2022;54:1479–92.
Argelaguet R, Velten B, Arnol D, Dietrich S, Zenz T, Marioni JC, et al. MultiOmics Factor Analysis—a framework for unsupervised integration of multiomics data sets. Mol Syst Biol. 2018;14: e8124.
Blei DM, Kucukelbir A, McAuliffe JD. Variational inference: a review for statisticians. J Am Stat Assoc. 2017;112:859–77.
Wang W, Stephens M. Empirical Bayes matrix factorization. arXiv [stat.ME]. 2018. Available: http://arxiv.org/abs/1802.06931
Chung NC, Storey JD. Statistical significance of variables driving systematic variation in highdimensional data. Bioinformatics. 2015;31:545–54.
Chung NC. Statistical significance of cluster membership for unsupervised evaluation of cell identities. Bioinformatics. 2020;36:3107–14.
Chen YT, Witten DM. Selective inference for kmeans clustering. arXiv [stat.ME]. 2022. Available: http://arxiv.org/abs/2203.15267
Neufeld A, Gao LL, Popp J, Battle A, Witten D. Inference after latent variable estimation for singlecell RNA sequencing data. arXiv [stat.ME]. 2022. Available: http://arxiv.org/abs/2207.00554
Gamazon ER, Huang RS, Dolan ME, Cox NJ, Im HK. Integrative genomics: quantifying significance of phenotypegenotype relationships from multiple sources of highthroughput data. Front Genet. 2012;3:202.
Knowles DA, Davis JR, Edgington H, Raj A, Favé MJ, Zhu X, et al. Allelespecific expression reveals interactions between genetic variation and environment. Nat Methods. 2017;14:699–702.
Lun ATL, Bach K, Marioni JC. Pooling across cells to normalize singlecell RNA sequencing data with many zero counts. Genome Biol. 2016;17:75.
Wolf FA, Angerer P, Theis FJ. SCANPY: largescale singlecell gene expression data analysis. Genome Biol. 2018;19:15. https://doi.org/10.1186/s1305901713820.
Zhang Y, Parmigiani G, Johnson WE. ComBatseq: batch effect adjustment for RNAseq count data. NAR Genom Bioinform. 2020;2:lqaa078.
Traag VA, Waltman L, van Eck NJ. From Louvain to Leiden: guaranteeing wellconnected communities. Sci Rep. 2019;9:5233.
Strober BJ, Tayeb K, Popp J, Qi G, Gordon M, Perez R, Ye C, Battle A. SURGE: uncovering contextspecific geneticregulation of gene expression from singlecell RNAsequencing using latentfactor models. https://github.com/bennystrobes/surgehttps://doi.org/10.5281/zenodo.10383060 (2023).
Acknowledgements
We thank Alkes L. Price for insights into the SLDSC analysis and for providing comments on the manuscript.
Peer review information
Andrew Cosgrove was the primary editor of this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.
Review history
The review history is available as Additional file 2.
Funding
AB was supported by NIH/NIGMS R35GM139580, NIH/NIDDK R01DK122586, and the Chan Zuckerberg Initiative. BJS was funded in part by U01 HG012009.
Author information
Authors and Affiliations
Contributions
AB and BJS proposed the idea for the project. BJS developed the SURGE model and performed the analysis. BJS and AB wrote the manuscript. KT aided in deriving the variational updates for SURGE. JP provided review of SURGE code. JP, GQ, and AB suggested relevant downstream analysis to run based on SURGE output. MGG, RP, and CJY provided guidance on analysis of singlecell eQTL data.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
AB is a shareholder of Alphabet, Inc., and a consultant for Third Rock Ventures.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1:
Supplementary methods. Fig. S1. Computational runtime of SURGE latent context inference and SURGE interaction eQTL calling. Fig. S2. Evaluation of SURGE’s ability to recapture simulated latent contexts simulations. Fig. S3. Evaluation of SURGE’s ability to identify correct number of simulated latent contexts in simulations. Fig. S4. Proportion of expression variance explained by SURGE latent contexts when SURGE was applied to samples concatenated across 10 GTEx v8 tissues. Fig. S5. QQ plot for SURGE interaction eQTLs identified when SURGE was applied to samples concatenated across 10 GTEx v8 tissues. Fig. S6. QQ plot for SURGE interaction eQTLs relative to expression PC interaction eQTLs when SURGE was applied to samples concatenated across 10 GTEx v8 tissues. Fig. S7. Comparison of inferred SURGE latent contexts with and without random effects included when SURGE was applied to samples concatenated across 10 GTEx v8 tissues. Fig. S8. SURGE latent context 4 and SURGE latent context 7 are explained by genotype PC1 when SURGE was applied to samples concatenated across 10 GTEx v8 tissues. Fig. S9. Relationship between SURGE latent contexts and xCell cell type enrichment score when SURGE was applied to samples concatenated across 10 GTEx v8 tissues. Fig. S10. Relationship between SURGE latent context 5 and Epithelial cell type enrichment score, and SURGE latent context 6 and Neuron cell type enrichment score. Fig. S11. Correlation between SURGE latent contexts and gene expression principal components and genotype principal components when SURGE was applied to samples concatenated across 10 GTEx v8 tissues. Fig. S12. Relationship between SURGE latent context 1 and xCell cell type enrichment score when SURGE was applied to samples from ColonSigmoid GTEx v8 tissue. Fig. S13. P values of SURGE latent context 1 interaction eQTLs when SURGE was applied to samples from only ColonSigmoid GTEx v8 tissue compared to pvalues of interaction eQTLs using xCell cell type enrichment score. Fig. S14. Distribution of cells per pseudocell and distribution of pseudocells per individual. Fig. S15. Proportion of expression variance explained by SURGE latent contexts when SURGE was applied to PBMC pseudocells. Fig. S16. QQ plot for SURGE interaction eQTLs identified when SURGE was applied to PBMC pseudocells. Fig. S17. QQ plot for SURGE interaction eQTLs relative to expression PC interaction eQTLs when SURGE was applied to PBMC pseudocells. Fig. S18. SURGE latent contexts capture known cell type when SURGE was applied to PBMC pseudocells. Fig. S19. UMAPprojected SURGE latent context loadings of pseudocells colored by cell type labels when SURGE was applied to PBMC pseudocells. Fig. S20. UMAPprojected SURGE latent context loadings of pseudocells colored by cell type marker genes when SURGE was applied to PBMC pseudocells. Fig. S21. SURGE latent context 1 captures differences in disease status of individuals when SURGE was applied to PBMC pseudocells. Fig. S22. Correlation between SURGE latent contexts and gene expression principal components when SURGE was applied to PBMC pseudocells. Fig. S23. Variance explained of SURGE latent contexts by various pseudocell sample characteristics when SURGE was applied to PBMC pseudocells. Fig. S24. Number of colocalizations identified between 15 GWAS studies and various types of eQTLs called on PBMC pseudocells. Fig. S25. Number of trait colocalizations with SURGE interaction eQTLs stratified by SURGE latent context when SURGE was applied to PBMC pseudocells. Fig. S26. Complex tratit SLDSC enrichment along eQTLs of SURGE latent contexts. Table S1. Number of genes with a genomewide significant SURGE interaction eQTL according to per “per context empirical FDR” when SURGE was applied to samples concatenated across 10 GTEx v8 tissues. Table S2. Number of genes with a genomewide significant SURGE interaction eQTL according to per “all context empirical FDR” when SURGE was applied to samples concatenated across 10 GTEx v8 tissues. Table S3. Association significance between tissue identity and SURGE latent context when SURGE was applied to samples concatenated across 10 GTEx v8 tissues. Table S4. Association significance between known ancestry and SURGE latent context when SURGE was applied to samples concatenated across 10 GTEx v8 tissues. Table S5. Pvalue of association between SURGE latent context and xCell cell type enrichment scores when SURGE was applied to samples concatenated across 10 GTEx v8 tissues. Table S6. Number of genes with a genomewide significant SURGE interaction eQTL according to per “per context empirical FDR” when SURGE was applied to PBMC pseudocells. Table S7. Number of genes with a genomewide significant SURGE interaction eQTL according to per “all context empirical FDR” when SURGE was applied to PBMC pseudocells. Table S8. Association significance between known cell type and SURGE latent context when SURGE was applied to PBMC pseudocells. Table S9. Association significance between known finegrained cell type and SURGE latent context when SURGE was applied to PBMC pseudocells. Table S10. Hallmark gene set enrichment analysis of genes strongly correlated with SURGE latent contexts when SURGE was applied to PBMC pseudocells. Table S11. MSigDB Biological Process gene set enrichment analysis of genes strongly correlated with SURGE latent contexts when SURGE was applied to PBMC pseudocells.
Additional file 2.
Review history.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Strober, B.J., Tayeb, K., Popp, J. et al. SURGE: uncovering contextspecific geneticregulation of gene expression from singlecell RNA sequencing using latentfactor models. Genome Biol 25, 28 (2024). https://doi.org/10.1186/s1305902303152z
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1305902303152z