f-scLVM: scalable and versatile factor analysis for single-cell RNA-seq

Single-cell RNA-sequencing (scRNA-seq) allows studying heterogeneity in gene expression in large cell populations. Such heterogeneity can arise due to technical or biological factors, making decomposing sources of variation difficult. We here describe f-scLVM (factorial single-cell latent variable model), a method based on factor analysis that uses pathway annotations to guide the inference of interpretable factors underpinning the heterogeneity. Our model jointly estimates the relevance of individual factors, refines gene set annotations, and infers factors without annotation. In applications to multiple scRNA-seq datasets, we find that f-scLVM robustly decomposes scRNA-seq datasets into interpretable components, thereby facilitating the identification of novel subpopulations. Electronic supplementary material The online version of this article (doi:10.1186/s13059-017-1334-8) contains supplementary material, which is available to authorized users.


The f-scLVM model
We here derive f-scLVM starting from the perspective of conventional factor analysis. Let Y by the N × G matrix of log-count expression levels for G genes observed in each of N samples (cells). We start with a bivariate linear model that factorizes the expression matrix into the sum of known covariates, annotated factors and unannotated factors: (1) Here, the vectors u c , p a and s h are known cell covaraites, as well as facto states for annotated and unannotated factors and V c , R a and Q h are the corresponding regulatory weights of a given factor on all genes. To simplify the derivation we will collapse the factors and weights, defining X = [u 1 , . . . , u C , r 1 , . . . , r A , s 1 , . . . , s H ] and a corresponding concatenated weight matrix W, resulting in Y = X · W T + ψ. ( Here, W denotes a G×K weight matrix that determines the regulatory affect of each factor k ∈ (1, . . . , K) on gene g ∈ (1, . . . , G). The N × K dimensional matrix X denotes the activity of each of K = C + A + H factors in each sample and ψ is residual noise.
We start by assuming Gaussian distributed residuals, which is similar to conventional factor analysis [18] (see Section 1.3 where we discuss generalizations to count-based and dropout noise models) ψ ∼ N (0, diag(τ −1 )). (3) Here, diag(τ −1 ) denotes the diagonal covariance matrix formed of the inverse elements of the noise precisions for each dimension (gene) τ = (τ 1 , . . . , τ G ). Together with Eq. (1) this noise model implies a Gaussian marginal likelihood of the form We introduce a conjugate prior on the noise precisions where Gamma denotes the gamma distribution. The prior on the factor activities X is an independent normal distribution with unit variance N (x n,k | 0, 1) .
Depending on the specific choice of the prior distribution for the weight matrix W, different factor models can be derived, including independent component analysis or conventional factor analysis [18]. We employ a structured sparsity prior that jointly models gene set annotations.

Modeling annotated and unannotated factors using a structured sparsity prior
An important difference between f-scLVM and conventional factor analysis is the two-level regularization on W, inducing structured sparsity on the weights and thereby interpretabilty of the corresponding factors. Specifically, we first use a gene-level sparsity prior on the elements of individual columns of W [16,21]. As a second level of sparseness we employ a relevance prior on the level of factors, corresponding to columns of W, thereby deactivating factors that are unused [15].
We start by describing the structured sparsity prior for annotated factors.

Modeling annotated factors
Sparseness of the factors weights is encouraged via a slab and spike prior P (w g,k | z g,k ) = N (w g,k | 0 , 1/α k ) if z g,k = 1 δ 0 (w g,k ) otherwise. .
Here, δ 0 (w g,k ) denotes the Dirac delta function centered on zero (inactive links) and 1/α k is the prior variance of weights for active links (factor specific; see also Section 1.1.3). The indicator variable z g,k determines whether factor k has as a regulatory effect on gene g (z g,k = 1) or not (z g,k = 0).
To achieve identifiability of the fitted factors as pathways, we link the binary indicator z g,k to binary gene set annotations by explicitly modelling them as observed data Here, the observed binary indicator I n g,k determines whether gene g is annotated to a given pathway (factor) k in the annotation. The annotations are replicated such that for each sample n a complete annotation is available. Technically, this approach is equivalent to scaling the likelihood component of the annotation with the number of cells. Since the likelihood component that links the indicator z g,k to observed expression data (Eq. (7)) scales with the number of cells, this ensures that the relative contribution of the annotations is independent of dataset size. The rate parameter FPR corresponds to the probability of false-positive annotations in the annotation and FNR denotes the probability of false-negative assignments.
Finally, for annotated factors the indicator variables z g,k are a priori Bernoulli distributed For annotated factors the sparseness structure is determined by the annotation and hence we choose an uninformative prior π = 0.5. The joint probability of all observed and unobserved data then follows as Here I n g,k is identical for all cell copies, i.e. I n g,k = I g,k ∀n. A graphical model representation of the full model is shown in Figure M1.

Modelling unannotated factors and known covariates
The effect of factors that are not included in the pathway annotation is modeled within the same framework. For unannotated factors, there exists no prior information, which means that the likelihood component for the annotation prior (Eq. (8)) is omitted. The prior probability of a regulatory effect for unannated factors π determines the expected sparseness level (see Eq. (9)). In the experiments, we consider two different types of factors. For sparse factors we set π = 0.1, which corresponds to the belief that 10% of the genes are regulated by these factors. Additionally, we model dense factors at a sparseness level of 0.99 (π = 0.99). Sparse factors tend to explain biological variation that is not well captured by the pre-annotated gene sets, whereas dense factors frequently correspond to confounding factors. These principles of sparse versus dense effects are similar to ideas that have previously been considered in population genomics [14,22].
For details and guidelines on how to set model parameters for learning unannotated factors, see Section 3.3.
Finally, cell covariates (e.g. the number of expressed genes [7], size factors, etc.) are treated analogously to dense factors. However, importantly their factor states x k are observed and do not need to be inferred during training.
Supp. Meth. Figure M1: Graphical model representation of f-scLVM . Circled variables are random and unobserved. Double-circled variables denote observed data, including gene expression profiles y n,g and the annotation data I n g,k . Statistical dependencies between all variables are indicated using arrows. The filled circles linking w z,k and z g,k denote the sparsity likelihood. For simplicity we have omitted an explicit representation of unannotated factors and cell covariates, for which no prior annotations are provided.

Automatic relevance determination for identifying relevant factors
A conventional spike and slab prior (Eq. (7)) is based on the assumption that each factor explains the same prior variance, although the set of active genes may differ between factors.
In particular for large annotations with hundreds of factors this assumption is likely false, as only a subset of the pathways will be active in a given dataset. To model different overall regulatory importance across factors we use a second regularization based on the automatic relevance determination (ARD) prior [15]. The ARD prior has been shown to be effective for shrinking factors with low relevance, for example in the context of conventional FA models (e.g. [22,6]). This factor level regularization is achieved by placing a hierarchical Gamma prior on the precision parameters of the regulatory prior for each factor k (see Eq. (7)) For factors that do not explain variation in the data the precision α k will be large, which corresponds to a small prior variance for the corresponding factor weights. The posterior distribution over the relevance parameters α k can also be used to deduce the importance of individual factors; see Section 3.6.

Parameter inference
Closed-form inference in sparse factor analysis is not tractable and hence, in general, computationally expensive Monte Carlo simulations or other approximate inference schemes are required. To enable the application of f-scLVM to large datasets with up to hundreds of thousands of cells and genome-wide expression counts, we here use efficient deterministic approximations instead of Monte Carlo schemes. This fully-factorized Variational Bayesian approximation scales linearly in the number of cells and genes, which renders the applications to larger datasets feasible.
Briefly, in variational Bayesian inference, the true intractable posterior distribution of the latent (unobserved) model parameters P(H | D) is approximated by a simpler (partially) factorized form Q(H) = i Q(H i | θ i ). Here, D denotes the observed data and θ i are variational parameters that parametrize the distribution of variation factors Q(H i | θ i ).
The objective of variational Bayesian inference is to determine variational parameters θ i such that the Kullback-Leibler (KL) divergence between the true posterior P (H | D) and the variational approximation Q(H) is minimized. The use of the KL divergence as a measure of distributional distance lends itself to an iterative algorithm for updating the variational parameters of individual factors sequentially. Under this approximation the log marginal likelihood is then bounded by This algorithm is guaranteed to minimize the KL divergence in each iteration and generalizes the widely used Expectation Maximization algorithm. For a comprehensive overview of Variational Bayesian approximate inference, see for example [5]. In each iteration, the parameters of individual variational factors θ i are updated in turn, given the current state of all other factors [11,2,3]. For any chosen factorization it can be shown that the optimal update for each factor can be obtained from the average log likelihood under all other Q-distributions These updates of individual Q-distributions are performed sequentially, until convergence is reached. If the chosen factorization matches the prior factorization of the model, it can be shown that the step in Eq. (13) corresponds to updating the variational parameters θ i , whereas the functional form of the approximate Q-distribution remains in the same class as the corresponding prior distributions. For brevity, we will in the following omit the explicit dependency of each variational factor on the respective parameters θ i .
Variational factorization of the model The first step to derive a variational inference algorithm for f-scLVM is to re-parameterize the model without the Dirac function (Eq. (7)). To this end, the elements w g,k are modeled as an (element wise) product of a Bernoulli random variable z g,k and a Gaussian random variablew g,k [25] The joint prior distribution of these two new random variables follows as This re-parameterization then allows us to define a suitable factorization of the unobserved variables W, Z, X, τ and α. Variational inference is most efficient if the Q distribution is factorized, which implies independence assumptions on the approximate posterior. Such approximations are generally problematic for strongly coupled parameters. In f-scLVM this concern applies in particular to the regulatory weights W and the binary indicator variables Z. While a factorizing assumptions, i.e. Q(W, Z) = Q(W)Q(Z) has been considered elsewhere [25,24], such an approach will lead to poor convergence, as the true factor with 2 K modes is approximated by a single unimodal factor [13]. In other words, two highly coupled random variables (W and Z) are inferred assuming approximate independence, which leads to poor results. To circumvent this, we here adapt the scheme proposed by Lazaro-Gredilla & Tsitas (2011) [13], who derived an efficient and accurate variational inference for the spike and slab prior, however in the context of Multi-Task and Multiple Kernel Learning. Briefly, the key idea is to treat each pair {w g,k , z g,k } as a single unit, choosing a joint factorization of the form Q(W, Z) = k g Q(w g,k , z g,k ). This approach yields an approximate marginal distribution with 2 K components, which better captures the multinomial posterior distribution of W. For all remaining model parameters we choose a fully factorized variational distribution, which delivers an overall linear runtime complexity of f-scLVM. The full approximate variational distribution follows as The corresponding variational lower bound F of this model can be written as: with Q() denoting the expectation under the Q-distributions Q().

Variational update equations
Sequential updated equations for individual factors are calculated using the expectation under all remaining factors using Eq. (13). We start with the joint variational distribution for Q(w g,k , z g,k ), which we rewrite by explicitly conditioning on the binary indicator z g,k This allows decomposing Q(w g,k , z g,k ) into Q(z g,k ) and the Q distribution for the corresponding weight, conditional on z g,k . Both Q distributions retain the functional form of their respective priors (i.e. Bernoulli and Normal; Eq. (14)- (15)) The Q distribution for the weight conditioned on z g,k follows as: with variational parameters (γ g,k , µw g,k , σw g,k ). Furthermore this decomposition also allows us to re-write the second term in Eq. 17 as log P (I | Z) Q(W,Z) = log P (I | Z) Q(Z) . Update equations for these variational parameter are given below, where denotes the expectation under all remaining Q distributions.
Taken together, this means that we can update Q(w g,k , z g,k ) using Consequently, the expectation of (w g,k z g,k ) under its Q distribution can simply be written as w g,k z g,k Q = γ g,k µw g,k .
For the remaining variational factors, we can use standard update equations for a conventional variational factor analysis model, e.g. [9,22]. The approximate posterior distribution for the factor activations X (c.f. Eq. (6)) follows as with variational parameters (µ x n,k , σ 2 x n,k ). The corresponding update equations for the variational parameters are: Similarly, the Q-distribution of the ARD factor relevance parameters α have the same functional form as their Gamma prior (Eq. (11)) with variational parameters (â α k ,b α k ) and update equationŝ Similarly, the Q-distribution of the noise precision values τ can be written as with variational parameters (â τg ,b τg ). The associated update equations follow as:

Non-Gaussian noise models for (low-coverage) sequence data
The model presented so far assumes Gaussian distributed residuals. In order to appropriately account for zero inflation, a consequence of dropout effects in sparsely sequenced single-cell data, f-scLVM can also be used in conjunction with a Hurdle noise model, which explicitly accounts for dropout. This is achieved by introducing a separate Bernoulli observation noise model for the subset of observations with zero counts in the expression matrix. For all remaining observations, the standard Gaussian noise model on a logarithmic scale is retained (see also Section 3). More formally, we introduce the matrix factorization model on latent variables F = XW T = [f n,g ], which is coupled to the observed expression count data Y using the likelihood model: P (y n,g |f n,g ) = 1/(1 + exp(f n,g )) if y n,g = 0 N (y n,g | f n,g , 1/κ g ) otherwise .
To achieve efficient inference in conjunction with this non-Gaussian likelihood model, we adapt prior work by Seeger et al. [20], who have proposed using local variational bounds for non-Gaussian likelihood functions in a different context. Briefly, additional variational parameters Ξ = Q(X)Q(W) T = [ξ n,g ] are introduced which determine pseudo-observationsỸ based on the zero inflated data Y, which in turn are modelled using a Gaussian noise model with precision κ g . In the following we outline how the update equations for these pseudo-observations and κ g can be derived.
If g(f n,g ) is twice differentiable and bounded by κ g such that g (f n,g ) < κ g ∀n, g we can use a Taylor expansion to approximate g(f n,g ) g(f n,g ) ≤ κ g /2(f n,g + ξ n,g ) 2 + g (ξ n,g )(f n,g − ξ n,g ) + g(ξ n,g ) =: q n,g (f n,g , ξ n,g ).
For non-zero observations with Gaussian noise model, this approximation is exact since g (f n,g ) = κ g for all non-zero observations. For the Bernoulli noise model the Taylor approximation holds for κ = 1/4 since g (f n,g ) < 1/4 for all zero observations.
We then further follow [20] and update Ξ = Q(X)Q(W) T = [ξ n,g ] with Q(W) and Q(X) denoting the Q-distributions of weights and latent variables as before. In order to derive the update equations for Q(W)Q(X) we bring the Taylor approximation of g(f n,g ), q n,g , in a quadratic form and note that q n,g (f n,g , ξ n,g ) ∝ κ g /2(f n,g − (ξ n,g − g(ξ n,g ))/κ g ) 2 =: − log N (ỹ n,g | f n,g , 1/κ g ) withỹ n,g = ξ n,g − g n,g (ξ n,g )/κ g . Consequently, for fixed [ξ n,g ], the update of Q(X)Q(W) is equivalent to f-scLVM with pseudo-dataỸ = [ỹ n,g ] and noise precision κ g . When using the dropout-noise model, we can thus derive updates for the pseudo-observations as y n,g = ξ n,g − κ g /(1 + exp(f n,g )) if y n,g = 0 y n,g otherwise .
Note that the pseudo-observations equal the observations Y for non-zero expression values.
We update κ g -which corresponds to τ g in the case of a Gaussian noise model -using Eq.(33)-(35), using only cells with non-zero expression values where N g corresponds to the number of cells with observed expression values for gene g.
The updates forW, X and α are implemented as described in Section 1.2.1, however with pseudoobservations instead of Y. This allows allows us to iteratively update the pseudo-observationsỸ based on Ξ = Q(X)Q(W) T as well as Q(X) and Q(W) usingỸ. The variational parameter update for the ARD prior to identify relevant factors and the spike-and-slab prior to regularize pathway components are unchanged. This extends the approach suggested in [20] as we allow for different forms of observation noise g n,g as well as gene-specific precision κ g , reflecting that the variance varies highly between genes and perform inference for κ g .

Poisson noise model for count data
The analogous modeling strategy can also be employed to model count observations using a Poisson noise model. Again, variational parameters ξ n,g are introduced which determine pseudo-observationsỸ based on the observed count data Y, which in turn are modeled using a Gaussian noise model. In contrast to the dropout-noise model, Y now correspond to the raw count data. More specifically, we write the Poisson likelihood with link function λ as P (y n,g |f n,g ) = λ(f n,g ) y n,g e −λ(fn,g) .
As before, g(f n,g ) = − log(P (y n,g | f n,g )) needs to be twice differentiable and bounded. We therefore choose a gene-specific link function λ g (f n,g ) = log(1 + exp(f n,g )), resulting in an upper bound of the second derivative ω g = 1/4 + 0.17y maxg , y maxg = max(y g ).
Analogously to the dropout noise-model, updates forW, X and α are performed as described in Section 1.2.1, with pseudo-observations instead of Y. However, unlike the Hurdle noise model there is no need to infer ω g (corresponding to κ g ) as it is determined by Eq. (42).

Relationship to existing factor analysis models
The f-scLVM model is related to a number of existing variants of factor analysis, all of which are based on a linear additive model. These methods can be broadly grouped into parametric and non-parametric approaches. Non-parametric methods [12] infer the number of active factors, in principle allowing an infinite number of factors to be used in the model. In contrast, parametric models need the user to specify the number of latent factors before inference. One strategy to mitigate the need to specify the precise number of factors in the model is the use of an ARD prior, which was first applied in the context of probabilistic principal component analysis [4] and later for factor analysis models, including PEER [22]. This approach is also applied in f-scLVM, where a much larger number of annotated and unannotated factors are included in the model and the ARD prior deactivates unused ones. A second aspect of regularization in factor analysis are sparsity priors to encourage element-wise sparseness of the factor loadings. f-scLVM employs a spike and slab prior, which has previously been used to achieve sparsity for this purpose, e.g. [8]. Our model additionally uses prior annotations to inform this spike-and-slab yes heteroscedatic None MCMC sparsity prior in conjunction with an ARD prior to infer which annotated factors are most relevant. Factor analysis with prior information has been utilized in early methods to reconstruct gene regulatory networks [19] Additionally, f-scLVM models the annotation as observed data that scales with the size of the expression dataset (Section 1.1.1), rather than using it to define a regulatory prior. This approach yields additional robustness across a wide range of different expression datasets (Fig. M3). A third key aspect of factor analysis models is the noise model employed, ranging from simple homoscedastic Gaussian noise models [18,22] to more complex approaches for modeling non-Gaussian noise, i.e. to account for over-dispersion [17]. To the best of our knowledge there is currently no existing method that combines non-Gaussian likelihood models and sparse factor analysis models. f-scLVM provides flexible likelihood models either modeling the observed data on a log Gaussian scale, as Poisson counts or using a Hurdle model.
Finally, factor analysis models use different inference schemes to fit model parameters. Many approaches employ accurate but slow MCMC methods, which tend to scale poorly to larger datasets. f-scLVM employs approximate Bayesian inference to achieve linear runtime complexity, thereby enabling its application to large datasets. For a tabular comparison of alternative methods, see Table 1.

Practical considerations, parameters and implementation
Variational Bayesian inference can be sensitive to implementation details such as parameter initialization and the update order. In the following we describe the specific strategy we use in f-scLVM .

Preprocessing
In the experiments we consider normalized single-cell RNA-seq dataset, following the primary analysis used in the respective source publication; see also Online Methods. When applying f-scLVM with the standard Gaussian noise model, we fit it on raw log count values (log(count + 1)) using mean centred expression values (per gene). When applying f-scLVM with the dropout noise-model, we employed the identical strategy, however without mean centring the data, such that observations with zero counts retain their value. When using f-scLVM in conjunction with the Poisson noise model, no log transformation was performed and the model is applied to the raw count values.
For each dataset, we reduced gene set annotations and only considered terms with at least 20 (expressed) assigned genes (15 for the more carefully curated MSigDB). Additionally, we reduced the set of genes and considered only expressed genes that were annotated to at least one pathway term.

Model initialisation, VB update schedule and convergence
Initialization of variational parameters The variational parameters of the regulatory weightsW are initialized randomly by sampling from a unit variance normal distribution, scaled by 1 √ K , with K being the number of factors. The variational parameters of latent factors (columns of X) that correspond to annotated factors are initialized using the first principal component calculated on the prior gene set of the corresponding factor. Dense unannotated factors without pathway information are initialized randomly by sampling from the prior (unit variance normal). Sparse unannotated factors are initialised using the first principal component of 20 randomly chosen highly variable genes (sampled from the top 100 most variable genes, sorted by variance).
The variational parameters of the regulatory sparsity prior γ g,k are initialized with the prior (Bernoulli prior with success probability [ρ g,k ]); for sparse unannotated factors, we initialise γ g,k corresponding to the 20 randomly chosen genes to 0.9. When a non-Gaussian noise model (dropout or Poisson noise model) is used, the pseudo-observations (Ỹ) are initialized using the observed data Y.
Parameter update schedule The variational schedule updates Q(W, Z) first, followed by Q(α),Q(X) and finally Q(τ ). For the non-Gaussian noise models an additional update step for the pseudo observations Y is included. As the individual factors X :,k should capture variation due to a particular biological process, it is important to minimize the risk of label switching, whereby the factor states do not match the regulatory annotation (see also [10]). This problem is specific to sparse factor models that incorporate prior information, where unlike standard FA the order of the factors is meaningful. To mitigate possible biases, we update the Q distributions of individual factors Q :,k in a randomized order, using different permutations in each model iteration. While this approach reduced ordering effects, we observed that the final solution is still affected by the update order of individual factors in the first iteration. To address this, we used a heuristic to determine the initial update order rather than random permutations for the first update cycle. This initial order is determined using a pre-training approach, for which we consider 50 update iterations: factors are ordered in increasing and decreasing order to correlation with the first principal component on all annotated genes and the consensus order after 50 updates is used as initial permutation for updating f-scLVM . Empirically, we observed that this heuristics leads to improved convergence and more accurate estimates of the final factor relevance (Fig. M2).  Supp. Meth. Figure M2: Impact of pretraining to determine an initial factor update order. a,b) Comparison of the inferred factor relevance for the cell-cycle staged mESC dataset (see also Supp. Fig. 4), using a bootstrap approach to assess robustness of the factor relevance. Results with pre-training are shown in (a), analogous results without pre-training in (b). In general, pre-training resulted in reduced variability across boot strap repeats, but with overall consistent interpretation. c) Comparison of the accuracy of f-scLVM on simulated data with default parameter settings (see Supplementary Table 1), either with or without (Rand.) pre-training (analogous to the results reported in main paper Fig. 2b). The pretraining approach resulted in slightly improved accuracy.
Monitoring convergence Model updates are performed until convergence, which was monitored using the reconstruction error. Alternatively, it is also possible to monitor the variational lower bound of the marginal likelihood (Eq. (12)). However, this approach would increase the computational cost as an explicit evaluation of the bound almost doubles the per-iteration compute cost. In practice, we observed that monitoring the reconstruction error is sufficient. We considered up to 2,000 iterations of variational updates or until convergence of the reconstruction error ( < 10 −6 for 50 consecutive iterations) was achieved, whatever occurred first.

Learning unannotated factors
Similarly, f-scLVM has the ability to learn sparse unannotated factors, which can, for example, capture variation between cell types, that are not readily reflected in pathway annotations provided to the model. We model these factor by setting π to 0.1, reflecting our prior belief that roughly 10% of genes should be active in a sparse hidden factor. To facilitate efficient learning, we initialize our model by seeding each factor with 20 highly variable genes for which we set π to 0.99.

Applying f-scLVM to very large datasets
We implemented a series of additional measures to improve the practical performance and convergence rate of f-scLVM. Fist, our software implementation makes use of parallel processing capacities, if executed using a modern Python interpreter. Second, to facilitate inference on datasets with 10,000 cells or more, as well as when using larger numbers of pathway factors (e.g. > 200 in the REACTOME database), we provide support for a pre-scoring heuristic to reduce the number of factors that need to be fitted jointly. Specifically, we first fit factors independently using SVD on the prior annotated gene sets per factor. We then retain the 50 terms for which the first eigenvector explains most of the observed variance. While this approach is likely to overestimate the importance of individual factors (cf. Supp. Fig. 1), is effective for pruning pathways that are highly unlikely to be relevant. Third, the model allows deactivation of individual factors once they have extremely low relevance (using α k /var(x k ) > 10 10 as a criterion). This early stopping approach is motivated by the observation that factors, once deactivated by the ARD prior, are unlikely to be reactivated in later stages of training.

Hyperparameter settings
In the experiments, we considered the following hyperparameters. For the spike-and-slab prior for annotated factors we choose an uninformative prior of π = 0.5. To model the annotation we set 1 − F P R to 99% and F N R to 0.1%, reflecting the belief that annotations are specific but include genes that are not necessary relevant in a given study. For the Gamma prior on α k and τ g , we chose the hyperparameters a = 10 −3 and b = 10 −3 , which correspond to uninformative prior settings.
Scaling the gene set annotations with gene set size An additional parameter is n eff , which corresponds to the effective number of cells based on which the annotation size is scaled to larger datasets. Technically, the likelihood term for the annotations P (I | Z) (Eq. (8)) is scaled by N/n eff . This approach is equivalent to modeling a full set of gene set annotations for each n eff cells in the dataset. In the experiments, we use n eff , which means that the FNR and FPR settings for the prior annotations are relative to a dataset with 200 cells. Empirically, we confirmed the expected effect of scaling the annotation likelihood with the data likelihood (see also Sec. 1.1.1). When using a fixed annotation prior, the number of false positive augmentations of gene sets (using the posterior on Z) scaled approximately linear with the dataset size, which reflects that the inferred factors are increasingly decoupled from the annotation. In contrast, the likelihood scaling yielded robust and accurate results across a wider range of datast sizes (Fig. M3).
Determining the number of unannotated factors The number of unannotated factors to include in the model is in principle a hyperparameter that needs to be set by the user. First, for unannotated dense factors, we observe that the model is insensitive to the total number of such factors included. This is because the ARD prior is able to deactivate those factors that are not needed, leading to results that are robust across a larger range (see also Supp. Fig. 3h). In the experiments, we include 3 unannotated dense factors throughout. The number of unannotated sparse factors requires further considerations, however. Because these factors are sparse and there exists no prior annotation to constrain the gene sets they effect, we find that the ARD prior less effectively regularizes their relevance. Consequently, sparse unannotated factors should only be activated if they are needed to explain the variation in the data. Empirically, we observed that if larger numbers of genes are activated in the annotated factors (more than 100%, see Sec. 3.6), this indicates that the annotation is not able to appropriately capture the variation in the data. This situation applied to the Zeisel data set as well as the human preimplentation Embroys (Supp. analyses Fig. SN1). To address this we fit 5 sparse hidden factors for these two datasets. Sparse hidden factors tend to capture variation between cell types, which are typically not well captured by annotated factors (Supp. Fig. 6).

Downstream analyses
The trained f-scLVM model can be used for different downstream analyses.
Factor relevance First, the posterior mean of the ARD score (factor-wise precision)α k is used as a measure for the relevance of an individual factor to drive expression variability. The inverse of this ARD score can be interpreted as the expected explained variance of the factor for the subset of genes with a regulatory effect. Larger values of 1/α k , which correspond to the expected variance explained by factor k, indicate larger relevance of factor k. When analysing the drivers of variability for selected subsets of cells only, the factor relevance can be mapped onto this subset without the need to recompute the model. This is achieved by re-weighting 1/α k with the relative variance of the corresponding factor k within the subset of cells under consideration. To exclude factors that may be driven by outlying cells, we filtered the reconstructed factors based on the mean absolute deviation (mad) and excluded factors with mad less than .4 before calculating the relevance score. For the retina and Zeisel datasets, no such filtering was applied, due to the known presence of very small cell populations.
Visualization Second, the posterior distribution over annotated and unannotated factors X :,k can be used to visualize cell states.
Gene set refinement By comparing the posterior distribution over the gene assignment to factors z g,k to the prior annotation I g,k , it is possible to identify individual genes that were added to or removed from a given pathway factor k. In practice, we consider the posterior threshold .5 for annotating genes to factors. Model residuals Finally, the inferred annotated and unannotated factors can also be used to estimate residual dataset. Residual data adjusted for the effect of a given factor k are derived using Y residual = Y − X :,k W T k When the model was trained using the dropout noise model, the residuals were calculated using the pseudo-countsỸ. This approach performs an implicit imputation of zero count values.