MOFA+: a statistical framework for comprehensive integration of multi-modal single-cell data.

Technological advances have enabled the profiling of multiple molecular layers at single-cell resolution, assaying cells from multiple samples or conditions. Consequently, there is a growing need for computational strategies to analyze data from complex experimental designs that include multiple data modalities and multiple groups of samples. We present Multi-Omics Factor Analysis v2 (MOFA+), a statistical framework for the comprehensive and scalable integration of single-cell multi-modal data. MOFA+ reconstructs a low-dimensional representation of the data using computationally efficient variational inference and supports flexible sparsity constraints, allowing to jointly model variation across multiple sample groups and data modalities.


Overview
Multi-Omics Factor Analysis (MOFA) is a statistical framework aimed at disentangling the sources of variation across data modalities (i.e. a multi-omics dataset). In this new study we improve the first model formulation (introduced in [2]) with the aim of performing integrative analysis of large-scale datasets across multiple data modalities (views) and across multiple conditions or studies (groups). In this appendix we provide a brief introduction to the original MOFA model, followed by a detailed description of the key model innovations.

Mathematical notation
-Matrices are denoted with bold capital letters: W -Vectors are denoted with bold non-capital letters: w. If the vector comes from a matrix, we will use a single index to indicate the row that it comes from. If two indices are used, the first one corresponds to the row and the second one to the column. The symbol ':' denotes the entire row/column. For instance, w i refers to the ith row from the W matrix, whereas w :,j refers to the jth column. -Scalars are denoted with non-bold and non-capital letters: w. If the scalar comes from a 1dimensional array (a vector), a single subscript will indicate its position in the vector. If the scalar comes from a 2-dimensional array, two indices will be shown at the bottom: the first one corresponding to the row and the second one to the column. For instance, w i,j refers to the value from the ith row and the jth column of the matrix W, and w i to the ith value of the vector w.
In some cases, higher dimensional arrays (tensors) are used, and the use of multiple indices follows the same rationality. -0 k is a zero vector of length K.
-I k is the identity matrix with rank K.
-E q [x] denotes the expectation of x under the distribution q. When the expectations are taken with respect to the same distribution many times, we will avoid cluttered notation and we will instead use x . -N (x | µ, σ): x follows a univariate normal distribution with mean µ and variance σ.
-G (x | a, b): x follows a gamma distribution with shape and rate parameters a and b.
-Beta (x | a, b): x follows a beta distribution with shape and rate parameters a and b.

Multi-Omics Factor Analysis v1
This section is reproduced from [2] with some modifications.
Factor analysis models are a probabilistic modelling approach which aim to reduce the dimensionality of a (big) dataset to generate a compressed and denoised representation of the data that is easier to interpret and visualise. Formally, given a dataset Y ∈ R N ×D of N samples and D features, the aim is to explain the dependencies between the observed features by means of a small set of K unobserved (or latent) variables, called factors. The factors capture the global sources of variation in the dataset. Intuitively, this is similar to Principal Component Analysis. MOFA is a multi-view generalisation of traditional Factor Analysis to M input matrices (or views) Y m ∈ R N ×Dm . Each view consists of non-overlapping features that often represent different assays. However, there is flexibility in the definition of views, depending on the hypothesis of the user. As we demonstrate in the manuscript, one can formulate multi-view problems even from unimodal data. For example, if one has DNA methylation data, one could define as a single view the matrix with all genomewide CpG measurements, but one could also split this matrix into different views, either by chromosome or by genomic context (i.e. promoters, enhancers, etc.). This formulation could enable the detection of patterns that are shared or specific to different genomic contexts.
Formally, in MOFA, the input data is factorised using the master equation of latent variable models: where Z ∈ R N ×K is a matrix that contains the factor values and W m ∈ R Dm×K are a set of M matrices that define the weights that relate the high-dimensional space to the low-dimensional latent representation. m ∈ R Dm captures the residuals, or the noise.
The structure of the MOFA 1.0 model is depicted in the following figure: The model takes M data matrices as input (Y 1 , · · · , Y M ), one or more from each data modality, with co-occurrent samples but features that are not necessarily related and can differ in numbers. MOFA decomposes these matrices into a matrix of factors (Z) and M weight matrices, one for each data modality (W 1 , · · · , W M ). White cells in the weight matrices correspond to zeros, i.e. inactive features, whereas the cross symbol denotes missing values. (b) The fitted MOFA model can be queried for different downstream analyses, including a variance decomposition to assess the proportion of variance explained by each factor in each data modality.
By default, residuals are assumed to be normally distributed and heteroskedastic: Which results into the following likelihood: Non-Gaussian noise models are available by replacing the prior distributions from Gaussian to other distributions, such as Bernoulli (for binary data) or Poisson (for count data). The mathematical details can be found in [2]. For simplicity, unless otherwise stated, we will always assume Gaussian noise.

Interpretation of the factors
Factors capture the global sources of variability in the data. They are analogous to the principal components in Principal Component Analysis (PCA). Mathematically, each factor ordinates cells along a one-dimensional axis centered at zero. Samples with different signs manifest opposite phenotypes along the inferred axis of variation, with higher absolute value indicating a stronger effect. For example, assume that the k-th factor captures the variability associated with cell cycle. We could expect cells in Mitosis to be at one side of the factor axis (irrespective of the sign, only the relative positioning being of importance), whereas cells in G1 phase are expected to be at the other end of the factor axis. Cells with intermediate phenotype, or with no clear phenotype (i.e. no cell cycle genes profiled), are expected to be located around zero (because of the zero-mean prior distribution).

Interpretation of the weights
The weights provide a score for each feature on each factor, and are interpreted in a similar way as the factors. Genes with no association with the factor are expected to have values close to zero, whereas genes with strong association with the factor are expected to have large absolute values. The sign of the weight indicates the direction of the effect: a positive weight indicates that the feature has higher levels in the cells with positive factor values, and vice-versa. Following the cell cycle example from above, genes that are upregulated in the M phase are expected to have large positive weights, whereas genes that are downregulated in the M phase (or, equivalently, upregulated in the G1 phase) are expected to have negative weights.
The following figure shows a real-case example of a Factor capturing the cell cycle effect, with the corresponding weights: Importantly, the scale of the weights should not be compared across views, only within a view. The scale of the weights are adjusted to match the distribution of the data modalities, so if you multiple the values of a data modality by two this will yield weights scaled by a factor of 2. Hence, for simplicity, it is common practice to report them scaled from -1 to 1 (or from 0 to 1 if using the absolute value).

Interpretation of the noise
The use of a probabilistic framework allows the model to explicitly disentangle the signal (the explained variance) from the noise (the unexplained variance). In the Gaussian framework, large values of τ m d indicate that the model predicts with high accuracy the observations for the feature d in view m. In contrast, small values of τ m d are indicative of low predictive power.

Prior distributions for the factors and the weights
The key determinant of the model is the regularization used on the prior distributions of the factors and the weights. In the first version of MOFA we defined a standard Gaussian prior on the factors: Effectively this assumes, a priori, independent samples centered at 0. This zero-mean prior induces sparsity, but only to some extent. If the data does not contain meaningful factors, the posterior distributions will also be centered at 0.
In contrast, the weights are assumed to be potentially very sparse, the rationality being that the number of features is very large and that real biological factors are driven by potentially small gene regulatory networks [8]. To achieve this, MOFA encodes two levels of sparsity: (1) a view-and factor-wise sparsity and (2) an individual feature-wise sparsity. The aim of the factor-and view-wise sparsity is to identify which factors are active in which views, such that the weight vector w m :,k is shrunk to zero if the factor k does not explain any variation in view m. In addition, we place a second layer of sparsity which encourages inactive weights on each individual feature.
Mathematically, we express this as a combination of an Automatic Relevance Determination (ARD) prior [14] for the view-and factor-wise sparsity and a spike-and-slab prior [16] for the feature-wise sparsity: However, the spike-and-slab prior contains a Dirac delta function, which makes the inference procedure troublesome. To solve this, we introduce a re-parametrization of the weights w as a product of a Gaussian random variableŵ and a Bernoulli random variable s [24], resulting in the following prior for every single weight w kd : In this formulation α m k controls the strength of factor k in view m and θ m k controls the fraction of nonzero (active) weights (i.e. the sparsity levels) of factor k in view m. Finally, we allow the model to learn the levels of sparsity and introduce the following hierarchical priors for θ and α: with hyper-parameters a θ 0 , b θ 0 = 1 and a α 0 , b α 0 = 1e −3 to get uninformative priors. Posterior values of θ m k close to 0 implies that most of the weights of factor k in view m are shrinked to 0 (sparse factor). In contrast, a value of θ m k close to 1 implies that most of the weights are non-zero (non-sparse factor). A small value of α m k implies that factor k is active in view m. In contrast, a large value of α m k implies that factor k is inactive in view m.
This achieves the definition of the original MOFA model. The use of a view-wise sparsity prior on the weights relies on the assumption that the features are not independent but structured into different views. Some factors may explain variability in only subsets of views, resulting in a structured sparsity as illustrated in Figure 1. Following the same logic, the integration of multiple conditions or studies requires breaking the assumption of independent samples and introducing a prior that captures the existence of different groups, such that some factors are allowed to be active in different subsets of groups. For example, consider the case where we try to integrate two single-cell multi-omics experiments where the first one contains a lineaging event that is not present in the second one. In this scenario, one expects to learn a factor that capture such variability, but this factor should only be active in the first data set. A real case example is shown in the following figure: In addition, if the data has a large number of (noisy) cells, one can expect each factor to be active in only a small fraction of cells.
To formalise the assumptions above, we mirror the double sparsity prior from the weights: where g is the index of the sample groups.
In addition, to account for the fact that different sample groups may exhibit different noise levels, we use the following noise model, where m,g d denotes the residual for a particular view, feature and sample group : Effectively, this symmetric sparsity allows the model to integrate multiple views (at the feature level) as well as multiple groups (at the sample level). The graphical model is shown in Figure 4:

Solving the rotational invariance problem
Conventional Factor Analysis is invariant to rotation in the latent space [26]. To demonstrate this property, let us apply an arbitrary rotation to the weights and the factors, specified by the rotation matrix R ∈ R K×K :Z First, note that the model likelihood is unchanged by this rotation, irrespective of the prior distribution used.
However, the prior distributions of the factors and the weights are only invariant to rotations when using isotropic Normal priors: where we have used the property R T = R −1 that applies to rotation matrices. The same derivation follows for the factors Z.
In practice, this property renders conventional Factor Analysis unidentifiable, hence limiting its interpretation and applicability. Sparsity assumptions, however, partially address the rotational invariance problem [10].
It is important to remark that the factors are nonetheless invariant to permutations. This implies that, under different initial conditions, the order of the factors is not necessarily the same in independent model fittings. To address this, we manually sort factors a posteriori based on total variance explained.

Scaling up using stochastic variational inference with natural gradients
The size of biological datasets is rapidly increasing, particularly in the field of single cell sequencing, with some studies reporting more than a milion cells [23,6].
In the original MOFA model, inference was performed using variational Bayes [2,3,11]. While this framework is typically faster than sampling-based Monte Carlo approaches, it becomes prohibitively slow with very large datasets, hence motivating the development of a more efficient inference scheme. For this purpose, we derived the stochastic version of the variational inference algorithm developed by [9,7,1] for our specific probabilistic model ( Figure 4).
In the first two subsections, we briefly review the variational inference algorithm and its stochastic version as proposed by [9]. In the next two subsections, we show how to derive the variational inference algorithm and its stochastic version for the specific MOFA+ probabilistic model.

Review section: Variational Inference (VI) a) Main principles
First, we briefly introduce the variational Bayes framework and show how to turn it into an optimisation problem that can be solved via coordinate or gradient ascent methods. For a detailed mathematical introduction to variational methods we refer the reader to [4,17,5,25,3].
Let us define a generative model specified by some observations Y and some latent variables X.
The central aim of Bayesian inference is to obtain the posterior distribution of the latent variables given the observations: x p(Y, X) For most realistic models, the integral in the denominator is intractable and one has to resort to approximations.
In variational inference the true (but intractable) posterior distribution p(X|Y) is approximated by a simpler (variational) distribution q(X). This distribution is found as the closest approximation to the true posterior, among a predefined family of distributions which are tractable to compute. The distance between the true posterior distribution and an approximate posterior distribution (belonging to the chosen family) is calculated using the KL divergence: Note that, if we allowed any possible choice of q(X), then the minimum of this function would occur when q(X) equals the true posterior distribution p(X|Y). Since the true posterior is intractable to compute, we instead look for the distribution minimizing the KL divergence among a restricted family of approximate distributions.
Doing some calculus it can be shown (see [4,17]) that the KL divergence KL(q(X)||p(X|Y)) is the difference between the log of the marginal probability of the observations log(Y) and a term L(X) that is typically called the Evidence Lower Bound (ELBO): The first term in the right-hand side is a constant. Hence, as illustrated in Figure 5, minimising the KL divergence is equivalent to maximising L(X) : The first term is the expectation of the log joint probability distribution with respect to the variational distribution. The second term is the entropy of the variational distribution. Importantly, given a simple parametric form of q(X), each of the terms in Equation (14) can be computed in closed form.
In some occasions, we will use the following form for the ELBO: where the first term is the expectation of the log likelihood and the second term is the difference in the expectations of the p and q distributions of each hidden variable.
In conclusion, variational learning involves minimising the KL divergence between q(X) and p(X|Y) by instead maximising L(X). The following image summarises the general picture of variational learning: To define the family of distributions where to find the approximate posterior q(X), two approaches are possible : non-parametric or free-form [4], and parametric or fixed-form [25,5]. We will present both approaches under the mean-field assumption, which assumes that the approximate posterior factorises over M disjoint groups of variables [21]:

b) Free-form mean-field variational inference
A common approach is to make no parametric assumptions on the form of the factors q i . Using calculus of variations (see [4,17]), it can be shown that the ELBO is maximised with respect to q i when: where E −i denotes an expectation with respect to the q distributions over all variables x j except for x i . The additive constant is set by normalising the distribution q * i (x i ):  [4,17,26] for mathematical details.

c) Fixed-form mean-field variational inference
An alternative and straightforward choice is to directly define the distribution q(X) to be of the same form as the prior distribution p(X), with (variational) parameters Θ. Thus, each factor q i belongs to the same family as the prior over the variables x i , and can be parametrized with its own parameters θ i to be tuned.
Importantly, this approach introduces parametric assumptions which only match the free-form meanfield derivation when using conjugate priors. However, for generic models with arbitrary families of distributions, no closed-form variational distributions exist via the free-form mean-field approximation [25,5].
In the fixed-form approach, the ELBO can be maximised by optimising the parameters Θ via conventional numeric optimisation methods. Hence, while the parametric assumption certainly limits the flexibility of variational distributions, the advantage of this formulation is that it opens up the possibility to use (potentially fast) gradient-based methods for the inference procedure [9,18,12].

Review section: Stochastic Variational Inference (SVI)
We have seen in the previous section that mean-field variational inference amounts to a coordinate ascent procedure to maximise the evidence lower bound.
A faster alternative to maximise the ELBO with respect to the variational distribution q(X) is to use a stochastic gradient ascent procedure. In the context of variational inference, this approach was proposed by [9] by the name of Stochastic Variational Inference (SVI).
In the followings, we first review briefly the two ingredients required to understand SVI -stochastic gradient ascent and natural gradients -to finally introduce the SVI algorithm.

a) Stochastic gradient ascent
Gradient ascent is a common first-order optimization algorithm for finding the maximum of a function [4,17]. It works iteratively by taking steps proportional to the gradient of the function evaluated at each iteration. Formally, for a differentiable function F (x), the iterative scheme of gradient ascent is: At each iteration, the gradient ∇F is re-evaluated and a step is performed towards its direction. The step size is controlled by ρ (t) , a parameter called the learning rate, which is typically adjusted at each iteration t [19].
Gradient ascent is appealing because of its simplicity, but it becomes prohibitively slow with large datasets, mainly because of the computational cost (both in terms of time and memory) associated with the iterative calculation of gradients [22].
Assuming the existence of redundancy in the dataset, a fast approximation of the gradient∇F can be calculated using a random subset of the data (minibatch). Formally, as in standard gradient ascent, the iterative training schedule proceeds by taking steps of size ρ in the direction of the approximate gradient ∇F : The step size ρ (t) can also be adjusted at each iteration t. When the series ρ(t) satisfies the Robbins-Monro conditions: t ρ (t) = ∞ and t (ρ (t) ) 2 < ∞, F is guaranteed to converge to a local maximum [20]. If F is not convex, the algorithm is sensible to the initialisation x t=0 .

b) Natural gradient ascent
Let us consider a model with a single hidden variable x and corresponding variational parameter θ. The objective function is the ELBO L(θ). From the definition of the gradient: where h represents an infinitesimally small positive step in the space of θ.
To find the steepest gradient, one would need to search over all possible directions d in an infinitely small distance h, and select thed with the largest gradient: Notice that the neighborhood of θ is measured in terms of its Euclidean norm, and the direction of steepest ascent is hence dependent on the Euclidean geometry of the θ space. This is the key problematic issue when working with probability distributions. A small step from θ (t) to θ (t+1) does not guarantee an equivalently small change from L(θ (t) ) to L(θ (t+1) ). As an example, consider four random variables: Using the Euclidean metric, the distance between ψ 1 and ψ 2 is the same as the distance between ψ 3 and ψ 4 . However, the distance in distribution space (measured for example by the KL divergence) is clearly much larger between ψ 1 and ψ 2 than between ψ 3 and ψ 4 ( Figure 6).  Figure 6: Illustration of the problem of using Euclidean distances to measure distances between parameters of distributions. In both plots, the red and blue distributions are separated by the same Euclidean distance of 10. Yet, the distance in probability space between the two distributions is intuitively much higher in the right plot.
Hence, rather than using the Euclidean distance, it is more appropriate to use the KL divergence as a distance metric: The direction of steepest ascent measured by the KL divergence is called the natural gradient [1,15]. By introducing Lagrange multipliers and Taylor expansions, one can solve the optimisation problem to obtain the direction of the steepest natural gradient (see [1,13]). The solution corresponds to the standard (Euclidean) gradient pre-multiplied by the inverse of the Fisher Information Matrix of q(x|θ): where F(θ) is defined as Effectively, the premultiplication by F −1 takes into account the local curvate of q(θ) in distribution space. Importantly, when q(x|θ) belongs to the exponential family, the Fisher Information matrix is simply the Hessian of the log normalizer.
In conclusion, while the standard gradient points to the direction of steepest ascent in Euclidean space, the natural gradient points to the direction of steepest ascent in a space where distances are defined by the KL divergence [13,1,9].

c) SVI is a stochastic natural gradient ascent
We can finally introduce the stochastic variational inference (SVI) algorithm, which may indeed be seen as a stochastic natural gradient ascent. This approach was proposed by [9] to solve the fixed-form meanfield variational inference introduced in the previous section. The stochastic nature of the approach is interesting when one dimension of the matrix of observed variables is much larger than the others. In our case, it corresponds to N , the number of samples (or cells).
In this section we will summarise the key principles to understand the SVI algorithm. For a complete mathematical derivation we refer the readers to [9] As a starting point, we classify the variables of the probabilistic model into four different types: • observations (Y): N different vectors y n which contain the observed variables for the n-th sample.
• local (hidden) variables (Z): N different vectors z n which contain all K hidden variables associated with each sample n. • global (hidden) variables (β): one vector that contains all B hidden variables not indexed by n.
• parameters (α): a vector that contains all fixed parameters for the global variables.
The distinction between local and global variables lies on the conditional dependencies. Given the global variables β, the nth local variable z n is conditionally independent from any other observation y j or local variable z j (where j = n): p(y n , z n |y j , z nj , β, α) = p(y n , z n |β, α) Under the assumptions that the complete conditionals in the model are in the exponential family 1which notably imply that p(β) and p(y n , z n |β) are also in the exponential family -, [9] showed that: • The update equations of the classic variational inference algorithm (which is a coordinate ascent using the classic gradient of the ELBO) are also satisfied when following the natural gradient of the ELBO • The ELBO decomposes into a global term and a sum of local terms -each local term specific to a particular sample -, and so does its natural gradient, which makes the stochastic gradient algorithm applicable to the natural gradient.
• The natural gradient of the ELBO is cheaper to compute than the classic gradient.
To illustrate the key points above, let us introduce a general formula for the ELBO (the objective function) in terms of the four types of variables defined above. For this purpose, we first introduce the following notations: • the prior distributions are members of the exponential family : • the complete conditionals are members of the exponential family: • the variational distributions in the fixed-form mean-field assumption belong to the same exponential family as their priors: where λ are the parameters governing the global variable and φ nk are the parameters governing the k-th local variable for the n-th sample.
Following this, the ELBO (Equation (14)), factorises as: Notice the existence of global terms (which do not depend on n) and local terms (which do depend on n). We can further derive the corresponding noisy estimate by using a randomly sampled mini-batch of size S: where the factor N/S ensures that the estimate is unbiased.
To derive the variational updates, we need to compute the natural gradient of L S with respect to the global and the local parameters at a turn.
Taking the natural gradient of L S with respect to the local parameter φ nk leads to : Similarly, taking the natural gradient of L S with respect to the global parameter λ leads to : If we had taken the natural gradient of L instead of L S , we would have obtained: which we can compare to the equation obtained using the standard (Euclidean) gradient: The standard gradient requires the Hessian of the log normalizer (i.e. the Fisher Information Matrix) to be explicitly computed at each iteration. Remarkably, in the natural gradient this term has canceled out (see Equation (21)), which leads to a much cheaper computation.

d) SVI algorithm vs. VI algorithm
We can notice that the SVI algorithm (with natural gradients) is equivalent to the non-stochastic VI algorithm when using a step-size 1 and when each mini-batch is equal to the full dataset: Algorithm 1 Mean-field variational inference 1: Initialise the global parameters λ (t=0) randomly 2: repeat 3: for each local variational parameter φ nk do 4: end for 6: for each global variational parameter λ do 7: end for 9: until ELBO convergence Algorithm 2 Stochastic mean-field variational inference 1: Initialise the global parameters λ (t=0) randomly. 2: Initialise step size ρ (t=0) 3: repeat 4: sample B a mini-batch of samples of size S

5:
for each local variational parameter φ nk such that n is in batch B do end for 8: for each global variational parameter λ do 9: Stochastic variational inference algorithm has two critical hyperparameters: • The batch size S controls the number of samples that are used to compute the gradients at each iteration. A trade-off exists where high batch sizes lead to a slower computation of the gradient but to a less noisy estimate of the gradient.
• The learning rate ρ (t) controls the step size in the direction of the natural gradient, with high learning rates leading to higher step sizes. Also, in the natural gradient setting, the learning rate controls how much memory from previous iterations is translated to the current updates.
The particular case of a constant ρ = 1 yields no memory from previous iterations, which is the particular case of the standard gradient ascent. To ensure proper convergence, the learning rate is usually decayed during training by a pre-defined function. How to adapt the learning rate is an extensive area of research [19].

Deriving the variational inference algorithm
We applied the variational inference principle previously introduced. We looked for an approximate distribution of the true posterior which is the closest to this true posterior (according to the KL divergence distance) among the following family of factorized distributions: Inspired by [24], we did not look for fully factorized distributions like in the mean-field distribution, since we can hardly assumeŵ m k and s m k to be independent in the posterior (θ k and α k correspond respectively to the amount and the variability of the non-zero weights for factor k).
Since the posterior is fully factorized in terms of the following variables {(ẑ g nk , s g nk ), α g k , θ g k , (ŵ m kd , s m kd ), α m k , θ m k , τ gm d }, we could derive from equation 17 update equations for the parameters of the following factors : q(ẑ g nk ), q(ẑ g nk |s g nk = 0), q(ẑ g nk |s g nk = 1), q(α g k ), q(θ g k ), q(ŵ m kd |s m kd = 0), q(ŵ m kd |s m kd = 1), q(α m k ), q(θ m k ), and q(τ gm d ). In order to derive update equations for the parameters of q(s g nk ) (and similarly for q(s m kd )), we adapted the reasoning done in [4] to show that, when the functional L(q) is maximized, then we get the following equation (where we denote X = X \ {ẑ g nk , s g nk }) and which slightly differs from equation 17 : log q(s g nk ) =E X ,ẑ g nk |s g nk log p(Y, X) q(ẑ g nk |s g nk ) =E X ,ẑ g nk |s g nk log p(Y |X) + E X ,ẑ g nk |s g nk log p(ẑ g nk |α g k ) + E X ,ẑ g nk |s g nk log p(s g nk ) − E X ,ẑ g nk |s g nk log q(ẑ g nk |s g nk ) Below we give the explicit update equations for every hidden variable of MOFA 2.0 model which are applied at each iteration of the classic variational inference algorithm.

ARD precision of the factors
For every group g and factor k: Prior distribution: Variational distribution q(α g k ):

Sparsity parameter of the factors
For every group g and factor k: Prior distribution: Variational distribution: Non-sparse weights Ng n=1 z g nk z g nj G g=1 Ng

ARD precision of the weights
For every view m and factor k:

Sparsity parameter of the weights
For every view m and factor k:

Noise (Gaussian)
For every view m, group g and feature d: where:â Although computing the ELBO is not necessary in order to estimate the posterior distribution of the parameters, it is used to monitor the convergence of the algorithm. As shown in Equation (15), the ELBO can be decomposed into a sum of two terms: (1) the expected log likelihood under the current estimate of the posterior distribution of the parameters and (2) the KL divergence between the prior and the variational distributions of the parameters: Log likelihood term Assuming a Gaussian likelihood: KL divergence terms Note that KL (q(X)||P (X)) = E q (q(X)) − E q (P (X)). Below, we will write the analytical form for these two expectations.
Sparse weights Non-sparse weights Sparse factors ARD precision for the weights Sparsity parameter of the weights ARD precision for the factors â α k lnb α k + (â α k − 1) ln α k −b α k α k − ln Γ(â α k ) (a 0 − 1) × ln(π g n,k ) + (b 0 − 1) ln(1 − π g n,k ) − ln(B(a 0 , b 0 )) E q [ln q(θ)] = G g=1 K k=1 Ng n=1 (a g k,n − 1) × ln(π g n,k ) + (b g k,n − 1) ln(1 − π g n,k ) − ln(B(a g k,n , b g k,n )) The local dimension we chose to apply the Stochastic Variational Inference algorithm is the axis of the samples. Indeed, in large single-cells datasets, there is often more single cells than variables measured : N = G g=1 N g is often larger than D = M m=1 D m .
Consequently, the only local hidden variables are the factors matrices Z g = (z g nk ) 1≤n≤Ng, 1≤k≤K which contains the factors expressions for each sample n belonging to group g ∈ [1, G]. When adding the spike-and-slab prior over the factors matrices Z g (sparse factors), the only local hidden variables are the two matricesẐ g and S g (whose term-wise product gives Z g ) for each g ∈ [1, G].
All other hidden variables are global : τ gm ,Ŵ m and S m (whose term-wise product gives W m ), α m , θ m , as well as α g and θ g when adding the spike-and-slab prior over the factor matrices Z g .
Hence, we chose to apply the following SVI algorithm to speed-up the inference of the MOFA model on datasets with a large number of samples N : for each local variational parameter φ g nk of nodes {ẑ g nk , s g nk } such that n is in batch B do 6: 7: φ (t+1) nk is the updated parameter φ nk following the classic VI update equation for each global variational parameter λ of nodes {τ gm ,Ŵ m , S m , α m , θ m , α g , θ g } do 11: λ (t+1) = (1 − ρ (t) )λ (t) + ρ (t) λ (t+1) B (72) 12: where λ (t+1) B is the updated parameter λ following the classic VI update equation, 13: but considering the selected batch B repeated N/S times instead of the full dataset. 14: end for 15: until ELBO convergence b) Choosing the hyperparameters Batch size: Using small batch sizes for inference on large datasets is often interesting : the number of iterations required to converge is multiplied by less than n while the time per iteration is divided by n. We may explain this fact by a faster improvement in the very first iterations of the guess of the global parameters : indeed, we do not need to know the observations for each cell of the dataset to improve the guess of the global variables from their random initialization. We did our experiments with a batch size varying from 10% to 50% of the full dataset.
Learning rate: We used the following function to iteratively decrease the learning rate: This function has two hyperparameters: • The forgetting rate κ: controls the decay of the learning rate, with large κ leading to faster decays.
In the natural gradient setting, κ also controls how quickly information from previous iterations is forgotten.
• The delay τ : determines the initial learning rate. In the natural gradient setting, the larger τ the more early iterations are down-weighted. Figure 7 shows the effect on the learning rate of varying the two hyperparameters.