Model description
In a previous study, we introduced Multi-Omics Factor Analysis (MOFA), a statistical framework for the integrative analysis of multi-omics data from a common set of samples [25]. Building on the Bayesian Group Factor Analysis framework, MOFA infers a low-dimensional representation of the data in terms of a small number of (latent) factors that capture the global sources of variability. Notably, MOFA employs Automatic Relevance Determination (ARD), a hierarchical prior structure that facilitates untangling variation that is shared across multiple modalities from variability that is present in a single modality. In addition, the sparsity assumptions on the weights facilitate the association of molecular features with each factor. Intuitively, MOFA can be viewed as a statistically rigorous generalization of (sparse) principal component analysis (PCA) to multi-omics data.
While the model is applicable to single-cell assays, MOFA and related factor models have critical limitations, including their scalability and the lack of ability to account for side information about the structure between cells. In particular, these models do not provide a principled approach for integrating multiple sample groups and data modalities within the same inference framework.
Here, we propose MOFA+, a model extension addressing these challenges by (i) developing a stochastic variational inference framework amenable to GPU computations, enabling the analysis of datasets with potentially millions of cells and (ii) incorporating priors for flexible, structure regularization, thus enabling joint modelling of multiple groups and data modalities.
Briefly, the inputs to MOFA+ are multiple datasets where features have been aggregated into non-overlapping sets of modalities (also called views) and where cells have been aggregated into non-overlapping sets of groups (Fig. 1a). Data modalities typically correspond to different omics (i.e., RNA expression, DNA methylation, and chromatin accessibility), and groups to different experiments, batches, or conditions. During model training, MOFA+ infers K latent factors with associated feature weight matrices (per data modality) that explain the major axes of variation across the datasets. As in MOFA v1, MOFA+ employs ARD priors to account for structure between views of the data, combined with sparsity-inducing priors to encourage interpretable solutions. However, MOFA+ employs an extended group-wise prior hierarchy, such that the ARD prior does not only act on model weights but also on the factor activities. This strategy enables the simultaneous integration of multiple data modalities and samples groups. Note that if using a single group, the generative model of MOFA+ reduces to the previous MOFA model (but with faster inference). After training, the model output enables a wide range of downstream analyses (Fig. 1b), including variance decomposition, inspection of feature weights, inference of differentiation trajectories, and clustering, among others.
For technical details and mathematical derivations, we refer the reader to “Methods” and the Additional file 2: Supplementary Methods. Guidelines for the selection of group views, data preprocessing and normalization, determination of the number of factors, interpretation of the factor values and the weights are provided in “Methods”. A technical comparison with other factor analysis models is provided in Additional file 3: Table S1.
Model validation using simulated data
Initially, we validated the new features of MOFA+ using simulated data drawn from its generative model. We considered data representing a range of dataset sizes with differing numbers of data modalities and sample groups.
First, to assess the utility of stochastic variational inference, we trained models either using conventional (deterministic) variational inference (VI) or using stochastic variational inference (SVI). Across a wide range of training hyperparameters (see “Methods”), we observed that SVI yields Evidence Lower Bounds (i.e., the objective function of variational inference) that are consistent with those obtained from conventional variational inference as employed in MOFA (Additional file 1: Fig. S1). However, the GPU-accelerated SVI implementation in MOFA+ achieved up to a ~ 20-fold increase in speed compared to VI, with the most dramatic speedups observed for large datasets. This inference scheme facilitates the application of MOFA+ to datasets comprising hundreds of thousands of cells using commodity hardware (Additional file 1: Fig. S2).
Next, we assessed the group-wise ARD priors, by assessing to what extent it facilitates the identification of factors with simultaneous differential activity between groups and data modalities. Indeed, when simulating data where factors explain differing amounts of variance across groups and across data modalities, MOFA+ was able to more accurately reconstruct the true factor activity patterns than MOFA v1 or conventional Bayesian Factor Analysis (Additional file 1: Fig. S3).
Integration of a heterogeneous time-course single-cell RNA-seq dataset
To illustrate the ability of MOFA+ to model data with samples that exhibit an explicit group structure, we considered a time-course scRNA-seq dataset, consisting of 16,152 cells that were isolated from multiple mouse embryos at embryonic days E6.5, E7.0, and E7.25 (two biological replicates per stage). In this dataset, individual embryos are expected to exhibit transcriptional differences at different stages and even between embryos from the same stage due to variation in the rate of the developmental progression. As a proof of principle, we used MOFA+ to disentangle stage-specific variation from variation that is shared across all stages. For this purpose, we considered the six batches of cells (two replicates for each of the three embryonic stages) as different groups in the MOFA+ model.
MOFA+ identified 7 factors that explain at least 1% of variance, which collectively explain between 35 and 55% of the total transcriptional cell-to-cell variance per embryo (Additional file 1: Fig. S4). Some factors recapitulate the existence of post-implantation developmental cell types, including extra-embryonic (ExE) cell types (Factor 1 and Factor 2, respectively) and the transition of epiblast cells to nascent mesoderm via a primitive streak transcriptional state (Factor 4; Fig. 2b, c and Additional file 1: Fig. S5). Consistently, the top weights for these factors are enriched for lineage-specific gene expression markers, including Ttr and Apoa1 for ExE endoderm, Rhox5 and Bex3 for ExE ectoderm, and Mesp1 and Phlda2 for nascent mesoderm [32]. Other factors captured technical variation due to metabolic stress that affects all batches in a similar fashion (Factor 3, Additional file 1: Fig. S6).
When inspecting the factor activity across developmental stages, we observed that the percentage of variance explained by Factor 1 is not correlated with developmental progression, indicating that commitment to ExE endoderm fate occurs early in the embryo and that the proportion of this cell type remains relatively constant from E6.5 to E7.25. In contrast, the amount of variance explained by Factor 4 increases over time (Fig. 2d), consistent with a higher proportion of cells committing to mesoderm after ingression through the primitive streak.
Altogether, this application shows how MOFA+ can identify biologically relevant structure in scRNA-seq datasets with multiple groups. Interpretability is achieved at the expense of reduced information content per factor (due to the linearity assumption of the model). Nevertheless, the MOFA+ factors can also be used as input for other methods that infer non-linear manifolds that discriminate cell types (Fig. 2e) and enable the reconstruction of pseudotime trajectories [33, 34].
Identification of context-dependent methylation signatures associated with cellular diversity in the mammalian cortex
As a second use case, we applied MOFA+ to investigate variation in epigenetic signatures between populations of neurons. This use case illustrates how a multi-group and multi-modal structure can be defined from seemingly uni-modal data, which allows for testing specific biological hypotheses.
We analyzed 3069 cells isolated from the frontal cortex of young adult mice, where DNA methylation was profiled using single-cell bisulfite sequencing [7]. Recent studies have demonstrated that neurons contain significant levels of non-CpG methylation (mCH), an epigenetic mark that has been historically dismissed as a methodological artifact of incomplete bisulfite conversion [35,36,37,38].
Here we used MOFA+ to dissect the degree of coordination between mCH and mCG signatures in different regions of the brain. As input data we quantified mCH and mCG levels at gene bodies, promoters and putative enhancer elements (“Methods”). Each combination of genomic and sequence context (e.g., mCG at enhancer elements) was defined as a separate data modality. To explore the influence of the neuron’s location, we grouped cells according to their cortical layer: Deep, Middle, or Superficial (Additional file 1: Fig. S7). Low coverage of DNA methylation per cell results in large amounts of missing values, which hampers the use of conventional dimensionality reduction techniques such as PCA or NMF [33, 34, 39]. By contrast, the probabilistic framework underlying MOFA+ naturally accounts for missing values [25].
MOFA+ identified 5 factors with a minimum variance explained of 1% (Methods; Additional file 1: Fig. S8). Factor 1, the major source of variation, is linked to the division between inhibitory and excitatory neurons. This factor shows significant mCG activity across all cortical layers, primarily associated with coordinated changes in enhancer elements, but to some extent also gene bodies (Fig. 3a,b). Consistently, the top weights in mCG gene body are enriched for genes whose RNA expression has been shown to discriminate between the two classes of neurons, including Neurod6 and Nrgn [7]. In addition, this analysis identified novel genes with differential gene body mCG levels that may have yet unknown roles in defining the epigenetic landscape of neuronal diversity, including Vsig2, Taar3, and Cort (Additional file 1: Fig. S9).
Factor 2 captures genome-wide differences in global mCH levels (R = 0.99), which is moderately correlated with differences in global mCG levels (R = 0.32) (Additional file 1: Fig. S10). Factor 3 captures heterogeneity linked to the increased cellular diversity along cortical depth, with the Deep layer displaying significantly more diversity of excitatory cell types than the Superficial layer (Fig. 3a,c). Again, we observed that the MOFA+ factors can be used as input to infer non-linear manifolds and reveal the existence of subpopulations of both excitatory and inhibitory cell types (Fig. 3d). Notably, t-SNE representation inferred using MOFA+ factors were substantially better at discriminating subpopulations than the conventional approach of using principal component analysis (Additional file 1: Fig. S11).
Interestingly, in addition to the dominant mCG signal, MOFA+ connected Factor 1 and Factor 3 to variation in mCH, which suggests a putative role of mCH in cellular diversity. We hypothesize that this can be supported if the genomic regions that show mCH signatures are different from the ones marked by the conventional mCG signatures. To investigate this, we correlated the mCH and mCG feature weights for each factor and genomic context. In all cases, we observe a strong positive dependency (Fig. 3e and Additional file 1: Fig. S12), indicating that mCH and mCG signatures are spatially correlated and target similar loci.
Taken together, our results support the hypothesis that mCH and mCG tag the same genomic loci and are associated with the same sources of variation, suggesting that the presence of mCH may be the result of non-specific de novo methylation as a by-product of the establishment of mCG [35].
MOFA+ reveals molecular signatures of lineage commitment during mammalian embryogenesis
As a final use case, we applied MOFA to a complex dataset with multiple sample groups and modalities. Briefly, scNMT-seq was used to jointly assay RNA expression, DNA methylation, and chromatin accessibility in 1828 cells collected across three stages of mouse development [40]. MOFA+ provides a principled approach for delineating coordinated variation between the transcriptome and the epigenome, and for assigning specific covariance patterns to developmental stages.
As input to the model, we quantified DNA methylation and chromatin accessibility at two sets of regulatory elements: gene promoters and enhancer elements (defined as distal H3K27ac sites [40,41,42]). RNA expression was quantified for protein-coding genes. After data processing (“Methods”), separate data modalities were defined for the RNA expression and for each combination of genomic context and epigenetic readout (five data modalities in total). Sample groups were defined by considering cells across the developmental stages (E5.5, E6.5, and E7.5), reflecting the underlying experimental design (Additional file 1: Fig. S13). Notably, the epigenetic readouts are extremely sparse, with, on average, only 18% and 10% of cells having recorded data at a gene promoter for DNA methylation and chromatin accessibility, respectively. In this context, methods that pool information across cells and features are essential for robust inference.
MOFA+ identified 10 factors that explain at least 1% of variation in gene expression (Additional file 1: Fig. S14). Factor 1 captures the formation of ExE endoderm, a cell type that is present across all stages (Fig. 4a), in agreement with our previous results using the independently generated transcriptomic atlas of mouse gastrulation (Fig. 2). MOFA+ links Factor 1 to changes across all molecular layers. Notably, the distribution of weights for DNA methylation is skewed towards negative values (at both enhancers and promoters), indicating that ExE endoderm cells are characterized by a state of global demethylation, consistent with previous studies [43].
The following factors captured the molecular variation associated with the emergence of the primary germ layers at E7.5: mesoderm (Factor 2, Fig. 4b), and embryonic endoderm (Factor 4, Additional file 1: Fig. S15). Again, for both factors, MOFA+ connected the transcriptome variation to changes in DNA methylation and chromatin accessibility. Yet, in striking contrast to Factor 1, the variance decomposition analysis and the distribution of weights indicate that the epigenetic dynamics are primarily associated with enhancer elements. In contrast, little coordinated variation is observed in promoters (Fig. 4b), even for genes that show strong differential expression between germ layers (Additional file 1: Fig. S16). These results are in agreement with other studies that have identified distal regulatory elements as a major target of epigenetic modifications during embryogenesis [44,45,46].
The remaining factors capture variation that is mostly driven by the RNA expression, whose etiology can be related to the existence of morphogenic gradients (Factor 8, Additional file 1: Fig. S17), the emergence of other cellular subpopulations during gastrulation (Factor 7, Additional file 1: Fig. S18) and cell cycle (Factor 6, Additional file 1: Fig. S19).
In conclusion, the MOFA+ output suggests that independent cell fate commitment events undergo different modes of epigenetic variation. While some lineages manifest global changes in the epigenetic landscape (ExE endoderm, Factor 1), other cell types are associated with the emergence of local epigenetic patterns that are driven by specific genomic contexts (embryonic endoderm and mesoderm, Factors 2 and 4).