Scalable latent-factor models applied to single-cell RNA-seq data separate biological drivers from confounding effects

Single-cell RNA-sequencing (scRNA-seq) allows heterogeneity in gene expression levels to be studied in large populations of cells. Such heterogeneity can arise from both technical and biological factors, thus making decomposing sources of variation extremely difficult. We here describe a computationally efficient model that uses prior pathway annotation to guide inference of the biological drivers underpinning the heterogeneity. Moreover, we jointly update and improve gene set annotation and infer factors explaining variability that fall outside the existing annotation. We validate our method using simulations, which demonstrate both its accuracy and its ability to scale to large datasets with up to 100,000 cells. Moreover, through applications to real data we show that our model can robustly decompose scRNA-seq datasets into interpretable components and facilitate the identification of novel sub-populations.


Introduction
Single-cell RNA-sequencing (scRNA-seq) is an established tool for assaying variability in gene expression levels between cells drawn from a population.
Cell-to-cell differences in gene expression can be driven by both observed and unmeasured factors, including technical effects such as batch, or biological drivers including cell type specific features, such as stage of T cell differentiation 1 . Importantly, such technical and biological factors can act upon the same genes 2, 3 , meaning that they should be modeled jointly to fully understand heterogeneity in scRNA-seq data.
Well-established approaches exist for handling observed factors, such as batch or experimental covariates 4,5 . Additionally, methods based on singularvalue decomposition (SVD) and linear mixed models have been developed in order to capture unwanted variability due to unmeasured factors, first for conventional ensemble RNA-profiling experiments [6][7][8] and more recently for scRNA-seq 2 . Furthermore, by using informative marker gene sets, methods based on SVD and regression have also been employed to reconstruct cell states, such as the cell cycle or differentiation stages 2,9 . More recently, Fan et al. 10 introduced PAGODA, a PCA-based method that tests for coordinated overdispersion of sets of genes. However, these existing methods do not model errors in how gene sets are defined, and, more importantly, they fit individual processes independent from each other and do not explicitly account for either the presence of additional unannotated biological factors or confounding sources of variation. Finally, existing factor methods were motivated by relatively small single-cell RNA-seq datasets. Thanks to recent technological advances, it is now possible to generate single-cell RNA-seq datasets containing hundreds of thousands of cells, which requires computationally more efficient methods.

Results
To address this, we here propose a factorial single-cell latent variable model (f-scLVM) that can capture three sources of variation: i) variation in expression attributable to pre-annotated gene sets; ii) variation attributable to sparse, putatively biologically meaningful, but unannotated gene sets; and iii) variation explained by confounding factors that are expected to affect the expression profile of the majority of genes.
If a factor explains variation in the data, we assume that the expression levels of all genes assigned to it co-vary in a consistent manner, thereby allowing the activity of this factor to be inferred from the data. For annotated factors, we incorporate prior annotations derived from publicly available resources such as MSigDB 11 or REACTOME 12 into our model, thereby assigning particular biologically related sets of genes to the same factor. The assignment of genes to these sets is refined in a data-driven manner, assuming that only a small number of changes occur (i.e., that the initial annotation is reasonably accurate). For unannotated, but biologically meaningful gene sets, we assume that they contain a small number of genes.
Finally, to infer confounding factors we build on the assumption that such factors will have global effects on the expression of large numbers of genes, similar to principles applied in population genomics 6,7 .
As well as updating the assignment of gene sets, our model also infers, for a given data set, which factors are most relevant. Inference of model parameters, including gene assignments and factor weights, are determined using computationally efficient variational Bayesian inference, which scales linearly in the number of cells and genes. The posterior distributions over model parameters facilitate a wide range of downstream analyses, including identification of biological drivers, data visualization, the refinement of gene sets and the estimation of residual dataset (Fig. 1a).
As a first illustration, we applied f-scLVM to a dataset of 182 mouse embryonic stem cell (mESC) transcriptomes, where each cell was experimentally staged according to its position within the cell cycle 2 .
Consequently, across the entire population, we expect that the cell cycle is the major source of variation. Indeed, when applying f-scLVM using 44 core molecular pathways derived from MSigDB 11 , the method robustly identified four factors, including G2/M checkpoint and P53 pathway (Supp. Fig. 1).
These two factors could be used to stratify the cells by their position in the cell cycle (Fig. 1b). Other methods, including PAGODA, inferred a larger number of collinear and partially redundant factors that less accurately discriminated cells by cell cycle stage (Supp. Fig. 2), underscoring the importance of jointly modeling all annotated and unannotated factors. Furthermore, unlike existing methodology, f-scLVM allowed for data driven refinement of gene set annotations (Fig. 1c). We observed that the model modified the G2/M checkpoint factor by adding two genes, Anln and Kif20a, both of which are well-characterized cell cycle regulators 13,14 . Similarly, the model identified Ptp4a3, a known target of P53 15 , as an additional member of the P53 pathway factor.
Given these promising results, we next used simulated data to assess how robustly our model can identify relevant factors and complete gene set annotations. Over a wide range of simulations, where we varied the number of annotated factors and simulated confounders, the degree of overlap between gene sets, the number of cells and the size of the annotated gene sets, we consistently observed that f-scLVM more accurately identified the true simulated drivers than other factor analysis methods ( Fig. 2a and Supp. Fig.   3a-e). Our method showed the greatest improvement in performance over existing methods when multiple unannotated factors were simulated and when the gene sets explaining the most variance in the data contained a substantial number of overlapping genes (Fig. 2b). We also confirmed that f-scLVM was robust to extremely sparse datasets, typical of droplet-based approaches (Supp. Fig. 3f,g).
Finally, we considered datasets with simulated errors in the gene set annotation to assess the model's ability to adjust gene sets, a feature that is unique to f-scLVM (Methods). We observed that the model accurately identified genes that should be excluded from and added to gene sets ( Fig. 2c and Supp. Fig. 4). Unsurprisingly, as the fraction of errors in the gene set annotation increased, the ability of the model to recover the true sets declined -however, in the more realistic setting where only a small fraction of genes were poorly annotated (1-10%), our model performed extremely well.
Having evaluated our approach, we next applied f-scLVM to a population of 3,005 neuronal cells 16 . Using gene sets derived from REACTOME pathways as annotation, our model supported the importance of a set of factors similar to those identified by methods such as PAGODA (Fig. 3a), but with important differences (e.g. Innate Immune System; Supp. Fig. 5). Additionally, our model suggested a refined annotation for some of the most important gene sets identified (Fig. 3a), with, on average, 10% of genes being added and 3% of genes being removed for the top 20 annotated factors. Furthermore, the model identified unannotated factors with a high relevance score (Fig. 3b), demonstrating the importance of modelling such factors. We observed that many of these unannotated factors were sparse and captured differences between cell types that are not readily reflected in the pathway annotations (Supp. Fig. 5,6a-c).
The top ranked annotated processes separated the cells into well-defined groups, with the endothelial-mural cells being stratified into two populations by the muscle contraction factor (Fig. 3c). Notably, our model augmented the corresponding gene set by activating several genes that have previously been implicated in muscle contraction but that were not present in the pre-defined REACTOME gene set (Fig. 3d). Among the 13 identified genes were several known markers of vascular smooth muscle cells, including Rgs4, Mtfge8, and Notch3 17-20 . A second major driver identified by f-scLVM was the innate immune system factor, which, in particular, separated microglia (nervous system immune cells) from the remaining cell types (Fig. 3c). Moreover, similar to the muscle contraction factor, the gene set was also augmented with meaningful genes (Supp. Fig. 6d). Finally, given the increasing trend to generate scRNA-seq datasets containing tens to hundreds of thousands of cells, we contrasted the computational efficiency of our model with a variety of factor analysis models (Fig. 4a, Supp. To illustrate this, we applied f-scLVM to expression profiles from 49,300 retinal cells profiled using Drop-seq 21 . Again, our model identified biologically plausible processes, including GPCR signaling and transmission across chemical synapses, as explaining variability in the data (Fig. 4b). As previously, unannotated factors explained substantial variation in the data ( Fig. 4c). In this case, this variation could be attributed to a single factor, which affected a large number of genes (Supp. Fig. 8a), suggesting that it may explain confounding effects. Indeed, this factor was correlated with the cellular detection rate (Supp. Fig 8f), a known confounding feature of scRNAseq datasets 3 .
Given this, we considered the residual dataset generated after regressing out the effect of this factor (Methods, Supp. Table 2). When focusing on a set of six related and well defined cell types identified in the primary analysis -Müller glia, astrocytes, fibroblasts, vascular epithelium, pericytes and microglia -the residual data revealed two sub populations of astrocytes (Fig. 4d,e), as well as two subpopulations of microglia (Supp. Fig. 8b-d) that are not observed in the unadjusted data. In total, 1,024 genes were differentially expressed between the identified astrocyte populations (FDR<10%; Supp. Table 3) and these were enriched for processes related to immune response and activation of astrocytes, such as inflammatory response, BMP signaling pathway, and cellular response to BMP (Supp. Table 4), and included known genes related to reactive/inflammatory processes in astrocytes such as Ccl2 22 . In addition BMP signalling is known to activate distinct downstream transcription factors, including Stat3 and Smad5, which show the expected pattern of behavior between the two newly-identified populations 23 (Fig. 4f).
Taken together, these results show that f-scLVM can be used to infer biological and confounding factors from large datasets.
More broadly, we considered a range of available datasets and investigated the nature of unannotated factors inferred by our model. We observed, as above, that these factors were often associated with technical experimental features that have previously been suggested to underpin variability in scRNA-seq data, including the number of expressed genes and sequencing depth (Supp. Fig. 9). However, these associations were often weak, suggesting that the inferred hidden factors help to capture additional unwanted variation that cannot be assigned to measured covariates.

Discussion
Herein, we have proposed a scalable factor analysis approach to comprehensively model the sources of single-cell transcriptome variability. Unique to our model is the ability to jointly infer both annotated and unannotated factors and to augment predefined gene sets in a data driven manner. Additionally, f-scLVM is computationally efficient, allowing analysis of very large datasets containing hundreds of thousands of cells.
We have validated our model using simulations as well as using real data where the sources of transcriptome variability are well understood.
Subsequently we have applied the model to a range of different studies, demonstrating its ability to infer drivers of transcriptome variation, adjust gene sets to discover new marker genes and account for hidden confounding factors in the data.
Of course, our model is not free of limitations. A general challenge for any method is to reliably differentiate confounding factors from biological signal. f-scLVM addresses this through specific assumptions on the effect of these factors (sparse versus dense) in conjunction with leveraging gene set annotation from pathway databases. However, there is no silver bullet solution and it is hence necessary to interpret the model results and, in particular, the unannotated factors in the context of a given dataset.
A second potential caveat is the lack of accurate gene set annotation, which will necessarily impact the quality of the results. To mitigate this challenge, f-

Competing interests
The authors declare that no competing interests exist.

Code availability
An open source implementation of f-scLVM for reviewing purposes is available at: https://github.com/PMBio/f-scLVM.

The factorial single-cell latent variable model (f-scLVM)
f-scLVM is based on a variant of matrix factorization, decomposing the observed gene matrix into a sum of sum of contributions from C measured covariates, A annotated factors, whose inference is guided by pathway gene sets, and H additional unannotated factors: (1) using matrix notation, collapsing the factors into a factor activation matrix is then inferred from the observed expression data and the prior annotation jointly. This approach allows gene set annotations to be incorporated as well as data-driven augmentation. Once trained, the marginal posterior estimates of V,S can be used to identify genes that are added to or removed from a particular process.

Identifying expression drivers using automatic relevance determination
In addition to sparseness of regulatory effects, f-scLVM employs a second level of regularization using automatic relevance determination (ARD) 25 . The ARD prior is widely used in (probabilistic) matrix factorization models to deactivate factors that are not needed. This is achieved by placing a hierarchical prior on the precision of the normal priors for active links (Eqn. (2)) S ~ Γ( , ). The parameter S will be large for factors with low relevance, which corresponds to low prior variance, thereby driving the regulatory weights to zero. The prior variance 1/ S can also be interpreted as a measure of the regulatory impact of a factor and corresponds to the expected variance explained by the factor, for the subset of genes with a regulatory effect (see downstream analyses).

Modeling unannotated factors
In approach is analogous to the dropout model, however assuming the following likelihood model: with link function hV = log (1 + ƒ "… ) and h,V now denoting raw count values.

Parameter inference
Closed-form inference in sparse factor analysis is not tractable. In order to achieve scalability to large numbers of cells and genes, we employ deterministic approximate Bayesian inference based on variational methods 27

Downstream analysis
The fitted f-scLVM model allows for a range of different downstream analyses.
Factor relevance: The relevance of annotated pathways factors can be deduced from the ARD variance 1/ S , which corresponds to the expected explained variance of factor k for the subset of genes with a regulatory effect (e.g. Fig. 3a,b).
Visualization: The posterior distribution over the inferred factors allows for the visualization of cell states (e.g. Fig. 3c). This is possible both for annotated factors and for unannotated factors, where in particular sparse unannotated factors frequently tend to capture additional structure between cell types (Supp. Fig. 5,6). Gene set refinement: By comparing the posterior distribution on the indicator variables V,S with the prior gene set annotations V,S , it is possible to identify individual genes that were added to or removed from a pathway factor during inference (e.g. Fig. 3a,d). We use the posterior threshold of 0.5 to identify genet set augmentations in the annotation.

Relationship to other factor analysis models
Several existing factor analysis methods are related to f-scLVM. First, factor analysis with dense unannotated factors is used to adjust for unwanted variation in bulk datasets, including SVA 6 , RUV 8   Methods Sec 2.

Implementation of comparison partners
We compared the performance of f-scLVM to a number of alternative factor models. First, we ran PAGODA using the scde R package 10  For runtime assessments, we additionally considered two approaches to account for unwanted variation. SVA timings were reported using the R implementation of the SVA package 30 , considering a fixed number of surrogate variables that correspond to the true number of simulated factors.
RUV runtime results were obtained using the RUV2 function from the R implementation, which estimates and adjusts for unwanted variation using control genes. Runtimes estimates were obtained using the time module in python (time() function) and the proc.time() function in R; all simulations were run on 8 cores of an Intel Xeon 2.60GHz CPU.

Diagnostics and f-scLVM parameter settings
By default f-scLVM is fitted using annotated factors guided by gene set annotations and additional dense unannotated factors that capture unwanted variation. However, for some datasets this set of factors may not be sufficient to explain the observed heterogeneity, e.g. because potential differences between cell types may not be well reflected by the provided annotations. In this case, it is advised to infer an additional set of sparse unannotated factors.  Fig. 3h).

Simulation study
We simulated gene expression matrices based on a linear additive model, an  Table 1 for full details of simulation parameters).
Each simulated dataset consisted of between 20 -500 synthetic cells and 6,000 genes. Gene set sizes were determined by sampling from REACTOME pathways (considering 421 pathways with 20 to 933 genes). When ranking active pathways, we compiled an annotation consisting of the true drivers and an additional set of 15 non-active pathways as a negative set, and provided it to each considered method. Confounding factors, if simulated, were generated analogously to the approach described in 31 , assuming broad effects affecting between 400 and 3,000 randomly selected genes. The annotations of simulated pathways were generated sequentially, ordering pathways by decreasing size and drawing genes with a selected overlap to already existing pathways. Pathways were simulated to have overlapping gene sets, between 0.0 (no overlap) and 0.7 (70% of the gens overlap). To test for the impact of the size of gene sets, we additionally considered sampling REACTOME pathways with between 20-50, 50-100, and 100-200 genes. To assess the robustness of f-scLVM to incorrect gene set annotations, we simulated between 1% and 50% of false negative and between 1% and 10% false positive genes in the gene set annotations of individual factors. We also considered more challenging miss-annotations by introducing gene-swaps between pairs of active factors (for between 1% and 25% of all genes). Factor activations as well as non-zero regulatory weights were drawn assuming a unit variance normal distribution. Residual noise was simulated as normally distributed with standard deviation 0.1. When dropout was simulated, we considered two alternative dropout mechanisms. First, we model a threshold effect by setting all values less than a given threshold to 0; this reflects a limit of detection where small numbers of molecules cannot be detected reliably. Second, we considered modeling the probability of dropout events as a function of the true expression level, assuming an exponential relationship 29 : ••'' = exp (− hV ‰ ), with hV being the latent expression level introduced above and the exponential decay parameter. Both dropout processes are simulated, where each setting is parameterized by and the threshold value, which corresponds to the lower limit of detection. For each simulation setting, 50 independent datasets were generated.
To assess the performance of f-scLVM and alternative methods, we consider the receiver operator characteristics (ROC) for identifying the true simulated drivers. For f-scLVM the factor relevance was used to rank factors. Analogous metrics were derived for all alternative methods; see Implementation details of alternative methods.
Additionally, we assessed the ability of f-scLVM to augment corrupted gene set annotations (Fig. 2c, Supp. Fig. 4). We evaluated the ability of the model to correct the false positive and false negative annotation separately.

Staged mouse embryonic stem cells
The set of 182 mESCs staged for the cell cycle have previously been Invitrogen). Cells in all three cell cycle stages were profiled using the Fluidigm C1 system. We followed the pre-processing and normalization approach as previously described 9 and considered log-transformed and size-factor adjusted (geometric library size on endogenous genes) gene expression counts for 6,635 variable genes for analysis. Additional results show in Supp.

Zeisel et al. dataset
We analyzed log-transformed gene expression values of 3,005 single neurons sequenced using a protocol with unique molecular identifier 16 . We followed the preprocessing and filtering steps from the primary publication, resulting in 7,097 variable genes. f-scLVM was applied using 161 annotations from the REACTOME database (after filtering, Supp. Methods), providing a high resolution annotation. Following model diagnostics steps (see Diagnostics and f-scLVM parameter settings), an additional 5 sparse unannotated factors were added and fit jointly with the remaining factors (Supp. Fig. 5). Residual datasets were generated by regressing out the effect of the most relevant unannotated factor (Supp. Fig. 6e-h).

49,300 Retina cells
We considered the normalized, log-transformed expression values of 49,300 retina cells as described in 21 . We considered all expressed genes, using the dropout noise model in f-scLVM to account for low sequence coverage. We considered gene sets from the REACTOME database. Because of the size of the data, we used factor pre-screening to reduce the set of factors before training, retaining 50 gene sets (Supp. Methods). To generate expression values corrected for confounding factors, we considered residual gene expression profiles, regressing out the effect of the most relevant unannotated dense factor (Supp. Table 2). Visualizations of corrected and raw expression values of six related cell types identified in the primary publication (Müller glia, astrocytes, fibroblasts, vascular epithelium, pericytes and microglia) were obtained using t-SNE.