Single-cell binary factor analysis (scBFA) model
scBFA is available as an R package on Bioconductor at https://bioconductor.org/packages/devel/bioc/html/scBFA.html, as well as on GitHub (https://github.com/quon-titative-biology/scBFA). The main function to run scBFA is scBFA().
In our notation below, matrices are represented by upper case bold letters, vectors by lower case bold letters, and numeric constants as upper case non-bold letters. Square brackets also indicate a matrix, though represented as a series of column vectors. A matrix subscript with round brackets indicates the index of the corresponding column vector.
The schematic of our single-cell binary factor analysis (scBFA) model is shown in Additional file 1: Figure S28. The input data to scBFA consists of two matrices, O and X. O is a matrix of counts, consisting of G features (genes in the case of scRNA-seq data, or loci in the case of scATAC-seq data) measured in each of N samples (cells). From the input data O, we compute a matrix B, where Bij represents the detection pattern observed for cell i (i = 1, …, N) and feature j (j = 1, …, G). For scRNA-seq inputs, Bij = 1 when Oij ≥ 1, otherwise Bij = 0. Therefore, Bij = 1 indicates that at least one read (or UMI) maps to gene j in cell i and therefore suggests gene expression. Similarly, for scATAC-seq input data, Bij = 1 when Oij ≥ 1, in other words, when at least one read maps to locus j in cell i (and therefore suggests locus accessibility), otherwise Bij = 0. scBFA is adapted from a generalized linear model framework and is therefore capable of adjusting for batch effects and other nuisance cell-level covariates. Input X = [x1, x2, …, xN]T is a N × C cell-level covariate matrix that enables correction for C observed nuisance factors such as batch effects or other cell-specific quality control measurements. If there are no such cell-level covariates that need to be adjusted for, X is the null matrix by default.
The intuition behind scBFA is that it performs dimensionality reduction to explain the high-dimensional detection pattern matrix B by estimating two lower-dimensional matrices: a N × K embedding matrix Z = [ z1, z2, … , zN]T,and a K × G loading matrix A = [a1, a2, … , aG]. Here, K is the number of latent dimensions used to approximate Bij, where where K ≪ G. ui and vi and represent the ith cell-level intercept and jth feature-specific intercept, respectively. u is therefore a vector of length N, and v is a vector of length G. For scRNA-seq, for example, we expect u and v will implicitly model the variation of gene expression caused by library size. µij is the mean of the Bernoulli distribution governing whether feature j is detected in cell i or not.
Formally, scBFA is defined by the following model:
$$ \mathrm{logit}\left({\mu}_{ij}\right)={\boldsymbol{x}}_{\boldsymbol{i}}^{\boldsymbol{T}}{\boldsymbol{\beta}}_{\boldsymbol{j}}+{\boldsymbol{z}}_{\boldsymbol{i}}^{\boldsymbol{T}}{\boldsymbol{a}}_{\boldsymbol{j}}+{u}_i+{v}_j $$
$$ P\left(\boldsymbol{B};\boldsymbol{A},\boldsymbol{Z},\boldsymbol{\beta}, \boldsymbol{X},\boldsymbol{u},\boldsymbol{v}\right)=\prod \limits_{i,j}\mathrm{Bernoulli}\left({B}_{ij}|{\mu}_{ij},\boldsymbol{A},\boldsymbol{Z},\boldsymbol{\beta}, \boldsymbol{X},\boldsymbol{u},\boldsymbol{v}\right) $$
We train the scBFA model by optimizing the following penalized likelihood function:
$$ f\left(\boldsymbol{B},\boldsymbol{A},\boldsymbol{Z},\boldsymbol{\beta}, \boldsymbol{X},\boldsymbol{u},\boldsymbol{v}\right)=\left[{\sum}_{i,j}\ln\ P\left({B}_{ij};\boldsymbol{A},\boldsymbol{Z},\boldsymbol{\beta}, \boldsymbol{X},\boldsymbol{u},\boldsymbol{v}\right)\right]-{\epsilon}_1{\left\Vert \boldsymbol{A}\right\Vert}_2^2-{\upepsilon}_2{\left\Vert \boldsymbol{Z}\right\Vert}_2^2-{\upepsilon}_3{\left\Vert \boldsymbol{\beta} \right\Vert}_2^2 $$
Here, ϵ1, ϵ2, and ϵ3 are the tunable parameters that control the regularization of the model parameters, where by default \( {\upepsilon}_1=\frac{\epsilon_0}{G},{\epsilon}_2=\frac{\upepsilon_0}{N} \), \( {\epsilon}_3=\frac{\epsilon_0}{G} \), and ϵ0 = max {N, G}. The optimization is carried out using the L-BFGS-B optimization routine available in the R optim() function. After completing the optimization, we orthogonalize Z and A using the orthogonalizeTraceNorm() function available in the ZINB-WaVE [8] package.
Binary PCA model and calculation of the gene detection pattern matrix
Binary PCA describes our fast approximation to scBFA by simply running PCA, with the exception of converting the input count matrix into a detection matrix by converting non-zero values to one. We implemented Binary PCA through the addition of a single line of R code. Suppose that countMatrix is the name of the matrix in R that stores, for example, the UMI counts for each gene in each cell. To run Binary PCA, we first convert the countMatrix into the gene detection pattern matrix before running PCA via the R command:
countMatrix[which(countMatrix>0)] <- 1
We then call PCA using the following command in R:
PCA_results <- prcomp(countMatrix, center=TRUE, scale.=FALSE);
Note that typically, scale is set to TRUE when calling PCA. For Binary PCA, we set it to FALSE because the variance in gene detection is potentially associated with cell types (e.g., genes with higher detection variance are more likely to be marker genes, and therefore should contribute more to the embedding).
Execution of scRNA-seq analysis methods
We compared scBFA against scVI [12], SAVER [13], sctransform [14], scrna2019 [15], PCA, ZINB-WaVE [8], and scImpute [9]. These seven methods were selected to represent diverse classes of approaches to scRNA-seq data analysis, including dimensionality reduction methods (PCA, ZINB-WaVE, scVI), preprocessing approaches that can be applied before dimensionality reduction (sctransform, scrna2019), and imputation methods that can be applied before dimensionality reduction (SAVER, scImpute). Of the dimensionality reduction methods, PCA was chosen because of its implementation in popular packages such as Seurat [42], and scVI [12] is a leading deep learning-based dimensionality reduction method. ZINB-WaVE was chosen specifically because it is a recently developed method, and scBFA is designed as a gene detection-based analog of ZINB-WaVE; therefore, comparing scBFA with ZINB-WaVE is the fairest comparison of gene detection versus quantification-based approaches.
We ran most of the scRNA-seq analysis methods with their default parameter settings, with the exception of scVI and scrna2019.
scVI requires specification of a learning rate and the number of iterations before convergence. During training, we found that scVI performance was heavily influenced by these two parameters. We therefore performed an unbiased grid search by setting the number of iterations to be either 2000 or 4000 and setting the learning rate to be either 1e−2, 1e−3, or 1e−4. We then trained the model with all 6 possible combinations of learning rate and number of iterations, and selected the combination of parameters that gave the lowest loss on the hold-out set. The loss value is provided by scVI during training. During training, the size of the training set is fixed to be 75% of the entire dataset, and the remaining parameters are fixed at their default values. We repeated the above parameter search for the same number of factors as the other methods for all scRNA-seq datasets. For the simulated datasets, given the large number of scenarios tested, we fixed the learning rate to be 0.001 and number of iterations to be 2000.
scrna2019 is a method developed to perform feature selection and GLM-based factor analysis on scRNA-seq [15]. The scrna2019 R package (obtained on May 6, 2019, from https://github.com/willtownes/scrna2019) offers both a GLM factor analysis model and a corresponding deviance score approximation. We used the deviance score approximation instead of the GLM framework for our experiments because several benchmarks required batch effect correction, which should be addressed using the deviance score approximation as per scrna2019’s authors’ recommendations [15]. Also, at the time of the writing of this paper, the GLM implementation produced errors for three of our datasets that prevented us from completing our experiments.
Execution of scATAC-seq analysis methods
We compared scBFA against PCA, Binary PCA, Scasat [34], Destin [35], scABC [36], chromVAR [37], and SCRAT [38]. Scasat and Destin are scATAC-seq analysis tools primarily designed to identify cell types and differential accessibility analysis. Both methods treat dimensionality reduction as a prior step before further clustering distinct cell types. Scasat’s embedding space is learned by performing multidimensional scaling (MDS) on a cell-cell Jaccard similarity matrix computed from a binarized chromatin accessibility matrix. Destin developed a weighted principal component analysis approach using distance to transcription start sites and reference regulomic data. scABC is an unsupervised clustering tool of single-cell epigenetic data and performs multi-stage clustering based on the input chromatin accessibility matrix directly. chromVAR aggregates motif position weight matrices (PWM) and chromatin accessibility to uncover cell populations and identify motifs associated with cell type-specific variation. SCRAT summarizes several distinct regulatory genomic data (including prior established gene sets and transcription factor binding motif sites, among others) to identify distinct cell populations from single-cell genomic data.
For SCRAT, we used the regulatory activity feature list provided by SCRATsummary() as the default input features. In addition, we also input the BED files corresponding to the scATAC-seq data as a custom feature as suggested by the SCRAT authors [38], which we found to improve the performance.
Quantifying the effect of imputation on scBFA
We compared the performance of scBFA before and after imputation on our 14 benchmark datasets under HVG selection. We tested two state-of-the-art imputation methods, SAVER [13] and scImpute [9]. SAVER estimates library size-normalized posterior means of gene expression levels (\( {\hat{\lambda}}_{ij} \)), which are inappropriate for input into scBFA because they are not sparse. We therefore sampled counts from SAVER’s generative model as follows:
$$ {O}_{ij}\sim \mathrm{Poisson}\left({s}_i{\hat{\lambda}}_{ij}\right) $$
where \( {\hat{\lambda}}_{ij} \) is SAVER’s imputed expression level and si is the library size for cell i divided by the mean library size across cells [13]. We generated five separate count matrices Oij based on the SAVER estimates \( {\hat{\lambda}}_{ij} \). For scImpute, we used its imputed gene counts matrix directly as input for scBFA.
Selection of representative datasets to measure gene detection rates
We obtained a total of 36 scRNA-seq datasets from which we calculated gene detection rates as a function of the number of cells in each dataset (Additional file 1: Figure S1). We obtained these datasets from two sources, the conquer database [43] and the Gene Expression Omnibus (GEO) [44]. For GEO, we used the search term “((‘single cell rna-seq’ OR ‘single cell transcriptomic’ OR ‘10X’ OR ‘single cell transcriptome’) AND Expression profiling by high throughput sequencing[DataSet Type]) AND (Homo sapiens[Organism] OR Mus musculus[Organism]),” sorted all datasets by size, then selected a similar number of datasets from both the top and bottom of the list (Additional file 1: Table S7).
Computing mean and dispersion curves
We use the DESeq2 [28] package to obtain gene-specific dispersion estimates for each dataset, where dispersion is measured across all cells in a dataset. Within the DESeq2 pipeline, gene-wise dispersions are first estimated, a trend line is fit to the gene-wise dispersion estimates, and finally, shrinkage is applied to the gene-wise dispersion estimates (MAP estimates). In Fig. 3b, we extracted the fitted gene-wise dispersion estimates from the trend line (second step), and we fit these dispersion estimates by local linear regression (LOESS) using the gene-wise mean of transcripts per million (TPM) across all cells as the explanatory variable. To address the border effect of LOESS, we removed the top and bottom 2.5% of genes as ranked by TPM. Note that using the MAP dispersion estimates (final DESeq2 step) or the fitted dispersion estimates from the trended fit (second step) does not materially change our conclusion. The exception is for the dataset PBMC where there are 455 genes with its MAP dispersion estimates staying at their initialized value of 1e−8 during optimization, while their fitted dispersion estimates are substantially different. We therefore chose the fitted dispersion estimates to generate Fig. 3b.
Benchmarking dimensionality reduction methods for scRNA-seq
We evaluated each dimensionality reduction method by how well their low dimensional embeddings discriminate experimentally defined cell types. For each dataset and method tested, we first performed dimensionality reduction on the entire dataset to obtain an embedding matrix representing each cell in K dimensions (the matrix Z described in the scBFA methods section). We then performed fivefold cross-validation in which we trained a non-regularized multi-level logistic classifier on the training embeddings from each method using the a priori known cell type labels, then used the model to predict cell type labels for the test embeddings. For every prediction, using the known cell type labels, we computed a confusion matrix and the corresponding Matthews’ correlation coefficient (MCC) as a measure of classification accuracy. MCC was calculated using the R package mltools. We repeated the fivefold cross-validation a total of 15 times and reported the mean classification accuracy as the final accuracy.
$$ \mathrm{MCC}=\frac{TP\ast TN- FP\ast FN}{\left(\sqrt{TP+ FP}\right)\ast \left(\sqrt{TP+ FN}\right)\ast \left(\sqrt{TN+ FP}\right)\ast \left(\sqrt{TN+ FN}\right)\ } $$
In our analysis of the ERCC dataset, we used a different evaluation metric because each “cell” represents technical replicates of the spiked-in RNA diluted at a constant ratio (10:1). Under the assumption that the only variation between “cells” is due to technical factors, we therefore used averaged within-group sum of squares (AWSS) to measure how the low-dimensional embedding learned by each method captured such homogeneity. Given an N by K embedding matrix Z, AWSS was calculated as follows:
$$ {\displaystyle \begin{array}{c}W=\left(Z-\overline{Z}\right)\ \\ {}\mathrm{AWSS}=\frac{\mathrm{trace}\left({W}^TW\right)}{N-1}\\ {}\kern1.75em \end{array}} $$
Here, \( \overline{Z} \) is an N by K matrix for which every row is the column mean of Z.
Benchmarking cell type identification methods for scATAC-seq
We benchmarked scBFA against existing scATAC-seq analysis tools by evaluating their ability to correctly cluster cell types. We used a different evaluation scheme from that used for the scRNA-seq experiments because one of the existing methods (scABC) does not produce low-dimensional embeddings and instead outputs cluster labels. The methods Scasat and Destin both provide cluster labels directly from their analysis pipeline. For scBFA, PCA, Binary PCA, chromVAR, and SCRAT, we clustered cells based on the learned embedding matrices using R’s built-in hierarchical clustering function hclust() with Wald’s distance. We compared the accuracy of the clustering results from each method using the metrics normalized mutual information (NMI) and Adjusted Rand Index (ARI), computed using the R package aricode.
Simulation of scRNA-seq data
A variation of the ZINB-WaVE model was used to simulate scRNA-seq datasets and is defined as follows (Additional file 1: Figure S29):
$$ {\boldsymbol{z}}_{\boldsymbol{\mu} \left(\boldsymbol{i}\right)}\sim N\left({\hat{\boldsymbol{z}}}_i,{\sigma}_{\mu}^2{\boldsymbol{I}}_{\boldsymbol{K}}\right) $$
$$ {\boldsymbol{z}}_{\boldsymbol{\pi} \left(\boldsymbol{i}\right)}\sim N\left({\hat{\boldsymbol{z}}}_{\boldsymbol{i}},{\sigma}_{\boldsymbol{\pi}}^2{\boldsymbol{I}}_{\boldsymbol{K}}\right) $$
$$ \mathrm{logit}\left({\pi}_{ij}\right)=\left({\boldsymbol{z}}_{\boldsymbol{\pi} \left(\boldsymbol{i}\right)}^{\boldsymbol{T}}{\hat{\boldsymbol{a}}}_{\boldsymbol{\pi} \left(\boldsymbol{j}\right)}+{\hat{u}}_{\pi (i)}+{\hat{v}}_{\pi (j)}-\delta \right) $$
$$ \log \left({\mu}_{ij}\right)=\left({\boldsymbol{z}}_{\boldsymbol{\mu} \left(\boldsymbol{i}\right)}^{\boldsymbol{T}}{\hat{\boldsymbol{a}}}_{\boldsymbol{\mu} \left(\boldsymbol{j}\right)}+{\hat{\mu}}_{\mu (i)}+{\hat{v}}_{\mu (j)}\right) $$
Πij ∼ Bernoulli(πij)
$$ {O}_{ij}\left\{\begin{array}{c}=0\kern6em \mathrm{if}\ {\Pi}_{ij}=1\\ {}\sim \mathrm{NB}\left({\mu}_{ij},r\right)\kern2.25em \mathrm{if}\ {\Pi}_{ij}=0\end{array}\right. $$
To keep the consistency of the notation, the parameters we used above \( \left\{\hat{\boldsymbol{Z}},{\hat{\boldsymbol{A}}}_{\boldsymbol{\mu}},{\hat{\boldsymbol{A}}}_{\boldsymbol{\pi}},{\hat{\boldsymbol{u}}}_{\boldsymbol{\mu}},{\hat{\boldsymbol{u}}}_{\boldsymbol{\pi}},{\hat{\boldsymbol{v}}}_{\boldsymbol{\mu}},{\hat{\boldsymbol{v}}}_{\boldsymbol{\pi}}\right\} \) respectively correspond to the parameters \( \left\{\hat{\boldsymbol{W}},{\hat{\boldsymbol{\alpha}}}_{\boldsymbol{\mu}},{\hat{\boldsymbol{\alpha}}}_{\boldsymbol{\pi}},{\hat{\boldsymbol{\gamma}}}_{\boldsymbol{\pi}},{\hat{\boldsymbol{\gamma}}}_{\boldsymbol{\mu}},{\hat{\boldsymbol{\beta}}}_{\boldsymbol{\mu}},{\hat{\boldsymbol{\beta}}}_{\boldsymbol{\pi}}\right\} \) used in the original ZINB-WaVE paper. In the first step of our simulations, all parameters with a hat accent are set a priori by fitting the ZINB-WaVE model using its R package [8] on a single scRNA-seq dataset in order to use realistic parameters for our simulation. The remaining parameters \( \left\{\delta, {\sigma}_{\mu}^2,{\sigma}_{\boldsymbol{\pi}}^2,r\right\} \) are then systematically varied in our simulations to determine their effect on downstream dimensionality reduction methods. Oij denotes the gene counts for cell i and feature j. As is described in the original ZINB-WaVE paper, \( \hat{\boldsymbol{Z}} = {\left[{\hat{\boldsymbol{z}}}_{\mathbf{1}},{\hat{\boldsymbol{z}}}_{\mathbf{2}},\dots, {\hat{\boldsymbol{z}}}_{\boldsymbol{N}}\right]}^{\boldsymbol{T}} \)is a N × K embedding matrix, while \( {\hat{\boldsymbol{A}}}_{\boldsymbol{\mu}}=\left[{\hat{\boldsymbol{a}}}_{\boldsymbol{\mu} \left(\mathbf{1}\right)},{\hat{\boldsymbol{a}}}_{\boldsymbol{\mu} \left(\mathbf{2}\right)},\dots, {\hat{\boldsymbol{a}}}_{\boldsymbol{\mu} \left(\boldsymbol{G}\right)}\right] \) and \( {\hat{\boldsymbol{A}}}_{\boldsymbol{\pi}}=\left[{\hat{\boldsymbol{a}}}_{\boldsymbol{\pi} \left(\mathbf{1}\right)},{\hat{\boldsymbol{a}}}_{\boldsymbol{\pi} \left(\mathbf{2}\right)},\dots, {\hat{\boldsymbol{a}}}_{\boldsymbol{\pi} \left(\boldsymbol{G}\right)}\right] \) are the corresponding K × G regression coefficient matrices for the negative binomial and Bernoulli distributions governing the gene count and detection components, respectively. The output of the Bernoulli distribution is the latent variable Πij, which decides whether a gene is detected (in which case the observed value Oij is sampled from a negative binomial distribution), or not detected. \( {\hat{\boldsymbol{u}}}_{\boldsymbol{\mu}} \) and \( {\hat{\boldsymbol{u}}}_{\boldsymbol{\pi}} \) are N × 1 cell-specific intercepts for the count matrix and detection matrix, respectively. Similarly, \( {\hat{\boldsymbol{v}}}_{\boldsymbol{\mu}} \) and \( {\hat{\boldsymbol{v}}}_{\boldsymbol{\pi}} \) are G × 1 gene-specific intercepts for the count matrix and detection matrix, respectively. The number of latent dimensions K used to generate the gene expression values was fixed at 5, and we used a total of 2000 highly variable genes as in the original dataset. The LPS dataset does not provide any cell-level covariates, so in these simulations, there are no cell- or gene-wise covariate matrices. For quality control purposes, we filtered out genes that are expressed in fewer than 1% of the cells and then filtered out cells in which less than 1% of genes are expressed.
The distinction between our simulation framework and ZINB-WaVE is that ZINB-WaVE maintains the same cell embedding space \( \hat{\boldsymbol{Z}} \) across both the gene detection and count spaces. In contrast, our framework relaxes this constraint by introducing individual embeddings Zπ and Zμ that are close to \( \hat{\boldsymbol{Z}} \). Formally, Zπ and Zμ are N × K embedding matrices for the gene detection and count spaces, respectively. Each row i of Zπ and Zπ are sampled from respective K-multivariate Gaussian distributions with the same mean defined by \( {\hat{\boldsymbol{z}}}_i \) and spherical variance parameters \( {\sigma}_{\boldsymbol{\pi}}^2 \) and \( {\sigma}_{\mu}^2 \), respectively.
In our simulations, we varied the simulation parameters \( \left\{\delta, {\sigma}_{\mu}^2,{\sigma}_{\boldsymbol{\pi}}^2,r\right\} \) as follows. To influence the total number of gene counts detected (total detection rate), we set δ ∈ {−2, −0.5, 1, 2.5, 4}. To influence the variance in the gene detection and count embedding spaces, we set \( {\sigma}_{\pi}^2,{\sigma}_{\mu}^2\in \left\{0.1,0.5,1,2,3\right\} \). Finally, we varied the common gene dispersion parameter r ∈ {0.5,1, 5}. In total, the number of unique parameter settings we used to simulate scRNA-seq data is 5 × 5 × 5 × 3 = 375. For each of those scenarios, we simulated 3 replicates, resulted in 375 × 3 = 1125 datasets.
Quality control of scRNA-seq data
For each scRNA-seq dataset tested, we performed a standardized quality control process. We first removed cells for which mitochondrial genes accounted for over 50% of the total observed counts. Then, we filtered out genes that are expressed in fewer than 1% of cells and removed cells whose library size (total read or UMI count) was less than one-eighth quantile of all cell library sizes. One exception is the MEM-T cell dataset, where we removed an extra 361 cells from the batch labeled “subject16” to remove batches that were confounded with cell types.
Preprocessing of scATAC-seq data
We followed the scATAC-seq pipeline for processing and aligning reads used by the Destin method [35], obtained from GitHub at https://github.com/urrutiag/destin on April 22, 2019. This preprocessing pipeline yielded 2779, 576, and 960 BAM files for GSE96769 [45], GSE74310 [46], and GSE107816 [47], respectively. These BAM files form the initial input of Destin, scABC, and SCRAT. The input chromatin accessibility matrix for chromVAR and Scasat was then obtained from Destin’s preprocessing pipeline directly.
For GSE96769, we only kept cells and genomic loci that are used in the original paper’s analysis. The indices for genome loci and cells that passed quality control are supplied in the supplementary files of the original paper. Beyond that, we selected a subset of frozen cells from five patients, excluding patient BM0106, and a subset of pDC cells from patient BM1137 to keep as many samples as possible while removing the part of the batches confounded with cell types. This enabled us to construct a design matrix to correct for patient-specific effects. We furthermore excluded cells that are labeled as unknown by the original author.
For each scATAC-seq dataset tested, we only kept genomic loci that are accessible in at least 1% of cells and then removed cells with a total number of accessible sites that deviates more than 3 standard errors to the mean (in either direction) across all cells. The number of retained cells used as input in our downstream analysis was 1358 for GSE96769, 572 for GSE74310, and 929 for GSE108716. We found that SCRAT and chromVAR’s preprocessing pipeline generated NA values, and so for these tools, we filtered out additional cells. For SCRAT, this yielded 1375 cells for GSE96769, 534 cells for GSE74310, and 815 cells for GSE108716. For chromVAR, this yielded 1358 cells for GSE96769, 529 cells for GSE74310, and 811 cells for GSE108716.
Defining cell type labels in benchmark datasets
Most benchmark datasets used in our analyses were selected because the cell types were already defined in the original study by either known experimental condition or via cell surface markers. However, for the PBMC dataset, Stoeckius et al. [48] collected single-cell antibody-derived tag (ADT) data as well as scRNA-seq using CITE-seq [48]. ADTs can be viewed as a digital readout of cell surface protein abundance. We defined the cell types within this dataset by performing Louvain clustering on the Jaccard similarity matrix constructed based on the normalized ADT levels, similar to Stoeckius et al. [48]. Louvain clustering was performed using the “cluster_louvain” function implemented in the igraph R package. Clustering identified 10 cell types automatically. Note that the quality control standard for this dataset is different compared to the other scRNA-seq datasets used in our analysis, as cells were required to pass both scRNA-seq-specific filters (minimum of 800 reads) and ADT-specific filters (minimum of 50 ADT counts).
Normalization of scRNA-seq data
For each method, we also normalized cells to control for differences in library size. For PCA, we normalize the counts by setting \( {\overset{\sim }{O}}_{ij}=\log \left(\frac{O_{ij}}{c_i}+1\right) \), where \( {\overset{\sim }{O}}_{ij} \) is the normalized gene count for cell i and gene j, Oij is the original gene count for cell i and gene j, and \( {c}_i={\sum}_j{O}_{ij} \) is library size for cell i. ZINB-WaVE directly accounts for library size via their cell-specific intercept. For scImpute, we used the total number of imputed counts per cell as their corresponding library size and normalized in the same way as PCA. For scBFA, we estimated the feature-specific intercepts and cell-specific intercepts to implicitly model the effect of library size. SAVER uses the library size divided by the median library size across all cells to adjust for cell size. sctransform uses the log of the library size in its model. scrna2019 outputs a transformed deviance score matrix that does not depend on library size as input.
Normalization of scATAC-seq data
For PCA, we performed a log transformation \( {\overset{\sim }{O}}_{ij}=\log \left({O}_{ij}+1\right) \) to adjust the counts within scATAC-seq, where Oij is the original read count for cell i and locus j. For scBFA, Scasat, Destin, scABC, Binary PCA, chromVAR, and SCRAT, no extra normalization was applied.
Gene selection in scRNA-seq data
Highly variable genes (HVG) selection was performed to identify the most overdispersed genes, that is, genes that exhibit more variance than expected based on their mean. The HVG selection was performed using the FindVariableFeatures command implemented in Seurat 3.0. By default, Seurat selected the top 2000 genes. Highly expressed gene (HEG) selection was performed to identify the genes that exhibit the highest variance across cells, regardless of their mean, and is therefore expected to capture genes with higher mean expression compared to HVGs. To identify HEGs, we calculated the gene-specific variance in the gene count space and select the top 2000 genes to make the set size comparable to HVGs.
The gene detection rate (the average fraction of cells in which a gene is detected as expressed) and gene-wise dispersion of each dataset calculated in Fig. 3b is based on these 2000 most variant genes under both the HVG and HEG selection criteria. For the timing experiment, we only selected the top 1000 genes under the HVG criterion for computational speed.
Batch effect correction
For both scRNA-seq and scATAC-seq datasets, we performed two types of batch effect correction, depending on how the cell types are distributed across the batches in the dataset. For datasets where all cell types are represented in all batches (e.g., replicates, patients), such as the HSC dataset, we used those cell-level covariates to define the N × C design matrix X (see the scBFA model details above). For ZINB-WaVE, scBFA, and scVI, we regressed X out within the model structure. Since PCA does not offer a framework to regress out nuisance factors, we first regressed X directly from the normalized counts \( {\overset{\sim }{O}}_{ij} \) using a linear model. We then applied PCA on the residual matrix and obtained the corresponding embeddings and factor loading matrix. For Binary PCA, scImpute, SAVER, sctransform, and scrna2019, we also regressed out X from the binary entries and imputed values separately, then used the residual matrix in the same way as for PCA.
For other datasets (MEM-T, Pancreatic, MGE in scRNA-seq, and GSE96769 and GSE74310 in scATAC-seq), some batches were missing a subset of cell types, resulting in a design matrix X that cannot be directly used to estimate all batch effects. In this scenario, our strategy for modifying the dataset to address batch effects is as follows. Note that we use the same parametrization used to define the scBFA model earlier, except that we define a new observation matrix M as a N × G matrix, where observations can either correspond to measured expressed levels O, inferred binary detection pattern B, or imputed read counts. Except for minor differences in parameterization, the GLM-based dimensionality reduction methods scBFA and ZINB-WaVE can be summarized in the following framework, where g is the link function, P is a probability measure, and μ is the expectation over the probability measure. In the case of ZINB-WaVE, P is a zero-inflated negative binomial distribution. In the case of scBFA, P is a Bernoulli distribution.
$$ g\left({\mu}_{ij}\right)=\left({\boldsymbol{x}}_{\boldsymbol{i}}^{\boldsymbol{T}}{\boldsymbol{\beta}}_{\boldsymbol{j}}+{\boldsymbol{z}}_{\boldsymbol{i}}^{\boldsymbol{T}}{\boldsymbol{a}}_{\boldsymbol{j}}+{u}_i+{v}_j\right) $$
$$ {M}_{ij}\sim P\left({\mu}_{ij}\right) $$
We first identify the largest subset of cell types that are represented in all batches within a given dataset. Define Msub as the submatrix of N′ observations (N′ < N) corresponding to this subset of cell types and similarly define the submatrices Xsub, Zsub, Asub and usub, where i′ = 1, …, N′. We ran each dimensionality reduction method once to obtain an estimate of \( \hat{\boldsymbol{\beta}} \) by optimizing the likelihood of the following model:
$$ {\boldsymbol{M}}_{\mathbf{sub}\left({i}^{\prime },j\right)}\sim P\left({g}^{-1}\left({\boldsymbol{x}}_{\mathbf{sub}\left({\boldsymbol{i}}^{\prime}\right)}^{\boldsymbol{T}}{\boldsymbol{\beta}}_{\boldsymbol{j}}+{\boldsymbol{z}}_{\mathbf{sub}\left({\boldsymbol{i}}^{\prime}\right)}^{\boldsymbol{T}}{\boldsymbol{a}}_{\boldsymbol{j}}+{u}_{\mathrm{sub}\left({i}^{\prime}\right)}+{v}_j\right)\right) $$
\( \hat{\boldsymbol{\beta}} \) learns the variance induced by different batches only. Then, we use \( \hat{\boldsymbol{\beta}} \) as a plug-in estimate of β, and performed each dimensionality method on the full dataset to obtain estimates of all other parameters. Note since both ZINB-WaVE and scBFA regularize their coefficient matrix β, Xsub and X are both standardized. For PCA, scImpute, SAVER, sctransform, scrna2019, SCRAT, and chromVAR, we used a similar strategy to obtain \( \hat{\boldsymbol{\beta}} \) by using linear regression to regress out Xsub from the observation matrix corresponding to the largest subset of cell types represented in all batches. Then, we calculated the residuals Rij = Mij − XβT on the full dataset with \( \boldsymbol{\beta} =\hat{\boldsymbol{\beta}} \) fixed and performed PCA on the residual matrix R. For scVI, we were unable to modify the model framework to adjust for batch effects when they were confounded with cell types, as was the case in MEM-T, Pancreatic, and MGE. Therefore, we measured the scVI performance when we did not correct for batch effect, as well as when we performed naïve batch effect correction ignoring the confounding, and then reported the best performance for scVI.
Scasat handles batch effects through the removal of batch-specific loci. However, for datasets GSE96769 and GSE74310, the batch effect is confounded with cell types. Therefore, we ran Scasat without batch effect correction because batch-specific loci would be indistinguishable from cell type-specific loci. Destin and scABC cannot adjust batch effect on their own, and so we ran them without batch effect correction on these two datasets.
Identification of marker genes
We evaluated the extent to which the inferred dimensions for each method recover known marker genes (Fig. 5). For each method, we first obtained the K × G factor loading matrix indicating which genes are contributing to each of the K factors. Then, for every loading matrix and given number of factors, we ranked the absolute value of each gene in each factor and calculated the area under the receiver-operator curve (AUROC) to measure the extent to which the known marker genes contribute more to a factor than expected by chance.
Note that ZINB-WaVE has two loading matrices corresponding to the gene detection and gene count components, respectively, and therefore appears twice in Fig. 5. In ZINB-WaVE, πij models whether a gene has been detected or not, and μij models the mean for the read counts under negative binomial distribution. As in the previous section, we used the parameters \( \left\{\hat{\boldsymbol{Z}},{\hat{\boldsymbol{A}}}_{\boldsymbol{\mu}},{\hat{\boldsymbol{A}}}_{\boldsymbol{\pi}},{\hat{\boldsymbol{u}}}_{\boldsymbol{\mu}},{\hat{\boldsymbol{u}}}_{\boldsymbol{\pi}},{\hat{\boldsymbol{v}}}_{\boldsymbol{\mu}},{\hat{\boldsymbol{v}}}_{\boldsymbol{\pi}}\right\} \) to replace the parameters \( \left\{\hat{\boldsymbol{W}},{\hat{\boldsymbol{\alpha}}}_{\boldsymbol{\mu}},{\hat{\boldsymbol{\alpha}}}_{\boldsymbol{\pi}},{\hat{\boldsymbol{\gamma}}}_{\boldsymbol{\pi}},{\hat{\boldsymbol{\gamma}}}_{\boldsymbol{\mu}},{\hat{\boldsymbol{\beta}}}_{\boldsymbol{\mu}},{\hat{\boldsymbol{\beta}}}_{\boldsymbol{\pi}}\right\} \) used y:
$$ \kern0.5em \mathrm{logit}\left({\pi}_{ij}\right)=\left({\boldsymbol{z}}_{\boldsymbol{\pi} \left(\boldsymbol{i}\right)}^{\boldsymbol{T}}{\boldsymbol{a}}_{\boldsymbol{\pi} \left(\boldsymbol{j}\right)}+{\boldsymbol{x}}_{\boldsymbol{i}}^{\boldsymbol{T}}{\boldsymbol{\beta}}_{\boldsymbol{\pi} \left(\boldsymbol{j}\right)}+{u}_{\pi (i)}+{v}_{\pi (j)}\right) $$
$$ \kern0.5em \log \left({\mu}_{ij}\right)=\left({\boldsymbol{z}}_{\boldsymbol{\mu} \left(\boldsymbol{i}\right)}^{\boldsymbol{T}}{\boldsymbol{a}}_{\boldsymbol{\mu} \left(\boldsymbol{j}\right)}+{\boldsymbol{x}}_{\boldsymbol{i}}^{\boldsymbol{T}}{\boldsymbol{\beta}}_{\boldsymbol{\mu} \left(\boldsymbol{j}\right)}+{u}_{\mu (i)}+{v}_{\mu (j)}\right) $$
The loading matrix aπ that models the gene detection component (π) is denoted as ZINB-WaVEdropout, and the loading matrix aμ that models gene counts is denoted ZINB-WaVEmean.
Trajectory inference
To evaluate the performance gains of scBFA in the context of trajectory inference, we used a recently published platform, dynverse, for which “gold standard” scRNA-seq datasets were already obtained and preprocessed, and scripts and performance metrics have already been defined to evaluate trajectory inference [32]. Gold standard datasets refer to those datasets in which experimental (non-computational) methods were used to annotate a dataset with trajectory information such as cell type clusters (milestones) and connections between cell type clusters (milestone networks). From the 27 datasets available on June 12, 2019, that met this gold standard status and were real (not synthetic), we filtered out datasets that had less than 170 cells, yielding a total of 20 benchmark datasets. As with our previous experiments, we used the HVG selection criterion to identify the top 2000 varying genes for dimensionality reduction.
Our strategy for benchmarking trajectory inference was to identify an existing, top-performing trajectory inference method that also uses dimensionality reduction in its pipeline then replace that dimensionality reduction step with one of the methods we tested in our study. The dynverse paper identified Slingshot as a top performer [32]. To evaluate scBFA and the other methods, we substituted the PCA step of Slingshot with each dimensionality reduction method (scBFA, ZINB-WaVE, PCA, scImpute, SAVER, scrna2019, sctransform, scVI) and used dynverse to measure the performance of each modified version of Slingshot. The number of input latent dimensions was set to 10. Because the Slingshot implementation throws NA in cases where it is uncertain of the assignment of cells to a particular lineage, we removed two datasets from further evaluation because the number of NAs produced prevented calculation of the performance metrics (germline-human-both_guo.rds, mESC-differentiation_hayashi.rds). For each of the 18 benchmarks (Additional file 1: Table S8), we used dynverse to compute three performance metrics with respect to the experimentally gathered trajectory information: F1milestones, F1branches, and NMSElm. F1milestones measures the similarity between clustering membership of two trajectories. F1branches compares the similarity between the assignment of two branches. NMSElm is a measurement of how well the position of a cell in the inferred trajectory predicts the position of the cell in the ground truth trajectory under linear regression. Larger values of F1milestones, F1branches, and NMSElm correspond to better performance. We obtained F1milestones, F1branches, and NMSElm via dynverse’s calculate_mapping and calculate_position_predict functions within the dyneval package and converted raw values to ranks for Fig. 6. The wrapper function to obtain the results from Slingshot is adapted from the internal function https://github.com/dynverse/ti_slingshot/blob/master/package/R/ti_slingshot.R.
Visualization
After we obtain the embedding matrix from every method, we use the t-distributed stochastic embedding [49] method to project the embedding matrix onto two dimensions for visualization as a scatterplot. In all visualizations, the number of factors used as input to t-SNE in each visualization is 10.
Timing experiments
In the timing experiment (Additional file 1: Figure S27), we randomly subsampled 1k, 10k, 50k, and 100k cells from the 1.3 million 10× brain cell dataset from E18 mice and recorded the single-core execution time (in seconds) of all methods (PCA, ZINB-WaVE, scImpute, SAVER, sctransform, scrna2019, and Binary PCA) on the same machine. Due to the non-convex nature of ZINB-WaVE’s objective function and different optimization scheme, we cannot strictly match the convergence criterion of ZINB-WaVE to scBFA. Therefore, we use the same number of iterations for each method that was used to generate the results in Fig. 1. Because scImpute requires specification of the number of cell clusters, we set the number of cell clusters to seven, similar to a previous study [50] that used seven as an underestimate of the true number of cell types.