Standard preprocessing of single-cell RNA-seq UMI data includes normalization by sequencing depth to remove this technical variability, and nonlinear transformation to stabilize the variance across genes with different expression levels. Instead, two recent papers propose to use statistical count models for these tasks: Hafemeister and Satija (Genome Biol 20:296, 2019) recommend using Pearson residuals from negative binomial regression, while Townes et al. (Genome Biol 20:295, 2019) recommend fitting a generalized PCA model. Here, we investigate the connection between these approaches theoretically and empirically, and compare their effects on downstream processing.

Results

We show that the model of Hafemeister and Satija produces noisy parameter estimates because it is overspecified, which is why the original paper employs post hoc smoothing. When specified more parsimoniously, it has a simple analytic solution equivalent to the rank-one Poisson GLM-PCA of Townes et al. Further, our analysis indicates that per-gene overdispersion estimates in Hafemeister and Satija are biased, and that the data are in fact consistent with the overdispersion parameter being independent of gene expression. We then use negative control data without biological variability to estimate the technical overdispersion of UMI counts, and find that across several different experimental protocols, the data are close to Poisson and suggest very moderate overdispersion. Finally, we perform a benchmark to compare the performance of Pearson residuals, variance-stabilizing transformations, and GLM-PCA on scRNA-seq datasets with known ground truth.

Conclusions

We demonstrate that analytic Pearson residuals strongly outperform other methods for identifying biologically variable genes, and capture more of the biologically meaningful variation when used for dimensionality reduction.

Introduction

The standard preprocessing pipeline for single-cell RNA-seq data includes sequencing depth normalization followed by log-transformation [1, 2]. The normalization aims to remove technical variability associated with cell-to-cell differences in sequencing depth, whereas the log-transformation is supposed to make the variance of gene counts approximately independent of the mean expression. Two recent papers argue that neither step works very well in practice [3, 4]. Instead, both papers suggest to model UMI (unique molecular identifier) data with count models, explicitly accounting for the cell-to-cell variation in sequencing depth (defined here as the total UMI count per cell). Hafemeister and Satija [3] use a negative binomial (NB) regression model (scTransform package in R), while Townes et al. [4] propose Poisson generalized principal component analysis (GLM-PCA). These two models are seemingly very different.

Here, we show that the model used by Hafemeister and Satija [3] has a too flexible parametrization, resulting in noisy parameter estimates. As a consequence, the original paper employs post hoc smoothing to correct for that. We show that a more parsimonious model produces stable estimates even without smoothing and is equivalent to a special case of GLM-PCA. We then demonstrate that the estimates of gene-specific overdispersion in the original paper are strongly biased and further argue that UMI data do not require gene-specific overdispersion parameters to account for technical noise. Rather, the technical variability is consistent with the same overdispersion parameter shared between all genes. We use available negative control datasets to estimate this technical overdispersion. Furthermore, we compare Pearson residuals, GLM-PCA, and variance-stabilizing transformations for highly variable gene selection and as data transformation for downstream processing.

A common modeling assumption for UMI or read count data without biological variability is that each gene g takes up a certain fraction p_{g} of the total amount n_{c} of counts in cell c [4, 6–10]. The observed UMI counts X_{cg} are then modeled as Poisson or negative binomial (NB) [11] samples with expected value μ_{cg}=p_{g}n_{c} without zero-inflation [10, 12]:

The Poisson model has a maximum likelihood solution (see “Methods”) that can be written in closed form as \(\hat {n}_{c} = \sum _{g} X_{{cg}}\) (sequencing depths), \(\hat {p}_{g} = \sum _{c} X_{{cg}} / \sum _{c} \hat {n}_{c}\), or, put together,

where \(\mu _{{cg}}+\mu _{{cg}}^{2}/\theta \) is the NB variance and θ→∞ gives the Poisson limit. The variance of Pearson residuals is, up to a constant, equal to the Pearson χ^{2} goodness-of-fit statistic [13] and quantifies how much each gene deviates from this constant-expression model. As pointed out by Aedin Culhane [14], singular value decomposition of the Pearson residuals under the Poisson model is known as correspondence analysis [15–18], a method with a longstanding history [19].

Hafemeister and Satija [3] suggested using Pearson residuals from a related NB regression model for highly variable gene (HVG) selection and also as a data transformation for downstream processing. In parallel, Townes et al. [4] suggested using deviance residuals (see “Methods”) from the same Poisson model as above for HVG selection and also for PCA as an approximation to their GLM-PCA. In the next sections, we discuss the relationships between these approaches.

The regression model in scTransform is overspecified

Hafemeister and Satija [3] used the 33k PBMC (peripheral blood mononuclear cells, an immune cell class that features several distinct subpopulations) dataset from 10X Genomics in their work on normalization of UMI datasets. For each gene g in this dataset, the authors fit an independent NB regression

Here, θ_{g} is the gene-specific overdispersion parameter, \(\hat {n}_{c}\) are observed sequencing depths as defined above, and β_{0g} and β_{1g} are the gene-specific intercept and slope. The natural logarithm follows from the logarithmic link function that is used in NB regression by default. The original paper estimates β_{0g} and β_{1g} using Poisson regression and then uses the obtained estimates to find the maximum likelihood estimate of θ_{g}. The resulting estimates for each gene are shown in Fig. 1a–c, reproducing Figure 2A from Hafemeister and Satija [3].

The authors observed that the estimates \(\hat {\beta }_{0g}\) and \(\hat {\beta }_{1g}\) were unstable and showed high variance for genes with low average expression (Fig. 1a–b). They addressed this with a “regularization” procedure that re-set all estimates to the local kernel average estimate for a given expression level. This is similar to some approaches to bulk RNA-seq analysis [6, 7] but with post hoc correction instead of Bayesian shrinkage. This kernel smoothing resulted in an approximately linear increase of the intercept with the logarithm of the average gene expression (Fig. 1a) and an approximately constant slope value of \(\hat {\beta }_{1g}\approx 2.3\) (Fig. 1b). The nature of these dependencies was left unexplained. Moreover, we found that \(\hat {\beta }_{0g}\) and \(\hat {\beta }_{1g}\) were strongly correlated (ρ=−0.91), especially for weakly expressed genes (Fig. 1d). Together, these clear symptoms of overfitting suggest that the regression model was overspecified.

Indeed, the theory calls for a less flexible model.As explained above, a common modeling assumption (Eq. 2) is that μ_{cg}=p_{g}n_{c}, or equivalently

We see that under this assumption, the slope β_{1g} does not need to be fit at all and should be fixed to 1, if ln(n_{c}) is used as predictor. Not only does this suggest an alternative, simpler parametrization of the model, but it also explains why Hafemeister and Satija [3] found that \(\hat {\beta }_{1g} \approx 2.3\): they used log10(n_{c})= ln(n_{c})/ ln(10) instead of ln(n_{c}) as predictor, and so obtained ln(10)≈2.3 as the average slope.

Under the assumption of Eq. 7, a Poisson or NB regression model should be specified using ln(n_{c}) as predictor with a fixed slope of 1, a so-called offset (Eqs. 5 and 7). This way, the resulting model has only one free parameter and is not overspecified. Moreover, the Poisson offset model is equivalent to Eqs. 1 and 2 and so, as explained above, has an analytic solution

which forms a straight line when plotted against the log-transformed average gene expression \(\frac {1}{n}\sum _{c} X_{{cg}}\) (Fig. 1e). This provides an explanation for the linear trend in \(\hat {\beta }_{0g}\) in the original two-parameter model (Fig. 1a).

In practice, our one-parameter offset model and the original two-parameter model after smoothing arrive at qualitatively similar results (Fig. 1h). However, we argue that the one-parameter model is more appealing from a theoretical perspective, has an analytic solution, and does not require post hoc averaging of the coefficients across genes.

The offset regression model is equivalent to the rank-one GLM-PCA

The offset regression model turns out to be a special case of GLM-PCA [4]. There, the UMI counts are modeled as

assuming k+1 latent factors, with U and V playing the role of principal components and corresponding eigenvectors in standard PCA. Importantly, the first latent factor is constrained to U_{c0}=1 for all cells c, such that V_{0g} can be interpreted as gene-specific intercepts. If the data are modeled without any further latent factors, Eq. 10 reduces to

$$ \ln(\mu_{{cg}}) = V_{0g} + \ln(n_{c}), $$

(11)

which is identical to Eq. 7 with V_{0g}=β_{0g}. This shows that the proposed one-parameter offset regression model is exactly equivalent to the intercept-only rank-one GLM-PCA.

Overdispersion estimates in scTransform are biased

After discussing the overparametrization of the systematic component of the scTransform model, we now turn to the NB noise model employed by Hafemeister and Satija [3]. The \(\hat {\theta }_{g}\) estimates in the original paper are monotonically increasing with the average gene expression, both before and after kernel smoothing (Fig. 1c). This suggests that there is a biologically meaningful relationship between the expression strength and the overdispersion parameter θ_{g}. However, this conclusion is in fact unsupported by the data.

To demonstrate this, we simulated a dataset with NB-distributed counts \(\widetilde X_{{cg}} \sim \text {NB}(\mu _{{cg}}, \theta =10)\) with μ_{cg} given by Eq. 3 using X_{cg} of the PBMC dataset. Applying the original estimation procedure to this simulated dataset showed the same positive correlation of \(\hat {\theta }_{g}\) with the average expression as in real data (Fig. 1f), strongly suggesting that it does not represent an underlying technical or biological cause, but only the estimation bias. Low-expressed genes had a larger bias and only for genes with the highest average expression was the true θ=10 estimated correctly.

Moreover, the \(\hat {\theta }_{g}\) estimates strongly depended on the exact details of the estimation procedure. Using the theta.ml() R function with its default 10 iterations, as Hafemeister and Satija [3] did, led to multiple convergence warnings for the simulated data in Fig. 1f. Increasing this maximum number of iterations to 100 eliminated most convergence warnings, but caused 49.9% of the estimates to diverge to infinity or above 10^{10} (Fig. 1g). These instabilities are likely due to shallow maxima in the NB likelihood w.r.t. θ [20].

The above arguments show that the overdispersion parameter estimates in Hafemeister and Satija [3] for genes with low expression were strongly biased. In practice, however, the predicted variance μ+μ^{2}/θ is only weakly affected by the exact value of θ for low expression means μ, and so the bias reported here does not substantially affect the Pearson residuals (see below). Also, many of the weakly expressed genes may be filtered out during preprocessing in actual applications. We note that large errors in NB overdispersion parameter estimates have been extensively described in other fields, with simulation studies showing that estimation bias occurs especially for low NB means, small sample sizes, and large true values of θ [21–23], i.e., for samples that are close to the Poisson distribution. Note also that post hoc smoothing [3] can reduce the variance of the \(\hat {\theta }_{g}\) estimates, but does not reduce the bias.

Negative control datasets suggest low overdispersion

To avoid noisy and biased estimates, we suggest to use one common θ value shared between all genes. Of course, any given dataset would be better fit using gene-specific values θ_{g}. However, our goal is not the best possible fit: We want the model to account only for technical variability, but not biological variability, e.g., between cell types; this kind of variability should manifest itself as high residual variance.

Rather than estimating the θ value from a biologically heterogeneous dataset such as PBMC, we think it is more appropriate to estimate the technical overdispersion using negative control datasets, collected without any biological variability [12]. We analyzed several such datasets spanning different droplet- and plate-based sequencing protocols (10x Genomics, inDrop, MicrowellSeq) and compared the \(\hat {\theta }_{g}\) estimates to the estimates obtained using simulated NB data with various known values of θ∈{10,100,1000,∞}. For the simulations, we used the empirically observed sample sizes and sequencing depths. We found that across different protocols, negative control data were consistent with overdispersion θ≈100 or larger (Additional file 1: Figure S1). The plateau at θ≈10 in the PBMC data visible in Fig. 1c could reflect biological and not technical variability. At the same time, negative control data did not exactly conform to the Poisson model (θ=∞), but likely overdispersion parameter values (θ≈100) were large enough to make the Poisson model acceptable in practice [10, 24, 25]. A parallel work reached the same conclusion [26].

Both Hafemeister and Satija [3] and Townes et al. [4] suggested to use Pearson/deviance residuals based on models that only account for technical variability, in order to identify biologically variable genes. Indeed, genes showing biological variability should have higher variance than predicted by such a model. As explained above, Pearson residuals in the model given by Eqs. 1 and 2 (or, equivalently, offset regression model or rank-one GLM-PCA) can be conveniently written in closed form:

For most genes in the PBMC data, the variance of the Pearson residuals was close to 1, indicating that this model predicted the variance of the data correctly and suggesting that most genes did not show biological variability (Fig. 1h). Using θ=100 led to several high-expression genes selected as biologically variable that would not be selected with a lower θ (e.g., Malat1), but overall, using θ=10,θ=100, or even the Poisson model with θ=∞ led to only minor differences (Additional file 1: Figure S2a–c). Using analytic Pearson residuals for HVG selection yielded a very similar result compared to using Pearson residuals from the smoothed regression presented in Hafemeister and Satija [3], with almost the same set of genes identified as biologically variable (Fig. 1h, Additional file 1: Figure S2e). This suggests that our model is sufficient to identify biologically relevant genes.

It is instructive to compare the variance of Pearson residuals to the variance that one gets after explicit sequencing depth normalization followed by a variance-stabilizing transformation. For Poisson data, the square root transformation \(\sqrt {x}\) is approximately variance-stabilizing, and several modifications exist in the literature [27], such as the Anscombe transformation \(2\sqrt {x+3/8}\) [28] and the Freeman-Tukey transformation \(\sqrt {x}+\sqrt {x+1}\) [29]. Normalizing UMI counts X_{cg} by sequencing depths n_{c} (and multiplying the result by the median sequencing depth 〈n_{c}〉 across all cells; “median normalization”) followed by one of the square-root transformations has been advocated for UMI data processing [30, 31].

Comparing the gene variances after the square-root transformation (Fig. 2a) with those of Pearson residuals (Fig. 2b) in the PBMC dataset showed that the square-root transformation is not sufficient for variance stabilization. Particularly affected are low-expression genes that have variance close to zero after the square-root transform [32]. For example, platelet markers genes such as Tubb1 have low average expression (because platelets are a rare population in the PBMC dataset) and do not show high variance after any kind of square-root transform (another example was given by the B-cell marker Cd79a). At the same time, Pearson residuals correctly indicate that these genes have high variance and are biologically meaningful (Fig. 2c). For the genes with higher average expression, some differentially expressed genes like the monocyte marker Lyz or the abovementioned Malat1 showed high variance in both approaches. However, the selection based on the square-root transform also included high-expression genes like Fos, which showed noisy and biologically unspecific expression patterns (Fig. 2c). Similar patterns were observed in the full-retina dataset [33] (Additional file 1: Figure S3).

The gene with the highest average expression in the PBMC dataset, Malat1, showed clear signs of biologically meaningful variability, e.g., it is not expressed in platelets (Fig. 2c). While this gene is selected as biologically variable based on Pearson residuals with θ≈100 as we propose (Fig. 2b), it was not selected by Hafemeister and Satija [3] who effectively used θ≈10 (Fig. 1c, h, Additional file 1: Figure S2). This again suggests that θ≈100 is more appropriate than θ≈10 to model technical variability of UMI counts.

Pearson residuals may even be “too sensitive” in that genes that are only expressed in a handful of cells may get very large residual variance. Hafemeister and Satija [3] suggested clipping residuals to \([-\sqrt {n}, \sqrt {n}]\). We found that this step avoids large residual variance in very weakly expressed genes (Additional file 1: Figure S2d, see “Methods” for more details). The variance of unclipped Pearson residuals under the Poisson model (θ=∞) was very similar to the Fano factor of counts after median normalization (Additional file 1: Figure S2f) and less useful for HVG selection compared to the clipped residuals.

Lastly, gene selection by the widely-used log(1+x)-transform as well as by the variance of deviance residuals as suggested by Townes et al. [4] led to very similar results as described above for the square-root transform: many biologically meaningful genes were not selected, as all three methods overly favored high-expression genes (Additional file 1: Figure S2g–i). In conclusion, neither of these transformations is sufficiently variance-stabilizing. In practice, many existing HVG selection methods take the mean-variance relationship into account when performing the selection (e.g., seurat and seurat_v3 methods [34, 35] as implemented in Scanpy [5]). We benchmarked their performance in the next section.

Analytic Pearson residuals separate cell types better than other methods

Next, we studied the effect of different normalization approaches on PCA representations and t-SNE embeddings. The first approach is median normalization, followed by the square-root transform [30, 31]. We used 50 principal components of the resulting data matrix to construct a t-SNE embedding. The second approach is computing Pearson residuals according to Eq. 12 with θ=100, followed by PCA reduction to 50 components. The third approach is computing 50 components of negative binomial GLM-PCA with θ=100 [4]. We used the same initialization to construct all t-SNE embeddings to ease the visual comparison [36].

We applied these methods to the full PBMC dataset (Additional file 1: Figure S4), three retinal datasets [33, 37, 38] (Fig. 3), and a large organogenesis dataset with n=2 million cells [39] (Fig. 4). For smaller datasets, the resulting embeddings were mostly similar, suggesting comparable performance between methods. Hafemeister and Satija [3] argued that using Pearson residuals reduces the amount of variance in the embedding explained by the sequencing depth variation, compared to sequencing depth normalization and log-transformation. We argue that this effect was mostly due to the large factor that the authors used for re-scaling the counts after normalization (Additional file 1: Figure S5): large scale factors and/or small pseudocounts (ε in log(x+ε)) are known to introduce spurious variation into the distribution of normalized counts [4, 41]. For the PBMC dataset, all three t-SNE embeddings showed similar amount of sequencing depth variation across the embedding space (Additional file 1: Figure S4g–i). Performing the embeddings on 1,000 genes with the largest Pearson residual variance did not noticeably affect the embedding quality (Additional file 1: Figure S4d–f).

However, on closer inspection, embeddings based on Pearson residuals consistently outperformed the other two. For example, while the Pearson residual embeddings clearly separated fine cell types in the full-retina dataset [33], the square-root embedding mixed some of them (we observed the same when using the log-transform). For the same dataset, GLM-PCA embedding did not fully separate some of the biologically distinct cell types. Furthermore, GLM-PCA embeddings often featured Gaussian-shaped blobs with no internal structure (Fig. 3), suggesting that some fine manifold structure was lost, possibly due to convergence difficulties.

Embedding the organogenesis dataset [39] using Pearson residuals uncovered a strong and surprising batch artifact: hitherto unnoticed, several genes were highly expressed exclusively in small subsets of cells, with each subset coming from a single embryo. These subsets appeared as isolated islands in the t-SNE embedding (Fig. 4), allowing us to uncover and remove this batch effect (Additional file 1: Figure S6), leading to the final, biologically interpretable embedding (Fig. 4). In contrast, embeddings based on log-transform or GLM-PCA did not show this batch artifact at all. GLM-PCA took days to converge (Table 1) and could recover only the coarse structure of the data. Interestingly, the final embedding based on Pearson residuals was broadly similar to the embedding obtained after log-transform and standardization of each gene, as expected given that Pearson residuals stabilize the variance by construction (Fig. 4). Together, these qualitative observations suggest that analytic Pearson residuals can represent small, distinct subpopulations in large datasets better than other methods.

To quantify the performance of dimensionality reduction methods, we performed a systematic benchmark using the Zhengmix8eq dataset with known ground truth labels [42] (Fig. 5). This dataset consists of PBMC cells FACS-sorted into eight different cell types with eight types occurring in roughly equal proportions. To make the setup more challenging, we added 10 pseudo-genes expressed only in a group of 50 cells, effectively creating a ninth, rare, cell type (see “Methods”). We used six methods to select 2000 HVGs (and additionally omitted HVG selection) and ten methods for data transformation and dimensionality reduction to 50 dimensions. We assessed the resulting (6+1)·10=70 pipelines using kNN classification of cell types. We used the macro F1 score (harmonic mean between precision and recall, averaged across classes) because this metric fairly averages classifier performance across classes of unequal size. Together, the F1 score of the kNN classifier quantifies how well each pipeline separated cell types in the 50-dimensional representation (Fig. 5c). We did not include approaches that use depth normalization with inferred size factors [44] in this comparison.

The pipeline that used analytic Pearson residuals for both gene selection and data transformation outperformed all other pipelines with respect to cell type classification performance. In contrast, popular methods for HVG selection (e.g., seurat_v3 as implemented in Scanpy [5, 35]) combined with log or square-root transformations after depth normalization performed worse and in particular were often unable to separate the rare cell type (Fig. 5a,b; see Additional file 1: Figure S7 for additional embeddings). The performance of GLM-PCA was also poor, likely due to convergence issues (with 15-dimensional, and not 50-dimensional, output spaces, GLM-PCA performed on par with Pearson residuals; data not shown), in agreement with what we reported above for the retinal datasets. Finally, deviance residuals [4] were clearly outperformed by Pearson residuals both as gene selection criterion and as data transformation. This is due to the reduced sensitivity of deviance residuals to low- or medium-expression genes (Additional file 1: Figure S2i). Note that in terms of the overall classification accuracy no pipeline outperformed Pearson residuals but many pipelines performed similarly well; this is because overall accuracy is not sensitive to the rare cell type, unlike the macro F1 score.

For this dataset, not using gene selection at all performed similarly well to HVG selection using Pearson residuals (Fig. 5c), but in general HVG selection is a recommended step in scRNA-seq data analysis [1, 2] and here Pearson residuals performed the best. Also, log-transformed counts that were standardized performed similarly well to Pearson residuals (Fig. 5c), in agreement with the above observations on the organogenesis dataset (Fig. 4b). Nevertheless, the same organogenesis example showed that Pearson residuals can be more sensitive (Fig. 4b, d).

Analytic Pearson residuals are fast to compute

The studied normalization pipelines differ in both space and time complexity. UMI count data are typically very sparse (e.g., in the PBMC dataset, 95% of entries are zero) and can be efficiently stored as a sparse matrix object. Sequencing depth normalization and square-root or log-transformation do not affect the zeros, preserving the sparsity of the matrix, and PCA can be run directly on a sparse matrix. In contrast, Pearson residuals form a dense matrix without any zeros and so can take a large amount of memory to store (4.5 Gb for the PBMC dataset). For large datasets this can become prohibitive (but note that a smart implementation may be able to avoid storing a dense matrix in memory [45]). In contrast, GLM-PCA can be run directly on a sparse matrix but takes a long time to converge (Table 1), becoming prohibitively slow for bigger datasets.

Computational complexity can be greatly reduced if gene selection is performed in advance. After selecting 1000 genes, Pearson residuals do not require a lot of memory (0.3 Gb for the PBMC dataset) and so can be conveniently used. Note that the Pearson residual variance can be computed per gene, without storing the entire residual matrix in memory. GLM-PCA, however, remained slow even after gene selection (4 h vs. 4 s for Pearson residuals for the PBMC dataset; 2 days vs. 4 minutes for the organogenesis dataset; Table 1).

Discussion

We reviewed and contrasted different methods for normalization of UMI count data. We showed that without post hoc smoothing, the negative binomial regression model of Hafemeister and Satija [3] exhibits high variance in its parameter estimates because it is overspecified, which is why it had to be smoothed in the first place. We argued that instead of smoothing an overspecified model, one should resort to a more parsimonious and theoretically motivated model specification involving an offset term. This made the model equivalent to the rank-one GLM-PCA of Townes et al. [4] and yielded a simple analytic solution, closely related to correspondence analysis [17]. Further, we showed that the estimates of per-gene overdispersion parameter θ_{g} in the original paper exhibit substantial and systematic bias. We used negative control datasets from different experimental protocols to show that UMI counts have low overdispersion and technical variation is well described by θ≈100 shared across genes.

We found that the approach developed by Hafemeister and Satija [3] and implemented in the R package scTransform in practice yields Pearson residuals that are often similar to our analytic Pearson residuals with fixed overdispersion parameter (Additional file 1: Figure S2e). We argue that our model with its analytic solution is attractive for reasons of parsimony, theoretical simplicity, and computational speed. Moreover, it provides an explanation for the linear trends in the smoothed estimates in the original paper. We have integrated Pearson residuals into upcoming Scanpy 1.9 [5].

Following our manuscript, scTransform was updated to scTransform v2 and now uses the offset model formulation [46]. At the same time, the authors argue that the dependence of the overdispersion parameter θ_{g} on the gene expression strength is not entirely explained by the estimation bias. To reduce the bias, scTransform v2 uses glmGamPoi [47] to estimate the offsets β_{0g} and the overdispersion parameters θ_{g} (which are then smoothed). The authors also refer to the bulk RNA-seq literature, where it has been observed that the overdispersion parameter grows monotonically with gene expression [6, 48, 49]. Given the difficulties with estimating overdispersion for low expression means (see above), we believe that this question requires further investigation. However, as argued above, whether θ is assumed to be constant or is allowed to vary between genes, has very little effect on the resulting Pearson residuals.

A parallel publication [50] suggested a Bayesian procedure named Sanity for estimating expression strength underlying the observed UMI counts, based on Poisson likelihood and Bayesian shrinkage. Importantly, Pearson residuals are not aiming at estimating the underlying expression strength; rather, they quantify how strongly each observed UMI count deviates from the null model of constant expression across cells. These two approaches can have opposite effects on markers genes of rare cell types: the Bayesian procedure shrinks their expression towards zero whereas our approach yields large Pearson residuals. We argued here that this emphasis on rare cell types is useful for many downstream tasks, but if the interest lies in true expression, approaches like Sanity may be more appropriate. Future work should perform comprehensive benchmarks on a variety of tasks [51].

On the practical side, we showed that Pearson residuals outperform other methods for selecting biologically variable genes. They are also better than other preprocessing methods for downstream analysis: in a systematic benchmarking effort, we demonstrated that Pearson residuals provide a good basis for general-purpose dimensionality reduction and for constructing 2D embeddings of single-cell UMI data. In particular, they are well suited for identifying rare cell types and their genetic markers. Applying gene selection prior to dimensionality reduction reduces the computational cost of using Pearson residuals down to negligible. We conclude that analytic Pearson residuals provide a theory-based, fast, and convenient method for normalization of UMI datasets.

Methods

Mathematical details

Analytic solution

The log-likelihood for the model defined in Eqs. 1 and 2

where we used the Poisson density p(x)=e^{x}e^{−μ}/x!. Taking partial derivatives with respect to n_{c} and p_{g} and setting them to zero, one obtains

This is a family of solutions. Setting \(\sum _{g} \hat {p}_{g} = 1\), we obtain Eq. 3 and the formulas for \(\hat {n}_{c}\) and \(\hat {p}_{g}\) given in the “Analytic Pearson residuals” section.

This derivation does not generalize to the negative binomial model with density

This does not have an analytic maximum likelihood solution. However, for large θ values Eq. 3 can be taken as an approximate solution.

Deviance residuals

Deviance is defined as the doubled difference between the log-likelihood of the saturated model and the log-likelihood of the actual model. The saturated model, in our case, is a full rank model with \(\hat {\mu }_{{cg}}^{*}=X_{{cg}}\). For the Poisson model, the deviance can therefore be obtained from Eq. 14 and is equal to

where the terms with \(\hat {\mu }_{{cg}} = X_{{cg}}\) are taken to be zero.

Deviance residuals are defined as square roots of the respective deviance terms, such that the sum of squared deviance residuals is equal to the deviance (note that for the Gaussian case this already holds true for the raw residuals, because the saturated model has zero log-likelihood, and the deviance is simply the squared error). It follows that for the Poisson model deviance residuals [4] are given by

It is easy to verify that this formula reduces to the Poisson case when θ→∞. When computing deviance residuals, we estimated \(\hat {\mu }_{{cg}}\) using Eq. 3.

Clipping Pearson residuals

Clipping Pearson residuals to \(\pm \sqrt {n}\) as suggested by Hafemeister and Satija [3] is needed to avoid large residual variance in rarely expressed genes (Additional file 1: Figure S2d). The intuition behind this heuristic is as follows. Consider a UMI dataset with n cells containing a biologically distinct rare population P of size m≪n. Let this population have a marker gene with expression following Poisson(λ) for the cells from P, and zero expression for all n−m remaining cells. For simplicity we assume the Poisson model here, and further assume that all cells have the same sequencing depth.

The expected average expression of this gene is λm/n and so the expected Pearson residual value for this gene for the cells from P is \((\lambda -\lambda m/n)/\sqrt {\lambda m/n} = (n-m)\sqrt {\lambda /(nm)} \approx \sqrt {\lambda n/m}\).

With the clipping threshold \(\sqrt {n}\), clipping will happen whenever λ>m, i.e., when the population P is either very small or has very large UMI counts. For example, a population of 10 cells having a marker gene with the within-population mean expression of 20 UMIs, will result in clipped residuals, as if the within-population mean expression were ∼ 10 UMIs. This may have a large effect on the leading principal components (even PC1) if the data contain a very small number of cells with strong marker gene expression.

Pearson residuals of biologically variable genes

It is instructive to observe the effect Pearson residuals have on genes that have the same variance of log-expression but different expression means. Consider a gene that has expression μ in half of the cells and is upregulated by a factor of two in the other half of the cells. Then its expression mean is 1.5μ, and the Pearson residuals are close to \(\pm 0.5\mu /\sqrt {1.5\mu } \approx 0.4 \sqrt {\mu }\), i.e. the variance of Pearson residuals grows linearly with μ. This makes sense because for higher-expressed genes there is more statistical certainty about over-Poisson variability, but at the same time highlights that Pearson residuals do not aim to estimate the underlying (log-)expression, unlike, e.g., Sanity [50].

Experimental details

Analyzed datasets and preprocessing

Used datasets are listed in Table 2. For the organogenesis dataset and the FACS-sorted PBMC dataset, we applied no further filtering. In all remaining datasets, we excluded genes that were expressed in fewer than 5 cells, following Hafemeister and Satija [3]. The data were downloaded following links in original publications in form of UMI count tables. Direct links to all data sources are given in our Github repository https://github.com/berenslab/umi-normalization.

HVG selection

For gene selection using sqrt(CPMedian), Pearson residuals, and deviance residuals, we applied the respective data transformation and used the variance after transformation as selection criterion. For Seurat and Seurat_v3 methods, we used the respective Scanpy implementations. In brief, these two methods regress out the mean-variance relationship, and return an estimate of the “excess” variance for each gene [34, 35]. For scTransform, we used the corresponding R package [3]. The Fano factor was computed after normalizing by sequencing depth and scaling by median sequencing depth.

Data transformation and dimensionality reduction

We used the following abbreviations to denote data transformations: sqrt(CPMedian) — normalization by sequencing depth, followed by scaling by the median depth across all cells (“counts per median”), followed by the square-root transform; log(CPMedian + 1) — normalization by sequencing depth, followed by scaling by the median depth across all cells, followed by log(x+1) transform; log(CPMedian + 1) + standardization — same as log(CPMedian + 1), but followed by centering each gene at mean zero and scaling it to unit variance; log(CPM + 1) — normalization by sequencing depth, followed by scaling by one million (“counts per million”), followed by log(x+1) transform. Pearson residuals were computed with Eq. 12 and then clipped to \(\pm \sqrt {n}\). Deviance residuals were computed with Eq. 20.

All of these methods were typically followed by dimensionality reduction by PCA to 50 dimensions using the Scanpy implementation [5], unless otherwise stated.

Further, we used three variants of GLM-PCA to transform raw counts and reduce dimensionality down to 50 in a joint step: Poisson GLM-PCA, negative binomial GLM-PCA with estimation of single overdispersion parameter θ shared across genes, and negative binomial GLM-PCA with fixed shared θ. In Townes et al. [4], the authors only used the former two methods. Whenever possible, we used the glmpca-py implementation with default settings. When we reduced the PBMC dataset to 1 000 genes for Additional file 1: Figure S4f, GLM-PCA did not converge with default penalty 1, so we increased it to 5, following the tuning procedure used in the authors’ R implementation. Similarly, negative binomial GLM-PCA with estimation of θ did not converge on the benchmark dataset (Fig. 5) when we used gene selection by either Deviance residuals (θ=100) or Pearson residuals (θ=10). For these two cases, we had to increase the penalty to 10. On the organogenesis dataset, the Python implementation did not converge within reasonable time, so for this dataset, we resorted to the R implementation. It uses a different optimization method and employs stochastic minibatches. All reported GLM-PCA results for this dataset are for batchsize 10,000, as batchsizes 100 and 1,000 (default) resulted in considerably longer runtimes. Because the R implementation does not support NB GLM-PCA with fixed theta, for this dataset, we used GLM-PCA with jointly fit \(\hat {\theta }\).

Unless otherwise stated, all residuals and GLM-PCA with fixed θ used θ=100. Whenever gene selection was performed prior to a data transformation that required sequencing depths, we computed those depths using the sum over selected genes only.

Benchmarking cell type separation with kNN classification

We used the Zhengmix8eq dataset with known ground truth labels obtained by FACS-sorting of eight PBMC cell types [42, 43]. There were 400–600 cells in each cell type. We created a ninth, artificial population from 50 randomly selected B-cells (marked blue in Fig. 5a–b). To mimic a separate cell type, we added 10 pseudo marker that had zero expression everywhere apart from those 50 cells. For those 50 cells, UMI values were simulated as Poisson(n_{i}p), where n_{i} is the sequencing depth of the ith selected cell (range: 452–9,697), and expression fraction p was set to 0.001.

We then applied the 70 normalization pipelines shown in Fig. 5c to this dataset. Each pipeline either used one of the six methods to select 2000 HVGs or proceeded without HVG selection, followed by one of the ten methods for data transformation and dimensionality reduction to 50 dimensions. To assess cell type separation in this output space, we used a kNN classifier with a leave-one-out cross-validation procedure: For each cell, we trained a kNN classifier on the remaining n−1 cells. This resulted in a class prediction for each cell based on the majority vote of its k=15 neighboring cells. We quantified the performance of this prediction by computing the macro F1 score (harmonic mean between precision and recall, averaged across classes to counteract class imbalance). We used the sklearn implementations for kNN classification and the F1 score [55].

Measuring runtimes

All runtimes given in Table 1 are wall times from running the code in a Docker container with an Ubuntu 18 system on a machine with 256 GB RAM and 2×24 CPUs at 2.1 Ghz (Xeon Silver 4116 Dodecacore). The Docker container was restricted to use at most 30 CPU threads. To reduce overhead, we did not use Scanpy for timing experiments, and instead used numpy for basic computations and sklearn for PCA with default settings. Note that we used different implementations of GLM-PCA for the PBMC and organogenesis dataset (see above for details).

t-SNE embeddings

All t-SNE embeddings were made following recommendations from a recent paper [36] using the FIt-SNE implementation [56]. We used the PCA (or, when applicable, GLM-PCA) representation of the data as input. We used default FIt-SNE parameters, including automatically chosen learning rate. For initialization, we used the first two principal components of the data, scaled such that PC1 had standard deviation 0.0001 (as is default in FIt-SNE). The initialization was shared among all embeddings shown in the same figure, i.e., PCs of one data representation were used to initialize all other embeddings as well. For all datasets apart from the organogenesis one, we used perplexity combination of 30 and n/100, where n is the sample size [36]. For the organogenesis dataset embeddings, we used perplexity 30 and exaggeration 4 [40].

All datasets used in this work are publicly available and listed in Table 2. Detailed download instructions can be found at our Github repository.

References

Luecken MD, C Theis FJ. Current best practices in single-cell RNA-seq analysis: a tutorial. Mol Syst Biol. 2019; 15(6):e8746. https://doi.org/10.15252/msb.20188746.

Hafemeister C, Satija R. Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression. Genome Biol. 2019; 20:296. https://doi.org/10.1186/s13059-019-1874-1.

Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014; 15(12):1–21. https://doi.org/10.1186/s13059-014-0550-8.

Eling N, Richard AC, Richardson S, Marioni JC, Vallejos CA. Correcting the mean-variance dependency for differential variability testing using single-cell RNA sequencing data. Cell Syst. 2018; 7(3):284–94. https://doi.org/10.1016/j.cels.2018.06.011.

Svensson V, Gayoso A, Yosef N, Pachter L. Interpretable factor models of single-cell RNA-seq via variational autoencoders. Bioinformatics. 2020; 36(11):3418–3421. https://doi.org/10.1093/bioinformatics/btaa169.

Grün D, Kester L, Van Oudenaarden A. Validation of noise models for single-cell transcriptomics. Nat Methods. 2014; 11(6):637–40. https://doi.org/10.1038/nmeth.2930.

Hill MO. Correspondence analysis: a neglected multivariate method. J R Stat Soc Ser C (Appl Stat). 1974; 23(3):340–54. https://doi.org/10.2307/2347127.

Holmes S. Multivariate data analysis: the French way. In: Probability and statistics: essays in honor of David A. Freedman. Institute of Mathematical Statistics: 2008. p. 219–33. https://doi.org/10.1214/193940307000000455.

Willson L, Folks J, Young J. Complete sufficiency and maximum likelihood estimation for the two-parameter negative binomial distribution. Metrika. 1986; 33(1):349–62. https://doi.org/10.1007/BF01894768.

Clark SJ, Perry JN. Estimation of the negative binomial parameter κ by maximum quasi-likelihood. Biometrics. 1989; 45(1):306–16. https://doi.org/10.2307/2532055.

Lord D. Modeling motor vehicle crashes using Poisson-gamma models: Examining the effects of low sample mean values and small sample size on the estimation of the fixed dispersion parameter. Accid Anal Prev. 2006; 38(4):751–66. https://doi.org/10.1016/j.aap.2006.02.001.

Lord D, Miranda-Moreno LF. Effects of low sample mean values and small sample size on the estimation of the fixed dispersion parameter of Poisson-gamma models for modeling motor vehicle crashes: a bayesian perspective. Saf Sci. 2008; 46(5):751–70. https://doi.org/10.1016/j.ssci.2007.03.005.

Wang J, Huang M, Torre E, Dueck H, Shaffer S, Murray J, Raj A, Li M, Zhang NR. Gene expression distribution deconvolution in single-cell RNA sequencing. Proc Natl Acad Sci. 2018; 115(28):6437–46. https://doi.org/10.1073/pnas.1721085115.

Lopez-Delisle L, Delisle J-B. baredsc: Bayesian approach to retrieve expression distribution of single-cell. bioRxiv. 2021. https://doi.org/10.1101/2021.05.26.445740.

Bar-Lev SK, Enis P. On the classical choice of variance stabilizing transformations and an application for a Poisson variate. Biometrika. 1988; 75(4):803–4. https://doi.org/10.1093/biomet/75.4.803.

Freeman MF, Tukey JW. Transformations related to the angular and the square root. Ann Math Stat. 1950; 21(4):607–11. https://doi.org/10.1214/aoms/1177729756.

Wagner F. Straightforward clustering of single-cell RNA-Seq data with t-SNE and DBSCAN. BioRxiv. 2019. https://doi.org/10.1101/770388.

Wagner F. Monet: an open-source Python package for analyzing and integrating scRNA-Seq data using PCA-based latent spaces. bioRxiv. 2020. https://doi.org/10.1101/2020.06.08.140673.

Warton DI. Why you cannot transform your way out of trouble for small counts. Biometrics. 2018; 74(1):362–8. https://doi.org/10.1111/biom.12728.

Stuart T, Butler A, Hoffman P, Hafemeister C, Papalexi E, Mauck III WM, Hao Y, Stoeckius M, Smibert P, Satija R. Comprehensive integration of single-cell data. Cell. 2019; 177(7):1888–902. https://doi.org/10.1016/j.cell.2019.05.031.

Tran NM, Shekhar K, Whitney IE, Jacobi A, Benhar I, Hong G, Yan W, Adiconis X, Arnold ME, Lee JM, et al.Single-cell profiles of retinal ganglion cells differing in resilience to injury reveal neuroprotective genes. Neuron. 2019; 104(6):1039–55. https://doi.org/10.1016%2Fj.neuron.2019.11.006.

Böhm JN, Berens P, Kobak D. A Unifying Perspective on Neighbor Embeddings along the Attraction-Repulsion Spectrum. arXiv. 2020. https://arxiv.org/abs/2007.08902.

Lun A. Overcoming systematic errors caused by log-transformation of normalized single-cell RNA sequencing data. BioRxiv. 2018. https://doi.org/10.1101/404962.

Duò A, Robinson MD, Soneson C. A systematic performance evaluation of clustering methods for single-cell RNA-seq data. F1000Research. 2018; 7:1141. https://doi.org/10.12688/f1000research.15666.2.

Zheng GX, Terry JM, Belgrader P, Ryvkin P, Bent ZW, Wilson R, Ziraldo SB, Wheeler TD, McDermott GP, Zhu J, et al.Massively parallel digital transcriptional profiling of single cells. Nat Commun. 2017; 8(1):1–12. https://doi.org/10.1038/ncomms14049.

Lun AT, Bach K, Marioni JC. Pooling across cells to normalize single-cell RNA sequencing data with many zero counts. Genome Biol. 2016; 17(1):1–14. https://doi.org/10.1186/s13059-016-0947-7.

Irizarry R. smallcount: R package with methods for small counts stored in a sparse matrix.[Internet]. [place unknown]: GitHub; 2021 [updated 2021 Jul 13; cited 2021 Jul 31]. https://github.com/rafalab/smallcount.

Ahlmann-Eltze C, Huber W. glmGamPoi: fitting Gamma-Poisson generalized linear models on single cell count data. Bioinformatics. 2020; 36(24):5701–2. https://doi.org/10.1093/bioinformatics/btaa1009.

Law CW, Chen Y, Shi W, Smyth GK. voom: Precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol. 2014; 15(2):1–17. https://doi.org/10.1186/gb-2014-15-2-r29.

Breda J, Zavolan M, van Nimwegen E. Bayesian inference of gene expression states from single-cell RNA-seq data. Nat Biotechnol. 2021; 39:1008–16. https://doi.org/10.1038/s41587-021-00875-x.

Klein AM, Mazutis L, Akartuna I, Tallapragada N, Veres A, Li V, Peshkin L, Weitz DA, Kirschner MW. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell. 2015; 161(5):1187–201. https://doi.org/10.1016/j.cell.2015.04.044.

Han X, Wang R, Zhou Y, Fei L, Sun H, Lai S, Saadatpour A, Zhou Z, Chen H, Ye F, et al.Mapping the mouse cell atlas by microwell-seq. Cell. 2018; 172(5):1091–107. https://doi.org/10.1016/j.cell.2018.02.001.

Linderman GC, Rachh M, Hoskins JG, Steinerberger S, Kluger Y. Fast interpolation-based t-SNE for improved visualization of single-cell RNA-seq data. Nat Methods. 2019; 16(3):243–5. https://doi.org/10.1038/s41592-018-0308-4.

We thank Christoph Hafemeister, Rahul Satija, William Townes, Florian Wagner, Constantin Ahlmann-Eltze, and Erik van Nimwegen for discussions and helpful comments, and the Scanpy team for support.

Funding

This work was supported by the Deutsche Forschungsgemeinschaft (BE5601/4-1, BE5601/6-1, and EXC 2064 ML, project number 390727645), by the German Ministry of Education and Research (01GQ1601 and 01IS18039A), and by the National Institute Of Mental Health of the National Institutes of Health under Award Number U19MH114830. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Author information

Authors and Affiliations

University of Tübingen, Institute for Ophthalmic Research, Tübingen, Germany

Jan Lause, Philipp Berens & Dmitry Kobak

University of Tübingen, Institute for Bioinformatics and Medical Informatics, Tübingen, Germany

Philipp Berens

University of Tübingen, Bernstein Center for Computational Neuroscience, Tübingen, Germany

Philipp Berens

University of Tübingen, Center for Integrative Neuroscience, Tübingen, Germany

JL and DK planned the study. JL performed computational experiments. JL, PB, and DK discussed the results. JL wrote the initial paper draft. JL, PB, and DK edited the paper. DK supervised the project. All authors read and approved the final manuscript.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.