### Multinomial model for scRNA-Seq

Let *y*_{ij} be the observed UMI counts for cell or droplet *i* and gene or spike-in *j*. Let \(n_{i}=\sum _{j} y_{ij}\) be the total UMIs in the sample, and *π*_{ij} be the unknown true relative abundance of gene *j* in cell *i*. The random vector \(\vec {y}_{i} = (y_{i1},\ldots,y_{iJ})^{\top }\) with constraint \(\sum _{j} y_{ij}=n_{i}\) follows a multinomial distribution with densit function:

$$f(\vec{y}_{i}) = \binom{n_{i}}{y_{i1},\ldots,y_{iJ}} \prod_{j} \pi_{ij}^{y_{ij}} $$

Focusing on a single gene *j* at a time, the marginal distribution of *y*_{ij} is binomial with parameters *n*_{i} and *π*_{ij}. The marginal mean is E[*y*_{ij}]=*n*_{i}*π*_{ij}=*μ*_{ij}, the marginal variance is \(\text {var}[y_{ij}] = n_{i} \pi _{ij}(1-\pi _{ij}) = \mu _{ij}-\frac {1}{n_{i}}\mu _{ij}^{2}\), and the marginal probability of a zero count is \((1-\pi _{ij})^{n_{i}} = \left (1-\frac {\mu _{ij}}{n_{i}}\right)^{n_{i}}\). The correlation between two genes *j*,*k* is:

$$\text{cor}[y_{ij},y_{ik}] = \frac{-\sqrt{\pi_{ij}\pi_{ik}}}{\sqrt{(1-\pi_{ij})(1-\pi_{ik})}} $$

The correlation is induced by the sum to *n*_{i} constraint. As an extreme example, if there are only two genes (*J*=2), increasing the count of the first gene automatically reduces the count of the second gene since they must add up to *n*_{i} under multinomial sampling. This means when *J*=2, there is a perfect anti-correlation between the gene counts which has nothing to do with biology. More generally, when either *J* or *n*_{i} is small, gene counts will be negatively correlated independent of biological gene-gene correlations, and it is not possible to analyze the data on a gene-by-gene basis (for example, by ranking and filtering genes for feature selection). Rather, comparisons are only possible between pairwise ratios of gene expression values [49]. Yet, this type of analysis is difficult to interpret and computationally expensive for large numbers of genes (i.e., in high dimensions). Fortunately, under certain assumptions, more tractable approximations may be substituted for the true multinomial distribution.

First, note that if correlation is ignored, the multinomial may be approximated by *J*-independent binomial distributions. Intuitively, this approximation will be reasonable if all *π*_{ij} are very small, which is likely to be satisfied for scRNA-Seq if the number of genes *J* is large, and no single gene constitutes the majority of mRNAs in the cell. If *n*_{i} is large and *π*_{ij} is small, each binomial distribution can be further approximated by a Poisson with mean *n*_{i}*π*_{ij}. Alternatively, the multinomial can be constructed by drawing *J*-independent Poisson random variables and conditioning on their sum. If *J* and *n*_{i} are large, the difference between the conditional, multinomial distribution, and the independent Poissons becomes negligible. Since in practice *n*_{i} is large, the Poisson approximation to the multinomial may be reasonable [50–53].

The multinomial model does not account for biological variability. As a result, an overdispersed version of the multinomial model may be necessary. This can be accommodated with the Dirichlet-multinomial distribution. Let \(\vec {y}_{i}\) be distributed as a multinomial conditional on the relative abundance parameter vector \(\vec {\pi }_{i}=(\pi _{i1},\ldots,\pi _{iJ})^{\top }\). If \(\vec {\pi }_{i}\) is itself a random variable with symmetric Dirichlet distribution having shape parameter *α*, the marginal distribution of \(\vec {y}_{i}\) is Dirichlet-multinomial. This distribution can itself be approximated by independent negative binomials. First, note that a symmetric Dirichlet random vector can be constructed by drawing *J*-independent gamma variates with shape parameter *α* and dividing by their sum. Suppose (as above) we approximate the conditional multinomial distribution of \(\vec {y}_{i}\) such that *y*_{ij} follows an approximate Poisson distribution with mean *n*_{i}*π*_{ij}. Let *λ*_{ij} be a collection of non-negative random variables such that \(\pi _{ij}=\frac {\lambda _{ij}}{\sum _{j} \lambda _{ij}}\). We require that \(\vec {\pi }_{i}\) follows a symmetric Dirichlet, which is accomplished by having *λ*_{ij} follow independent gamma distributions with shape *α* and mean *n*_{i}/*J*. This implies \(\sum _{j} \lambda _{ij}\) follows a gamma with shape *J**α* and mean *n*_{i}. As *J*→*∞*, this distribution converges to a point mass at *n*_{i}, so for large *J* (satisfied by scRNA-Seq), \(\sum _{j} \lambda _{ij}\approx n_{i}\). This implies that *y*_{ij} approximately follows a conditional Poisson distribution with mean *λ*_{ij}, where *λ*_{ij} is itself a gamma random variable with mean *n*_{i}/*J* and shape *α*. If we then integrate out *λ*_{ij} we obtain the marginal distribution of *y*_{ij} as negative binomial with shape *α* and mean *n*_{i}/*J*. Hence a negative binomial model for count data may be regarded as an approximation to an overdispersed Dirichlet-multinomial model.

Parameter estimation with multinomial models (and their binomial or Poisson approximations) is straightforward. First, suppose we observe replicate samples \(\vec {y}_{i}\), *i*=1,…,*I* from the same underlying population of molecules, where the relative abundance of gene *j* is *π*_{j}. This is a null model because it assumes each gene has a constant expected expression level, and there is no biological variation across samples. Regardless of whether one assumes a multinomial, binomial, or Poisson model, the maximum likelihood estimator (MLE) of *π*_{j} is \(\hat {\pi }_{j} = \frac {\sum _{i} y_{ij}}{\sum _{i} n_{i}}\) where *n*_{i} is the total count of sample *i*. In the more realistic case that relative abundances *π*_{ij} of genes vary across samples, the MLE is \(\hat {\pi }_{ij}=\frac {y_{ij}}{n_{i}}\).

An alternative to the MLE is the maximum a posteriori (MAP) estimator. Suppose a symmetric Dirichlet prior with concentration parameter *α*_{i} is combined with the multinomial likelihood for cell *i*. The MAP estimator for *π*_{ij} is given by:

$$\tilde{\pi}_{ij}=\frac{\alpha_{i}+y_{ij}}{J\alpha_{i}+n_{i}} = w_{i}\frac{1}{J}+(1-w_{i})\hat{\pi}_{ij} $$

where *w*_{i}=*J**α*_{i}/(*J**α*_{i}+*n*_{i}), showing that the MAP is a weighted average of the prior mean that all genes are equally expressed (1/*J*) and the MLE (\(\hat {\pi }_{ij}\)). Compared to the MLE, the MAP biases the estimate toward the prior where all genes have the same expression. Larger values of *α*_{i} introduce more bias, while *α*_{i}→0 leads to the MLE. If *α*_{i}>0, the smallest possible value of \(\tilde {\pi }_{ij}\) is *α*_{i}/(*J**α*_{i}+*n*_{i}) rather than zero for the MLE. When there are many zeros in the data, MAP can stabilize relative abundance estimates at the cost of introducing bias.

### Mathematics of distortion from log-normalizing UMIs

Suppose the true counts in cell *i* are given by *x*_{ij} for genes *j*=1,…,*J*. Some of these may be zero, if a gene is not turned on in the cell. Knowing *x*_{ij} is equivalent to knowing the total number of transcripts \(t_{i}=\sum _{j} x_{ij}\) and the relative proportions of each gene *π*_{ij}, since *x*_{ij}=*t*_{i}*π*_{ij}. The total number of UMI counts \(n_{i}=\sum _{j} y_{ij}\) does not estimate *t*_{i}. However, under multinomial sampling, the UMI relative abundances \(\hat {\pi }_{ij}=\frac {y_{ij}}{n_{i}}\) are MLEs for the true proportions *π*_{ij}. Note that it is possible that \(\hat {\pi }_{ij}=0\) even though *π*_{ij}>0. Because \(\sum _{j} \hat {\pi }_{ij}=1\) regardless of *n*_{i}, the use of multinomial MLEs is equivalent to the widespread practice of normalizing each cell by the total counts. Furthermore, the use of size factors *s*_{i}=*n*_{i}/*m* leads to \(\hat {\pi }_{ij} \times m\) (if *m*=10^{6}, this is CPM).

Traditional bulk RNA-Seq experiments measured gene expression in read counts of many cells per sample rather than UMI counts of single cells. Gene counts from bulk RNA-Seq could thus range over several orders of magnitude. To facilitate comparison of these large numbers, many bulk RNA-Seq methods have relied on a logarithm transformation. This enables interpretation of differences in normalized counts as fold changes on a relative scale. Also, for count data, the variance of each gene is a function of its mean, and log transformation can help to prevent highly expressed outlier genes from overwhelming downstream analyses. Prior to the use of UMIs, scRNA-Seq experiments also produced read counts with wide ranging values, and a log transform was again employed. However, with single cell data, more than 90% of the genes might be observed as exact zeros, and log(0)=−*∞* which is not useful for data analysis. UMI data also contain large numbers of zeros, but do not contain very large counts since PCR duplicates have been removed. Nevertheless, log transformation has been commonly used with UMI data as well.

The current standard is to transform the UMI counts as \(\log _{2}(c+\hat {\pi }_{ij} \times m)\) where *c* is a pseudocount to avoid taking the log of zero, and typically *c*=1. As before, *m* is some constant such as 10^{6} for CPM (see also [54] for an alternative). Finally, the data are centered and scaled so that the mean of each gene across cells is 0, and the standard deviation is 1. This standardization of the data causes any subsequent computation of distances or dimension reduction to be invariant to constant additive or multiplicative scaling. For example, under Manhattan distance, *d*(*x*+*c*,*y*+*c*)=|*x*+*c*−(*y*+*c*)|=|*x*−*y*|=*d*(*x*,*y*). In particular, using size factors such as CPM instead of relative abundances leads to a rescaling of the pseudocount, and use of any pseudocount is equivalent to replacing the MLE with the MAP estimator. Let *k*=*c*/*m* and *α*_{i}=*k**n*_{i}. Then, the weight term in the MAP formula becomes *w*_{i}=*J**k*/(1+*J**k*)=*w* which is constant across all cells *i*. Furthermore *J**k*=*w*/(1−*w*), showing that:

$${}{\begin{aligned} \log_{2}(c+\hat{\pi}_{ij} \times m) &= \log_{2}(k+\hat{\pi}_{ij}) + \log_{2}(m)\\ &= \log_{2}\left(\frac{w}{1-w}\frac{1}{J}+\hat{\pi}_{ij}\right)+\log_{2}(m)\\ &= \log_{2}\left(w\frac{1}{J}+(1-w)\hat{\pi}_{ij}\right)-\log_{2}(1-w)+\log_{2}(m)\\ &= \log_{2}(\tilde{\pi}_{ij})+C \end{aligned}} $$

Where *C* is a global constant that does not vary across cells or genes. For illustration, if *c*=1 and *m*=10^{6}, this is equivalent to assuming a prior where all genes are equally expressed and for cell *i*, a weight of *w*=*J*/(10^{6}+*J*) is given to the prior relative to the MLE. Since the number of genes *J* is on the order of 10^{4}, we have *w*≈.01. The prior sample size for cell *i* is *J**α*_{i}=10^{−6}*J**n*_{i}≈.01×*n*_{i} where *n*_{i} is the data sample size. The standard transformation is therefore equivalent to using a weak prior to obtain a MAP estimate of the relative abundances, then log transforming before dimension reduction.

In most scRNA-Seq datasets, the total number of UMIs *n*_{i} for some cells may be significantly less than the constant *m*. For these cells, the size factors *s*_{i}=*n*_{i}/*m* are less than 1. Therefore, after normalization (dividing by size factor), the counts are scaled up to match the target size of *m*. Due to the discreteness of counts, this introduces a bias after log transformation, if the pseudocount is small (or equivalently, if *m* is large). For example, let *c*=1 and *m*=10^{6} (CPM). If *n*_{i}=10^{4} for a particular cell, we have *s*_{i}=.01. A raw count of *y*_{ij}=1 for this cell is normalized to 1/.01=100 and transformed to log2(1+100)=6.7. For this cell, on the log scale, there cannot be any values between 0 and 6.7 because fractional UMI counts cannot be observed and log2(1+0)=0. Small pseudocounts and small size factors combined with log transform arbitrarily exaggerate the difference between a zero count and a small nonzero count. As previously shown, this scenario is equivalent to using MAP estimation of *π*_{ij} with a weak prior. To combat this distortion, one may attempt to strengthen the prior to regularize \(\tilde {\pi }_{ij}\) estimation at the cost of additional bias, as advocated by [21]. An extreme case occurs when *c*=1 and *m*=1. Here, the prior sample size is *J**n*_{i}, so almost all the weight is on the prior. The transform is then \(\log _{2}(1+\hat {\pi }_{ij})\). But this function is approximately linear on the domain \(0\leq \hat {\pi }_{ij}\leq 1\). After centering and scaling, a linear transformation is vacuous.

To summarize, log transformation with a weak prior (small size factor, such as CPM) introduces strong artificial distortion between zeros and nonzeros, while log tranformation with a strong prior (large size factor) is roughly equivalent to not log transforming the data.

### Generalized PCA

PCA minimizes the mean squared error (MSE) between the data and a low-rank representation, or embedding. Let *y*_{ij} be the raw counts and *z*_{ij} be the normalized and transformed version of *y*_{ij} such as centered and scaled log-CPM (*z*-scores). The PCA objective function is:

$$\min_{u,v} \sum_{i,j}(z_{ij}-\vec{u}_{i}'\vec{v}_{j})^{2} $$

where \(\vec {u}_{i},\vec {v}_{j}\in \mathbb {R}^{L}\) for *i*=1,…,*I*, *j*=1,…,*J*. The \(\vec {u}_{i}\) are called factors or principal components, and the \(\vec {v}_{j}\) are called loadings. The number of latent dimensions *L* controls the complexity of the model. Minimization of the MSE is equivalent to minimizing the Euclidean distance metric between the embedding and the data. It is also equivalent to maximizing the likelihood of a Gaussian model:

$$z_{ij}\sim\mathcal{N}\left(\vec{u}_{i}'\vec{v}_{j},\sigma^{2}\right) $$

If we replace the Gaussian model with a Poisson, which approximates the multinomial, we can directly model the UMI counts as:

$$y_{ij}\sim \text{Poi}\left(n_{i}\exp\{\vec{u}_{i}'\vec{v}_{j}\}\right) $$

or alternatively, in the case of overdispersion, we may approximate the Dirichlet-multinomial using a negative binomial likelihood:

$$y_{ij}\sim NB\left(n_{i}\exp\{\vec{u}_{i}'\vec{v}_{j}\};~\phi_{j}\right) $$

We define the *linear predictor* as \(\eta _{ij} = \log n_{i} + \vec {u}_{i}'\vec {v}_{j}\). It is clear that the mean \(\mu _{ij}=e^{\eta }_{ij}\) appears in both the Poisson and negative binomial model statements, showing that the latent factors interact with the data only through the mean. We may then estimate \(\vec {u}_{i}\) and \(\vec {v}_{j}\) (and *ϕ*_{j}) by maximizing the likelihood (in practice, adding a small L2 penalty to large parameter values improves numerical stability). A link function must be used since \(\vec {u}_{i}\) and \(\vec {v}_{j}\) are real valued whereas the mean of a Poisson or negative binomial must be positive. The total UMIs *n*_{i} term is used as an offset since no normalization has taken place; alternative size factors *s*_{i} such as those from scran [20] could be used in place of *n*_{i}. If the first element of each \(\vec {u}_{i}\) is constrained to equal 1, this induces a gene-specific intercept term in the first position of each \(\vec {v}_{j}\), which is analogous to centering. Otherwise, the model is very similar to that of PCA; it is simply optimizing a different objective function. Unfortunately, MLEs for \(\vec {u}_{i}\) and \(\vec {v}_{j}\) cannot be expressed in closed form, so an iterative Fisher scoring procedure is necessary. We refer to this model as GLM-PCA [55]. Just as PCA minimizes MSE, GLM-PCA minimizes a generalization of MSE called the *deviance* [56]. While generalized PCA was originally proposed by [31] (see also [57] and [58]), our implementation is novel in that it allows for intercept terms, offsets, overdispersion, and non-canonical link functions. We also use a blockwise update for optimization which we found to be more numerically stable than that of [31]; we iterate over latent dimensions *l* rather than rows or columns. This technique is inspired by non-negative matrix factorization algorithms such as hierarchical alternating least squares and rank-one residue iteration, see [59] for a review.

As an illustration, consider GLM-PCA with the Poisson approximation to a multinomial likelihood. The objective function to be minimized is simply the overall deviance:

$$\begin{array}{*{20}l} D &= \sum_{i,j} y_{ij}\log\left(\frac{y_{ij}}{\mu_{ij}}\right)-(y_{ij}-\mu_{ij})\\ \log\mu_{ij} &= \eta_{ij} = \log s_{i} + \vec{u}_{i}'\vec{v}_{j} = \log s_{i} + v_{j1} + \sum_{l=2}^{L} u_{il}v_{jl} \end{array} $$

where *s*_{i} is a fixed size factor such as the total number of UMIs (*n*_{i}). The optimization proceeds by taking derivatives with respect to the unknown parameters: *v*_{j1} is a gene-specific intercept term, and the remaining *u*_{il} and *v*_{jl} are the latent factors.

The GLM-PCA method is most concordant to the data-generating mechanism since all aspects of the pipeline are integrated into a coherent model rather than being dealt with through sequential normalizations and transformations. The interpretation of the \(\vec {u}_{i}\) and \(\vec {v}_{j}\) vectors is the same as in PCA. For example, suppose we set the number of latent dimensions to 2 (i.e., *L*=3 to account for the intercept). We can plot *u*_{i2} on the horizontal axis and *u*_{i3} on the vertical axis for each cell *i* to visualize the relationships between cells such as gradients or clusters. In this way, the \(\vec {u}_{i}\) and \(\vec {v}_{j}\) capture biological variability such as differentially expressed genes.

### Residuals and *z*-scores

Just as mean squared error can be computed by taking the sum of squared residuals under a Gaussian likelihood, the deviance is equal to the sum of squared *deviance residuals* [56]. Since deviance residuals are not well-defined for the multinomial distribution, we adopt the binomial approximation. The deviance residual for gene *j* in cell *i* is given by:

$${}r^{(d)}_{ij}=\text{sign}(y_{ij}-\hat{\mu}_{ij})\sqrt{2y_{ij}\log\frac{y_{ij}}{\hat{\mu}_{ij}} + 2(n_{i}-y_{ij})\log\frac{n_{i}-y_{ij}}{n_{i}-\hat{\mu}_{ij}}} $$

where under the null model of constant gene expression across cells, \(\hat {\mu }_{ij}=n_{i}\hat {\pi }_{j}\). The deviance residuals are the result of regressing away this null model. An alternative to deviance residuals is the Pearson residual, which is simply the difference in observed and expected values scaled by an estimate of the standard deviation. For the binomial, this is:

$$r^{(p)}_{ij}=\frac{y_{ij}-\hat{\mu}_{ij}}{\sqrt{\hat{\mu}_{ij}-\frac{1}{n_{i}}\hat{\mu}_{ij}^{2}}} $$

According to the theory of generalized linear models (GLM), both types of residuals follow approximately a normal distribution with mean zero if the null model is correct [56]. Deviance residuals tend to be more symmetric than Pearson residuals. In practice, the residuals may not have mean exactly equal to zero, and may be standardized by scaling their gene-specific standard deviation just as in the Gaussian case. Recently, Pearson residuals based on a negative binomial null model have also been independently proposed as the sctransform method [60].

The *z*-score is simply the Pearson residual where we replace the multinomial likelihood with a Gaussian (normal) likelihood and use normalized values instead of raw UMI counts. Let *q*_{ij} be the normalized (possibly log-transformed) expression of gene *j* in cell *i* without centering and scaling. The null model is that the expression of the gene is constant across all cells:

$$q_{ij}\sim\mathcal{N}\left(\mu_{j},~\sigma^{2}_{j}\right) $$

The MLEs are \(\hat {\mu }_{j} = \frac {1}{I}\sum _{i} q_{ij}\), \(\hat {\sigma }^{2}_{j} = \frac {1}{I}\sum _{i} (q_{ij}-\hat {\mu }_{j})^{2}\), and the *z*-scores equal the Pearson residuals \(z_{ij}=(q_{ij}-\hat {\mu }_{j})/\hat {\sigma }_{j}\).

We compared the accuracy of the residuals approximations by simulating 150 cells in 3 clusters of 50 cells each with 5000 genes, of which 500 were differentially expressed across clusters (informative genes). We also created 2 batches, batch 1 with total counts of 1000 and batch 2 with total counts of 2000. Each cluster had an equal number of cells in the 2 batches. We then ran GLM-PCA on the raw counts, PCA on *l**o**g*_{2}(1+*C**P**M*), PCA on deviance residuals, and PCA on Pearson residuals with *L*=2 dimensions.

### Feature selection using deviance

Genes with constant expression across cells are not informative. Such genes may be described by the multinomial null model where *π*_{ij}=*π*_{j}. Goodness of fit to a multinomial distribution can be quantified using deviance, which is twice the difference in log-likelihoods comparing a saturated model to a fitted model. The multinomial deviance is a joint deviance across all genes,and for this reason is not helpful for screening informative genes. Instead, one may use the binomial deviance as an approximation:

$$D_{j} = 2\sum_{i}\left[y_{ij}\log\frac{y_{ij}}{n_{i}\hat{\pi}_{j}} + (n_{i}-y_{ij})\log\frac{(n_{i}-y_{ij})}{n_{i}(1-\hat{\pi}_{j})}\right] $$

A large deviance value indicates the model in question provides a poor fit. Those genes with biological variation across cells will be poorly fit by the null model and will have the largest deviances. By ranking genes according to their deviances, one may thus obtain highly deviant genes as an alternative to highly variable or highly expressed genes.

### Systematic comparison of methods

We considered combinations of the following methods and parameter settings, following [15]. Italics indicate methods proposed in this manuscript. Feature selection: highly expressed genes, highly variable genes, and *highly deviant genes*. We did not compare against highly dropout genes because [15] found this method to have poor downstream clustering performance for UMI counts and it is not as widely used in the literature. The numbers of genes are 60, 300, 1500. Normalization, transformation, and dimension reduction: PCA on log-CPM *z*-scores, ZINB-WAVE [28], *PCA on deviance residuals*, *PCA on Pearson residuals*, and *GLM-PCA*. The numbers of latent dimensions are 10 and 30. Clustering algorithms are *k*-means [61] and Seurat [17]. The number of clusters is all values from 2 to 10, inclusive. Seurat resolutions are 0.05, 0.1, 0.2, 0.5, 0.8, 1, 1.2, 1.5, and 2.