Skip to main content
Fig. 1 | Genome Biology

Fig. 1

From: Analytic Pearson residuals for normalization of single-cell RNA-seq UMI data

Fig. 1

Regression model of Hafemeister and Satija [3] compared to the offset model. Each dot corresponds to a model fit to the counts of a single gene in the 33k PBMC dataset (10x Genomics, n = 33,148 cells). Following Hafemeister and Satija [3], we included only the 16,809 genes that were detected in at least five cells. Color denotes the local point density from low (blue) to high (yellow). Expression mean was computed as \(\frac {1}{n}\sum _{c} X_{{cg}}\). a Intercept estimates \(\hat {\beta }_{0g}\) in the original regression model. Dashed line: Analytic solution for \(\hat {\beta }_{0g}\) in the offset model we propose. b Slope estimates \(\hat {\beta }_{1g}\). Dashed line: β1g= ln(10)≈2.3. c Overdispersion estimates \(\hat {\theta }_{g}\). d Relationship between slope and intercept estimates (ρ=−0.91). e Intercept estimates in the offset model, where the slope coefficient is fixed to 1. Dashed line shows the analytic solution, which is a linear function of gene mean. f Overdispersion estimates \(\hat {\theta }_{g}\) on simulated data with true θ=10 (dashed line) for all genes. g Overdispersion estimates \(\hat {\theta }_{g}\) on the same simulated data as in f, but now with 100 instead of 10 iterations in the theta.ml() optimizer (R, MASS package). Cases for which the optimization diverged to infinity or resulted in spuriously large estimates (\(\hat {\theta }_{g}>10^{6}\)) are shown at \(\hat {\theta }_{g}=\infty \) with some jitter. Dashed line: true value θ=10. h Variance of Pearson residuals in the offset model. The residuals were computed analytically, assuming θ=100 for all genes. Following Hafemeister and Satija [3], we clipped the residuals to a maximum value of \(\sqrt {n}\). Dashed line indicates unit variance. Red dots show the genes identified in the original paper as most variable

Back to article page