Comparison and evaluation of statistical error models for scRNA-seq

Background Heterogeneity in single-cell RNA-seq (scRNA-seq) data is driven by multiple sources, including biological variation in cellular state as well as technical variation introduced during experimental processing. Deconvolving these effects is a key challenge for preprocessing workflows. Recent work has demonstrated the importance and utility of count models for scRNA-seq analysis, but there is a lack of consensus on which statistical distributions and parameter settings are appropriate. Results Here, we analyze 59 scRNA-seq datasets that span a wide range of technologies, systems, and sequencing depths in order to evaluate the performance of different error models. We find that while a Poisson error model appears appropriate for sparse datasets, we observe clear evidence of overdispersion for genes with sufficient sequencing depth in all biological systems, necessitating the use of a negative binomial model. Moreover, we find that the degree of overdispersion varies widely across datasets, systems, and gene abundances, and argues for a data-driven approach for parameter estimation. Conclusions Based on these analyses, we provide a set of recommendations for modeling variation in scRNA-seq data, particularly when using generalized linear models or likelihood-based approaches for preprocessing and downstream analysis. Supplementary Information The online version contains supplementary material available at (10.1186/s13059-021-02584-9).

: Relationship between gene abundance and the variance of Pearson residuals. Same as Figure 2D but additionally showing results for sctransform (v2 regularization).     Figure 2C, but with additional datasets 41-50. Also shown are results from sctransform (v2 regularization). 8 Gene mean

Residual variance
Figure S10: Evaluating variance stabilization for different error models. Same as in Figure 2C, but with additional datasets 51-59. Also shown are results from sctransform (v2 regularization). 9 Overdispersion was estimated using all cells after accounting for library size using a negative-binomial GLM. The red curve indicates a LOESS fit. All datasets exhibit a relationship between gene mean and θ [Datasets Overdispersion was estimated using all cells after accounting for library size using a negative-binomial GLM.
The red curve indicates a LOESS fit. All datasets exhibit a relationship between gene mean and θ [Datasets Overdispersion was estimated using all cells after accounting for library size using a negative-binomial GLM.
The red curve indicates a LOESS fit. All datasets exhibit a relationship between gene mean and θ [Datasets Using mean counts distribution from PBMCs profiled using Smart-seq3, synthetic counts matrices were generated using a fixed θ = {0.001, 0.01, 0.1, 1, 10, 100}. There is a bias in estimated θ from all the three methods: MLE (glmGamPoi (1)), SCT (sctransform) and SCT2 (sctransform, v2 regularization). The bias arises from difficulty in estimating the true θ when µ < 1 and µ < θ . We note that the variance of the NB model is given by µ gc + µ 2 gc /θ g . The second term approaches 0 for small values of µ, which is where we observe this bias. Therefore, the bias in parameter estimation has minimal impact on both the expected NB variance, and the final Pearson residuals (2).  Figure S16: Effect of 'upsampling' on µ − θ relationship Relationship between gene mean and dispersion observed in PBMC Smart-seq3 dataset, simulated dataset with different true dispersions (θ sim = 10 and 100) and Metacells (M20, M14, M12). The simulated datasets were generated using mean counts from the observed PBMC Smart-seq3 dataset, but by 'upsampling' the means to be 500 times larger. Metacells were generated using MetaCell (3) using different parameters of K for the KNN graph. M20, M14, and M12 represents 20, 14, and 12 metacells constructed using K = 200, 300, and 400 respectively.       (ChromiumV3)   CD74  CST3  GNLY  S100A9  PPBP  PF4  IGKC  IGLC2  IGLC3 Figure S25: Benchmarking differential expression analysis. Number of differential expression genes identified between two groups of biological identical HEK293 cells profiled using Quartz-Seq2 and Drop-seq. Datasets were obtained from Mereu et al. (9). Column titles show the DE testing method.   Table S3: Proportion of highly variable genes overlapping with marker genes. Values in columns θ = 100, θ = 10, sctransform v1, and sctransform v2 indicate the median cumulative proportion of genes up to a given rank that overlap with the marker genes in the system (over multiple technologies). Bone Marrow and Fetus, represent datasets with only one technology.

System
Variable feature rank θ = 100 θ = 10 sctransform v1 sctransform v2 Table S4: Number of highly variable genes overlapping with marker genes. Values in columns θ = 100, θ = 10, sctransform v1, and sctransform v2 indicate the median cumulative number of genes up to a given rank that overlap with the marker genes in the system (over multiple technologies). Bone Marrow and Fetus, represent datasets with only one technology.