Statistics or biology: the zero-inflation controversy about scRNA-seq data

Researchers view vast zeros in single-cell RNA-seq data differently: some regard zeros as biological signals representing no or low gene expression, while others regard zeros as missing data to be corrected. To help address the controversy, here we discuss the sources of biological and non-biological zeros; introduce five mechanisms of adding non-biological zeros in computational benchmarking; evaluate the impacts of non-biological zeros on data analysis; benchmark three input data types: observed counts, imputed counts, and binarized counts; discuss the open questions regarding non-biological zeros; and advocate the importance of transparent analysis. Supplementary Information The online version contains supplementary material available at (10.1186/s13059-022-02601-5).

specific %) truncates the lowest non-zero counts of each gene based on the gene's specific masking proportion determined by the empirical formula.
The second categorization is regarding whether the masking proportion is gene-specific. Two schemes mask the same expected proportion 100p% of non-zero counts for all genes: random mask (all genes) and quantile mask (per-gene, same %). Three schemes use gene-specific masking proportions: quantile mask (all genes), random mask (per-gene, specific %), and quantile mask (per-gene, specific %). Specifically, although quantile mask (all genes) does not use the empirical formula to determine gene-specific masking proportions as in random mask (per-gene, specific %) and quantile mask (per-gene, specific %), it still truncates different proportions of non-zero counts for different genes. The reason is that its truncation threshold is set to the p-th quantile of all genes' non-zero counts, and different genes have different numbers of non-zero counts below that threshold. It is also worth noting that we do not include random mask (per-gene, same %) because it is theoretically equivalent to random mask (all genes)-both schemes are expected to randomly mask 100p% of every gene's non-zero counts (Additional File 1: Fig. S1a).
Note that random masking aims to reflect the random nature of sampling zeros. In a sequencing experiment, allocation of reads to genes is essentially random sampling from a multinomial distribution, whose probabilities are the proportions of genes in terms of cDNA copy numbers in the sequencing library. Due to the randomness of sampling, for two genes with moderately different non-zero proportions, it is possible that, in one experiment, the gene with the larger proportion receives a zero read count, i.e., a sampling zero, while the gene with the smaller proportion receives a non-zero read count. The magnitude of the randomness depends on the sequencing depth. For every gene, the standard deviation of its count over its expected count is equal to a large constant depending on its proportion (i.e., (1 − q i )/q i , where q i is the proportion of gene i) multiplied by the inverse of the square root of the sequencing depth (i.e., 1/ √ N , where N is the sequencing depth). Hence, the smaller the sequencing depth, the larger the standard deviation of every gene's count in relation to its expected count, the more likely that genes receive sampling zeros irrespective of their proportions. Moreover, the expected number of sampling where I is the number of genes) decreases as the sequencing depth increases. In contrast, quantile masking aims to reflect gene proportions in the sequencing library and technical zeros, i.e., zero counts due to zero proportions without randomness. Quantile masking also reflects the fact that, despite of randomness, a gene with a small proportion is more likely to receive a sampling zero than a gene with a much larger proportion does.
Hence, for Drop-seq and 10x Genomics, since they sequence many cells, per-cell sequencing depth is low and thus randomness is influential, random masking better represents the occurrence of non-biological zeros, sampling zeros in particular, than quantile masking does. For Smart-seq2, since per-cell sequencing depth is high and thus randomness is negligible, quantile masking better resembles the generation mechanism of non-biological zeros, technical zeros in particular, than random masking does.  Figure S1: Five masking schemes for introducing non-biological zeros. (a) A tree diagram illustrating the design of the five masking schemes. From the top, the first division is about whether masking is independent of or completely dependent on count values, with the former as random masking and the latter as quantile masking. The second division is about whether masking is performed across all genes (with the same masking proportion) or within each gene (i.e., pergene). If the latter, the third division is regarding whether the masking proportion is the same for all genes or specific to each gene depending on the gene's mean non-zero expression level. Note that random masking across all genes is equivalent to random masking per-gene with the same masking proportion (shown by the double arrow on the left). (b) A toy example illustration of the five masking schemes. The topleft plot shows the expression counts of three genes in four cells without zero-inflation; the other five plots show the expression counts after the five masking schemes are applied with the same masking proportion p = 0.5 (i.e., 50% of the non-zero gene expression counts are masked as zeros). (c) Technical explanation of each masking scheme. In the notations, p denotes the overall masking proportion across all genes, and p i is the masking proportion of gene i.   total zero proportion masking proportion Figure S2: Effects of non-biological zeros on DE gene identification in terms of precision and recall. We introduce a varying number of non-biological zeros, which correspond to masking proportions 0.1-0.9, into the simulated (a) Smart-seq2, (b) Drop-seq, and (c) 10x Genomics datasets using five masking schemes. The horizontal axes show (top) the total zero proportion (including the zeros before masking and the non-biological zeros introduced by masking) and (bottom) the masking proportion (i.e., the proportion of non-zero counts masked by a masking schemes). After the introduction of non-biological zeros, we apply Monocle 3 and Seurat to each dataset to identify DE genes. We evaluate the accuracy using the precision and recall (given the false discovery rate 5%; defined in Fig. ??d) Figure S3: Evaluation of clustering analysis on observed, binarized and imputed data. We evaluate the clustering analysis on Smart-seq2, Drop-seq, and 10x Genomics data based on observed, binarized and imputed data. We perform this analysis before and after using the five masking schemes (type 1 ZI-type 5 ZI) to introduce non-biological zeros. Besides the bin-Qiu et al. which indicates the clustering algorithm developed specially for binarized data, we use Louvain clustering (in Seurat) on observed, binarized, and imputed data. We use ARI to evaluate the clustering results. Masking schemes

Silhouette score
Raw counts scImpute's imputed counts SAVER's imputed counts MAGIC's imputed counts Dimension reduction Figure S4: Dimension reduction comparison using PCA, TSNE and UMAP. In addition to the original data, we use five masking schemes (Type 1-Type 5) to introduce 50% nonbiological zeros and evaluate the effects on the downstream analyses with different input data. The five masking schemes are random mask (all genes), quantile mask (all genes), random mask (per-gene, specific %), quantile mask (per-gene, same %), and quantile mask (per-gene, specific %), corresponding to type 1-type 5, respectively. Then, we perform scImpute, Saver, or Magic to get the imputed data. Finally, We evaluate the dimension reduction using PCA, t-SNE or UMAP on Smart-seq2, Drop-seq, and 10x Genomics data based on observed and imputed data and use Silhouette score to evaluate the dimension reduction results.  Figure S5: Evaluation of dimension reduction analysis on observed, binarized and imputed data. We evaluate the dimension reduction analysis on Smart-seq2, Drop-seq, and 10x Genomics data based on observed, binarized and imputed data. We perform this analysis before and after using the five masking schemes (type 1 ZI-type 5 ZI) to introduce non-biological zeros. We use UMAP (in Seurat) on observed, binarized, and imputed data to perform dimension reduction. We use Silhouette score to evaluate the dimension reduction results.  Figure S6: UMAP dimesion reduction visualization on observed, binarized and imputed Smart-seq2 data. We perform UMAP (in Seurat) on Smart-seq2's observed, binarized and imputed data. We perform this analysis before and after using the five masking schemes (type 1 ZI-type 5 ZI) to introduce non-biological zeros.  Figure S7: UMAP dimesion reduction visualization on observed, binarized and imputed Drop-seq data. We perform UMAP (in Seurat) on Drop-seq's observed, binarized and imputed data. We perform this analysis before and after using the five masking schemes (type 1 ZI-type 5 ZI) to introduce non-biological zeros.  Figure S8: UMAP dimesion reduction visualization on observed, binarized and imputed 10x Genomics data. We perform UMAP (in Seurat) on 10x Genomics' observed, binarized and imputed data. We perform this analysis before and after using the five masking schemes (type 1 ZI-type 5 ZI) to introduce non-biological zeros.  Figure S9: Evaluation of DE analysis on observed, binarized and imputed data. We evaluate the DE analysis on Smart-seq2, Drop-seq, and 10x Genomics data based on observed, binarized and imputed data. We perform this analysis before and after using the five masking schemes (type 1 ZI-type 5 ZI) to introduce non-biological zeros. We apply two-sample proportion test on binarzied data and MAST (in Seurat) on observed, binarized, and imputed data to perform DE analysis. We use precision (given the false discovery rate 5%) to evaluate the DE results.  Figure S10: Evaluation of DE analysis on observed, binarized and imputed data. We evaluate the DE analysis on Smart-seq2, Drop-seq, and 10x Genomics data based on observed, binarized and imputed data. We perform this analysis before and after using the five masking schemes (type 1 ZI-type 5 ZI) to introduce non-biological zeros. We apply two-sample proportion test on binarzied data and MAST (in Seurat) on observed, binarized, and imputed data to perform DE analysis. We use recall (given the false discovery rate 5%) to evaluate the DE results.  Figure S11: Evaluation of DE analysis on observed, binarized and imputed data. We evaluate the DE analysis on Smart-seq2, Drop-seq, and 10x Genomics data based on observed, binarized and imputed data. We perform this analysis before and after using the five masking schemes (type 1 ZI-type 5 ZI) to introduce non-biological zeros. We apply two-sample proportion test on binarzied data and MAST (in Seurat) on observed, binarized, and imputed data to perform DE analysis. We use F 1 score (given the false discovery rate 5%) to evaluate the DE results.