Skip to main content
Fig. 1 | Genome Biology

Fig. 1

From: Observation weights unlock bulk RNA-seq tools for zero inflation and single-cell applications

Fig. 1

Zero inflation results in overestimated dispersion and jeopardizes power to discover differentially expressed genes. a–e Scatterplots of the estimated biological coefficient of variation (BCV, defined as the square root of the negative binomial dispersion parameter ϕ) against average log counts per million (CPM) computed using EDGER. a BCV plot for the real Buettner et al. [7] scRNA-seq dataset subsampled to n=10 cells. b BCV plot for the real Deng et al. [66] scRNA-seq dataset subsampled to n=10 cells. Both panels (a) and (b) show striped patterns in the BCV plot, which significantly distort the mean–variance relationship, as represented by the red curve. c BCV plot for a simulated bulk RNA-seq dataset (n=10), obtained from the Bottomly et al. [67] dataset using the simulation framework of Zhou et al. [57]. Dispersion estimates generally decrease smoothly as gene expression increases. d BCV plot for a simulated zero-inflated bulk RNA-seq dataset, obtained by randomly introducing 5% excess zero counts in the dataset from (c). Zero inflation leads to overestimated dispersion for the genes with excess zeros, resulting in striped patterns, as observed also for the real scRNA-seq data in panels (a) and (b). e BCV plot for simulated zero-inflated bulk RNA-seq dataset from (d), where excess zeros are downweighted in dispersion estimation (i.e., weights of 0 for excess zeros and 1 otherwise). Downweighting recovers the original mean–variance trend. f True positive rate vs. false discovery proportion for the simulated zero-inflated dataset of (d). The performance of EDGER (red curve) deteriorates in a zero-inflated setting due to overestimation of the dispersion parameter. However, assigning the excess zeros a weight of zero in the dispersion estimation and model fitting result in a dramatic performance boost (orange curve). Hence, downweighting excess zero counts is the key to unlocking bulk RNA-seq tools for zero inflation. BCV biological coefficient of variation, CPM counts per million, ZI zero inflated

Back to article page