UMI-count modeling and differential expression analysis for single-cell RNA sequencing

Read counting and unique molecular identifier (UMI) counting are the principal gene expression quantification schemes used in single-cell RNA-sequencing (scRNA-seq) analysis. By using multiple scRNA-seq datasets, we reveal distinct distribution differences between these schemes and conclude that the negative binomial model is a good approximation for UMI counts, even in heterogeneous populations. We further propose a novel differential expression analysis algorithm based on a negative binomial model with independent dispersions in each group (NBID). Our results show that this properly controls the FDR and achieves better power for UMI counts when compared to other recently developed packages for scRNA-seq analysis. Electronic supplementary material The online version of this article (10.1186/s13059-018-1438-9) contains supplementary material, which is available to authorized users.


Dataset from
We kept single cells with quality OK and removed ERCC genes.

Application of other DE Methods
Monocle2: We followed the analysis code in the supplementary file of the Monocle2 paper https://www.nature.com/articles/nmeth.4150#s4. For the UMI count, it requires the full count matrix. It has its own function to estimate the size factors for normalization. SCDE: We followed the link http://hms-dbmi.github.io/scde/diffexp.html. SCDE requires the count matrix as the input.
Seurat: We follow the link http://satijalab.org/seurat/pbmc3k_tutorial_1_4.html. The total UMI is regressed out except for the Poisson or negative binomial model. For Poisson or negative binomial model, the total UMI is incorporated into the DE model.

A statistical model linking the UMI count and read count of single cells
Our proposed model illustrates the relation and difference between the UMI count and read count (Additional file 1: Fig. S12), being motivated by the work of Qiu et al. [8]. Because the efficiency of the reverse transcription from the original mRNA to cDNA in scRNA-seq is low, e.g., less than ~10% [9,10], the cDNA count or UMI count for each cell can be modelled as a sampling process from the pool of mRNA. When the mRNA proportions are the same among cells (i.e., there is no biological variation) and the only technique variation is in the sampling depth, e.g., the total UMI per cell, the gene count follows a multinomial distribution or approximately Poisson distribution given the small gene proportions. This is the simplest distribution with which to model the UMI count when considering different sampling depths.
When there are some moderate heterogeneity, i.e., when the mRNA proportions vary across cells, the negative binomial (NB) model is often used. This is the case if we assume that the mRNA proportion of a given gene in each cell follows a gamma distribution. For the read count, we need to consider the additional amplification process, which is a multiplication of the captured cDNA. In the ideal case, the multiplication factor is a constant across all cells and genes. In practice, however, this multiplication factor is likely to differ among cells and genes, thereby introducing more complexity. The final sequencing read count can be modeled as a multinomial sampling process depending on the final total sequence reads and the proportion of the amplified cDNA.
We can formalize the above model into statistical formulae. We do not consider spike-in RNAs here. Let be the mRNA count, with the subscripts indicating cell and gene , then the total RNA count for the cell is = ∑ . Some mRNAs may be degraded or lost during cell separation or lysis. The available mRNA before reverse transcription can be modeled as follows: where is cell specific and unknown. Because reverse transcription from mRNA to cDNA is a low-efficiency process in single cells, it can be modeled as a sampling process using a multinomial distribution: where is the cDNA count, and = ∑ = , represents the reverse transcription efficiency for cell . Here, we assume that the efficiency is only cell specific but is the same for all genes. Because the cDNA count per gene is usually small and the total cDNA count in a cell is large, can be approximated using a Poisson distribution with mean , where = ∑ , the proportion of each gene transcript. Different cells will exhibit some degree of variation in gene proportions even for homogeneous cell populations. If we assume that follows a gamma distribution parameterized as ( = 1 , = 1 ), where and are the mean proportion and dispersion of gene , respectively, then the cDNA count follows a negative binomial distribution ( , ). Here, we assume that the UMI count catches all available cDNA with sufficient sequencing depth.
For the amplification process, because it is likely to be specific for both cells and genes, we use a general model for the amplified cDNA count as follows: where is the amplification parameter for cell and gene . This could be simplified to a constant or could be only cell or gene specific with more knowledge of the amplification process.
The last step is sequencing. The read count can be modeled as a multinomial sampling process given the total reads: where is the total reads for cell . Because the reverse-transcribed cDNA has many 0s, resulting in many 0s in even after amplification and many zero reads in , modeling will be more complex than modeling directly. For example, when has three distinct values 0, 1, or 2, for a pool of cells, this is likely to result in three clusters for the read counts, corresponding to being 0, 1, or 2, respectively. In this case, even a ZINB model might not model the read count well.
Our model differs from that of Qiu [8] in the following respects. The cDNA count before amplification is modeled as a sampling process, and we explicitly separate the amplification process and the sequencing process. The first difference explains the massive 0 counts in the scRNA-seq data matrix. The second illustrates the complication involved in the amplification process, as well as the advantage of using UMI to avoid this complication.
In the above model we assume that the UMI count captures all the cDNAs. This is probably the case when sufficient sequencing depth is used. Otherwise the UMI count is again a sampling of the cDNA count, which can be combined with the sampling process of reverse transcription with decreased capture efficiency and the subsequent modeling will be the same.                 LGALS1 was detected as a DE gene by NBID, MAST and ROTS. IFNG is detected as a DE gene by NBID and MAST. CXCR5 is detected as a DE gene by NBID only, TOX is not detected as a DE gene by any of the above three methods. The y-axis is the TPM / 100 plus 1 and then converted to log10 scale. The 2nd and 3rd column show DE genes detected only by NBID in 8 different FDR percentile ranges and TPM > 50 in at least one population.