BUTTERFLY: addressing the pooled amplification paradox with unique molecular identifiers in single-cell RNA-seq

The incorporation of unique molecular identifiers (UMIs) in single-cell RNA-seq assays makes possible the identification of duplicated molecules, thereby facilitating the counting of distinct molecules from sequenced reads. However, we show that the naïve removal of duplicates can lead to a bias due to a “pooled amplification paradox,” and we propose an improved quantification method based on unseen species modeling. Our correction called BUTTERFLY uses a zero truncated negative binomial estimator implemented in the kallisto bustools workflow. We demonstrate its efficacy across cell types and genes and show that in some cases it can invert the relative abundance of genes. Supplementary Information The online version contains supplementary material available at 10.1186/s13059-021-02386-z.

Correction evaluation for the PBMC_V3_2 dataset. The data was downsampled to 1/20 for A and to 1/10 for B-F, and the corrected expression using different prediction methods was compared to ground truth (predicting to 10

Supplementary Notes Supplementary Note 1 -Evaluation of Prediction Algorithms
We evaluated 2 different algorithms for prediction -the Preseq DS method based on rational function approximation (RFA), and the zero truncated negative binomial (ZTNB), where the a negative binomial distribution is fitted to the CU histogram and then used for prediction. The Preseq DS method is evaluated with 2 different parameterizations (MT = 2 and MT = 20), corresponding to a different number of copies per UMI at which the CU histogram is truncated. In addition, we evaluated a selection method recommended by Deng et al (12), referred to as "best practice", in which Preseq DS (MT = 2) is chosen for genes where the CV of the counts per UMI is greater than one and ZTNB otherwise.
We downsampled 14 datasets to one tenth of the original amount of reads, and predicted up to the original gene expression. Fig. S7 A shows the log 2 fold change (LFC) between corrected expression and ground truth for genes as a function of the number of UMIs available for the gene in the downsampled dataset. The LFC presented for each method was calculated as a LOESS fit of all genes from all datasets included in this study. Lin. scaled corresponds to the case where all genes are scaled equally (i.e. no advanced prediction is used). Fig. S7 B shows an analogous analysis, with the difference that the data has been CPM-normalized before the LFC is calculated. It is evident that CPM normalization improves the correction performance.
ZTNB and best practice are almost identical, suggesting that the ZTNB is chosen for most genes for the best practice case algorithm, and we see no benefit using the best practice method. In general, ZTNB has better performance for highly expressed genes, while Preseq DS is marginally better for the middle range genes, but the overall performance is similar. Evaluation for each of the 14 datasets individually is available in Fig. S8 --S21. In general, the results are similar across datasets. The individual evaluation of datasets show that Preseq DS is more stable when CU histograms are truncated at 2 as compared to 20; the latter parametrization sometimes gives rise to outlier genes with larger prediction errors (e.g. Fig S19 D). Interestingly, the Preseq DS algorithm (MT = 20) performs best for predicting the total number of molecules in most cases (e.g. Fig. S7 A). We speculate that this method can take advantage of the high number of molecules, and that a negative binomial distribution may not be sufficient to describe the CU histogram of differently amplified pooled genes.

Supplementary Note 2 -Evaluation of Synthetic Data
To investigate the theoretical amplification bias caused by incomplete sequencing of cDNA libraries, we generated simulated data where the number of copies of each molecule was drawn from a negative binomial distribution, with the parameters mean ( ) and size. The size parameter was set to 1 for all simulations, which is a reasonable number compared to observed real values for well amplified genes (where the fit for this parameter is reliable). The mean variable is varied throughout the different simulations. To simulate realistic values for the mean variable in some simulations, we estimated from the observed FSCM of the PBMC_V3_3 dataset. We first calculated a vector f of theoretical values for FSCM from a vector of values (m), as ) where pdf(x, ) is the probability density function for x counts for the negative binomial , distribution. The value of for each gene in the dataset was then estimated by calculating FSCM for the gene and mapping that value to a value using linear interpolation between the vectors f and m.
Count matrices and CU histograms per gene were generated from simulated molecules, where molecules with zero copies were discarded. The simulation was based on 12,000 molecules per cell, which resulted in approximately 7,400 detected molecules per cell on average when using the negative binomial mean per gene as calculated from the PBMC_V3_3 dataset. For all simulations pairs of genes with different properties were generated, where each pair was compared in different ways. In each simulation 5,000 cells and 5,000 gene pairs were used. We then estimated results for two cases: uncorrected and corrected data.
In Fig. S22 A, 11 simulations were run, where of the first gene in each pair was set to 2 3 , and of the second gene was varied between 2 -7 and 2 3 . The average log 2 fold change for the pairs was then calculated for each simulation. All genes were expressed at 1000 CPM. Correction was done using ZTNB.
The data for Fig. S22 B was generated using 6 simulations, where the values of both genes in each pair were randomly selected from the PBMC_V3_3 dataset. 50% of the gene pairs were differentially expressed with a fixed log 2 fold change (LFC) within each simulation, varying from 0.5 to 3. To investigate to what extent the bias affects the ability to identify gene pairs that were differentially expressed, we calculated the LFC of all gene pairs after the bias (and correction) was applied. We then based on the LFC and ground truth (i.e. if gene pairs are differentially expressed or not) calculated the area under the curve (AUC) for the Receiver Operating Characteristic (ROC) curve, which describes the ability to correctly classify gene pairs as differentially expressed. All genes were expressed at 1000 CPM. Correction was done using ZTNB.
For Fig. S22 C, we generated data using 11 simulations. The of the first gene in each pair was randomly selected from the PBMC_V3_3 dataset, while the of the second was set to the that of the first, multiplied with a factor a, which is constant through each simulation. The