Datasets
We analyzed a total of 14 public scRNA-Seq datasets collected using 6 technologies; Drop-Seq, Seq-Well, MARS-seq 2.0, and 10x Chromium version v2, v3, and Next GEM. The datasets were generated from 5 sample types: mouse brain (dataset EVAL), mouse retina (datasets MRET and MRET2), a mix of mouse ES cells and embroynic fibroblasts (MARSSEQ), human lung tumor (dataset LC), and human PBMC (remaining datasets). The datasets, including metadata, are listed in Additional file 1: Table S2, with additional information in Additional file 1: Table S3. The datasets were processed using kallisto [23] and bustools [14], yielding counts and gene association for each unique molecule. Molecules mapped to multiple genes were discarded before analysis.
Pre-processing of sequencing files
The datasets were processed using kallisto [23] version 0.46.2 and a version of bustools [14] specifically developed for this study (the butterfly branch), yielding counts and gene association for each unique molecule. kallisto index files were created for mouse and human using cDNA files from Ensembl (v. 96 for mouse, v. 94 for human).
We developed the new commands collapse and umicorrect in a branch of the bustools code, and modified the commands text and fromtext, to enable production of bus files mapped per gene (BUG-files) for further processing in R. The umicorrect command was implemented as described previously [24], while collapse transforms the BUS records from being mapped to equivalence classes to being merged where appropriate and mapped to genes. Transcripts to genes files, which are required for the collapse command, were generated with the function transcript2gene from the BUSpaRse R package [25], version 0.99.25.
Processing of data in R
All reads mapping to more than one gene were discarded before further processing, as were cells with fewer than 200 UMIs (except for the LC dataset, where cells with fewer than 1000 UMIs were discarded, motivated by the large number of cells compared to the expected number from the authors). Statistical metrics were also calculated for each dataset (Additional file 1: Table S3).
Overview of the BUTTERFLY correction
The BUTTERFLY correction is a method that supports correction of gene abundance estimates in UMI-based single-cell RNA-Seq data using either prediction of unseen species or an algorithm we have named binomial downsampling (see below).
Prediction of unseen species is used for predicting the gene expression given more reads, while binomial downsampling can be used to more accurately remove the amplification bias between two datasets with different average numbers of reads per UMI, given that they have similar amplification patterns across genes. We have evaluated the correction method using three algorithms for prediction (see below), although only the zero-truncated negative binomial (ZTNB) algorithm is implemented in the kallisto bustools workflow. The other algorithms are accessible through the R code provided with this publication.
Correction for unseen molecules
To correct gene abundance estimates, the number of unseen molecules are predicted for each gene, assuming that the same gene behaves similarly across cells within a cell population, which if not explicitly differently specified in the text corresponds to all cells within a dataset. For each gene, all molecules across all cells within the population are pooled and used to calculate a copies per UMI (CU) histogram. The CU histogram is then in turn used as input to a prediction algorithm. Prediction was done using the Good-Toulmin estimator as well as the Daley-Smith (DS, based on rational functions approximation) and zero truncated negative binomial (ZTNB) included in the PreseqR package. We implemented the Good-Toulmin estimator in R and used the functions ds.rSAC and a modified version of ztnb.rSAC from PreseqR (where the modified version has larger error tolerances, which speeds up computation time considerably while producing very similar results). The gene expression was predicted per gene, pooling the UMIs from all cells in the dataset. Histograms over the number of copies per UMI (CU histograms) were constructed per gene and used as input to the prediction algorithms. The predicted number of UMIs per gene was then used to calculate the gene expression in counts per million (CPM). We used ZTNB for prediction except for Fig. 4C, Fig. 5F, and H, where the prediction is based on the Daley-Smith (DS) algorithm (MT = 2, described below).
The Good-Toulmin estimator
The Good-Toulmin estimator seeks to estimate the number of new molecules (U) discovered by multiplying the number of total reads with a factor t + 1:
$$ U=-{\sum}_{i=1}^{\infty }{\left(-t\right)}^ih(i) $$
h(i) here corresponds to the number of molecules with i copies, i.e., the value at i in the CU histogram. As can be seen in Additional file 1: Fig. S8–S21, the estimator performs very well for t ≤ 1, but is highly unstable for predictions outside that range. The Good-Toulmin estimator does not depend on any assumptions about the distribution of the data.
The ZTNB estimator
The ZTNB method is based on using an expectation-maximization (EM) algorithm for fitting a negative binomial curve to the histogram of number of copies per UMI, where the number of molecules with zero counts is unknown. It is then assumed that the size parameter of the negative binomial remains the same as the number of reads increase and that the mean is proportional to the number of reads. The number of molecules with at least one copy can then be estimated from the probability density function of the negative binomial.
Once the predicted gene expression is estimated, the UMI counts in the counts matrix are scaled by a factor mg, calculated for each gene g as
$$ {m}_g=\frac{T}{P}{c}_g, $$
where T is the total number of UMIs in the count matrix, P is the total number of predicted UMIs across all genes, and cg is the number of predicted UMIs for gene g.
The EM algorithm used in Preseq for fitting the negative binomial curve to the histogram is an iterative method with two steps in each iteration: the E step and the M step. The goal is to estimate the parameters mean (μ) and size(s), while also taking the unseen molecules into account. In the E step, the number of molecules with zero counts, z, can be estimated by using the probability density function (pdf(n,μ,s), where n is the copies per molecule) of the negative binomial at zero counts, using the current value of μ and s. First, the total number of molecules L is estimated:
$$ L=\frac{N}{1- pdf\left(0,\mu, s\right)} $$
where N is the total number of observed molecules. Then, the number of molecules with zero counts is calculated:
$$ z=L\ pdf\left(0,\mu, s\right) $$
The CU histogram is now complemented with the zero value, which was previously missing.
A new value for μ can then be estimated, according to
$$ \mu ={\sum}_{i=1}^{\infty }i\ h(i)/L $$
where h(i) is the number of molecules in the CU histogram at i copies. In the M step, a new value of the size parameter is then estimated using an L-BFGS-B algorithm to maximize the log likelihood. The algorithm requires a definition of the log likelihood function (ll(s)) and the first derivative of a function that has its maximum at the same point as the log likelihood (pd(s)):
$$ ll(s)={\sum}_{i=0}^{\infty }h(i)\ \mathit{\ln}\left( pdf\left(i,\mu, s\right)\right) $$
$$ pd(s)=\frac{\sum_{i=0}^{\infty } digamma\left(i+s\right)\ h(i)}{L}- digamma(s)+\mathit{\log}(s)-\mathit{\log}\left(s+\mu \right) $$
The starting values for the iteration are in the Preseq implementation selected as μ = 0.5 and s = 1.
The ZTNB method assumes that the data follows a negative binomial distribution. As indirectly shown in Fig. 5D, this assumption is violated for some genes since the gene is differently amplified in different cell populations, yielding a sum of several count distributions and thereby a less accurate prediction for such genes.
The Preseq Daley-Smith estimator
The Preseq DS estimator seeks to estimate the number of additional molecules (t) found when increasing the number of reads by a factor t + 1. The quantity is estimated as:
$$ \varDelta (t)={\sum}_{j=1}^{\infty }{\left(-1\right)}^{j+1}{\left(t-1\right)}^i{n}_j $$
(t) is however not guaranteed to converge for j > 2. To stabilize the function, an approximation is developed using rational function approximation (RFA), in which a ratio of two polynomials are used to describe the function. The Preseq DS estimator makes no assumptions about the distribution of copies per UMI within a gene. For more details, see [11].
The Preseq DS estimator has a parameter MT that can be used to truncate the CU histograms before prediction, which can stabilize the calculation. Unless otherwise stated, we have used DS = 2 throughout this work.
The FSCM metric
The fraction of single-copy molecules (FSCM) is a metric that to a certain extent describes how many unseen molecules that can be expected in a population of molecules. FSCM is calculated as
$$ FSCM=\frac{h(1)}{\sum_{i=1}^{\infty }h(i)} $$
where h(i) represents the number of molecules with i copies in the CU histogram.
Retrieval of GC content and transcript length
Transcript lengths were retrieved using the GenomicFeatures [26] R package (version 1.36.4) in combination with the biomaRt [27] package (version 2.40.0). We used the biomart ENSEMBL_MART_ENSEMBL (version 103) and the dataset mmusculus_gene_ensembl (version GRCm38.p6). We calculated GC content using the R package BSgenome.Mmusculus.UCSC.mm10 [28] (version 1.4.0), in combination with GenomicFeatures and Biostrings [29].
Correction using pooled data from other datasets
The prediction is dependent on having enough data per gene to build CU histograms, as sampling effects on the histogram leads to unstable prediction. A way to circumvent this issue is to use CU histogram data from other similar datasets in the prediction. The Preseq DS algorithm with CU histograms truncated at 2 copies per molecule is convenient in this regard, since we only need to estimate the fractions of molecules that have one (FSCM) and two (FDCM) copies. These metrics were measured per gene for the datasets PBMC_V3, PBMC_V3_2, PBMC_NG, PBMC_NG_2, PBMC_V2 and EVALPBMC (pool source datasets) and were used to predict the dataset PBMC_V3_3.
Since each dataset has a different degree of saturation (i.e., average counts per UMI), there is a need to normalize FSCM and FDCM between the dataset being predicted and the pool source datasets (see for example Additional file 1: Fig S5 C, where the dataset PBMC_V2 on average has a lower FSCM than LC). We utilized quantile (all quantiles) normalization [30] to adjust the FSCM and FDCM of the pool source datasets to be more similar that of the dataset being predicted.
The pooled FSCM and FDCM metrics for a gene are then calculated as a weighted mean of all datasets, including the dataset being predicted, where the weight is the number of UMIs for the gene per dataset. A third metric, the fraction of molecules that have more than two copies (FMCM), is calculated from FSCM and FDCM, as FMCM = 1 - FSCM - FDCM. The histogram used for prediction is simplified to these three bins and constructed using those metrics scaled with the number of UMIs of the gene in the dataset to be predicted. Prediction is then carried out as described above.
For generating Fig. 4C, where prediction is performed using downsampled data, the pool source datasets were downsampled as well to better match the dataset being predicted regarding degree of saturation. Since much data is lost during downsampling, each dataset was repeatedly downsampled 10 times and added to the data pool, providing a list of in total 60 datasets.
Binomial downsampling
Two datasets with similar gene amplification bias but with different sequencing depth can be batch corrected by predicting one dataset to a similar sequencing depth as the other. It is a less challenging problem to predict the gene expression of the more saturated dataset given less reads than try to predict the gene expression of the less saturated dataset given more reads. A simple method is downsampling. However, downsampling includes random sampling, which introduces a sampling noise and thus gives less accurate results. An alternative method is a procedure that we have termed binomial downsampling.
Binomial downsampling operates on the CU histogram h(i) (i > 0) of a gene in a pool of cells (here always the full dataset), where i represents the observed copies per UMI for the molecules. We seek the expression of the gene at the fraction x (0 < x < 1) of the original reads. The probability p(i,j) that a molecule with i copies has j copies after a regular downsampling can be calculated as
$$ p\left(i,j\right)= dbinom\left(j,i,x\right) $$
where dbinom(j, i, x) is the probability density function for the binomial distribution at j successes, given i trials and the probability x of success in each trial. The downsampled histogram h’(j) can thus be calculated as
$$ h'(j)={\sum}_{i=1}^{\infty }p\left(i,j\right)\ h(i) $$
The value of interest is h’(0), since this is the number of molecules lost in the downsampling. The remaining number of molecules n for the gene is thus the sum of the non-zero part of the histogram:
$$ n={\sum}_{i=1}^{\infty }h^{\prime }(i) $$
To transform this into gene expression, the binomial downsampling is applied to all genes, followed by a CPM normalization of the data. For a gene g with a former gene expression of eb,g and a downsampled gene expression of ea,g (both in CPM), a scale factor fg can be calculated as
$$ {f}_g=\frac{e_{a,g}}{e_{b,g}} $$
To finally correct the count matrix, each element of the gene g in the single-cell count matrix is then multiplied by the scale factor fg.
To determine the downsampling grade x, we selected the x which yielded the highest correlation (CCC) between the batch corrected datasets.
Concordance correlation coefficient and MSE
To compare the similarity in gene expression of two samples, the Pearson correlation is not a suitable metric since it measures linear correlation, and not similarity. We instead used Lin’s concordance correlation coefficient (CCC), which describes the expected perpendicular distance from a 45° line passing through the origin [31]. We used the CCC function in the R package DescTools (version 0.99.36) to calculate the metric, using default parameters. As a complement, we also calculated the mean squared error (MSE) on the same data.
The similarity and MSE were calculated on log-transformed data, where the transformed gene expression li for gene i is calculated as li = log2(ei + 1), where ei is the gene expression of the gene in counts per million (CPM).
Simulated data
Generation of simulated data is described in detail in Additional file 1: Supplementary Note 2. The AUC calculations were conducted using the R package pROC [32], version 1.17.0.1
Single-cell processing
Single-cell processing for Fig. 5 was performed using Seurat [17] v. 3.1.1, following a standard workflow. For Fig. 5A, B cells were filtered keeping cells with more than 200 UMIs and a mitochondrial content of less than 15%. Ten PCs were used in the clustering and UMAP analysis. For Fig. 5C–H and Additional file 1: Fig. S24, cells were filtered keeping cells with more than 200 UMIs and a mitochondrial content of less than 10%. 10 PCs were used in the clustering and UMAP analysis.
Figure details
In Fig. 4C, the correction is based on correcting the UMIs for all genes individually. The prediction range is the same for all genes, scaling up the counts to match the total counts in the full dataset. The predicted UMIs are then CPM normalized.
To avoid the uncertainty in the FSCM calculations that arise from having too few molecules, only genes with at least 200 molecules in both datasets were included in Fig. 2B and Additional file 1: Fig. S5. Similarly, only genes with at least 30 molecules were included in Additional file 1: Fig. S1–S3. The reason for the lower limit in this case is that a lower limit includes more genes, reducing differences between datasets with different total number of reads.
To avoid the uncertainty in prediction from lowly expressed genes, genes with a gene expression lower than 100 CPM were removed before calculating CCC in Fig. 4D, E, leaving in total 1018 genes. CPM was then recalculated based on these genes only to avoid the influence of lowly expressed genes, for which prediction is less accurate. The value of 100 was chosen since the prediction error at that number of molecules is roughly half of that of uncorrected data (Additional file 1: Fig. S7), which in turn is motivated by that the CCC value is affected by prediction errors from 2 predictions.
In Fig. 4E, the PBMC_NG_2 dataset is downsampled to 49 percent of the reads, which is the downsampling grade that maximizes the correlation between the datasets.
In Fig. 5G, H, each gene is scaled to have a total of 1000 counts across all clusters. Only clusters with more than 300 cells are included (in total 7 clusters), and only points for which there are at least 20 UMIs are shown.
In all figures where the gene expression of a dataset is used, the gene expression is calculated as the mean gene expression across all cells.
Implementation
The ZTNB prediction was implemented in bustools with the addition of the ‘predict’ command using the same algorithm as PreSeqR [12, 13], and utilizing the c++ libraries Eigen [33], LBFGSpp [34], and CppOptimizationLibrary [35]. The count command in bustools has been extended with the option “--hist,” which generates CU histograms that serve as input to predict, together with the count matrix. In addition, UMI correction was implemented as the “umicorrect” command, utilizing the method of [24].