 Research
 Open Access
 Published:
A systematic evaluation of singlecell RNAsequencing imputation methods
Genome Biology volume 21, Article number: 218 (2020)
Abstract
Background
The rapid development of singlecell RNAsequencing (scRNAseq) technologies has led to the emergence of many methods for removing systematic technical noises, including imputation methods, which aim to address the increased sparsity observed in singlecell data. Although many imputation methods have been developed, there is no consensus on how methods compare to each other.
Results
Here, we perform a systematic evaluation of 18 scRNAseq imputation methods to assess their accuracy and usability. We benchmark these methods in terms of the similarity between imputed cell profiles and bulk samples and whether these methods recover relevant biological signals or introduce spurious noise in downstream differential expression, unsupervised clustering, and pseudotemporal trajectory analyses, as well as their computational run time, memory usage, and scalability. Methods are evaluated using data from both cell lines and tissues and from both plate and dropletbased singlecell platforms.
Conclusions
We found that the majority of scRNAseq imputation methods outperformed no imputation in recovering gene expression observed in bulk RNAseq. However, the majority of the methods did not improve performance in downstream analyses compared to no imputation, in particular for clustering and trajectory analysis, and thus should be used with caution. In addition, we found substantial variability in the performance of the methods within each evaluation aspect. Overall, MAGIC, kNNsmoothing, and SAVER were found to outperform the other methods most consistently.
Background
Recent advances in highthroughput technologies have been developed to measure gene expression in individual cells [1–5]. In contrast to bulk RNAsequencing (RNAseq), a distinctive feature of data measured using singlecell RNAsequencing (scRNAseq) is the increased sparsity, or fraction of observed “zeros,” where a zero refers to no unique molecular identifiers (UMIs) or reads mapping to a given gene in a cell [6–9]. These observed zeros can be due to biological (relevant or nuisance) fluctuations in the measured trait or technical limitations related to challenges in quantifying small numbers of molecules. Examples of the latter include mRNA degradation during cell lysis or variation by chance of sampling lowly expressed transcripts [10]. The word dropout [6–8] has been previously used to describe both biological and technical observed zeros, but the problem with using this catchall term is it does not distinguish between the types of sparsity [10].
To address the increased sparsity observed in scRNAseq data, recent work has led to the development of “imputation” methods, in a similar spirit to imputing genotype data for genotypes that are missing or not observed. However, one major difference is that in scRNAseq standard transcriptome reference maps such as the Human Cell Atlas [11] or the Tabula Muris Consortium [12] are not yet widely available for all species, tissue types, genders, and so on. Therefore, the majority of imputation methods developed to date do not rely on an external reference map.
These imputation methods can be categorized into three broad approaches [10]. The first group are imputation methods that directly model the sparsity using probabilistic models. These methods may or may not distinguish between biological and technical zeros, but if they do, they typically impute gene expression values for only the latter. A second approach adjusts (usually) all values (zero and nonzero) by smoothing or diffusing the gene expression values in cells with a similar expression profiles identified, for example, using neighbors in graph. The third approach first identifies a latent space representation of the cells, either through lowrank matrixbased methods (capturing linear relationships) or deeplearning methods (capturing nonlinear relationships), and then reconstructs the observed expression matrix from the lowrank or estimated latent spaces, which will no longer be sparse. For the deeplearning approaches, such as variational autoencoders, both the estimated latent spaces and the “imputed” data (i.e., reconstructed expression matrix) can be used for downstream analyses, but otherwise only the imputed data is typically provided for downstream analyses.
Due to these recent and concurrent development, evaluations and comparisons between scRNAseq imputation methods have been limited or restricted to a subset of imputation methods and downstream applications [13–16]. Furthermore, imputation methods can require varying types of raw or processed data as input, may rely on different methodological assumptions, and may be appropriate for only certain scRNAseq experimental protocols, such as UMIbased [1–3] or fulllength [4, 17] transcript methods. Given these differences, the performance of these methods has been shown to vary in the evaluations todate. For example, one study found imputation methods can introduce false signals when identifying differentially expressed genes with modelbased methods outperforming smoothingbased methods, in particular for genes with a small effect size (log2 fold change) [14]. Another study found imputation methods can introduce spurious correlations between imputed expression and total UMI counts [15]. Alternatively, others have shown spurious structural patterns in low dimensional representations of imputed data [14, 18], which we also find in data where we expect no structural patterns in the data, but patterns associated with library size emerge in the imputed data (Fig. 1a, Additional file 1: Figure S1). In contrast, others have found a subset of imputation methods to be helpful to estimate library size factors for normalization of sparse scRNAseq data [16]. Therefore, the answer to the question of which methods can, let alone should, be used for a particular analysis is often unclear.
To address this gap, we performed a systematic benchmark comparison and evaluation of 18 stateoftheart scRNAseq imputation methods (Fig. 1b). Specifically, we evaluated (1) modelbased imputation methods (bayNorm [19], SAVER [20], SAVERX [21], scImpute [22], scRecover [23], VIPER [24]), (2) smoothbased imputation methods (DrImpute [25], MAGIC [26], kNNsmoothing [27]), and (3) data reconstruction methods either using deeplearning methods (AutoImpute [28], DCA [29], DeepImpute [30], SAUCIE [31], scScope [32], scVI [33]) or lowrank matrixbased methods (ALRA [34], mcImpute [35], PBLR [36]). Compared to existing benchmark evaluations of imputation methods, our benchmark is more comprehensive in terms of (i) the number of imputation methods considered (18 imputation methods compared to 3–8 in previous benchmarks [13, 14, 18]), (ii) the types of tissues considered (both cell lines and tissues), (iii) the number of downstream analyses considered (four applications compared to just one, for example, differential expression [14]), and (iv) different types of experimental protocols considered (both droplet and platebased protocols as opposed to just platebased [13]), all of which is described in Additional file 2: Table S1. Throughout, we use teletype (or monospace) font when referring to specific software packages and italicized font when referring to datasets. While many of the imputation methods used raw scRNAseq UMI or read counts as input, a subset of these methods required normalized (or logtransformed normalized) counts. In the latter case, we used the scRNAseq pooling normalization method [37] implemented in the scran [38] R/Bioconductor [39, 40] package, which has been previously shown to outperform other scRNAseq normalization methods in fulllength and UMIbased methods [16, 18]. We also included a baseline comparison of “no imputation,” which is the raw scRNAseq counts that have been adjusted for only library size with scran [37] normalization. While there are more methods available that could be used for imputation, we only included methods in our benchmark that (i) were originally designed (or specified by the authors) to be used as an imputation method, (ii) included software for users to download and run locally, and (iii) did not need external pieces of information (e.g., a network or an external reference map) and used scRNAseq counts (raw or normalized) as input.
In this paper, we first evaluate the performance of the imputation method themselves on their ability to recover true expression values by comparing the similarity between imputed cell profiles and bulk samples in a homogeneous population of cells. Then, we investigate the performance of the imputation methods in downstream analyses including differential expression analysis, unsupervised clustering, and trajectory analysis. In addition to simulated data, we used two types of real singlecell data: cell lines and tissues measured across experimental platforms, including fulllength and UMI methods using platebased and dropletbased protocols. We summarize our results and provide a key set of recommendations for users and investigators to navigate the current space of scRNAseq imputation methods.
Results
Similarity between bulk RNAseq and imputed scRNAseq data
To evaluate an imputation method’s ability to recover biological expression observed in bulk RNAseq data, we assessed the similarity between bulk and imputed scRNAseq data using cell lines (Fig. 2). Here, we focused on cell lines since they are less heterogeneous than tissues and have welldefined bulk expression profiles. The imputed values from the scRNAseq data were evaluated in two settings.
In the first evaluation, we directly compared imputed scRNAseq profiles from cell lines to a bulk RNAseq profile from the same cell lines (Fig. 2a–d). The test data include 10x Genomics UMIbased scRNAseq data for five CellBench [18] cell lines (sc_10x_5cl) and Fluidigm C1 platebased scRNAseq read count data for five ENCODE cell lines (ENCODE_fluidigm_5cl). Using the rankbased Spearman correlation coefficient (SCC) [41] between the imputed scRNAseq profile and bulk profile, the majority of imputation methods (16 out of 18) improved the correlation compared to not imputing. Methods such as SAVER and SAVERX (without pretraining) performed well with 10x Genomics UMI count data (ranked 2 and 3 out of 18), but their performance gain was not as pronounced with read count data from the platebased Fluidigm platform (ranked 12 and 13 out of 18) (Fig. 2d). We note that this difference in performance is expected as SAVER and SAVERX have assumed a negative binomial distribution for UMI data [20, 21]. UMI count data have been shown to follow a negative binomial distribution as opposed to read count data which have been shown to follow a zeroinflated negative binomial distribution [42–45]. Within each platform, we found that the methods scVI, DCA, and MAGIC performed better compared to the other methods, and this was consistent in both the Fluidigm and 10x platforms (Fig. 2d). When comparing across platforms, PBLR and AutoImpute showed significant crossplatform performance differences (twosample ttests at a false discovery rate (FDR) < 0.05 and relative performance change \( \overline {SCC}_{10x}\overline {SCC}_{fluidigm} / max(\overline {SCC}_{10x},\overline {SCC}_{fluidigm}) > 25\%\)). For both methods, the correlation between the imputed singlecell profile and bulk profile was higher in the Fluidigm platform than in the 10x platform (Fig. 2d).
In the second evaluation, we assessed an imputation method’s ability to preserve the difference on the log scale between two cell type profiles (i.e., two cell lines) by comparing the difference in two singlecell cell type profiles to the difference in two bulk cell type profiles (Fig. 2e–h). Compared to Fig. 2a–d, the majority, though a smaller set, of imputation methods (13 out of 18) preserved the cell type difference better than no imputation. The imputation methods MAGIC, DCA, and scVI resulted in the highest correlation using both UMI and nonUMI platebased protocols (Fig. 2h). Similar to Fig. 2d, SAVER and SAVERX (without pretraining) resulted in the much higher correlation using UMI count data than using nonUMI platebased data. When comparing across platforms, 9 out of 18 imputation methods showed significant crossplatform performance differences (twosample ttests FDR < 0.05, relative performance change > 25%) (Fig. 2h). Among these methods, AutoImpute and SAUCIE showed higher correlation between the imputed expression difference and the bulk expression difference in the Fluidigm platform, while the other methods (MAGIC, SAVERX, SAVER, kNNsmoothing, ALRA, scRecover, scScope) showed higher correlation in the 10x platform.
Finally, the performance of some imputation methods were found to be affected by the number of cells in the dataset. For data generated using the same platform, we observed that a smaller cell number is associated with a smaller imputed valuebulk correlation for these methods (Fig. 2d, h: green and brown dots on the right of the heatmaps; and Additional file 3: Table S2). For example, for Fluidigm datasets in Fig. 2d, IMR90 with 23 cells showed smaller correlation compared to the other platebased datasets with more cells. For methods such as scVI, DCA, MAGIC, scImpute, SAUCIE, PBLR, and DeepImpute, the Spearman correlation between the cell number and performance (imputationbulk SCC) across the Fluidigm datasets was higher than 0.6 (Fig. 2d).
Impact of scRNAseq imputation on identifying differentially expressed genes
Next, we evaluated the impact of imputation on the downstream analysis of identifying differentially expressed genes (DEGs). We intentionally designed our evaluation to primarily rely on empirical analyses of real data in order to preserve genegene correlations. In these empirical analyses, the ground truth was not completely known. Thus, our evaluation could not explicitly calculate sensitivity and specificity as previous studies [14, 15]. However, we preferred empirical evaluation over simulation. This is because some imputation methods model the expression levels for one gene based on the expression levels of other genes, such as with SAVER and SAVERX, and simulation or spikein studies in which ground truth is known but the genegene correlation is disrupted would unfairly disfavor such methods. Also, modeling the complex genegene correlation in real data via simulation and spikein studies is difficult. We considered two methods for differential expression (DE) analysis (namely MAST [46] and Wilcoxon ranksum test [47] abbreviated as Wilcoxon), because we wanted to assess whether the performance difference among the imputation methods in this section was consistent regardless of the choice of DE method or whether it was due to the choice of the DE method. While MAST uses a parametric model developed specifically for singlecell data, the Wilcoxon test is a nonparametric test commonly used for both bulk and singlecell data.
In the first analysis, we performed a DE enrichment analysis (Fig. 3a). We treated genes identified as differentially expressed in bulk RNAseq data as a “gold standard” similar to previous studies [48]. Then, we calculated the overlap of DEGs between the bulk data and DEGs identified from scRNAseq data between the same two cell types using MAST and Wilcoxon (Fig. 3b–d, Additional file 1: Figure S2ag). Note that the analyses presented in Fig. 3b–d are different from those presented in Fig. 2e–h. Figure 2e–h only consider the logfold change (LFC) between two cell types. In contrast, in Fig. 3b–d we also estimate the uncertainty of LFC (due to celltocell variability) and take it into account in order to rank genes, similar to what one would do in hypothesis testing.
Using UMIbased 10x Genomics scRNAseq data from five cell lines (sc_10x_5cl), we found that 12 and 10 out of the 18 methods outperformed no imputation using MAST and Wilcoxon, respectively (Additional file 1: Figure S2b, c). Among them, kNNsmoothing, SAVER, and SAVERX (without pretraining) had the highest overlap of singlecell DEGs with bulk DEGs using MAST, but SAVER, kNNsmoothing, and scVI performed the best when using Wilcoxon (Additional file 1: Figure S2b, c). While MAGIC performed well when calculating logfold changes (Fig. 2h), we found that MAGIC’s performance dropped when taking into account the estimates of uncertainty (Additional file 1: Figure S2B,C). Further investigation shows that the estimated genespecific standard deviations from the MAGIC imputed values were much smaller compared to the estimated standard deviations from other imputation methods (Additional file 1: Figure S3). In turn, the estimated standard errors for the logfold changes using MAGIC were small, leading to a wide teststatistic distribution and a p value distribution skewed toward small p values (Additional file 1: Figure S4). This suggests that MAGIC could have inaccurately estimated cell variability, which reduced its gene ranking performance.
Using the platebased read count data from five ENCODE cell lines (ENCODE_fluidigm_5cl), scVI performed best using either MAST or Wilcoxon (Additional file 1: Figure S2de). However, only 7 and 1 out of the 18 methods increased the overlap of DEGs compared to no imputation when using MAST and Wilcoxon tests, respectively.
We also compared methods using the UMIbased scRNAseq bone marrow tissue data (HCA_10x_tissue) in the 10x Genomics platform. Because cells in this dataset were unsorted, we first computationally labeled cell types using bulk RNAseq data (described in “Methods” section). Here, we found that mcImpute and kNNsmoothing were in the top three either using MAST or Wilcoxon (Additional file 1: Figure S2fg). Similar to using the ENCODE_fluidigm_5cl dataset, we found a small number of methods (6 and 3 out of 18 methods using MAST and Wilcoxon, respectively) that had a higher overlap between singlecell and bulk DEGs compared to no imputation.
To summarize this DE enrichment analysis across the three datasets (sc_10x_5cl, ENCODE_fluidigm_5cl, HCA_10x_tissue) and two DE methods (MAST and Wilcoxon), broadly we found kNNsmoothing (a smoothingbased method) and scVI (a deeplearning method) had the highest overlap with bulk DEGs (Fig. 3b–d), and they both most consistently outperformed no imputation (in 5 out of 6 test settings in Additional file 1: Figure S2bg). The modelbased methods SAVER and SAVERX were among the top performers with UMIbased datasets, and they outperformed no imputation in 3 out of the 4 UMIbased test settings (Additional file 1: Figure S2b, c, f, g).
Next, we performed a null DE analysis using cells from a homogeneous population where we expect no DEGs after correction for multiple testing. We selected three cell types, namely the A549 cell line from the sc_10x_5cl dataset, GM12878 from ENCODE_fluidigm_5cl, and the monocytes from HCA_10x_tissue. For each cell type, we randomly split cells into two groups (Fig. 3e) with varying sizes ranging from N = 10 to 500 cells per group, imputed the expression values and identified DEGs (Fig. 3f–h, Additional file 1: Figure S2hn). Broadly, we found the majority of methods perform well in this analysis (specifically most methods are close to the (0,0) coordinate in Fig. 3f–h), but for the methods that identified false positive DEGs, we found no consistency in which types of imputation methods identified false positive DEGs as there were examples in all types of imputation methods (modelbased, smoothingbased, and data reconstruction based) that reported false positive DEGs (Additional file 1: Figure S2in). We also observed that imbalanced group sizes of cells when identifying DEGs resulted in more imputation methods reporting false positives using both MAST and Wilcoxon (Additional file 1: Figure S2ij).
To complement the enrichment and null DE analysis, we also asked if the imputed expression of known celltypespecific marker genes can correctly predict cell type. Using a UMIbased and FACS sorted peripheral blood mononuclear cell (PBMC) dataset [3] (PBMC_10x_tissue) and known PBMC marker genes highly expressed in purified PBMC cell types (CD19 for B cells; CD14 for monocytes; CD34 for CD34 + cells; CD3D for CD4 T helper cells, cytotoxic T cells, memory T cells, naive cytotoxic T cells, naive T cells, regulatory T cells; CD8A for cytotoxic T cells and naive cytotoxic T cells), we evaluated the performance of predicting a cell type (e.g., B cell) based on the expression of a marker gene (e.g., CD19) (Fig. 3i). We estimated the area under the ROC (AUROC) curve where the expression of the marker gene (e.g., CD19 expression) is the predictor and the true cell type (e.g., B cell or not) is the label (for more details see “Methods” section). Among the methods that returned imputed values, we found that 10 out of 11 imputation methods (that returned imputation values) produced higher AUROC compared to no imputation (Fig. 3i). Specifically, SAVERX, MAGIC, and SAVER were the top three methods with this UMIbased scRNAseq data.
We further evaluated the impact of the magnitude of DE (i.e., effect size) on an imputation method, which was previously shown to be important for the performance of imputation methods in the context of identifying DEG [14]. Using the sc_10x_5cl dataset with five cell lines, we compared the cell lines pairwise and stratified genes into high and low LFC using the bulk RNAseq data, where high (or low) LFC genes were defined as the top 10% (or bottom 10%) genes based on absolute values of LFC. Interestingly, when the magnitude of DE was large, the overlap between the bulk and singlecell DEGs for most imputation methods (13 out of 16 methods that returned imputed values) was higher compared to no imputation (Fig. 3j). However, when the magnitude of DE was small, only 2 out of the 16 methods increased the overlap between the bulk and singlecell DEGs compared to no imputation, suggesting that most imputation methods may have smoothed away small differential signals (Fig. 3k). In Fig. 3k, the DEG overlap was highly variable across datasets, with some datasets showing larger overlap compared to other datasets regardless of imputation methods. This is because different datasets have different signal abundance levels, and methods tend to show a larger imputationbulk overlap in datasets with more bulk DE genes compared to datasets with fewer bulk DE genes (Additional file 1: Figure S5a). In real applications, the signal abundance is often determined by the underlying biology rather than determined by users. Thus, the primary goal of this analysis is to compare different imputation methods on the same dataset rather than studying dataset variability. To more clearly show the method difference conditional on the same dataset, we further ranked all methods within each dataset and compared the ranks. The conclusion remained similar: most imputation methods improved analyses for DE genes with large LFC over no imputation but did not improve for genes with small LFC (Additional file 1: Figure S5b, c). The above analyses were performed with MAST, but we found consistent results using the Wilcoxon test (Additional file 1: Figure S5d, e).
Impact of scRNAseq imputation on unsupervised clustering
Unsupervised clustering is another common downstream analysis with scRNAseq data to empirically define groups of cells with similar expression profiles [40]. Here, we assessed the impact of the 18 imputation methods on unsupervised clustering, specifically using kmeans [49] and Louvain clustering [50]. Similar to our motivation for using two DE methods (MAST and Wilcoxon), we considered two unsupervised clustering methods in this section in order to assess whether the performance difference among the imputation methods was due to the choice of the clustering method or if the relative performance of imputation methods was consistent regardless of the choice of clustering method. However, instead of only considering the imputed gene expression profiles from the 18 imputation methods, we also considered the three latent spaces from scVI, scScope, and SAUCIE (for a total of 21 “methods”) for unsupervised clustering and inferring pseudotemporal trajectories. Clustering was performed both on the top principal components of the imputed data from the 18 methods and on the three latent spaces directly provided by scVI, scScope, and SAUCIE. We used four metrics to assess the clustering performance: the median Silhouette index, adjusted Rand index (ARI) [51], entropy of cluster accuracy (H_{acc}), and entropy of cluster purity (H_{pur}). The last three were also used by and described in Tian et al. (2019) [18]. The Silhouette index measures consistency within clusters (or how similar an observation is to its own cluster compared to other clusters). The last three metrics assess the similarity of predicted cluster labels to a known ground truth, and they have been shown to have good correlation with each other [18]. The primary difference between the last two is that H_{acc} measures the diversity (or accuracy) of the true group label within each cluster assigned by the clustering method, while H_{pur} measures the diversity (or purity) of the predicted cluster labels within each of the true groups. We scaled ARI to range between 0 and 1, and Silhouette index ranges between − 1 and 1. In both cases, a higher score represents a better performance. H_{acc} and H_{pur} range between 0 and a number larger than one with a lower score representing a better performance.
We applied each imputation method to seven datasets from CellBench [18] (Additional file 4: Table S3, the “Methods” section), a data compendium consisting of both UMIbased 10x Genomics, Dropseq, and platebased scRNAseq data for benchmarking analysis methods, and then applied kmeans clustering (Fig. 4a–c, Additional file 1: Figure S6). A similar evaluation using the CellBench data was performed in Tian et al. [18] who evaluated three scRNAseq imputation methods (kNNsmoothing [27], DrImpute [25], and SAVER [20]) with five unsupervised clustering methods. Here, we expanded their analysis to include 18 scRNAseq imputation methods and 3 latent space outputs from scVI, scScope, and SAUCIE—for a total of 21 “methods.” Our primary goal here is to evaluate the imputation methods rather than clustering methods.
Broadly, using kmeans clustering 8 out of 21 methods (MAGIC, SAUCIE_latent, SAVERX, mcImpute, SAVER, etc.) improved clustering results compared to no imputation in these data (Fig. 4a). Overall, using the latent spaces for scVI, scScope and SAUCIE was better than using their imputed expression values (e.g., observations sampled from the posterior distribution using scVI). We illustrate how individual cells cluster along the first two principal components using an example dataset from CellBench (sc_celseq2_5cl_p1) with no imputation and with imputation using MAGIC (Fig. 4b). Using Louvain clustering yielded similar results (Fig. 4c, Additional file 1: Figure S7) although the top methods were slightly different (MAGIC, SAVER, SAVERX, ALRA, mcImpute, and bayNorm).
We further compared imputation methods using the scRNAseq dataset of ten sorted PBMC cell types from 10x Genomics [3] (PBMC_10x_tissue) using both kmeans clustering and Louvain clustering. For the imputation methods that successfully returned imputed values, 10 methods (MAGIC, SAUCIE_latent, SAUCIE, SAVERX, SAVER, etc.) outperformed no imputation (Fig. 4d). In Fig. 4e, we show how the sorted PBMC cells cluster along the UMAP components [52] with no imputation and with imputation using MAGIC (other methods see Additional file 1: Figure S8). Again, the results between kmeans and Louvain clustering were similar (Fig. 4f, Additional file 1: Figure S9) though the top methods when using Louvain clustering were slightly different (SAUCIE, SAUCIE_latent, MAGIC, SAVER, SAVERX). We provide UMAP plots of the cells colored by true cell types and the clustering results from both kmeans and Louvain clustering in Additional file 1: Figure S8.
Impact of scRNAseq imputation on inferring pseudotemporal trajectories
We also evaluated the impact of imputation methods on inferring cells’ pseudotemporal trajectories. In contrast to clustering analysis which used data with distinct cell types to evaluate clustering, here we used datasets in which cells had a continuum of transcriptomic profiles (e.g., cell differentiation) to assess if imputation methods could recover continuous biological processes. Analogous to the section above, we applied methods to infer trajectories on the top principal components. We also inferred trajectories using the latent spaces directly provided by scVI, scScope, and SAUCIE, similar to clustering analysis.
First, we applied imputation methods to six RNA mixture and cell mixture datasets from CellBench [18] followed by using two different trajectory analysis methods, Monocle 2 [53] and TSCAN [54]. In these data, the true trajectories of cells were known. They were used to evaluate the impact of imputation methods on the ability to infer trajectories. We considered two trajectory inference methods in order to assess whether the performance difference among the imputation methods was due to the choice of the trajectory inference method or if the relative performance of imputation methods was consistent regardless of the choice of trajectory inference method. A similar evaluation using CellBench data was performed by Tian et al. [18] who evaluated three scRNAseq imputation methods (kNNsmoothing [27], DrImpute [25], and SAVER [20]) with five trajectory inference methods. Here, we expanded their analysis to include not only the imputed gene expression profiles from the 18 imputation methods, but also the three latent spaces from scVI, scScope, and SAUCIE (for a total of 21 “methods”). While we include two trajectory inference methods, we note our primary focus is to evaluate the imputation methods. The performance metrics used in this analysis were (1) the Pearson correlation between cells’ rank order along the inferred trajectory and their rank order along the true trajectory and (2) the proportion of cells for which the inferred branch overlapped (i.e., was consistent) with the branch in the true trajectory. Both of these metrics have been previously described and used to evaluate inferred cell trajectories in [18].
Using the CellBench data, we found the imputation methods kNNsmoothing, SAVER, and ALRA led to both increased correlation (Fig. 5a) and overlap (Fig. 5b) compared to no imputation using the TSCAN trajectory inference. Using Monocle 2 trajectory inference, SAVER, kNNsmoothing, mcImpute, and the latent spaces from SAUCE (SAUCE_latent) increased both the correlation (Additional file 1: Figure S10a) and overlap (Additional file 1: Figure S10b) compared to no imputation. This confirms the variability in imputation methods’ performance (depending on trajectory inference method), in particular for the overlap, that was previously shown [18] (Fig. 5c–d). Finally, analogous to results shown in clustering evaluation, for imputation methods that return latent spaces, using the latent spaces generally led to better performance than using the imputed expression values.
Next, we evaluated the performance of the imputation methods using bone marrow cells from the HCA_10x_tissue [11]. Because these are not cell lines, we first computationally labeled cell types using bulk RNAseq data (see the “Methods” section for details). We used the schematic of hematopoietic stem cell (HSC) differentiation shown in Fig. 1a in Buenrostro et al. [55] to computationally label cell types. The bone marrow contains hematopoietic stem cells (HSCs) differentiating into three major lineages: lymphoid, erythroid, and myeloid cells (Fig. 5e). Here, we compare the estimated pseudotime to the level of differentiation for each pair of bone marrow cells. For instance, for the pair of cells A and B, if cell A is a HSC cell (we assign it as differentiation level 1), cell B is a multipotent progenitor (MPP) cell (we assign it as differentiation level 2), and the inferred pseudotime for cell A and B is t_{A} and t_{B} where t_{A}<t_{B}, then we call it “correctly ordered.” The percentage of the correctly ordered cells averaged across all possible cell pairs serves as the performance metric.
Similar to using the CellBench data, kNNsmoothing most consistently outperformed other imputation methods using this heterogeneous tissue data using either TSCAN or Monocle 2. However, there was variability in the performance of other imputation methods depending on the trajectory analysis method used (Fig. 5f), as previously reported [18]. Figure 5f shows the percentage of correctly ordered cell pairs in the bone marrow dataset using Monocle 2 (xaxis) and TSCAN (yaxis) as the pseudotime reconstruction methods. We found kNNsmoothing and MAGIC slightly performed better than using no imputation. However, the majority (all but two) of methods reported 78%+ correctly ordered cells in this data using either TSCAN or Monocle 2.
Discussion
We have presented a systematic benchmark evaluation comprehensively comparing 18 scRNAseq imputation methods. Our comparison is subject to several limitations. Firstly, the imputation methods were mostly compared with default parameters which may not achieve optimal performance across all datasets. Our work could be further improved with the use of methods such as molecular crossvalidation (MCV) [56]. In addition, we used 72 h as the time limit for convergence for imputation methods, which does not guarantee algorithmic convergence for some methods. In our evaluation of imputation methods on inferring pseudotime with trajectory analysis methods, the cell types of the tissue HCA_10x_tissue cells were computationally annotated. Another limitation is that there are no disease tissues included in this study. In the future, it is worthwhile to continue investigating how conclusions presented in this study may translate to applications in a diseased setting such as cancer tissues. One challenge is that the expression of some genes in diseased cells might be abnormal [57–60], which could lead to the false identification of similar cells and could affect the imputation performance.
An open problem not investigated in our current study is the impact of imputation methods on the RNA velocity analysis [61–63]. Since RNA velocity is estimated by analyzing unspliced and spliced mRNA, it takes into account both spliced and unspliced counts. Existing imputation methods deal with the dropout events by imputing the gene expression values rather than the original reads, and the gene expression is usually quantified on exons only which may not distinguish contributions from the spliced versus unspliced transcripts. Therefore, whether existing imputation methods can also be applied to velocity analysis and to separately impute spliced and unspliced transcripts (including introns) remains an open problem that requires extensive future investigation which is beyond the scope of the present study. In addition to the velocity analysis, evaluating how imputation methods may impact other emerging analyses such as spatial transcriptomics [64–68] also warrants future investigation.
Conclusions
A good imputation method should allow one to accurately recover gene expression. Ideally, it should improve downstream analyses without introducing many artifacts or false signals. Motivated by this, we evaluated the performance of the imputation methods based on the similarity between imputed singlecell profiles and bulk profiles in a homogeneous population of cells, and the impact of the imputation methods on three downstream analyses: differential expression analysis, unsupervised clustering analysis, and pseudotime inference. We conclude by summarizing our results in Fig. 6 which ranked methods based on their average performance. Computational time, memory usage, and scalability were not used to rank the methods. However, they were assessed separately using four datasets of 10^{3},5×10^{3},5×10^{4} and 10^{5} cells, respectively. Specifically, we used (1) computation time (in seconds), (2) memory (in maximum resident set size of all tasks in job, i.e., MaxRSS, returned from sacct), and (3) scalability—regression coefficient in the linear model where the computation time is fitted against the number of cells on the log10scale. The Additional file 1: Figure S11 shows the comparison of time, memory, and scalability. As a method’s performance depends on the analysis task and the experimental platform used to generate the data, Fig. 6a–c provides heatmaps that summarize the performance of different methods in different analysis tasks and for different experimental platforms. Figure 6d also summarizes the methods with performance better than no imputation (methods shown in ranked order) for each analysis task and data platform. It can serve as a map to guide users and investigators to navigate the current space of scRNAseq imputation methods depending on their needs. In addition, since the input data requirements and processing procedures are variable among methods, we also outlined these details in Additional file 5: Table S4 for user’s reference.
Of the methods considered, MAGIC, kNNsmoothing,and SAVER were found to outperform the other methods most consistently (Fig. 6). However, the performance of methods varied across evaluation criteria, experimental protocols, datasets, and downstream analyses. For example, scVI was one of the top performers in terms of the similarity between the imputed singlecell and bulk expression profiles (Fig. 2), but it did not perform among the top in clustering and trajectory analysis (Figs. 4 and 5). SAVERX performed well in UMIbased data, but less well in nonUMI based data (Fig. 6). While MAGIC was one of the top performers overall (Fig. 6), it performed worse than many other methods when identifying differentially expressed genes in hypothesis testing type settings that take into account cell variability (Fig. 3).
In addition, we found that while some imputation methods improve detecting differentially expressed genes or discovering marker genes, they also can introduce false positive signals, sometimes driven by imbalanced cell numbers between groups (e.g., Additional file 1: Figure S2ij). The magnitude (i.e., effect size) of differential expression (i.e., logfold change) plays a role in the performance of the imputation methods. Most imputation methods strengthen large effect sizes compared to no imputation. However, if the original expression difference is small, then most imputation methods may smooth away the small differential signal and hence do not show clear advantage over not imputing (Fig. 3j, k).
One important observation is that, while the majority of imputation methods outperformed no imputation in recovering bulk expression (16 out of 18 methods) and log fold changes of individual genes between cell types without considering cell variability within each cell type (13 out of 18 methods) (Fig. 2), much fewer methods performed better than no imputation for identifying differentially expressed genes after considering cell variability (1–12 out of 18 depending on the test scenario), clustering cells (6–10 out of 21 methods) or inferring pseudotemporal trajectories (4–11 out of 21 methods) (the “Results” section). Thus, the current imputation methods as a whole seem to be most effective for providing a point estimate of the activity of individual genes, and they become less effective when coupled with various downstream analysis tasks. For differential expression analysis, the decreased effectiveness is likely due to inaccurate cell variance characterization after imputation. For clustering and trajectory analysis, the reduced effectiveness is likely because these two analyses attempt to analyze celltocell relationship rather than individual genes. Cell clustering and trajectory analysis are usually conducted by embedding the highdimensional expression vector of each cell into a relatively lowdimensional space. Each dimension in the lowdimensional space combines information from many genes, which increases signaltonoise ratio by diluting technical noise such as observed zeros due to technical variation, even without imputation. Thus, the recovery of celltocell relationship can be influenced less by imputation. By contrast, the measurements of individual genes contain high level of technical noises which can be greatly mitigated by imputation by borrowing information from other genes or cells. Thus, imputation could be more helpful for analyzing individual genes rather than celltocell relationship. An open question to be investigated in the future is whether the improvement on the various downstream analysis tasks by imputation has already reached its upper limit and, if not, how to design new imputation methods to further improve the analysis of celltocell relationship or differential expression that takes into account cell variability.
In terms of computation, MAGIC, DCA, and scVI are among the most efficient methods according to the average of three metrics: time, memory, and scalability. SAUCIE, scScope, DeepImpute, ALRA, and kNNsmoothing exhibit high scalability with increasing numbers of cells in datasets. scImpute and bayNorm are intermediary while the remaining methods do not scale well for large datasets (Additional file 1: Figure S11).
Our systematic benchmark evaluation highlights the advantages and disadvantages of existing imputation methods and that the performance of an imputation method depends on many external factors, such as experimental protocols and analyses usage. We hope that this study can benefit both users and method developers and provide an evaluation standard for future scRNAseq imputation methods.
Methods
All methods were evaluated with default parameters, with the exception of the deeplearningbased methods for which the maximum epoch time was set as 400. We used 72 h as the time limit for convergence for the imputation methods. This did not guarantee algorithmic convergence for some methods. For a description of the data, see the “Data” section and Additional file 4: Table S3. For complete details on the methods used, the input, the output, preprocessing steps, the programming language used, version number, and link to software, see Additional file 5: Table S4.
Overview of scRNAseq imputation methods
The 18 imputation methods are reviewed and summarized in Additional file 5: Table S4. We grouped them into four categories.

1
Modelbased methods: These include 6 methods, namely, bayNorm [19], SAVER [20], SAVERX [21], scImpute [22], scRecover [23], and VIPER [24]. These methods assume that the data follow a specific model. SAVER models the UMI counts using a negative binomial distribution. SAVERX is similar to SAVER but has an additional option to pretrain hyperparameters using existing datasets. scRecover is based on a zeroinflated negative binomial distribution model which tries to adapt to high dropout rates. bayNorm uses a binomial model of mRNA capture for its likelihood function. scImpute assumes the dropout rate of each gene follows a double exponential function. VIPER is based on a sparse nonnegative regression model.

2
Smoothingbased methods: These include 3 methods, namely, DrImpute [25], MAGIC [26], and kNNsmoothing [27]. DrImpute first clusters cells and then averages the expression values from similar cells. MAGIC performs data diffusion on the Markov affinitybased matrix for the imputation of cells. In contrast, kNNsmoothing models technical variance using a Poisson distribution and based on that, kNNsmoothing smooths each cell by its knearest neighbors.

3
Deep learning methods: These include 6 methods, namely, AutoImpute [28], DCA [29], DeepImpute [30], SAUCIE [31], scScope [32], and scVI [33]. In these methods, a latent space is constructed using deep learning models to represent cells by lowdimensional latent variables which are used to reconstruct gene expression. The latent space representation can be used for downstream analyses, such as clustering the cells or inferring pseudotime trajectories on the cells, but not for differential gene expression analysis. DCA is a deep count autoencoder network that uses a negative binomial noise model with or without zeroinflation (depending on the dispersion learned form data) and captures nonlinear genegene dependencies. scVI is based on a hierarchical Bayesian model and applies deep neural networks to specify the conditional distributions of variables where the latent variables are mapped to the zeroinflated negative binomial distribution. AutoImpute applies overcomplete autoencoders and tends to be more conservative by considering the expression values as truly zeros if the genes are silenced across bulk samples. DeepImpute constructs multiple subneural networks to impute sets of target genes using genes highly correlated with the target genes. SAUCIE is a regularized autoencoder that uses the reconstructed signal from autoencoder to denoise and impute the data. ScScope iteratively performs imputation using a recurrent network layer.

4
Lowrank matrix representation methods: These include 3 methods, namely, ALRA [34], mcImpute [35], and PBLR [36]. In these lowrank matrixbased methods, cell profiles are mapped to a lowdimensional linear space for imputation. ALRA uses SVD decomposition followed by a thresholding scheme. mcImpute uses nuclear norm minimization, a matrix completion algorithm. PBLR first groups cells into subpopulations and then runs a bounded lowrank matrix recovery method within each cell subpopulation.
Note that both deeplearningbased methods and lowrankmatrixbased methods use the idea of data reconstruction. Regarding the programming languages, the deeplearningbased methods were all implemented using Python. mcImpute and PBLM were implemented using MATLAB. MAGIC and SAVERX are available in both R and Python. All others are implemented using R. For more details about the methods, see Additional file 5: Table S4.
Data
Single cells and bulk samples from cell lines

1
Single cells and pseudo cells from the CellBench [18] scRNAseq benchmarking dataset

All UMIbased data from five cell lines (HCC827, H1975, H2228, H838, A549) in the CellBench [18] benchmarking dataset (except for cellmix5, a population control) using the CELseq2 protocol (sc_celseq2, sc_celseq2_5cl_p1, sc_celseq2_5cl_p2, sc_celseq2_5cl_p3, cellmix1, cellmix2, cellmix3, cellmix4, RNAmix_celseq2), Dropseq Dolomite protocol (sc_dropseq), the Sortseq protocol (RNAmix_sortseq), and 10x Chromium Genomics protocol (sc_10x, sc_10x_5cl). For a description of the experimental design, GEO accession numbers, protocol parameters, see thesc_mixologyGitHubrepo and Additional file 4: Table S3.

N=10 bulk RNAseq samples from GSE86337[69] (each of the five cell lines have two replicates)


2
Five ENCODE (A549, GM12878, h1hESC, IMR90, K562) cell lines

ENCODE_fluidigm_5cl: The five cell lines correspond to five cell types and contain a total of N = 362 cells all from GSE81861. They were sequenced with the SMARTer fulllength method [17] using the Fluidigm C1 protocol [70]. The cell types include N = 74 A549 cells, N = 96 GM12878 cells (batch 2), N = 96 H1hESC cells (batch 1) referred to as H1 in the manuscript, N = 23 IMR90 cells, and N = 73 K562 cells.

The bulk RNAseq samples from ENCODE [71]: N = 7 A549 samples, N= 11 GM12878 samples, N = 8 H1hESC samples, N = 5 IMR90 samples, and N = 27 K562 samples


3
Jurkat cell lines

10x_293t_jurkat (293T cells): N = 3258 cells measured using UMIs and the dropletbased protocol from 10x Genomics [3] (https://support.10xgenomics.com/singlecellgeneexpression/ datasets/1.1.0/jurkat)


4
HEK293T cell lines

10x_293t_jurkat (jurkat cells): N = 2885 cells measured using UMIs and the dropletbased protocol from 10x Genomics [3] (https://support.10xgenomics.com/singlecellgeneexpression/ datasets/1.1.0/293t)

Single cells and bulk samples from tissues

1
Bone marrow tissue from the Human Cell Atlas [11] (HCA)

HCA_10x_tissue: N = 6939 bone marrow cells from sample MantonBM6 measured using 10x Genomics (https://data.humancellatlas.org/explore/projects/ cc95ff892e684a08a234480eca21ce79)

N= 49 bulk RNAseq samples from 13 cell types (B cell, CD4 T cell, CD8 T cell, CMP, GMP, HSC, MEP, Monocyte, MPP, and NK cells each has 4 samples and each of CLP, Erythroid and LMPP has 3 samples). (GSE74246[73])


2
Sorted peripheral blood mononuclear cell (PBMC) tissue from 10x Genomics (UMI)

PBMC_10x_tissue: N= 59620 sorted cells from 10 cell types (N = 4033 B cells, N = 498 CD14 monocyte cells, N = 9162 CD34 cells, N= 7046 CD4 T helper cells, N = 7555 CD56 NK cells, N= 7631 cytotoxic T cells, N = 6969 memory T cells, N = 5672 naïve cytotoxic cells, N = 4569 naïve T cells, and N = 6485 regulatory T cells) (http://support.10xgenomics.com/singlecell/datasets).

Data processing and imputation
Singlecell RNAseq data
We applied the same quality control (QC) criterion across all singlecell datasets. Cells with at least 500 detected genes were retained. ERCC spikeins were removed. Genes expressed in at least 10% of cells in cell line data and 1% of cells in tissue samples were retained. Mitochondrial genes were removed. We applied these cell and genefiltering steps to all imputation methods and skipped each method’s own gene and cellfiltering (if applicable) to keep the dimension of the imputed values (output) the same across imputation methods. Singlecell profiles from five Fluidigmbased ENCODE cell lines (ENCODE_fluidigm_5cl) were combined into one count matrix which was used as input to each imputation method. A similar procedure was applied to other datasets. Singlecell profiles from UMIbased Jurkat and HEK293T cell lines were also combined into one UMI count matrix (10x_293t_jurkat) which was used as input to each imputation method. Data were normalized by the pooling normalization method [37] implemented in the scran [38] R/Bioconductor [39, 40] package and log2transformed if any imputation method requires normalized counts or logtransformed normalized counts as input. We added postprocessing steps for methods if required. For example, if a method did not normalize nor apply a logtransformation before or during imputation, we applied scran normalization and log2transformation to the output. If a method included normalization but does not logtransform the data during imputation, we applied log2transformation to the output.
Bulk RNAseq data
Read counts were downloaded from GEO (accession numbers GSE129240, GSE86337, and GSE74246). Each sample was normalized by library size and multiplied by a size factor 10^{6}, or count per million (CPM), and then log2transformed with a pseudocount of 1, i.e., log2(CPM+1). For the ENCODE samples, the downloaded data were already normalized (Fragments Per Kilobase of transcript per Million mapped reads, or FPKM) and log2transformed.
Null simulation
To generate Fig. 1a and Additional file 1: Figure S1, we used the 293T cells from the 10x_293t_jurkat dataset with G=18729 genes and C=2885 cells. We applied genelevel quality control by only retaining genes that have expression at least 10% of cells, which kept G=7582 genes. Let X_{gi} represent the observed scRNAseq UMI counts for the g^{th} gene where g∈(1,…,G) and the i^{th} cell where i∈(1,…,C) from a given dataset. We estimated the mean expression level for each gene across the cells: \(\hat {\mu }_{g} = \sum _{i=1}^{C} \frac {X_{gi}}{C}\). Then, for each gene, we simulated scRNAseq UMI counts Y_{gj} for the g^{th} gene and the j^{th} simulated cell from a Poisson distribution with the mean equal to the product of the estimated genelevel mean (\(\hat {\mu }_{g}\)) estimated from X and a cellspecific library size factor (δ_{j}) randomly sampled from a uniform distribution between [a,b]:
where δ_{j}∼U[a,b] for j∈(1,…,C^{′}). For Fig. 1a and Additional file 1: Figure S1, C^{′} = 1000 and the library size factors δ_{j}∼U[0.9,1.1]. We passed the gene expression matrix to each imputation method. Data preprocessing and postprocessing steps followed the “Data processing and imputation” section. Principal component analysis was performed on genes with coefficient of variation (cv) greater than the median cv.
Evaluation of similarity between bulk and imputed scRNAseq data
Correlation of gene expression profiles between bulk and imputed scRNAseq profiles
For a given scRNAseq dataset, we averaged the scran normalized [37] log2transformed scRNAseq counts across cells (referred to as a “pseudobulk”) and calculated the Spearman’s rank correlation coefficient (SCC) (or ρ) between the pseudobulk and the bulk RNAseq profile (averaged across replicate bulk samples) of the same cell type. Next, for each cell in a given scRNAseq dataset, we calculated the SCC between the cell’s imputed scRNAseq profile (e.g., using SAVER) and the averaged bulk RNAseq profile. The median SCC across all cells was then used to evaluate the performance of an imputation method within a dataset. To rank methods across datasets, we first averaged the median SCCs across datasets within the same experimental protocol (e.g., UMIbased or platebased) and then averaged these averages across protocol. In the analyses above, pseudobulk was used as a reference for the approximate upper bound of the singlecell imputation performance since both bulk and pseudobulk profiles try to measure a cell population’s average behavior, and the correlation between a pseudobulk and the corresponding bulk profile is expected to increase as one pools an increasing number of cells to create the pseudobulk sample [74, 75]. Note that unlike singlecell profiles, pseudobulk cannot characterize celltocell variability. Thus, pseudobulk cannot replace singlecell data in studies that require analyzing celltocell variability, such as a rigorous differential expression analysis that aims to detect differences between two cell types that cannot be explained by the background celltocell variation.
To see whether there exist significant platform differences, twosample ttest was conducted to evaluate whether the Spearman’s correlation coefficients (SCC) between the bulk and imputed singlecell data for each method had equal mean or different means across different platforms (e.g., 10x vs. Fluidigm). The p values from all methods were then adjusted for multiple testing using the BenjaminiHochberg false discovery rate (FDR) method [76]. Methods with FDR < 0.05 and relative performance change > 25% were highlighted in Fig. 2d, h. Here, relative performance change is defined as \( \overline {SCC}_{10x}\overline {SCC}_{fluidigm} / max(\overline {SCC}_{10x},\overline {SCC}_{fluidigm})\), where \(\overline {SCC}\) represents mean imputed valuebulk SCC across all datasets within a platform.
To quantify the relationship between the number of cells in a dataset and the imputation performance, we calculated the Spearman correlation between the cell number and a method’s imputation performance (i.e., SCC between the bulk and imputed singlecell data) across datasets for each imputation method within each platform (e.g., 10x or Fluidigm).
Correlation between the bulk and imputed singlecell gene expression log fold changes
This analysis was similar to the above one, but here the SCC was calculated by comparing bulk and singlecell differential expression (DE). For a given pair of cell types, we first averaged the cell profiles from each cell type to form two “pseudobulk” samples using scran normalized [37] and log2transformed scRNAseq profiles. The difference between the two pseudobulk profiles and the difference between the two bulk RNAseq profiles were computed and the SCC between the two differential profiles was computed.
Next, we took one cell from each cell type and calculated the difference in the imputed scRNAseq profile (e.g., using SAVER) between these two cells. The SCC between this difference and the difference between two bulk RNAseq profiles from the same two cell types was then computed. This was repeated for all possible cell pairs. For a given pair of cell types (both within the same experimental protocol), the median SCC across all cell pairs was computed. To rank methods across datasets, we averaged the median SCCs across datasets within each protocol and then averaged these averages across protocols.
Evaluation of imputation to identify differentially expressed genes
Ranking differentially expressed genes (DEGs)
For each of the three imputed scRNAseq datasets (sc_10x_5cl, ENCODE_fluidigm_5cl and HCA_10x_tissue), we identified differentially expressed genes (DEGs) between all pairs of cell types. We considered two methods to identify DEGs from scRNAseq: (1) MAST [46] which models the data using a hurdle model and (2) Wilcoxon ranksum test [47]. For bulk RNAseq samples, DEGs were identified using the limma [77] R/Bioconductor package. We corrected p values for multiple testing using the BenjaminiHochberg (BH) method [76] (p.adjust function in the stats R package) to derive false discovery rate (FDR) (only when a method did not already correct for FDR) and identified genes with FDR smaller than α=0.05. The bulk DEGs were treated as a “gold standard” similar to previous studies [48]. For the scRNAseq data, The singlecell DEGs were ranked by p values or the logscaled expression fold change if there was a tie for p values. For i from 1 to 100, we calculated the proportion of top 10∗i singlecell DEGs that overlap with bulk DEGs. The average of these 100 proportions served as the performance metric.
Null differential analysis
For each dataset in this analysis, we started with a homogeneous population of cells where we expect no DEGs after correction for multiple testing. We used the A549 cells from the sc_10x_5cl dataset (N=1256 cells), the GM12878 cell line from the ENCODE_fluidigm_5cl dataset (N=96 cells), and the cell type with the largest number of cells from the HCA_10x_tissue (N=193 monocytes). For each dataset, we randomly sampled cells into two groups with group size ranging from N = 10 to 500 cells per group, imputed the expression values of these two groups together and identified DEGs using MAST [46] and Wilcoxon ranksum test [47]. The genes with FDR < 0.05 were identified as DEGs.
Predicting cell type using imputed expression of known PBMC marker genes
Using the sorted peripheral blood mononuclear cells (PBMCs) [3] (PBMC_10x_tissue dataset), we assessed the performance of an imputation method on recovering the expression level of known PBMC marker genes. The cell typespecific marker genes used in this analysis were the following: CD19 for B cells; CD14 for monocytes; CD34 for CD34 + cells; CD3D for CD4 T helper cells, cytotoxic T cells, memory T cells, naive cytotoxic T cells, naive T cells, regulatory T cells; CD4 for CD4 T_helper cells, memory T cells, naive T cells, regulatory T cells; CD8A for cytotoxic T cells and naive cytotoxic T cells. We evaluated the performance of predicting a cell type (e.g., B cell) based on the imputed expression of a marker gene (e.g., CD19). For each cell type and marker gene pair, we calculated the area under the receiver operating characteristic (ROC) curve (AUROC) using the performance function in the ROCR R package[78] where the expression of the marker gene is the predictor and the true cell type is the label. Specifically, the cells were first sorted in a descending order according to the imputed values of the marker gene (e.g., CD19). Consider the cell type A. Assume there were K cells in total and b of them were in cell type A. Assume in the top N cells, a of them were in cell type A; in the remaining K−N cells, c of them were not in cell type A. Sensitivity was calculated as a/b and specificity was calculated as c/(K−b). The ROC curve was obtained by plotting sensitivity against 1specificity for different N (N=1,2,...,K).
Types of failures
In the differential expression analysis, there are three types of failures.

1
“ImputationFail”: It means no imputation results were returned after running the imputation methods for 72 h. Without imputed values, one cannot proceed with the differential expression analysis. This type of failure is due to failure of the imputation methods, so we assign a zero score to the method, and the score will be counted towards the method’s overall performance in the evaluation and affect the ranking of the imputation methods.

2
“DifferentialFail”: Here, the imputation methods can impute the gene expression, but the methods for detecting differentially expressed genes failed to run. In this scenario, imputation methods themselves did not fail, but one does not have enough information to compare their imputation performance since differential expression analysis cannot be run. Therefore, this type of failure will not be counted in evaluation, and the test results provided by the other test method will be counted only (in our analysis, Wilcoxon ranksum test can always provide test results).

3
False positives and false negatives: Here, both the imputation methods and the differential expression analysis methods (e.g., MAST) can be run without encountering technical issues. However, one may incorrectly identify differentially expressed genes from the imputed data. This further includes two scenarios: (a) false positives: a nondifferential gene is incorrectly reported as differential; (b) false negatives: a differential gene is incorrectly reported as nondifferential. Null differential analysis (Fig. 3e–h) specifically evaluates false positives by asking each method to detect differential genes in null analyses where no differential genes are expected. The comparisons to ranking differentially expressed genes (Fig. 3a–d), on the other hand, evaluates each method’s ability to put truly differential genes to top rank which depends on both false positives and false negatives. For a given rank (say top N genes), a better method should have fewer false positives among the top N reported genes and fewer false negatives in the remaining genes. This type of failure will be considered in the evaluation of imputation methods’ overall performance.
Evaluation of imputation on unsupervised clustering
We used two sets of datasets for this analysis. The first set is CellBench [18] data, which consists of 7 datasets. Three of these datasets contained three cell lines (datasets: sc_10x, sc_dropseq, sc_celseq2) and four datasets contained five cell lines (datasets: sc_10x_5cl, sc_celseq2_5cl_p1, sc_celseq2_5cl_p2, sc_celseq2_5cl_p3). The second set of data contains sorted PBMCs [3] (N = 59620) with 10 cell types.
Clustering was performed using both kmeans [49] and Louvain clustering [50] where the number of clusters was set to be the number of known cell types known in each dataset. Kmeans clustering was performed using the top 10 PCs. In kmeans, we directly set k to be the number of known cell types, while in Louvain clustering this is achieved by increasing the number of nearest neighbors iteratively until the desired number of clusters is obtained. Louvain clustering was performed by performing feature selection using highly variable genes (HVGs), applying PCA by prcomp(), and then building a shared knearestneighbors (kNN) graph [50] based on the Euclidean distances of the top 10 PCs. An edge was drawn between all pairs of cells sharing at least one neighbor, weighted by the characteristics of the shared nearest neighbors. For this last step, we used the buildSNNGraph function in the scran [38] R/Bioconductor package with the top 10 PCs as input, d=NA and other parameters as default. Clusters were identified by a multilevel modularity optimization algorithm for finding community structure [79] using cluster_louvain in igraph R package[80]. For scVI_latent, scScope_latent, and SAUCIE_latent, we skipped the principal component analysis of the imputed values and replaced the top 10 PCs using the latent space coordinates obtained from scVI, scScope, and SAUCIE.
To evaluate the performance of each method, we used four metrics:

1
Entropy of accuracy (H_{acc}) where 0≤H_{acc}≤ logM. H_{acc} measures the diversity of the groundtruth labels within each predicted cluster group assigned by the clustering algorithm.
$$H_{acc} =  \frac{\sum_{i=1}^{M} \sum_{j=1}^{N_{i}} p_{i}(x_{j}) \log(p_{i}(x_{j}))} {M} $$where M is the total number of predicted clusters from the clustering algorithm, N is the number of groundtruth clusters, and N_{i} is the number of groundtruth clusters in the i^{th} predicted cluster. x_{j} are cells in the j^{th} groundtruth cluster, and p_{i}(x_{j}) are the proportions of cells in the j^{th} groundtruth cluster relative to the total number of cells in the i^{th} predicted cluster. A smaller value of H_{acc} is better as it means the cells in a predicted cluster are homogeneous and from the same group [18]. However, H_{acc} can lead to overclustering, with an extreme case being treating each cell as a cluster (or H_{acc} = 0).

2
Entropy of purity (H_{pur}) where 0≤H_{pur}≤ logN. H_{pur} measures the diversity of the predicted cluster labels within each of the groundtruth groups.
$$H_{pur} =  \frac{\sum_{i = 1}^{N} \sum_{j = 1}^{M_{i}} p_{i}(x_{j}) \log(p_{i}(x_{j}))}{N} $$where N is the total number of groundtruth clusters, M_{i} is the number of predicted clusters in the i^{th} true cluster. x_{j} are cells in the j^{th} predicted cluster, and p_{i}(x_{j}) are the proportions of cells in the j^{th} predicted cluster relative to the total number of cells in the i^{th} groundtruth cluster. A smaller value of H_{pur} is better as it means the cells in the groundtruth groups are homogeneous with the same predicted cluster labels [18]. However H_{pur} can lead to underclustering, with an extreme case being assigning all cells into one predicted cluster so that each of the groundtruth groups has the same predicted cluster label (H_{pur} = 0).

3
Adjusted Rand index [51] (ARI). We used the adjustedRandIndex function in mclust package [81]. The minimum ARI of each dataset was obtained by permuting cells’ groundtruth cell type labels 10^{4} times, recomputing ARI, and averaging the 10^{4} ARIs from random permutations. The maximum has been theoretically proved as 1. We subtract the ARI by the empirical minimum and divided by the distance between the empirical minimum and theoretical maximum.

4
Median Silhouette index S=Median(s(i)) where i is a cell, C_{i} is the set of cells in the same cluster as i, C_{i} is its cardinality (i.e., number of cells in a cluster),
$$s(i) = \begin{cases} 1  a(i) / b(i) & \text{if} a(i) < b(i) \\ 0 & \text{if} a(i) = b(i) \\ b(i) / a(i) 1 & \text{if} a(i) > b(i) \end{cases} $$and \(a(i) = \frac {1}{C_{i} 1} \sum _{k \in C_{i}, k\neq i} d(i,k), b(i) = \min _{i\neq j} \frac {1}{C_{j}} \sum _{k\in C_{j}} d(i,k)\). Here d(i,k) denotes the Euclidean distance between cells i and k. This is implemented by silhouette function in the cluster R package [82]. The range of S is [−1,1].
Identify cell types in HCA_10x_tissue dataset
To computationally identify the cell types in the HCA_10x_tissue dataset, we downloaded bulk RNAseq profiles from 13 primary hematopoietic cell types (GEO accession number GSE74246). The data were normalized and scaled using log2transformed TPM. We averaged bulk profile replicates from the same cell type. We used the bulk profiles to computationally assign a cell type label for each individual cell. Consider the j^{th} bulk cell type profile. We calculated the logfold change (LFC) between the j^{th} bulk cell type profile and each of the other bulk profile and identified the top 100 genes with the largest LFC for each comparison, which were used to define anchor gene sets. Repeating this for all bulk cell types resulted in 13×12=156 anchor gene sets. Next, for each bulk cell type or each single cell, we calculated mean expression of anchor genes in each of these gene sets, yielding a vector of length 156 which we refer to as “anchor profile” of a cell type or a cell. We calculated the Spearman correlation between each single cell anchor profile and each bulk cell type anchor profile. A cell was computationally assigned to the j^{th} cell type if the bulk profile of this cell type had the largest correlation with the single cell and the correlation coefficient was greater than 0.6. If no cell type label was assigned to a cell in this way, the cell’s cell type was labeled as unknown. Next, kmeans clustering was performed on the cells. For each cluster, if at least 70% of cells were inferred as cell type A, then all cells in the cluster were relabeled as cell type A. If a cluster cannot be labeled in this way, then all cells in the cluster were labeled as unknown cell type. The largest cluster obtained in this procedure contained N = 193 cells inferred as monocytes. They were used for the null differential analysis described before.
Evaluation on pseudotime inference
Constructing trajectories
We used the (1) CellBench [18] RNA mixture (RNAmix_celseq2, RNAmix_sortseq) and cell mixture (cellmix1, cellmix2, cellmix3, cellmix4) datasets from three cell lines (H2228, H1975, HCC827) generated by CELseq2 and SORTseq protocols [18] and (2) bone marrow cells from the HCA_10x_tissue [11]. Monocle 2 [53] and TSCAN [54] were used to construct trajectories on the imputed expression values. Monocle 2 uses a DDRTree (Discriminative DRTree) [83] for dimensionality reduction and tree construction. For TSCAN, we calculated the top principal components (PCs) with the prcomp function in the stats R package. We then run kmeans clustering (using the kmeans function in the stats R package) on the top PCs (obtained by TSCAN using elbow method) to obtain predicted cluster labels for each cell. Then, we used the predicted cluster labels and top PCs as input to TSCAN. The number of clusters was chosen to be the smallest one that allowed two branches in the spanning tree to be consistent with the underlying true tree structure.
To identify the root cells or root state when inferring trajectories with TSCAN and Monocle 2 using CellBench data, the cluster with the most H2228 cells was selected. For HCA_10x_tissue data, each cell was assigned a differentiation level, as follows. HSC: level 1; MPP: level 2; LMPP and CMP: level 3; CLP, GMP, and MEP: level 4; B cell, CD4 T cell, CD8 T cell, NK cell, Monocyte, and Erythroid: level 5. Then, the cluster of cells with the smallest averaged differentiation level was set to be root state for the inferred trajectories.
Assessment
For the CellBench [18] data, we used the same performance metrics as in Tian et al.’s work [18]: (1) the Pearson correlation between the inferred trajectory and the rank order of the cells for which we know the true ordering of the cells, and (2) the proportion of cells on an inferred trajectory for which the inferred branch and the true branch were the same.
For the bone marrow cells from the HCA_10x_tissue data, we compared the estimated pseudotime to the level of differentiation for each pair of cells. For instance, for a pair of cells A and B, if cell A is a HSC (differentiation level 1), cell B is a MPP (differentiation level 2), and the inferred pseudotime for cell A and B is t_{A} and t_{B} where t_{A}<t_{B}, then we call it “correctly ordered.” The percentage of the correctly ordered cells averaged across all possible cell pairs served as the performance metric.
Performance of imputation algorithms on time, memory, and scalability
We created four datasets for this analysis. Using the 10x_293t_jurkat dataset, we created two smaller datasets by randomly sampling N = 1000 and N = 5 000 cells (1k_cell, 5k_cell, respectively). Using the HCA_10x_tissue dataset, we created two larger datasets by randomly sampling N = 50,000 and N = 100,000 cells (50k_cell, 100k_cell, respectively). By running the imputation methods on all datasets, we assessed the computational time (in minutes), memory usage (MaxRSS or maximum resident set size of all tasks in a job in gigabytes (GB) returned from the Slurm command sacct), and scalability with respect to cell number (Additional file 1: Figure S11, Additional file 6: Table S5). To describe scalability for each method, a linear model was fit using the lm function from the stats R package where the computation time was the response and the number of cells on the log10scale was the predictor. The coefficient of the cell number represents the scalability of the method. For methods that failed to produce results, the running time was set to be the maximum time (72 h) plus 1 min, and the memory was set to be the maximum memory. The time and memory were linearly scaled to [0,1]. The average scaled time and memory across all datasets were used as the final score as shown in Fig. 6.
Overall performance score
We have assessed the performance of 18 imputation methods in a total of 1100 evaluation settings: 30 in bulk correlation, 1032 in differential analysis (972 in ranking differentially expressed genes, 60 in null differential analysis), 6 in predicting cell types using PBMC data, 16 in clustering analysis, 12 in trajectory analysis, and 4 in efficiency assessment. All the assessment measures were mapped to [ 0,1] by subtracting the theoretical minimum and then dividing by the difference of the theoretical maximum and minimum. Empirical extrema were used when theoretical ones did not exist. Ten thousand permutations was applied to obtain the empirical minimum of ARI for each dataset. H_{acc} and H_{pur} were further subtracted by 1 for the convenience of applying “the higher the score, the better the performance” criterion. Efficiency measures (time, memory, scalability) were scaled with the extrema of all methods’ performance.
Evaluation of imputed values through the similarity to bulk samples, differential analysis, clustering, and pseudotime inference were considered as four main assessment aspects. In each aspect, the scores across datasets were first averaged, and then the mean from different analysis tools (MAST and Wilcoxon; kmeans and Louvain clustering; Monocle 2 and TSCAN) were averaged. Next, for each aspect except for clustering, the mean of multiple assessment statistics (e.g., overlap and correlation in assessing pseudotime inference) was used as the evaluation score. For clustering, since H_{acc},H_{pur} and ARI are highly correlated (Fig. 4a, d), we first computed the average of H_{acc},H_{pur} and ARI, and then averaged it with medianSil and used the final average as the evaluation score. This is to avoid the three highly correlated metrics (H_{acc},H_{pur} and ARI) dominating the evaluation. Finally, the mean evaluation scores of all four assessment aspects was used to derive methods’ overall performance rank.
Availability of data and materials
The data used in this analysis are all publicly available. All data are described in the “Methods” section and Additional file 4: Table S3 with all links or GEO accession numbers. The imputation methods are described in Additional file 5: Table S4. All code to reproduce the presented analyses are available at https://github.com/Winnie09/imputationBenchmark[84]. The version of source code used in this article was deposited in Zenodo with the access code DOI: 10.5281/zenodo.3967825 (https://doi.org/10.5281/zenodo.3967825) [85]. The R package ggplot2 [86] for data visualization was used. All accession numbers are listed in Additional file 4: Table S3, but we list them here too: GSE81861 [17], GSE118767 [18], https://support.10xgenomics.com/singlecellgeneexpression/datasets[3], https://preview.data.humancellatlas.org/[11], GSE86337 [69], GSE129240 [72], and GSE74246 [73]. Other bulk RNAseq samples are from ENCODE [71].
References
 1
Yanai I, Hashimshony T. Celseq2singlecell RNA sequencing by multiplexed linear amplification. Methods Mol Biol. 2019; 1979:45–56.
 2
Macosko EZ, Basu A, Satija R, Nemesh J, Shekhar K, Goldman M, Tirosh I, Bialas AR, Kamitaki N, Martersteck EM, Trombetta JJ, Weitz DA, Sanes JR, Shalek AK, Regev A, McCarroll SA. Highly parallel genomewide expression profiling of individual cells using nanoliter droplets. Cell. 2015; 161(5):1202–14. https://doi.org/10.1016/j.cell.2015.05.002.
 3
Zheng GXY, Terry JM, Belgrader P, Ryvkin P, Bent ZW, Wilson R, Ziraldo SB, Wheeler TD, McDermott GP, Zhu J, Gregory MT, Shuga J, Montesclaros L, Underwood JG, Masquelier DA, Nishimura SY, SchnallLevin M, Wyatt PW, Hindson CM, Bharadwaj R, Wong A, Ness KD, Beppu LW, Deeg HJ, McFarland C, Loeb KR, Valente WJ, Ericson NG, Stevens EA, Radich JP, Mikkelsen TS, Hindson BJ, Bielas JH. Massively parallel digital transcriptional profiling of single cells. Nat Commun. 2017; 8:14049. https://doi.org/10.1038/ncomms14049.
 4
Picelli S, Faridani OR, Björklund AK, Winberg G, Sagasser S, Sandberg R. Fulllength RNAseq from single cells using smartseq2. Nat Protoc. 2014; 9(1):171–81. https://doi.org/10.1038/nprot.2014.006.
 5
Datlinger P, Rendeiro AF, Boenke T, Krausgruber T, Barreca D, Bock C. Ultrahigh throughput singlecell RNA sequencing by combinatorial fluidic indexing. bioRxiv. 2019. https://doi.org/10.1101/2019.12.17.879304. https://www.biorxiv.org/content/early/2019/12/18/2019.12.17.879304.full.pdf.
 6
McDavid A, Finak G, Chattopadyay PK, Dominguez M, Lamoreaux L, Ma SS, Roederer M, Gottardo R. Data exploration, quality control and testing in singlecell qPCRbased gene expression experiments. Bioinformatics. 2013; 29(4):461–7. https://doi.org/10.1093/bioinformatics/bts714.
 7
Hicks SC, Townes FW, Teng M, Irizarry RA. Missing data and technical variability in singlecell RNAsequencing experiments. Biostatistics. 2018; 19(4):562–78. https://doi.org/10.1093/biostatistics/kxx053.
 8
Kharchenko PV, Silberstein L, Scadden DT. Bayesian approach to singlecell differential expression analysis. Nat Methods. 2014; 11(7):740–2. https://doi.org/10.1038/nmeth.2967.
 9
Karaayvaz M, Cristea S, Gillespie SM, Patel AP, Mylvaganam R, Luo CC, Specht MC, Bernstein BE, Michor F, Ellisen LW. Unravelling subclonal heterogeneity and aggressive disease states in TNBC through singlecell RNAseq. Nat Commun. 2018; 9(1):3588. https://doi.org/10.1038/s41467018060520.
 10
Lähnemann D, Köster J, Szczurek E, McCarthy DJ, Hicks SC, Robinson MD, Vallejos CA, Campbell KR, Beerenwinkel N, Mahfouz A, Pinello L, Skums P, Stamatakis A, Attolini C. S. O., Aparicio S, Baaijens J, Balvert M, Barbanson B. d., Cappuccio A, Corleone G, Dutilh BE, Florescu M, Guryev V, Holmer R, Jahn K, Lobo TJ, Keizer EM, Khatri I, Kielbasa SM, Korbel JO, Kozlov AM, Kuo T. H., Lelieveldt BPF, Mandoiu II, Marioni JC, Marschall T, Mölder F, Niknejad A, Raczkowski L, Reinders M, Ridder J. d., Saliba A. E., Somarakis A, Stegle O, Theis FJ, Yang H, Zelikovsky A, McHardy AC, Raphael BJ, Shah SP, Schönhuth A. Eleven grand challenges in singlecell data science. Genome Biol. 2020; 21(1):31. https://doi.org/10.1186/s1305902019266.
 11
Regev A, Teichmann SA, Lander ES, Amit I, Benoist C, Birney E, Bodenmiller B, Campbell P, Carninci P, Clatworthy M, et al.Science forum: the human cell atlas. Elife. 2017; 6:27041.
 12
Tabula Muris Consortium, Overall coordination, Logistical coordination, Organ collection and processing, Library preparation and sequencing, Computational data analysis, Cell type annotation, Writing group, Supplemental text writing group, Principal investigators. Singlecell transcriptomics of 20 mouse organs creates a Tabula Muris. Nature. 2018; 562(7727):367–72. https://doi.org/10.1038/s4158601805904.
 13
Zhang L ZS. Comparison of computational methods for imputing singlecell RNAsequencing data. IEEE/ACM Trans Comput Biol Bioinform. 2018; 17:376–89.
 14
Andrews TS, Hemberg M. False signals induced by singlecell imputation. F1000Research. 2019; 7:1740. https://doi.org/10.12688/f1000research.16613.2. Accessed 28 June 2019.
 15
Breda J, Zavolan M, van Nimwegen EJ. Bayesian inference of the gene expression states of single cells from scRNAseq data. bioRxiv. 2019. https://doi.org/10.1101/2019.12.28.889956.
 16
Vieth B, Parekh S, Ziegenhain C, Enard W, Hellmann I. A systematic evaluation of single cell RNAseq analysis pipelines. Nat Commun. 2019; 10(1):4667. https://doi.org/10.1038/s41467019122667.
 17
Li H, Courtois ET, Sengupta D, Tan Y, Chen KH, Goh JJL, Kong SL, Chua C, Hon LK, Tan WS, et al.Reference component analysis of singlecell transcriptomes elucidates cellular heterogeneity in human colorectal tumors. Nat Genet. 2017; 49(5):708.
 18
Tian L, Dong X, Freytag S, Lê Cao KA, Su S, JalalAbadi A, AmannZalcenstein D, Weber TS, Seidi A, Jabbari JS, Naik SH, Ritchie ME. Benchmarking single cell RNAsequencing analysis pipelines using mixture control experiments. Nat Methods. 2019; 16(6):479–87. https://doi.org/10.1038/s4159201904258.
 19
Tang W, Bertaux F, Thomas P, Stefanelli C, Saint M, Marguerat SB, Shahrezaei V. bayNorm: Bayesian gene expression recovery, imputation and normalization for singlecell RNAsequencing data. Bioinformatics. 2020; 36(4):1174–81.
 20
Huang M, Wang J, Torre E, Dueck H, Shaffer S, Bonasio R, Murray JI, Raj A, Li M, Zhang NR. Saver: gene expression recovery for singlecell RNA sequencing. Nat Methods. 2018; 15(7):539.
 21
Wang J, Agarwal D, Huang M, Hu G, Zhou Z, Ye C, Zhang NR. Data denoising with transfer learning in singlecell transcriptomics. Nat Methods. 2019; 16(9):875–8.
 22
Li WV, Li JJ. An accurate and robust imputation method scimpute for singlecell RNAseq data. Nat Commun. 2018; 9(1):997.
 23
Miao Z, Li J, Zhang X. screcover: discriminating true and false zeros in singlecell RNAseq data for imputation. bioRxiv. 2019;:665323.
 24
Chen M, Zhou X. Viper: variabilitypreserving imputation for accurate gene expression recovery in singlecell RNA sequencing studies. Genome Biol. 2018; 19(1):196.
 25
Gong W, Kwak I. Y., Pota P, KoyanoNakagawa N, Garry DJ. Drimpute: imputing dropout events in single cell RNA sequencing data. BMC Bioinformatics. 2018; 19(1):220.
 26
Van Dijk D, Sharma R, Nainys J, Yim K, Kathail P, Carr AJ, Burdziak C, Moon KR, Chaffer CL, Pattabiraman D, et al. Recovering gene interactions from singlecell data using data diffusion. Cell. 2018; 174(3):716–29.
 27
Wagner F, Yan Y, Yanai I. Knearest neighbor smoothing for highthroughput singlecell RNAseq data. bioRxiv. 2017;:217737.
 28
Talwar D, Mongia A, Sengupta D, Majumdar A. Autoimpute: autoencoder based imputation of singlecell RNAseq data. Sci Rep. 2018; 8(1):16329.
 29
Eraslan G, Simon LM, Mircea M, Mueller NS, Theis FJ. Singlecell RNAseq denoising using a deep count autoencoder. Nat Commun. 2019; 10(1):390.
 30
Arisdakessian C, Poirion O, Yunits B, Zhu X, Garmire LX. Deepimpute: an accurate, fast, and scalable deep neural network method to impute singlecell RNAseq data. Genome Biol. 2019; 20(1):1–14.
 31
Amodio M, Van Dijk D, Srinivasan K, Chen WS, Mohsen H, Moon KR, Campbell A, Zhao Y, Wang X, Venkataswamy M, et al.Exploring singlecell data with deep multitasking neural networks. Nat Methods. 2019; 7:1–7.
 32
Deng Y, Bao F, Dai Q, Wu LF, Altschuler SJ. Scalable analysis of celltype composition from singlecell transcriptomics using deep recurrent learning. Nat Methods. 2019; 16(4):311.
 33
Lopez R, Regier J, Cole MB, Jordan MI, Yosef N. Deep generative modeling for singlecell transcriptomics. Nat Methods. 2018; 15(12):1053.
 34
Linderman GC, Zhao J, Kluger Y. Zeropreserving imputation of scRNAseq data using lowrank approximation. bioRxiv. 2018;:397588.
 35
Mongia A, Sengupta D, Majumdar A. Mcimpute: matrix completion based imputation for single cell RNAseq data. Front Genet. 2019; 10:9.
 36
Zhang L, Zhang S. Pblr: an accurate single cell RNAseq data imputation tool considering cell heterogeneity and prior expression level of dropouts. bioRxiv. 2018;:379883.
 37
Lun ATL, Bach K, Marioni JC. Pooling across cells to normalize singlecell RNA sequencing data with many zero counts. Genome Biol. 2016; 17:75. https://doi.org/10.1186/s1305901609477.
 38
Lun A, McCarthy D, Marioni J. A stepbystep workflow for lowlevel analysis of singlecell RNAseq data with Bioconductor [version 2; peer review: 3 approved, 2 approved with reservations]. F1000Research. 2016; 5(2122). https://doi.org/10.12688/f1000research.9501.2.
 39
Huber W, Carey VJ, Gentleman R, Anders S, Carlson M, Carvalho BS, Bravo HC, Davis S, Gatto L, Girke T, Gottardo R, Hahne F, Hansen KD, Irizarry RA, Lawrence M, Love MI, MacDonald J, Obenchain V, Oleś AK, Pagès H, Reyes A, Shannon P, Smyth GK, Tenenbaum D, Waldron L, Morgan M. Orchestrating highthroughput genomic analysis with Bioconductor. Nat Methods. 2015; 12(2):115–21. https://doi.org/10.1038/nmeth.3252.
 40
Amezquita RA, Carey VJ, Carpp LN, Geistlinger L, Lun ATL, Marini F, RueAlbrecht K, Risso D, Soneson C, Waldron L, Pagès H, Smith M, Huber W, Morgan M, Gottardo R, Hicks SC. Orchestrating singlecell analysis with Bioconductor. bioRxiv. 2019:590562. https://doi.org/10.1101/590562.
 41
Spearman C. The proof and measurement of association between two things In: Jenkins JJ, Paterson DG, editors. Studies in individual differences: the search for intelligence. United States: Appleton Century Crofts: 1961.
 42
Townes FW, Hicks SC, Aryee MJ, Irizarry RA. Feature selection and dimension reduction for singlecell RNASeq based on a multinomial model. Genome Biol. 2019; 20(1):295. https://doi.org/10.1186/s1305901918616.
 43
Hafemeister C, Satija R. Normalization and variance stabilization of singlecell RNAseq data using regularized negative binomial regression. Genome Biol. 2019; 20(1):1–15.
 44
Vieth B, Ziegenhain C, Parekh S, Enard W, Hellmann I. powsimr: power analysis for bulk and single cell RNAseq experiments. Bioinformatics. 2017; 33(21):3486–8. https://doi.org/10.1093/bioinformatics/btx435.
 45
Svensson V. Droplet scRNAseq is not zeroinflated. Nat Biotechnol. 2020; 38(2):147–50. https://doi.org/10.1038/s4158701903795.
 46
Finak G, McDavid A, Yajima M, Deng J, Gersuk V, Shalek AK, Slichter CK, Miller HW, McElrath MJ, Prlic M, et al. Mast: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in singlecell RNA sequencing data. Genome Biol. 2015; 16(1):278.
 47
Bauer DF. Constructing confidence sets using rank statistics. J Am Stat Assoc. 1972; 67(339):687–90.
 48
Ye C, Speed TP, Salim A. DECENT: differential expression with capture efficiency adjustmeNT for singlecell RNAseq data. Bioinformatics. 2019. https://doi.org/10.1093/bioinformatics/btz453.
 49
Lloyd S. Least squares quantization in pcm. IEEE Trans Inf Theory. 1982; 28(2):129–37.
 50
Xu C, Su Z. Identification of cell types from singlecell transcriptomes using a novel clustering method. Bioinformatics. 2015; 31(12):1974–80.
 51
Hubert L, Arabie P. Comparing partitions. J Classif. 1985; 2(1):193–218.
 52
Leland McInnes JM, Healy J. UMAP Uniform Manifold Approximation and Projection for Dimension Reduction. 2018. https://arxiv.org/abs/1802.03426. Accessed date: 6 Dec 2018.
 53
Qiu X, Hill A, Packer J, Lin D, Ma Y. A., Trapnell C. Singlecell mRNA quantification and differential analysis with Census. Nat Methods. 2017; 14(3):309–15. https://doi.org/10.1038/nmeth.4150.
 54
Ji Z, Ji H. Tscan: pseudotime reconstruction and evaluation in singlecell RNAseq analysis. Nucleic Acids Res. 2016; 44(13):117.
 55
Buenrostro JD, Corces MR, Lareau CA, Wu B, Schep AN, Aryee MJ, Majeti R, Chang HY, Greenleaf WJ. Integrated singlecell analysis maps the continuous regulatory landscape of human hematopoietic differentiation. Cell. 2018; 173(6):1535–48.
 56
Batson J, Royer L, Webber J. Molecular crossvalidation for singlecell RNAseq. bioRxiv. 2019. https://doi.org/10.1101/786269. http://arxiv.org/abs/https://www.biorxiv.org/content/early/2019/09/30/786269.full.pdf.
 57
Baylin SB, Esteller M, Rountree MR, Bachman KE, Schuebel K, Herman JG. Aberrant patterns of dna methylation, chromatin formation and gene expression in cancer. Hum Mol Genet. 2001; 10(7):687–92.
 58
AbateShen C. Deregulated homeobox gene expression in cancer: cause or consequence?. Nat Rev Cancer. 2002; 2(10):777–85.
 59
Zhang L, Zhou W, Velculescu VE, Kern SE, Hruban RH, Hamilton SR, Vogelstein B, Kinzler KW. Gene expression profiles in normal and cancer cells. Science. 1997; 276(5316):1268–72.
 60
Yoe J, Kim D, Kim S, Lee Y. Capicua restricts cancer stem celllike properties in breast cancer cells. Oncogene. 2020; 39(17):3489–506.
 61
La Manno G, Soldatov R, Zeisel A, Braun E, Hochgerner H, Petukhov V, Lidschreiber K, Kastriti ME, Lönnerberg P, Furlan A, et al. RNA velocity of single cells. Nature. 2018; 560(7719):494–8.
 62
Svensson V, Pachter L. RNA velocity: molecular kinetics from singlecell RNAseq. Mol Cell. 2018; 72(1):7–9.
 63
Gorin G, Svensson V, Pachter L. RNA velocity and protein acceleration from singlecell multiomics experiments. Genome Biol. 2020; 21:Art–No.
 64
Burgess DJ. Spatial transcriptomics coming of age. Nat Rev Genet. 2019; 20(6):317.
 65
10x Genomics. Visium Spatial Gene Expression Solution. https://www.10xgenomics.com/solutions/spatialgeneexpression/. Accessed 23 Jan 2020.
 66
Rodriques SG, Stickels RR, Goeva A, Martin CA, Murray E, Vanderburg CR, Welch J, Chen LM, Chen F, Macosko EZ. Slideseq: a scalable technology for measuring genomewide expression at high spatial resolution. Science. 2019; 363(6434):1463–7.
 67
Vickovic S, Eraslan G, Salmén F, Klughammer J, Stenbeck L, Schapiro D, Äijö T, Bonneau R, Bergenstråhle L, Navarro JF, et al.Highdefinition spatial transcriptomics for in situ tissue profiling. Nat Methods. 2019; 16(10):987–90.
 68
Moncada R, Wagner F, Chiodin M, Devlin JC, Baron M, Hajdu CH, Simeone DM, Yanai I. Integrating singlecell RNAseq with spatial transcriptomics in pancreatic ductal adenocarcinoma using multimodal intersection analysis. bioRxiv. 2019;:254375.
 69
Holik AZ, Law CW, Liu R, Wang Z, Wang W, Ahn J, AsselinLabat ML, Smyth GK, Ritchie ME. RNAseq mixology: designing realistic control experiments to compare protocols and analysis methods. Nucleic Acids Res. 2016; 45(5):30.
 70
Islam S, Zeisel A, Joost S, La Manno G, Zajac P, Kasper M, Lönnerberg P, Linnarsson S. Quantitative singlecell RNAseq with unique molecular identifiers. Nat Methods. 2014; 11(2):163.
 71
Consortium EP, et al.The encode (encyclopedia of dna elements) project. Science. 2004; 306(5696):636–40.
 72
Zaitsev K, Bambouskova M, Swain A, Artyomov MN. Complete deconvolution of cellular mixtures based on linearity of transcriptional signatures. Nat Commun. 2019; 10(1):2209.
 73
Corces MR, Buenrostro JD, Wu B, Greenside PG, Chan SM, Koenig JL, Snyder MP, Pritchard JK, Kundaje A, Greenleaf WJ, et al.Lineagespecific and singlecell chromatin accessibility charts human hematopoiesis and leukemia evolution. Nat Genet. 2016; 48(10):1193–203.
 74
Pliner HA, Packer JS, McFalineFigueroa JL, Cusanovich DA, Daza RM, Aghamirzaie D, Srivatsan S, Qiu X, Jackson D, Minkina A, et al. Cicero predicts cisregulatory DNA interactions from singlecell chromatin accessibility data. Mol Cell. 2018; 71(5):858–71.
 75
Zhou W, Ji Z, Fang W, Ji H. Global prediction of chromatin accessibility using smallcellnumber and singlecell RNAseq. Nucleic Acids Res. 2019; 47(19):121.
 76
Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B Methodol. 1995; 57(1):289–300.
 77
Smyth GK. Limma: linear models for microarray data. In: Bioinformatics and Computational Biology Solutions Using R and Bioconductor. China: Springer: 2005. p. 397–20.
 78
Sing T, Sander O, Beerenwinkel N, Lengauer T. ROCR: visualizing classifier performance in R. Bioinformatics. 2005; 21(20):3940–1.
 79
Blondel VD, Guillaume JL, Lambiotte R, Lefebvre E. Fast unfolding of communities in large networks. J Stat Mech Theory Exp. 2008; 2008(10):10008.
 80
Csardi G, Nepusz T, et al.The igraph software package for complex network research. InterJournal, Complex Syst. 2006; 1695(5):1–9.
 81
Scrucca L, Fop M, Murphy TB, Raftery AE. mclust 5: clustering, classification and density estimation using Gaussian finite mixture models. R J. 2016; 8(1):205–33.
 82
Rousseeuw PJ. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math. 1987; 20:53–65.
 83
Mao Q, Wang L, Goodison S, Sun Y. Dimensionality reduction via graph structure learning. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Sydney NSW Australia. New York: Association for Computing Machinery: 2015. p. 765–74.
 84
Hou W, Ji Z, Ji H, Hicks SC. A systematic evaluation of singlecell RNAsequencing imputation methods. 2019. https://github.com/Winnie09/imputationBenchmark. Accessed date: 30 Jul 2020.
 85
Hou W, Ji Z, Ji H, Hicks SC. A systematic evaluation of singlecell RNAsequencing imputation methods. 2020. https://doi.org/10.5281/zenodo.3967825.
 86
Wickham H. Ggplot2: elegant graphics for data analysis. Use R!Switzerland: Springer; 2016.
Acknowledgements
We would like to thank the Maryland Advanced Research Computing Center (MARCC) for providing computing resources.
Review history
The review history is available as Additional file 7.
Funding
This work is supported by the National Institutes of Health grants R01HG010889 and R01HG009518 to HJ and R00HG009007 to SCH. SCH is also supported by CZF2019002443 from the Chan Zuckerberg Initiative DAF, an advised fund of Silicon Valley Community Foundation. WH and SCH are supported by Alex’s Lemonade Stand Foundation.
Author information
Affiliations
Contributions
Authors’ contributions
All authors conceptualized and contributed to the main aims of the project. WH and ZJ performed the benchmark data analyses. All authors wrote and approved the final manuscript.
Authors’ information
Twitter handles: @HWenpin (Wenpin Hou), ZhichengJi (Zhicheng Ji), @jihk99 (Hongkai Ji), @stephaniehicks (Stephanie C. Hicks)
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Additional file 1
Supplementary Figures. Supplementary Figures S1S11. (PDF 17.4 MB).
Additional file 2
Supplementary Table S1. Comparison of our benchmark with other benchmark analyses involving imputation methods. (XLSX 11KB)
Additional file 3
Supplementary Table S2. Table of the Spearman correlation coefficients (SCC) shown in Fig. 2d, h filled circles between the number of cells and the correlation scores. We used the Spearman correlation coefficients (SCC) of an imputation method (values in heatmap Fig. 2d) and the number of cells in the columns as inputs to calculate the Spearman correlation between the number of cells and the scores. (CSV 1 KB)
Additional file 4
Supplementary Table S3. Summary of all datasets used in each evaluation of this benchmark. The table includes the names, protocols, source (GEO accession numbers or links to download) and cell details of each dataset. (XLSX 16 KB)
Additional file 5
Supplementary Table S4. Summary of all scRNAseq imputation methods used in each evaluation of this benchmark. The table includes the name of the method, input, output, preprocessing steps for each method that we applied, the programming language, assumptions about the method, the download date, software version number, and link to software package. (XLSX 15KB)
Additional file 6
Supplementary Table S5. Values of all three efficient measures in time, memory and scalability using all four datasets. The table include the computation time and memory of four datasets with 10^{3},5×10^{3},5×10^{4},10^{5} cells for all imputation methods. Scalability is the coefficient of the cell number of each dataset in the linear model where the number of cells on the log10scale is fitted against the computation time. (CSV 3 KB)
Additional file 7
Review history.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Hou, W., Ji, Z., Ji, H. et al. A systematic evaluation of singlecell RNAsequencing imputation methods. Genome Biol 21, 218 (2020). https://doi.org/10.1186/s1305902002132x
Received:
Accepted:
Published:
Keywords
 Gene expression
 Singlecell RNAsequencing
 Imputation
 Benchmark