Comprehensive benchmarking of 14 methods on ten datasets using five evaluation metrics
We evaluated the performance of batch correcting algorithms in terms of their ability to integrate batches while maintaining cell type separation (Fig. 1). The batch correction algorithms tested are either available in the R or Python language environment. For Seurat 2, Harmony, MNN Correct, fastMNN, and limma, the data preprocessing steps of normalization, scaling, and highly variable gene (HVG) selection were performed using the Seurat 2 package. For Seurat 3 batch correction, the corresponding functions in the package were used. For the other methods, their respective recommended preprocessing pipelines were employed.
The algorithms were tested on ten datasets, covering a diverse spectrum of cell types such as dendritic cells, pancreatic cells, retinal cells, and peripheral blood mononuclear cell (PBMCs), with datasets from both human and mouse. The technologies used also span a broad range, including 10x, SMART-seq, Drop-seq, and SMARTer. Dataset details and sources can be found in Additional file 1: Table S1, while the breakdown on cell counts per cell type is in Additional file 2: Table S2. We organized the datasets into five relevant scenarios: identical cell types with different scRNA-seq protocols, non-identical cell types, multiple batches (> 2 batches), big datasets, and simulated data. Scenario 1 consisted of dataset 2 of murine tissues, and dataset 5 of human perpherial blood mononuclear cells (PBMCs). For each dataset, all batches have the same cell types, though the proportion within each batch can vary significantly. The batches were sequenced with different technologies, thus increasing the difficulty of batch correction. Datasets 1, 6, 7, and 10 formed scenario 2, where batches within each dataset have non-identical cell types. The datasets of this scenario encompassed a range of different tissue samples: human dendritic cells, immortalized cell lines, mouse retina cells, and mouse hematopoietic cells. For scenario 3 of multiple batches, we used dataset 4, which contains five batches of human pancreatic cell expression data acquired using four scRNA-seq technologies. The distribution of cell types is also significantly different between batches. Scenario 4 was composed of datasets 8 and 9, where we tested the algorithms on big datasets with more than half a million cells each. In the final scenario, we used the Splatter package [19] to generate simulation datasets with different drop-out rates and unbalanced cell counts across batches. The aim of this scenario was to study the impact of batch correction on downstream differential gene expression analysis. In particular, we were interested in improving the recovery of differentially expressed genes (DEGs) after batch correction. We compared changes in the detected differentially expressed genes in terms of precision and recall with the F-score. Coverage of different scenarios by the different datasets can be found in Additional file 3: Table S3.
To evaluate the batch correction results, we employed t-Distributed Stochastic Neighbor Embedding (t-SNE) [20] and Uniform Manifold Approximation and Projection (UMAP) [21] visualizations in conjunction with the k-nearest neighbor batch-effect test (kBET) [22], local inverse Simpson’s index (LISI) [13, 23], average silhouette width (ASW) [24], and adjusted rand index (ARI) benchmarking metrics [25]. The UMAP plots are shown here in the main text, while the t-SNE plots are in the Additional file 4: Figure S1–S11. The kBET metric measures batch mixing on the local level using a predetermined number of nearest neighbors, which are selected around each data point by distance, to compute the local batch label distribution. A small proportion of local distributions deviating from the global batch label ratio (i.e., rejection rate) denotes good batch mixing. Here, we computed the kBET rejection rates for local sample sizes at 5%, 10%, 15%, 20%, and 25% to ensure that the assessment was not subject to the choice of sample size. The LISI metric can be used to measure the local cell type distribution (cLISI) or batch distribution (iLISI), based on local neighbors chosen on a preselected perplexity. Here, we used a perplexity of 40. Using the selected neighbors of a cell, the LISI was then computed on the local cell type and batch labels for the cLISI and iLISI indices respectively. For comparison between methods, we took the median value of the scores computed for all cells in the dataset, and scaled such that 0 and 1 denote the worst and best possible scores respectively. We also used the ASW metric to assess batch mixing and preserving cell type purity. For ease of comparison, we plotted the scores as 1-ASWbatch and ASWcell type, such that a higher value denotes better performance. Similarly, we computed and plotted the ARI scores in the same fashion, 1-ARIbatch and ARIcell type. To compute the ARI scores, k-means clustering was first performed to obtain cluster labels for comparison against batch labels and cell type labels to obtain the ARIbatch and ARIcell type scores respectively. All the batch mixing indices were computed for common cell types only, while all cells were used for cell type purity assessment. The computed values of benchmarking metrics can be found in Additional file 5: Table S4, while the statistical tests for significance are in Additional file 6: Table S5. To summarize these metrics, we summed the ranks of each method across all metrics to obtain a rank sum that was used to sort the methods. The rank sums are summarized in Fig. 21a, with details in Additional file 8: Table S7.
Scenario 1: identical cell types, different technologies
In this scenario, we tested the methods against datasets that contain the same cell types across all batches. However, different technologies were used to acquire the scRNA-seq data for each batch. For dataset 2, the visualization plots show that Seurat 2, Seurat 3, Harmony, fastMNN, MNN Correct, scGen, Scanorama, scMerge, and LIGER successfully mixed the common cells (Fig. 2). There was minimal cell type mixing, except for the mixing of NK and T cells, which may be attributed to the gene expression similarities of these cell types [26]. ComBat, limma, MMD-ResNet, ZINB-WaVE, and BBKNN were able to bring similar cell types across batches close, but with little to no mixing.
Comparing the iLISI scores, scMerge was the top method for batch mixing, and LIGER was a close second (p = 0.015) (Fig. 3). All methods gave good cLISI scores (1-cLISI > 0.96), which is congruent with the visualizations. For kBET, Harmony was top for batch mixing, followed by LIGER and scGen (p < 0.001). Using the ASW assessment, Seurat 3 and Harmony were the best methods in balancing between performance in batch and cell type, though all other methods also obtained good scores in batch mixing (1-ASWbatch > 0.9). In the ARI scores for batch mixing, all methods scored greater than 0.9, with Harmony obtaining the best ARI cell type score of 0.67 (p < 0.001) and an ARI batch score of 0.97. In most metrics, Harmony ranked high, and unsurprisingly, it was also the best method based on the rank sum, with MNN Correct and Seurat 3 tied at second place.
In dataset 5, there are two pairs of similar cell types, CD4 and CD8, and monocytes CD14 and FCGR3A. None of the methods were able to produce distinct clusters of CD14 and FCGR3A, or CD4 and CD8 in the visualization plots; the FCGR3A cells invariably formed a sub-cluster attached to the CD14 cluster, while CD8 cells formed sub-clusters around CD4 cells (Fig. 4). Seurat 2, Seurat 3, Harmony, fastMNN, and MNN Correct evenly mixed the batches with minimal mixing between CD4 and CD8 sub-clusters. In these cases, some separation of the CD4 and CD8 sub-clusters is visible, especially in the t-SNE plot (Additional file 4: Figure S2). scGen, MMD-ResNet, and LIGER also evenly mixed the batches, but with greater mixing of CD4 and CD8 cells. Scanorama, ZINB-WaVE, and scMerge not only mixed the CD4 and CD8 cells, but also accomplished poorer overall batch mixing. Finally, ComBat, limma, and BBKNN brought the batches close but did not mix them.
Using the cLISI metric, most methods had good scores for cell type purity of greater than 0.98 (Fig. 5). As the metric only measures local cell purity, the mixing at the edges of cell type-specific sub-clusters were poorly captured by the metric. This resulted in methods with high cLISI scores despite the mixing of CD4 and CD8 cells in the visualization plots. In terms of batch mixing (iLISI), LIGER was top (p < 0.001), followed by Seurat 2 and Seurat 3. The computed kBET scores also showed LIGER as the top method with Seurat 2 as a near second, while Seurat 3 was third for batch mixing (p < 0.001). In terms of ASW metrics, the batch mixing scores were greater than 0.95 for all methods, while Harmony and Seurat 3 was top in terms of cell type purity (p = 0.183), followed by MNN Correct. Similarly with ARI, Harmony, was the best method in terms of cell type purity, followed by fastMNN, Seurat 3, and MNN Correct as next best (p < 0.13). These four methods also had high ARIbatch scores of greater than 0.97. Using the rank sum, Harmony and Seurat 3 were tied as the best methods overall, with LIGER at the third place.
For both datasets, Harmony was the top method, and Seurat 3 ranked second and third once. Based on these results, both methods are highly recommended for datasets with common cell types. Though LIGER was only ranked third for dataset 5 and tied at fourth place with fastMNN for dataset 2, it was a consistent performer and thus also a competitive method worth considering.
Scenario 2: non-identical cell types
Dataset 1 poses an interesting challenge to batch correction algorithms due to the presence of two highly similar cell types present in dissimilar batches. Examination of the visualization plots shows that most methods were able to mix both batches together (Fig. 6). limma brought cell clusters of both batches close but did not achieve mixing, while MMD-ResNet and BBKNN did not bring any cell clusters of common type closer. scGen, Harmony, LIGER, and scMerge were able to integrate double negative and pDC cells from batches 1 and 2, while keeping CD141 and CD1C cells in separate clusters, with minimal mixing of CD1C, CD141, and double negative cells. The remaining methods produced higher levels of cell type mixing; MNN Correct, fastMNN, Seurat 3, and Seurat 2, and ZINB-WaVE produced single well mixed clusters of CD141 and CD1C cells, while ComBat and Scanorama brought CD1C and CD141 cells close, which would be hard to distinguish as different cell types in the case of unlabeled experimental data. ZINB-WaVE produced a large loose cluster in both tSNE and UMAP. While cell labels show segregation into sub-clusters, these sub-clusters cannot be easily discerned visually without labels.
Most batch correction algorithms require at least one identical cell type to be shared between any pair of data batches, to guide the data alignment. MNN Correct [5], fastMNN [5, 7], Seurat 3 [12], and Scanorama [9] search for MNNs to find the shared populations between different datasets. When there are sub-populations that are not shared between different batches, false matching of the MNNs and incorrect alignment can occur, especially if there are cell types that are highly similar. This appears to be the reason for the clustering of CD1C and CD141 cells together by many methods. The same issue was also been reported with Seurat 3 [12], where the researchers identified a small number of incorrect matches.
In terms of kBET scores, LIGER and Seurat 2 were the best in terms of batch integration for this dataset (Fig. 7). For the iLISI metric, LIGER and Seurat 2 again achieved the highest scores. In terms of cLISI, most methods posted high scores (1-cLISI > 0.96), except for Seurat 2 and ZINB-WaVE. By the ASW metrics, LIGER was the leading method in both cell purity and batch mixing (p < 0.001) . Except for ZINB-WaVE and MMD-ResNet, the other methods gave excellent ASW batch integration scores (1-ASWbatch > 0.95). For ARI assessment, most methods gave good batch mixing, with the exception of ZINB-WaVE, which was also the worst in terms of cell type purity. Using the rank sum of the metrics, fastMNN emerged as the best method, with LIGER and scMerge ranking second and third respectively.
Dataset 6 contains only two cell types, with two out of the three batches containing only one cell type that is also only shared with the third batch. The t-SNE and UMAP plots show that scGen, scMerge, and BBKNN were able to produce two large cell type-specific clusters (293T and Jurkat) that were well mixed with cells from their respective batches (Fig. 8). Harmony also mixed the batches well, but with the Jurkat cells divided into two clusters. LIGER also produced two batch mixed clusters, but with some cell type mixing. In Seurat 2’s output, the batches were mixed, but the 293T and Jurkat cell clusters were too closely positioned to be easily separated visually. fastMNN, Scanorama, ZINB-WaVE, and MMD-ResNet batch mixed the 293T cells but not the Jurkat cells, while Seurat 3 only mixed the Jurkat cells. Finally, MNN Correct, ComBat, and limma incorrectly mixed the Jurkat and 293T cells from different batches.
The LISI metrics also indicate that Harmony, scMerge, and scGen were the best methods for this dataset in terms of batch integration and cell type purity (Fig. 9). Similarly, Harmony was top ranked in kBET, followed by scGen and Scanorama, despite the relatively poor batch mixing of Jurkat cells by Scanorama in the visualizations. Using the ASW metrics, Harmony was the leading method, followed by scMerge, scGen, and Scanorama, with Scanorama showing lower batch integration but higher cell type purity. With the ARI metrics, scGen, scMerge, ZINB-WaVE, Harmony, and Scanorama were the top methods. Based on the rank sum of the assessment metrics, Harmony was the top method, followed by Scanorama and scGen. These results are quite consistent across all metrics, which gives confidence on our assessment of the methods. Due to the BBKNN’s output being a graph, assessment metrics could not be computed. However, based on the UMAP visualization, we consider BBKNN to be a competitive method.
For dataset 7 (Fig. 10), the cell counts across cell types are highly uneven with batch 1 predominantly made up of bipolar cells (88%), and smaller numbers of amacrine, Muller, cone, and rod cells. On the other hand, batch 2 contains more cell types with 66% being rod cells (see Additional file 2: Table S2) and cell types such as ganglion, vascular endothelium, and horizontal that are not found in batch 1. From the visualization plots, the batch effect does not appear to be significant. In particular, the Muller and bipolar cells already show substantial mixing. Surprisingly, ComBat and limma separated the shared cell types to form batch and cell type separated clusters instead. ZINB-WaVE, scMerge, and MMD-ResNet clustered most of the cells into a single large cluster, albeit with intra-cluster segregation among the cell types. The remaining methods were largely able to mix the common cells while maintaining cell type purity among clusters, though Seurat 2 separated the Muller cells from both batches.
Using the LISI metrics, we assessed batch integration and cell type purity where LIGER was top for batch integration (p < 0.001) and the best method overall (Fig. 11). On the other hand, the computed kBET metric shows scMerge as the best for batch integration (p < 0.001). Meanwhile, the ASW metrics show no clearly superior method, with methods showing tradeoff between scores in batch mixing and cell type purity. The metric also shows a high batch mixing score for ComBat, despite the lack of batch mixing in the visualization plots. In the case of ARI, trade-offs between batch and cell type metrics can be seen among the methods, without a clearly superior method. Using the rank sum to combine the assessment metric results, LIGER was the top performing method with MNN Correct second and scMerge third. For this dataset, it is difficult to determine a clearly superior method by visual inspection or evaluation metrics. While LIGER can be concluded to be the best based on the metrics, the visualizations of other methods such as Harmony, Seurat 3, and scGen suggest that these methods were also able to perform batch integration and preserve cell type purity, despite their fairly low rankings in the rank sum.
In dataset 10, batch 1 contains only CMP, GMP, and MEP cells, while batch 2 contains all cell types. Based on the cell type information, we can expect the clustering of cells to reflect their developmental lineage, with GMP and MEP forming their respective clusters that are also close to CMP cells [5]. In the visualization plots, Seurat 2, Seurat 3, Harmony, Scanorama, and LIGER, were able to batch mix the GMP and MEP cells, with the expected mixing of other cell types (Fig. 12). fastMNN, MNN Correct, scGen, and BBKNN brought the GMP and MEP cells close with minimal batch mixing. ZINB-WaVE and scMerge moved the cells closer to form a big loose cluster, while ComBat and limma moved the batches closer but no batch mixing occurred. Finally, MMD-ResNet made almost no impact on the distribution of cells.
The iLISI metric computed shows that Seurat 2 was the best for batch mixing, followed closely by LIGER (p = 0.057), though LIGER's output is superior in cell type purity (cLISI) with p value < 0.001 (Fig. 13). Harmony and Seurat 3 ranked third and fourth respectively for iLISI but with increasingly better cLISI scores. A similar trend can be seen in the kBET results with LIGER as the top result, followed by Seurat 2 and Harmony (p < 0.001). Based on the ASW metrics, Scanorama was the top method for both batch mixing and cell type purity (p < 0.001). The ARI scores gave very different results with scGen as the top method for cell type purity (p < 0.001) while having a batch mixing score comparable to other methods with high ARIbatch scores (> 0.9). Using the rank sum to summarize the evaluations, Harmony, Scanorama, and LIGER were the top methods for this dataset. While Harmony did not top any metric, it was ranked second in three metrics (ASW, ARI, and LISI) and third for kBET; the results highlight Harmony's efficacy on this dataset.
In this scenario, we tested the batch correction methods on four diverse datasets. While no method was the best for all datasets, LIGER was one of the top three methods for three datasets (1, 7, 10), while scMerge was ranked third for datasets 1, 6, and 7. Harmony ranked first for datasets 6 and 10, while Scanorama ranked second for datasets 6 and 10. Based on these results, LIGER was the leading method in this scenario.
Scenario 3: multiple batches
This scenario tested batch correction abilities with multiple batches. Dataset 4 consists of five batches of human pancreatic cells sequenced with four technologies. The t-SNE and UMAP plots show that Seurat 3, Harmony, scGen, and LIGER produced clusters that evenly mixed with cells from different batches (Fig. 14). The batch mixing was less even for Seurat 2, fastMNN, Scanorama, ZINB-WaVE, scMerge, and BBKNN. The above methods also mixed the stellate and mesenchymal cells to varying extents except for scGen, which can be attributed to the supervised nature of the method. Delta and gamma cells were also clustered close by LIGER and Harmony, though better separation can be seen in Harmony’s t-SNE plot (Additional file 4: Figure S7). MNN Correct, ComBat, limma, and MMD-ResNet brought cell-specific clusters from different batches close, but without significant batch mixing. The cell types were also broken up into multiple smaller clusters by these methods.
From the LISI metrics, the cell type purity of the method outputs was high (> 0.98), while Seurat 3 was also top in batch integration (p < 0.001) (Fig. 15). Seurat 3 was also second in batch mixing by the kBET metric, while LIGER was top (p < 0.001). Assessment by ASW showed that ZINB-WaVE ranked top for batch integration (p < 0.001), though the earlier visual analysis shows that it did not mix the batches well. The other methods show poorer ASW batch scores but higher ASW cell type scores with Scanorama as the best (p < 0.001). All methods received high ARI batch integration scores (> 0.85), despite the lack of batch mixing by methods such as ComBat and limma. However, the leading methods (score > 0.99) of Seurat 3, Harmony, scGen, and LIGER were largely similar to the results obtained with iLISI. For ARI cell type assessment, ZINB-WaVE received the highest score, though not significantly different (p value > 0.05) compared to other methods except MMD-ResNet and limma. Using the rank sum to summarize the metrics, Seurat 3 was the leading method, followed by scGen and scMerge. Returning to the analysis with t-SNE and UMAP visualizations, we can concur with the rankings. While dataset 6 was analyzed in scenario 2, the dataset also contains more than two batches and therefore relevant to this scenario. For dataset 6, the top three methods were Harmony, Scanorama, and scGen, with scMerge ranked fourth. Taking the ranking results of both datasets into consideration, for datasets with labeled cell types, scGen is recommended based on its performance. For unlabeled data, Seurat 3 and Harmony are the recommended choices.
Scenario 4: big data
In this scenario, we tested the methods on two large datasets with more than 100,000 cells each. BBKNN, ComBat, Harmony, LIGER, limma, MMD-ResNet, Scanorama, scGen, Seurat 3, and ZINB-WaVE were able to complete runs on the full datasets. The remaining four methods were unable to complete the batch correction runs due to memory or runtime requirements. FastMNN, scMerge, and Seurat 2 required more than 2.27 TB of memory and caused excessive disk thrashing from virtual memory usage, while MNN Correct was able to run but took too long (> 48 h). The metrics and visualizations shown were computed on the 10% downsampled data; however, our analysis here focuses on the methods that could complete running on the large datasets.
Dataset 8 consists of two batches of murine brain data acquired using different technologies (Fig. 16). The cell numbers are unevenly distributed across cell types, and the bulk of the cells in batch 2 consist of astrocytes, neurons, oligodendrocytes, and polydendrocytes. Only LIGER appears to have maintained relatively good cell type separation while achieving batch mixing. Seurat 3, Harmony, ZINB-WaVE, scGen, and MMD-ResNet produced comparatively less even batch mixing. ComBat, limma, Scanorama, and BBKNN fared even poorer with little to no batch mixing. Employing the LISI metrics, all methods maintained high local cell type purity with good cLISI scores (1-cLISI > 0.8) (Fig. 17). Among the methods that could complete running on the large datasets, LIGER and Seurat 2 achieved the highest iLISI scores among the methods (p value = 0.057), followed by Harmony and Seurat 3 (p < 0.001). Within the LISI metrics, we can see some trade off between batch integration and cell type purity among LIGER, Harmony, and Seurat 3. Surprisingly, the kBET metric shows very different results, with none of these methods achieving a good score. Instead, fastMNN, scMerge, and MNN Correct were the top three methods by kBET. The ASW metrics also paint a substantially different picture with ZINB-WaVE being the best in batch mixing, though most methods showed high batch mixing scores (1-ASWbatch > 0.93) as well, while Harmony produced the highest cell type purity (p < 0.001). The ARI results also differ, most methods were also able to produce high batch mixing scores greater than 0.95 (except for limma). Overall, with scGen being the best method, being top for batch mixing (p < 0.001), and tied with LIGER for cell type purity (p = 0.34). In terms of ARI cell type purity, scGen and LIGER were followed by Harmony. Combining the ranking across all metrics with rank sum, Seurat 3 ranked first, followed by scGen and Seurat 2.
Dataset 9 is composed of two data batches, each derived from different tissue. Due to the lack of available cell type information, we could only assess batch mixing capabilities. From the plots, the raw data shows substantial mixing. Visually, most methods were able to evenly mix the batches, except for scMerge, limma, and Scanorama (Fig. 18). ComBat improved the mixing slightly, but limma separated the batches instead. Harmony, LIGER, MMD-ResNet, Seurat 3, and ZINB-WaVE produced iLISI scores above 0.45 (p < 0.05), significantly better than the other methods (Fig. 19). In terms of the kBET scores, MMD-ResNet was consistently better than all other methods at all sample sizes (p < 0.001). Batch mixing measured by ASW shows that most methods achieved mixing better than the raw data, except for limma, scMerge, fastMNN, Scanorama, and Seurat 3. The ASWbatch score for Seurat 3 is surprisingly poorer than competing methods, given its visualization plots and rankings in the other rankings. In the ARI batch mixing assessment, most methods obtained high scores (> 0.9). Surprisingly, limma had a high score (> 0.99), which is comparable to other leading methods while its performance measured by other metrics was poor. Summarizing the various metrics, the computed rank sum showed LIGER as the top method, with ZINB-WaVE ranked second and MMD-ResNet third.
Among the methods that ranked in the top three, LIGER and Seurat 3 ranked top for one dataset each. Visual examination of the plots also supports this result, thus making both methods recommended for large datasets.
Scenario 5: simulation
One of the uses of batch integration is to obtain a corrected gene expression matrix for downstream analysis. This is illustrated by several recent publications that used batch-effect-corrected matrices for analyses, such as pseudotime trajectory analysis [9, 18] and DEG analysis [5, 12]. However, these only compared a small number of batch-effect removal methods and test cases. Herein, we performed a comprehensive evaluation of eight methods that return batch-corrected expression matrices. We evaluated the methods in terms of detecting differential gene expression, using the DEG analysis workflow shown in Fig. 20a. Six different use cases were designed using the Splatter simulation package [19] in the R language environment to cover different scenarios of cell population sizes and drop-out rates, including small and large drop-out rates, balanced and unbalanced cell counts in two batches, and also the case of small cell numbers in one batch. The simulation details are shown in Fig. 20b. The advantage of employing simulation data generated with Splatter is that the true number of DEGs is known, thus enabling us to study the impact of batch correction on the fraction of true DEGs recovered.
For this study, we only considered the methods that produce a batch-effect-corrected gene expression matrix: Seurat 3, MNN Correct, ComBat, limma, scGen, Scanorama, ZINB-WaVE, and scMerge. The simulated gene expression matrices with batch effects were used as input to the batch-effect correction methods. We tested the batch-effect removal methods with two cases for each expression matrix, using all genes, and only highly variable genes (HVGs). DEG detection was performed on the batch-corrected matrices. We first detected DEGs between cell type “Group 1” versus cell type “Group 2” and computed the F-score of upregulated and downregulated genes in “Group 1.” We then calculated the accuracy metrics of true positive (TP), false positive (FP), false negative (FN), and F-score to examine the accuracy of DEGs obtained from each corrected matrix, as compared to the true DEGs. The F-scores are summarized in Fig. 20c.
As seen in Fig. 20c, the median F-scores of MNN Correct, ComBat, and limma were 0.73, 0.71, and 0.76 for upregulated genes in the case of all genes, and 0.94, 0.91, and 0.94 for upregulated genes in the case of HVGs. Statistically, there was no performance difference between these methods (Wilcoxon p value > 0.05, Additional file 7: Table S6B). We also see a similar trend in the F-score results for downregulated genes in the case of all genes. However, if we considered only downregulated HVGs, limma was not able to remove batch effects and resulted in lower F-score than MNN Correct and ComBat with statistical significance (Wilcoxon p value < 0.01, Fig. 20c, Additional file 7: Table S6E). The median F-scores of ZINB-WaVE and scMerge were 0.71 and 0.70 for upregulated genes (all genes), and 0.96 and 0.9 for upregulated genes (HVGs), with no statistically significant differences (Wilcoxon p value > 0.05, Additional file 7: Table S6B). Finally, in the case of Seurat 3, scGen, and Scanorama, the median F-scores were 0.42, 0.29, and 0.17 in upregulated genes (all genes), and 0.92, 0.88, and 0.22 in upregulated genes (HVGs). Statistical tests showed no significant difference between Seurat 3 and scGen, but there was a significant difference between the results of Seurat 3 and Scanorama (Wilcoxon p value < 0.05), and between scGen and Scanorama (Wilcoxon p value < 0.01). In particular, Scanorama’s F-score was lower than the raw, implying that the method removed most of the cell type variation between “Group 1” and “Group 2.” This is a crucial point, as the goal of batch correction is to remove variations due to data acquisition under different conditions and technologies, while preserving variations of biological origin. Therefore, the F-score results of DEG analysis applied on corrected results should have higher accuracy compared to the raw data. For most methods, the F-scores of the batch-corrected data were higher than those of the raw data, except for scGen and Scanorama. This shows that the batch-effect removal methods were effective in removing the technical variance between data batches of our simulated data. Based on the F-scores, MNN Correct, ZINB-WaVE, ComBat, and scMerge were the top performing methods (Fig. 20c, Additional file 7: Table S6).
The proportion of correctly detected upregulated genes was higher for the HVG set than for the all genes set. This is unsurprising, since HVGs are more likely to retain their cell type variations than genes with lower variability. Therefore, applying batch-effect correction on selected HVGs will give better precision, but restricts the study. If it is necessary to perform batch correction on the full gene dataset, then one should be careful of erroneous conclusions that may arise from the false positives and negatives.
Next, we examined the accuracy metrics of the methods (Additional file 7: Table S6). MNN Correct, ComBat, and limma produced corrected matrices with high TP and low FP rates in DEG analysis. This was followed by ZINB-WaVE and scMerge with slightly lower TP. The TP and FP numbers were further lower for the case of Seurat 3, while scGen and Scanorama gave the worst results with the lowest TP and highest FP. Based on these metrics, ComBat, limma, and MNN Correct were the best methods, followed by ZINB-WaVE and scMerge. Conversely, scGen and Scanorama were the worst performing methods, with Scanorama removing biologically important gene expression difference between cell groups.
When creating the test cases, we used different drop-out rates and sample sizes. As seen in Fig. 20 and Additional file 7: Table S6, with the same batch size and different drop-out rate (case 1 compared to 2, and case 3 compared to 4), the TP and FP counts were on average similar between the case of a high drop-out rate compared to a low drop-out rate, except for scGen. For scGen, there was a big difference in the FP count between low and high drop-out rates, especially in the downregulated genes (FP = 722 in case 3 vs FP = 2763 in case 4, and FP = 92 in case 5 vs FP = 1926 in case 6). Therefore, the drop-out rate had minimal impact on most methods. We also observed slightly higher TP and FP numbers when there was an unbalanced number of cells in the batches (500 cells in batch 1 and 900 cells in batch 2, Fig. 20b), as opposed to balanced cell numbers (500 cells in batch 1 and 450 cells in batch 2, Fig. 20b, Additional file 7: Table S6). However, with a greater imbalance in cell numbers (simulations 5 and 6, with 80 cells in batch 1 and 400 cells in batch 2), the number of TP was instead significantly lower compared to other simulations. Overall, within the tested range of drop-out rates and cell counts, the dropout rate had little impact on the results (TP and FP), while the effect of cell count balance was mixed. Another noteworthy observation is that scGen is prone to FP errors compared to the other methods.
In conclusion, ComBat, MNN Correct, ZINB-WaVE, and scMerge are the recommended methods to obtain a batch-effect-corrected matrix for downstream analysis.