Datasets used for performance evaluation
For consistency, in this paper, an individual dataset will be referred to as a “batch.” Multiple batches investigating similar biological problems will be referred to as a “dataset.” We applied BERMUDA on simulated datasets, human pancreas cell datasets, and PBMC datasets to assess its performance (Table 1). We used two methods to generate simulated datasets for evaluating the performance of BERMUDA. For the 2D Gaussian dataset, we followed [20] to generate highly synthetic data with batch effects. We simulated two batches of four different cell types according to different bivariate normal distributions in a two-dimensional biological subspace. The cell population composition of each batch was generated randomly. Then we randomly projected the data to a 100-dimensional space to simulate the high-dimensional gene expression data. Gene-specific batch effects were generated by adding Gaussian noise to the high-dimensional data. For the Splatter dataset, we used the Splatter [31] package to simulate RNA sequence counts of two batches with four different cell populations. Splatter can directly simulate multiple batches following similar cell type compositions at the same time. We set the cell population composition to be 0.4, 0.3, 0.2, and 0.1 among the four simulated cell types.
To evaluate whether BERMUDA can remove batch effects in real scRNA-seq data and extract meaningful biological insights, we also applied it to datasets of human pancreas cells and PBMCs. The pancreas dataset was obtained from Gene Expression Omnibus (GSE85241 for Muraro batch [32], GSE84133 for Baron batch [33]) and The European Bioinformatics Institute (E-MTAB-5061 for Segerstolpe batch [37]). The PBMC dataset was obtained from 10x Genomics support datasets. To effectively compare the difference between the cases where all the cell types or only a subset of those were shared among batches, we only retained the shared cell types in the pancreas dataset. The details of the datasets are shown in Table 1.
Framework of BERMUDA
BERMUDA is a novel unsupervised framework for batch correction across different batches (Fig. 1a). The workflow of BERMUDA includes the following five steps: (1) preprocessing of scRNA-seq data, (2) clustering of cells in each batch individually, (3) identifying similar cell clusters across different batches, (4) removing batch effects by training an autoencoder (Fig. 1b), and (5) utilizing the batch-corrected codes for further downstream analysis. We introduce each step in detail in the following sections.
Preprocessing
Gene expression levels from each cell were first quantified using transcript-per-million values (TPM). First, we restricted the analysis to genes that were highly variable based on Seurat v2 [22]. Gene expression values were normalized as
$$ {G}_{ij}={\log}_2\left({\mathrm{TPM}}_{ij}+1\right), $$
(1)
where TPMij is the TPM of gene i in cell j. We subsequently standardized expression per batch as
$$ {G}_{ij}^{\prime }=\frac{G_{ij}-{\overline{G}}_i}{\sigma_i}, $$
(2)
where G′ij is the standardized expression level of gene i in cell j. \( {\overline{G}}_i \) is the mean expression level for gene i and σi is the standard deviation of expression level for gene i. Then, we linearly scaled the expression level of each gene to [0, 1].
Clustering of cells and identifying similar cell clusters
Cell clusters were identified from each batch individually following the pipeline in Seurat v2 [22]. Seurat v2 implemented a clustering algorithm based on optimizing the modularity function on a k-nearest neighbor graph. We then used MetaNeighbor [8] to determine the similarity between clusters from different batches based on Spearman correlation. For n batches each contains ci clusters, i = 1, 2, …, n, MetaNeighbor produces a similarity score for each pair of cell clusters by calculating the mean of area under the receiver operator characteristic curve (AUROC) in cross-validation. We denote \( {M}_{i_1,{j}_1,{i}_2,{j}_2} \) as the similarity score between cluster j1 in batch i1 and cluster j2 in batch i2. Because we were interested in similar clusters across different batches, we set the similarity score between clusters within the same batch to 0. For each cluster, we considered the most similar cluster in each of the other batches, such that
$$ {M}_{i_{a,}{j}_a,{i}_b,{j}_b}=\left\{\begin{array}{c}{M}_{i_a,{j}_a,{i}_b,{j}_b},\kern0.5em \mathrm{if}\ {M}_{i_a,{j}_a,{i}_b,{j}_b}=\max \left\{\ {M}_{i_a,{j}_a,{i}_b,j},j=1,2,\dots, {c}_{i_b}\right\},{i}_a\ne {i}_b\\ {}0,\kern0.5em \mathrm{otherwise}\end{array}\right.. $$
(3)
We further made the similarity scores between two clusters symmetrical by modifying \( {M}_{i_a,{j}_a,{i}_b{j}_b} \) as
$$ {M}_{i_a,{j}_a,{i}_b,{j}_b}=\max \left\{{M}_{i_a,{j}_a,{i}_b,{j}_b},{M}_{i_b,{j}_b,{i}_a,{j}_a}\right\},{i}_a,{i}_b=1,2,\dots, n,{j}_a=1,2,\dots, {c}_{i_a},{j}_b=1,2,\dots, {c}_{i_b}. $$
(4)
Instead of considering mutual nearest cell clusters, our criterion for identifying similar cell clusters summarized in Eqs. 3 and 4 can accommodate the case where a cell cluster in one batch corresponds to multiple clusters in another batch, which makes BERMUDA more robust to the results in the clustering step. Considering the case where cluster j1 in batch i1 was identified to be similar to both cluster j2 and j3 in batch i2 with high confidence by MetaNeighbor (e.g., \( {M}_{i_1,{j}_1,{i}_2,{j}_2}={M}_{i_2,{j}_2,{i}_1,{j}_1}=0.99 \), \( {M}_{i_1,{j}_1,{i}_2,{j}_3}={M}_{i_2,{j}_3,{i}_1,{j}_1}=0.98 \)), where cluster j2 and j3 in batch i2 might be separated from a single, larger cluster by the clustering algorithm. If we only considered mutual nearest clusters, we would have only aligned cluster j1 in batch i1 with cluster j2 in batch i2. However, by using our proposed criterion, according to Eqs. 3 and 4, we can get \( {M}_{i_1,{j}_1,{i}_2,{j}_3}={M}_{i_2,{j}_3,{i}_1,{j}_1}=0.98 \) (\( {M}_{i_1,{j}_1,{i}_2,{j}_3}=0 \), \( {M}_{i_2,{j}_3,{i}_1,{j}_1}=0.98 \) from Eq. 3 and \( {M}_{i_1,{j}_1,{i}_2,{j}_3}=\max \left({M}_{i_1,{j}_1,{i}_2,{j}_3},{M}_{i_2,{j}_3,{i}_1,{j}_1}\right)=0.98 \) from Eq. 4), which faithfully captured the similarity relationships identified.
Finally, we binarized \( {M}_{i_a,{j}_a,{i}_b,{j}_b} \) with a threshold value Sthr, where
$$ {M}_{i_a,{j}_a,{i}_b,{j}_b}=1\ \mathrm{if}\ {M}_{i_a,{j}_a,{i}_b,{j}_b}\ge {S}_{thr},{i}_a,{i}_b=1,2,\dots, n,{j}_a=1,2,\dots, {c}_{i_a},{j}_b=1,2,\dots, {c}_{i_b}. $$
(5)
This can prevent finding a spurious counterpart for a cluster (e.g., cluster j2 in batch i1) if such cluster was specific to only one batch. In this case, the similarity scores between cluster j2 in batch i1 to any other clusters should be lower than Sthr, which set \( {M}_{i_1,{j}_2,i,j}=0,\forall i,j \) (Eq. 5). In this way, cluster j2 in batch i1 will not be aligned to other clusters during the training process of BERMUDA. Sthr was chosen empirically and we observed that BERMUDA achieved robust and competitive results across different datasets when Sthr was between 0.85 and 0.90 (Additional file 1: Figure S8, S9).
Batch correction using an autoencoder
BERMUDA uses an autoencoder to project the original uncorrected gene expression profiles to a low-dimensional space to remove the experimental artifacts across different batches (Fig. 1b). An autoencoder can be represented as a function x′ = f(x) = fdecoder(fencoder(x)), where f reconstructs the input gene expression profile x through the neural network. To avoid trivial solutions, autoencoders usually incorporate a bottleneck layer that learns a low-dimensional embedding of the input data called code, e.g., z = fencoder(x). In BERMUDA, we used an autoencoder with three hidden layers, and the default number of neurons in each hidden layer were 200, 20, and 200. For the synthetic data generated from bivariate Gaussian distributions, we set the neurons in each hidden layer as 20, 2, and 20 to reconstruct the two-dimensional biological plane.
The mini-batch gradient descent algorithm commonly adopted in deep learning was used to train BERMUDA. Mini-batch gradient descent is a variation of the gradient descent algorithm and is widely adopted in the field of deep learning. For each iteration in each epoch during the training process of BERMUDA, a “mini-batch” X = {x1, x2, …, xB} was sampled from the dataset, which contained nmb cells from each cluster (\( B={n}_{mb}\sum \limits_{i=1}^n{c}_i \)). We used nmb = 50 in our experiments. However, we observed that BERMUDA was robust to the choice of nmb and can outperform existing methods under a wide range of nmb values (Additional file 1: Figure S10). The loss was calculated on the entire mini-batch and the parameters in BERMUDA were then updated using gradient descent. In each epoch, multiple iterations were performed to cover all the cells in the dataset.
The loss function for training the autoencoder consisted of two parts. The first part is a reconstruction loss between the output layer and the input layer defined by mean squared error
$$ {L}_1\left(\mathbf{X}\right)=\sum \limits_{i=1}^B{\left\Vert {\mathbf{x}}_i-{\mathbf{x}}_i^{\prime}\right\Vert}_2^2, $$
(6)
where xi and \( {\mathbf{x}}_i^{\prime } \) are input and reconstructed expression profile of the i-th cell in a mini-batch. The second part is a novel maximum mean discrepancy (MMD) based loss [29] that estimates the differences in distributions among similar cell clusters in different batches. MMD is a non-parametric distance estimate between distributions based on the reproducing kernel Hilbert space (RKHS) and has proven to be highly effective in many deep transfer learning tasks [41,42,43,44]. Since MMD does not require density estimates as an intermediate step and does not assume any parametric density on the data, it can be applied to different domains [45]. MMD is also memory-efficient, fast to compute, and performs well on high dimensional data with low sample size [46, 47]. Considering the case where only a subset of the cell population is shared among batches, instead of applying MMD loss on batches entirely, we only considered the loss between pairs of similar cell clusters among different batches. So, the MMD-based loss can be defined as
$$ {L}_2\left(\mathbf{X}\right)={\sum}_{i_a,{i}_b,{j}_a,{j}_b}{M}_{i_a,{j}_a,{i}_b,{j}_b}\bullet MMD\left({\mathbf{Z}}_{i_a,{j}_a},{\mathbf{Z}}_{i_b,{j}_b}\right), $$
(7)
where Zi, j is the code of the input Xi, j, and Xi, j is the expression profiles of cells from cluster j of batch i in the mini-batch X. MMD(∙) equals to zero when the underlying distributions of the observed samples are the same. By minimizing the MMD loss between the distributions of similar clusters, the autoencoder can be trained to remove batch effects in the bottleneck layer. In summary, the total loss function on a mini-batch can be written as
$$ L\left(\mathbf{X}\right)={L}_1\left(\mathbf{X}\right)+\lambda {L}_2\left(\mathbf{X}\right), $$
(8)
where λ is a regularization parameter. We followed the strategy introduced by Ganin et al. [48] to gradually increase λ during the training process. The regularization parameter at epoch p is calculated as
$$ {\lambda}_p=\frac{2}{1+{e}^{-\frac{10p}{np}}}-1, $$
(9)
where np is the number of total epochs in training. This can help the autoencoder to first focus on finding a proper low-dimensional representation of the original gene expression data, then focus on aligning the distributions of similar clusters in the low-dimensional space.
Performance evaluation
To evaluate the performance of BERMUDA, we examined the outputs when specific cell types were removed from their respective batches. We then used three metrics to compare algorithm performance. First, we used a k-nearest-neighbor based divergence estimation method [49] to evaluate the quality of merging the shared cell population among batches. For n scRNA-seq batches with gene expression profiles X1, X2, …, Xn and their corresponding batch-corrected low-dimensional embeddings Z1, Z2, …, Zn, we define
$$ divergence\_ score=\frac{1}{n_{div}}{\sum}_{i\ne j,s\left({\mathbf{Z}}_i,{\mathbf{Z}}_j\right)\ne \varnothing, s\left({\mathbf{Z}}_j,{\mathbf{Z}}_i\right)\ne \varnothing }D\left(s\left({\mathbf{Z}}_i,{\mathbf{Z}}_j\right),s\left({\mathbf{Z}}_j,{\mathbf{Z}}_i\right)\right), $$
(10)
where s(Zi, Zj) is the cell population in Zi that is shared by Zj, D(Zi, Zj) is the divergence estimation of the two distributions given samples Zi and Zj, and ndiv = ∣ {(i, j)| i ≠ j, s(Zi, Zj) ≠ ∅ , s(Zj, Zi) ≠ ∅ }∣ is the number of pairs of batches with shared cell population. Since proper removal of batch effects should produce results where the distributions of shared cell populations among different batches are similar, a smaller divergence_score is preferred, indicating that the shared cell population between different batches are homogeneously mixed. Second, we used entropy to evaluate whether a cell population that only exists in a certain batch remains distinct from other populations after correction. We define
$$ entropy\_ score=\frac{1}{n_{ent}}\sum \limits_{i\ne j,d\left({\mathbf{Z}}_i,{\mathbf{Z}}_j\right)\ne \varnothing}\frac{1}{\left|d\left({\mathbf{Z}}_i,{\mathbf{Z}}_j\right)\right|}\sum \limits_{\mathbf{k}\in d\left({\mathbf{Z}}_i,{\mathbf{Z}}_j\right)\ }E\left(\mathbf{k}\right), $$
(11)
where d(Zi, Zj) is the cell population in Zi that is not shared by Zj, and nent = ∣ {(i, j)| i ≠ j, d(Zi, Zj) ≠ ∅ }∣ is the number of pairs of batches where there exists distinct cell population in Zi from Zj. E(k) is the estimation of entropy locally around cell k defined as
$$ E\left(\mathbf{k}\right)=\sum \limits_{i=1}^n{p}_i\log \left({p}_i\right), $$
(12)
where pi is the proportion of cells from batch i among the NN-nearest neighbors of cell k. We chose NN = 100 in our evaluations. When batch effects are removed properly, a cell type that only exists in a batch should not be mixed with cells from other batches. So, a smaller entropy_score is desired, suggesting that biological signals only contained in a subset of batches are properly preserved during correction. Note that when all the batches share the same cell types, we did not calculate entropy_score during evaluation since there is no batch-specific cell population.
The divergence and entropy estimations were calculated for pairs of batches and then averaged to acquire a summary of the batch correction performance among multiple batches. When the dimensionality of the embedding was high, divergence_score and entropy_score were calculated based on the two-dimensional UMAP [30] embeddings of the data to derive robust estimations of divergence and entropy. UMAP is a general dimensionality reduction algorithm that can achieve competitive results compared to t-SNE [50], while preserves more global structures of the data.
Third, since divergence_score and entropy_score are both proposed to evaluate the mixture of cells among batches, we also compared a metric to evaluate the separation of different cell types after batch effects being removed. To this end, we calculated the silhouette coefficient with clusters defined by cell types. For a cell k, let a(k) be the average distance between k and all the other cells within the same cluster and b(k) be the smallest average distance between k and all the cells in any other cluster, we define the silhouette coefficient of cell k as
$$ S\left(\mathbf{k}\right)=\frac{b\left(\mathbf{k}\right)-a\left(\mathbf{k}\right)}{\max \left\{a\left(\mathbf{k}\right),b\left(\mathbf{k}\right)\right\}}. $$
(13)
The average silhouette coefficient of all the cells from different batches is calculated after batch effect removal, such that
$$ silhouette\_ score=\frac{1}{\sum \limits_{i=1}^n\mid \left\{\mathbf{k}|\mathbf{k}\in {\mathbf{Z}}_i\right\}\mid}\sum \limits_{i=1}^n\sum \limits_{\mathbf{k}\in {\mathbf{Z}}_i}S\left(\mathbf{k}\right). $$
(14)
A larger silhouette_score indicates that the cell type assignment in the aligned dataset is more appropriate, where a cell is close to cells of the same type and distant from cells of different types. S(k) is calculated using Euclidean distance on the two-dimensional UMAP embeddings of the results.
Performance comparison with popular batch correction methods
We compared BERMUDA with several existing state-of-the-art batch correction methods for scRNA-seq data, including mnnCorrect [20], BBKNN [24], Seurat v2 (v2.3.4) [22], Seurat v3 (v3.0.0) [23], and scVI [28]. BBKNN and mnnCorrect were applied to log-transformed TPM data of variable genes. Seurat v2, Seurat v3, and scVI were applied on the datasets following the recommended workflow [22, 23, 28]. Due to the restriction of the workflow, we did not apply Seurat v2, Seurat v3, and scVI on the Gaussian simulated gene expression data. To demonstrate the necessity of batch correction methods for scRNA-seq data, we also compared BERMUDA with batch correction methods for microarray and bulk RNA-seq data, such as combat [14] and limma [15] (Additional file 1: Figure S3).