### Datasets used for performance evaluation

For consistency, in this paper, an individual dataset will be referred to as a “batch.” Multiple batches investigating similar biological problems will be referred to as a “dataset.” We applied *BERMUDA* on simulated datasets, human pancreas cell datasets, and PBMC datasets to assess its performance (Table 1). We used two methods to generate simulated datasets for evaluating the performance of *BERMUDA*. For the 2D Gaussian dataset, we followed [20] to generate highly synthetic data with batch effects. We simulated two batches of four different cell types according to different bivariate normal distributions in a two-dimensional biological subspace. The cell population composition of each batch was generated randomly. Then we randomly projected the data to a 100-dimensional space to simulate the high-dimensional gene expression data. Gene-specific batch effects were generated by adding Gaussian noise to the high-dimensional data. For the Splatter dataset, we used the Splatter [31] package to simulate RNA sequence counts of two batches with four different cell populations. Splatter can directly simulate multiple batches following similar cell type compositions at the same time. We set the cell population composition to be 0.4, 0.3, 0.2, and 0.1 among the four simulated cell types.

To evaluate whether *BERMUDA* can remove batch effects in real scRNA-seq data and extract meaningful biological insights, we also applied it to datasets of human pancreas cells and PBMCs. The pancreas dataset was obtained from Gene Expression Omnibus (GSE85241 for *Muraro batch* [32], GSE84133 for *Baron batch* [33]) and The European Bioinformatics Institute (E-MTAB-5061 for *Segerstolpe batch* [37]). The PBMC dataset was obtained from 10x Genomics support datasets. To effectively compare the difference between the cases where all the cell types or only a subset of those were shared among batches, we only retained the shared cell types in the pancreas dataset. The details of the datasets are shown in Table 1.

### Framework of *BERMUDA*

*BERMUDA* is a novel unsupervised framework for batch correction across different batches (Fig. 1a). The workflow of *BERMUDA* includes the following five steps: (1) preprocessing of scRNA-seq data, (2) clustering of cells in each batch individually, (3) identifying similar cell clusters across different batches, (4) removing batch effects by training an autoencoder (Fig. 1b), and (5) utilizing the batch-corrected codes for further downstream analysis. We introduce each step in detail in the following sections.

### Preprocessing

Gene expression levels from each cell were first quantified using transcript-per-million values (TPM). First, we restricted the analysis to genes that were highly variable based on *Seurat v2* [22]. Gene expression values were normalized as

$$ {G}_{ij}={\log}_2\left({\mathrm{TPM}}_{ij}+1\right), $$

(1)

where TPM_{ij} is the TPM of gene *i* in cell *j*. We subsequently standardized expression per batch as

$$ {G}_{ij}^{\prime }=\frac{G_{ij}-{\overline{G}}_i}{\sigma_i}, $$

(2)

where *G*′_{ij} is the standardized expression level of gene *i* in cell *j*. \( {\overline{G}}_i \) is the mean expression level for gene *i* and *σ*_{i} is the standard deviation of expression level for gene *i*. Then, we linearly scaled the expression level of each gene to [0, 1].

### Clustering of cells and identifying similar cell clusters

Cell clusters were identified from each batch individually following the pipeline in *Seurat v2* [22]. *Seurat v2* implemented a clustering algorithm based on optimizing the modularity function on a k-nearest neighbor graph. We then used MetaNeighbor [8] to determine the similarity between clusters from different batches based on Spearman correlation. For *n* batches each contains *c*_{i} clusters, *i* = 1, 2, …, *n*, MetaNeighbor produces a similarity score for each pair of cell clusters by calculating the mean of area under the receiver operator characteristic curve (AUROC) in cross-validation. We denote \( {M}_{i_1,{j}_1,{i}_2,{j}_2} \) as the similarity score between cluster *j*_{1} in batch *i*_{1} and cluster *j*_{2} in batch *i*_{2}. Because we were interested in similar clusters across different batches, we set the similarity score between clusters within the same batch to 0. For each cluster, we considered the most similar cluster in each of the other batches, such that

$$ {M}_{i_{a,}{j}_a,{i}_b,{j}_b}=\left\{\begin{array}{c}{M}_{i_a,{j}_a,{i}_b,{j}_b},\kern0.5em \mathrm{if}\ {M}_{i_a,{j}_a,{i}_b,{j}_b}=\max \left\{\ {M}_{i_a,{j}_a,{i}_b,j},j=1,2,\dots, {c}_{i_b}\right\},{i}_a\ne {i}_b\\ {}0,\kern0.5em \mathrm{otherwise}\end{array}\right.. $$

(3)

We further made the similarity scores between two clusters symmetrical by modifying \( {M}_{i_a,{j}_a,{i}_b{j}_b} \) as

$$ {M}_{i_a,{j}_a,{i}_b,{j}_b}=\max \left\{{M}_{i_a,{j}_a,{i}_b,{j}_b},{M}_{i_b,{j}_b,{i}_a,{j}_a}\right\},{i}_a,{i}_b=1,2,\dots, n,{j}_a=1,2,\dots, {c}_{i_a},{j}_b=1,2,\dots, {c}_{i_b}. $$

(4)

Instead of considering mutual nearest cell clusters, our criterion for identifying similar cell clusters summarized in Eqs. 3 and 4 can accommodate the case where a cell cluster in one batch corresponds to multiple clusters in another batch, which makes *BERMUDA* more robust to the results in the clustering step. Considering the case where cluster *j*_{1} in batch *i*_{1} was identified to be similar to both cluster *j*_{2} and *j*_{3} in batch *i*_{2} with high confidence by MetaNeighbor (e.g., \( {M}_{i_1,{j}_1,{i}_2,{j}_2}={M}_{i_2,{j}_2,{i}_1,{j}_1}=0.99 \), \( {M}_{i_1,{j}_1,{i}_2,{j}_3}={M}_{i_2,{j}_3,{i}_1,{j}_1}=0.98 \)), where cluster *j*_{2} and *j*_{3} in batch *i*_{2} might be separated from a single, larger cluster by the clustering algorithm. If we only considered mutual nearest clusters, we would have only aligned cluster *j*_{1} in batch *i*_{1} with cluster *j*_{2} in batch *i*_{2}. However, by using our proposed criterion, according to Eqs. 3 and 4, we can get \( {M}_{i_1,{j}_1,{i}_2,{j}_3}={M}_{i_2,{j}_3,{i}_1,{j}_1}=0.98 \) (\( {M}_{i_1,{j}_1,{i}_2,{j}_3}=0 \), \( {M}_{i_2,{j}_3,{i}_1,{j}_1}=0.98 \) from Eq. 3 and \( {M}_{i_1,{j}_1,{i}_2,{j}_3}=\max \left({M}_{i_1,{j}_1,{i}_2,{j}_3},{M}_{i_2,{j}_3,{i}_1,{j}_1}\right)=0.98 \) from Eq. 4), which faithfully captured the similarity relationships identified.

Finally, we binarized \( {M}_{i_a,{j}_a,{i}_b,{j}_b} \) with a threshold value *S*_{thr}, where

$$ {M}_{i_a,{j}_a,{i}_b,{j}_b}=1\ \mathrm{if}\ {M}_{i_a,{j}_a,{i}_b,{j}_b}\ge {S}_{thr},{i}_a,{i}_b=1,2,\dots, n,{j}_a=1,2,\dots, {c}_{i_a},{j}_b=1,2,\dots, {c}_{i_b}. $$

(5)

This can prevent finding a spurious counterpart for a cluster (e.g., cluster *j*_{2} in batch *i*_{1}) if such cluster was specific to only one batch. In this case, the similarity scores between cluster *j*_{2} in batch *i*_{1} to any other clusters should be lower than *S*_{thr}, which set \( {M}_{i_1,{j}_2,i,j}=0,\forall i,j \) (Eq. 5). In this way, cluster *j*_{2} in batch *i*_{1} will not be aligned to other clusters during the training process of *BERMUDA*. *S*_{thr} was chosen empirically and we observed that *BERMUDA* achieved robust and competitive results across different datasets when *S*_{thr} was between 0.85 and 0.90 (Additional file 1: Figure S8, S9).

### Batch correction using an autoencoder

*BERMUDA* uses an autoencoder to project the original uncorrected gene expression profiles to a low-dimensional space to remove the experimental artifacts across different batches (Fig. 1b). An autoencoder can be represented as a function **x**^{′} = *f*(**x**) = *f*_{decoder}(*f*_{encoder}(**x**)), where *f* reconstructs the input gene expression profile **x** through the neural network. To avoid trivial solutions, autoencoders usually incorporate a bottleneck layer that learns a low-dimensional embedding of the input data called code, e.g., **z** = *f*_{encoder}(**x**). In *BERMUDA*, we used an autoencoder with three hidden layers, and the default number of neurons in each hidden layer were 200, 20, and 200. For the synthetic data generated from bivariate Gaussian distributions, we set the neurons in each hidden layer as 20, 2, and 20 to reconstruct the two-dimensional biological plane.

The mini-batch gradient descent algorithm commonly adopted in deep learning was used to train *BERMUDA*. Mini-batch gradient descent is a variation of the gradient descent algorithm and is widely adopted in the field of deep learning. For each iteration in each epoch during the training process of *BERMUDA*, a “mini-batch” **X** = {**x**_{1}, **x**_{2}, …, **x**_{B}} was sampled from the dataset, which contained *n*_{mb} cells from each cluster (\( B={n}_{mb}\sum \limits_{i=1}^n{c}_i \)). We used *n*_{mb} = 50 in our experiments. However, we observed that *BERMUDA* was robust to the choice of *n*_{mb} and can outperform existing methods under a wide range of *n*_{mb} values (Additional file 1: Figure S10). The loss was calculated on the entire mini-batch and the parameters in *BERMUDA* were then updated using gradient descent. In each epoch, multiple iterations were performed to cover all the cells in the dataset.

The loss function for training the autoencoder consisted of two parts. The first part is a reconstruction loss between the output layer and the input layer defined by mean squared error

$$ {L}_1\left(\mathbf{X}\right)=\sum \limits_{i=1}^B{\left\Vert {\mathbf{x}}_i-{\mathbf{x}}_i^{\prime}\right\Vert}_2^2, $$

(6)

where **x**_{i} and \( {\mathbf{x}}_i^{\prime } \) are input and reconstructed expression profile of the *i*-th cell in a mini-batch. The second part is a novel maximum mean discrepancy (MMD) based loss [29] that estimates the differences in distributions among similar cell clusters in different batches. MMD is a non-parametric distance estimate between distributions based on the reproducing kernel Hilbert space (RKHS) and has proven to be highly effective in many deep transfer learning tasks [41,42,43,44]. Since MMD does not require density estimates as an intermediate step and does not assume any parametric density on the data, it can be applied to different domains [45]. MMD is also memory-efficient, fast to compute, and performs well on high dimensional data with low sample size [46, 47]. Considering the case where only a subset of the cell population is shared among batches, instead of applying MMD loss on batches entirely, we only considered the loss between pairs of similar cell clusters among different batches. So, the MMD-based loss can be defined as

$$ {L}_2\left(\mathbf{X}\right)={\sum}_{i_a,{i}_b,{j}_a,{j}_b}{M}_{i_a,{j}_a,{i}_b,{j}_b}\bullet MMD\left({\mathbf{Z}}_{i_a,{j}_a},{\mathbf{Z}}_{i_b,{j}_b}\right), $$

(7)

where **Z**_{i, j} is the code of the input **X**_{i, j}, and **X**_{i, j} is the expression profiles of cells from cluster *j* of batch *i* in the mini-batch **X**. *MMD*(∙) equals to zero when the underlying distributions of the observed samples are the same. By minimizing the MMD loss between the distributions of similar clusters, the autoencoder can be trained to remove batch effects in the bottleneck layer. In summary, the total loss function on a mini-batch can be written as

$$ L\left(\mathbf{X}\right)={L}_1\left(\mathbf{X}\right)+\lambda {L}_2\left(\mathbf{X}\right), $$

(8)

where *λ* is a regularization parameter. We followed the strategy introduced by Ganin et al. [48] to gradually increase *λ* during the training process. The regularization parameter at epoch *p* is calculated as

$$ {\lambda}_p=\frac{2}{1+{e}^{-\frac{10p}{np}}}-1, $$

(9)

where *np* is the number of total epochs in training. This can help the autoencoder to first focus on finding a proper low-dimensional representation of the original gene expression data, then focus on aligning the distributions of similar clusters in the low-dimensional space.

### Performance evaluation

To evaluate the performance of *BERMUDA*, we examined the outputs when specific cell types were removed from their respective batches. We then used three metrics to compare algorithm performance. First, we used a k-nearest-neighbor based divergence estimation method [49] to evaluate the quality of merging the shared cell population among batches. For *n* scRNA-seq batches with gene expression profiles **X**_{1}, **X**_{2}, …, **X**_{n} and their corresponding batch-corrected low-dimensional embeddings **Z**_{1}, **Z**_{2}, …, **Z**_{n}, we define

$$ divergence\_ score=\frac{1}{n_{div}}{\sum}_{i\ne j,s\left({\mathbf{Z}}_i,{\mathbf{Z}}_j\right)\ne \varnothing, s\left({\mathbf{Z}}_j,{\mathbf{Z}}_i\right)\ne \varnothing }D\left(s\left({\mathbf{Z}}_i,{\mathbf{Z}}_j\right),s\left({\mathbf{Z}}_j,{\mathbf{Z}}_i\right)\right), $$

(10)

where *s*(**Z**_{i}, **Z**_{j}) is the cell population in **Z**_{i} that is shared by **Z**_{j}, *D*(**Z**_{i}, **Z**_{j}) is the divergence estimation of the two distributions given samples **Z**_{i} and **Z**_{j}, and *n*_{div} = ∣ {(*i*, *j*)| *i* ≠ *j*, *s*(**Z**_{i}, **Z**_{j}) ≠ ∅ , *s*(**Z**_{j}, **Z**_{i}) ≠ ∅ }∣ is the number of pairs of batches with shared cell population. Since proper removal of batch effects should produce results where the distributions of shared cell populations among different batches are similar, a smaller *divergence*_*score* is preferred, indicating that the shared cell population between different batches are homogeneously mixed. Second, we used entropy to evaluate whether a cell population that only exists in a certain batch remains distinct from other populations after correction. We define

$$ entropy\_ score=\frac{1}{n_{ent}}\sum \limits_{i\ne j,d\left({\mathbf{Z}}_i,{\mathbf{Z}}_j\right)\ne \varnothing}\frac{1}{\left|d\left({\mathbf{Z}}_i,{\mathbf{Z}}_j\right)\right|}\sum \limits_{\mathbf{k}\in d\left({\mathbf{Z}}_i,{\mathbf{Z}}_j\right)\ }E\left(\mathbf{k}\right), $$

(11)

where *d*(**Z**_{i}, **Z**_{j}) is the cell population in **Z**_{i} that is not shared by **Z**_{j}, and *n*_{ent} = ∣ {(*i*, *j*)| *i* ≠ *j*, *d*(**Z**_{i}, **Z**_{j}) ≠ ∅ }∣ is the number of pairs of batches where there exists distinct cell population in **Z**_{i} from **Z**_{j}. *E*(**k**) is the estimation of entropy locally around cell **k** defined as

$$ E\left(\mathbf{k}\right)=\sum \limits_{i=1}^n{p}_i\log \left({p}_i\right), $$

(12)

where *p*_{i} is the proportion of cells from batch *i* among the *NN*-nearest neighbors of cell **k**. We chose *NN* = 100 in our evaluations. When batch effects are removed properly, a cell type that only exists in a batch should not be mixed with cells from other batches. So, a smaller *entropy*_*score* is desired, suggesting that biological signals only contained in a subset of batches are properly preserved during correction. Note that when all the batches share the same cell types, we did *not* calculate *entropy*_*score* during evaluation since there is no batch-specific cell population.

The divergence and entropy estimations were calculated for pairs of batches and then averaged to acquire a summary of the batch correction performance among multiple batches. When the dimensionality of the embedding was high, *divergence*_*score* and *entropy*_*score* were calculated based on the two-dimensional UMAP [30] embeddings of the data to derive robust estimations of divergence and entropy. UMAP is a general dimensionality reduction algorithm that can achieve competitive results compared to t-SNE [50], while preserves more global structures of the data.

Third, since *divergence*_*score* and *entropy*_*score* are both proposed to evaluate the mixture of cells among batches, we also compared a metric to evaluate the separation of different cell types after batch effects being removed. To this end, we calculated the silhouette coefficient with clusters defined by cell types. For a cell **k**, let *a*(**k**) be the average distance between **k** and all the other cells within the same cluster and *b*(**k**) be the smallest average distance between **k** and all the cells in any other cluster, we define the silhouette coefficient of cell **k** as

$$ S\left(\mathbf{k}\right)=\frac{b\left(\mathbf{k}\right)-a\left(\mathbf{k}\right)}{\max \left\{a\left(\mathbf{k}\right),b\left(\mathbf{k}\right)\right\}}. $$

(13)

The average silhouette coefficient of all the cells from different batches is calculated after batch effect removal, such that

$$ silhouette\_ score=\frac{1}{\sum \limits_{i=1}^n\mid \left\{\mathbf{k}|\mathbf{k}\in {\mathbf{Z}}_i\right\}\mid}\sum \limits_{i=1}^n\sum \limits_{\mathbf{k}\in {\mathbf{Z}}_i}S\left(\mathbf{k}\right). $$

(14)

A larger *silhouette*_*score* indicates that the cell type assignment in the aligned dataset is more appropriate, where a cell is close to cells of the same type and distant from cells of different types. *S*(**k**) is calculated using Euclidean distance on the two-dimensional UMAP embeddings of the results.

### Performance comparison with popular batch correction methods

We compared *BERMUDA* with several existing state-of-the-art batch correction methods for scRNA-seq data, including *mnnCorrect* [20], *BBKNN* [24], *Seurat v2* (v2.3.4) [22], *Seurat v3* (v3.0.0) [23], and *scVI* [28]. *BBKNN* and *mnnCorrect* were applied to log-transformed TPM data of variable genes. *Seurat v2*, *Seurat v3*, and *scVI* were applied on the datasets following the recommended workflow [22, 23, 28]. Due to the restriction of the workflow, we did not apply *Seurat v2*, *Seurat v3*, and *scVI* on the Gaussian simulated gene expression data. To demonstrate the necessity of batch correction methods for scRNA-seq data, we also compared *BERMUDA* with batch correction methods for microarray and bulk RNA-seq data, such as *combat* [14] and *limma* [15] (Additional file 1: Figure S3).