scMC learns biological variation through the alignment of multiple single-cell genomics datasets

Distinguishing biological from technical variation is crucial when integrating and comparing single-cell genomics datasets across different experiments. Existing methods lack the capability in explicitly distinguishing these two variations, often leading to the removal of both variations. Here, we present an integration method scMC to remove the technical variation while preserving the intrinsic biological variation. scMC learns biological variation via variance analysis to subtract technical variation inferred in an unsupervised manner. Application of scMC to both simulated and real datasets from single-cell RNA-seq and ATAC-seq experiments demonstrates its capability of detecting context-shared and context-specific biological signals via accurate alignment.

Simulation dataset 2. Compared to dataset 1, there are imbalanced cell subpopulation compositions between the two batches in dataset 2. We added two more cell clusters to the first batch of dataset 1 as follows: Sim2 <-splatSimulate(batchCells = 1000, group.prob = c(0.5, 0.5), method = "groups", verbose = FALSE) In sum, dataset 2 has two batches with six cell clusters in one batch and four cell clusters in another batch. Dataset 3 contains three batches with each batch has 1000, 1000, 2000 cells. We first generated balanced cell clusters for all batches, and then removed one cluster from the first batch and two clusters from the third batch. The dataset with balanced cell clusters was generated by: Simulation dataset 5. Dataset 5 consists of 12,097 cells with six batches and seven cell groups, which was simulated by Splatter package using the codes provided in https://github.com/theislab/scib/blob/master/notebooks/data_preprocessing/simulations/sim1.R.

Simulation dataset 3.
Detailed information can be found in Luecken et al [1].
Detailed information can be found in Luecken et al [1].

Robustness analysis of tuning parameters
In scMC, there are two tuning parameters: λ and T. λ controls the relative contribution of the technical variation when learning correction vectors. When λ increases, the ratio between the technical variation and the total amount of variation becomes stabilized (see Methods in main text, Additional file 2 Figure S16B, S17B). λ = 10, which is used as a default value, usually provides a stable result. scMC was found to be relatively robust when λ was greater than a certain value (Additional file 2 Figure S16A, S17A). T is a thresholding parameter determining whether cell clusters are shared across different datasets based on their similarity. If T is too small, the biological variation may be removed. If T is too large, the technical variation might not be completely removed. It was found that T larger than 0.5 provides better results, with T=0.6 (as default value) used for all the datasets. By visualizing the corrected data in UMAP using both simulated and real datasets, scMC was found to be relatively robust to T values within certain ranges (Additional file 2 Figure S18-S19).

Biological function analysis of the identified cell subpopulations in the brain tissue from adult mouse scATAC-seq dataset
Four cell subpopulations in the brain tissue were identified from scMC-integrated data on the ChromVAR kmer transformed scATAC-seq data (Additional file 2 Figure S13A). To gain insights into the biological functions of these identified subpopulations, we first identified differential loci of these four cell subpopulations by aggregating scATAC-seq data of each cell subpopulation and performing Wilcox rank test on the aggregated scATAC-seq data. We aggregated scATAC-seq data of each cell subpopulation by summing the single cell chromatin profiles of randomly selected 10 cells in each cell subpopulation. Second, we identified enriched transcriptional factors (TFs) in these differential loci using chromVAR [2]. chromVAR calculates the bias corrected deviations in accessibility. For each motif, there is a value for each cell, which measures how different the accessibility for loci with that motif is from the expected accessibility based on the average of all the cells. By performing hierarchical clustering of the calculated deviations of the identified128TFs, we found that the patterns of these TFs were almost specific to each particular cell subpopulation (Additional file 2 Figure S13B). Third, we used GREAT [3] to detect enriched biological processes of the differential loci (Additional file 2 Figure S13C).