Overview of scAI
To deconvolute heterogeneous single cells from both transcriptomic and epigenomic profiles, we aggregate the sparse/binary epigenomic profile in an unsupervised manner to allow coherent fusion with transcriptomic profile while projecting cells into the same representation space using both the transcriptomic and epigenomic data. Using the normalized scRNA-seq data matrix X1 (p genes in n cells) and the single-cell chromatin accessibility or DNA methylation data matrix X2 (q loci in n cells) as an example, we infer the low-dimensional representations via the following matrix factorization model:
$$ {\min}_{W_1,{W}_2,H,Z\ge 0}\alpha {\left\Vert {X}_1-{W}_1H\right\Vert}_F^2+{\left\Vert {X}_2\left(Z\circ R\right)-{W}_2H\right\Vert}_F^2+\lambda {\left\Vert Z-{H}^TH\right\Vert}_F^2+\gamma \sum \limits_j{\left\Vert {H}_{.j}\right\Vert}_1^2, $$
(1)
where W1 and W2 are the gene loading and locus loading matrices with sizes p × K and q × K (K is the rank), respectively. Each of the K columns is considered as a factor, which often corresponds to a known biological process/signal relating to a particular cell type. \( {W}_1^{ik} \) and \( {W}_2^{ik} \) are the loading values of gene i and locus i in factor k, and the loading values represent the contributions of gene i and locus i in factor k. H is the cell loading matrix with size K × n (H.j is the jth column of H), and the entry Hkj is the loading value of cell j when mapped onto factor k. Z is the cell-cell similarity matrix. R is a binary matrix generated by a binomial distribution with a probability s. α, λ, γ are regularization parameters, and the symbol ∘ represents dot multiplication. The model aims to address two major challenges simultaneously: (i) the extremely sparse and near-binary nature of single-cell epigenomic data and (ii) the integration of this binary epigenomic data with the scRNA-seq data, which are often continuous after being normalized.
Aggregation of epigenomic profiles through iterative refinement in an unsupervised manner
To address the extremely sparse and binary nature of the epigenomic data, we aggregate epigenomic data of similar cells based on the cell-cell similarity matrix Z, which is simultaneously learned from both transcriptomic and epigenomic data iteratively. Epigenomic data can be simply aggregated by X2Z. However, this strategy may lead to over-aggregation, for example, in one subpopulation, similar cells exhibit almost the same aggregated epigenomic signals, which improperly reduces the cellular heterogeneity. To reduce such over-aggregation, a binary matrix R, generated from a binomial distribution with probability s, is utilized for randomly sampling of similar cells. After normalizing H with the sum of each row equaling 1 in each iteration step and Z°R with the sum of each column equaling 1, then the aggregated epigenomic profiles are represented by X2(Z ∘ R). The ith column of X2(Z ∘ R) represents the weighted combination of epigenomic signals from some cells similar to the ith cell. These strategies not only enhance epigenomic signals, but also maintain cellular heterogeneity within and between different subpopulations.
Integration of binary and count-valued data via projection onto the same low-dimensional space
Through aggregation, the extremely sparse and near-binary data matrix X2 is transformed into the signal-enhanced continuous matrix X2(Z ∘ R), allowing coherent fusion with transcriptomic measurements (Fig. 1a). These two matrices are projected onto a common coordinate system represented by the first two terms in the optimization model (Eq. (1)). In this way, cells are mapped onto a K-dimensional space with the cell loading matrix H, and the cell-cell similarity matrix Z is approximated by H′H, as represented by the third term in Eq. (1). The sparseness constraint on each column of H is added by the last term of Eq. (1).
Downstream analysis using the inferred low-dimensional representations
scAI simultaneously decomposes transcriptomic and epigenomic data into multiple biologically relevant factors, which are useful for a variety of downstream analyses (Fig. 1b–d). (1) The cell subpopulations can be identified from the cell loading matrix H using a Leiden community detection method (see the “Methods” section). (2) The genes and loci in the ith factor are ranked based on the loading values in the ith columns of W1 and W2 (see Fig. 1b and the “Methods” section). (3) To simultaneously analyze both gene and loci information associated with cell states, we introduce an integrative visualization method, VscAI. By combining these learned low-rank matrices (W1, W2, H, and Z) with the Sammon mapping [27] (see the “Methods” section), VscAI simultaneously projects genes and loci that separate the cell states into a two-dimensional space alongside the cells (Fig. 1c). (4) Finally, the regulatory relationships between the marker genes and the chromosome regions in each factor or cell subpopulation are inferred by combining the correlation analysis and the nonnegative least square regression modeling of gene expression and chromatin accessibility (see Fig. 1d and the “Methods” section). Overall, these functionalities allow the deconvolution of cellular heterogeneity and reveal regulatory links from transcriptomic and epigenomic layers.
Model validation and comparison using simulated data
To evaluate scAI, we simulated eight single-cell datasets with the sparse count data matrix X1 and the sparse binary data matrix X2 (i.e., paired scRNA-seq and scATAC-seq). To recapitulate the properties of the single-cell multiomics data (e.g., a high abundance of zeros and binary epigenetic data), we generated bulk RNA-seq and DNase-seq profiles from the same sample with MOSim [28]. Then, we added the effects of dropout and binarized the data. A detailed description of the simulation approach and the simulated data are shown in Additional file 1: Supplementary methods (Simulation datasets) and Additional file 2: Table S1. These datasets encompass eight scenarios with different transcriptomic/epigenomic properties: different sparsity levels (dataset 1), different noise levels (dataset 2), missing clusters in the epigenomic profiles (i.e., clusters defined from gene expression do not reflect epigenetic distinctions) (dataset 3), missing clusters in the transcriptomic profiles (i.e., clusters defined from epigenetic profile do not reflect gene expression distinctions) (dataset 4), discrete cell states (dataset 5), a continuous biological process (dataset 6), imbalanced cluster sizes with the same number of clusters defined from both transcriptomic and epigenomic profiles (dataset 7), and imbalanced cluster sizes with missing clusters in the epigenomic profiles (dataset 8).
First, we compared the visualization of cells using the scRNA-seq data, scATAC-seq data, and aggregated scATAC-seq data, respectively (Fig. 2a). Due to the inherent sparsity and noise in the data, the cells were not well separated in the scRNA-seq data and the scATAC-seq data using Uniform Manifold Approximation and Projection [29] (UMAP) (Fig. 2a) and t-SNE (Additional file 2: Figure S1), in particular for datasets 5 and 6. However, the cell subpopulations were clearly distinguishable in the low-dimensional space when using the aggregated scATAC-seq data generated by scAI for all eight different scenarios (Fig. 2a). In addition, the cell subpopulations were well separated when visualized by VscAI, which embedded cells in two dimensions by leveraging the information from both scRNA-seq and scATAC-seq data (Fig. 2b). For dataset 3 and dataset 4, in which one cluster was missing in either the transcriptomic or the epigenomic data alone, scAI was able to reveal all the anticipated clusters. For example, in dataset 4, only four clusters were revealed in the scRNA-seq data, but five clusters were embedded in the scATAC-seq data (the fourth row of Fig. 2a). Without the addition of the scATAC-seq information, four clusters were detected (Additional file 2: Figure S2), whereas the integration of both the scRNA-seq and the scATAC-seq data revealed five clusters. In the first five datasets, the cell states are discrete whereas dataset 6 depicts a continuous transition process at five different time points. The continuous transitions in these five cell states were well characterized by scAI with the aggregated scATAC-seq data but could not be captured by using only the sparse scATAC-seq data with UMAP (the sixth row of Fig. 2a) and t-SNE (Additional file 2: Figure S1). For the datasets 7 and 8 with imbalanced cluster sizes, scAI accurately revealed all the expected clusters. In particular, three cell clusters were observed in the low-dimensional space of both scATAC-seq and aggregated scATAC-seq data in the dataset 8 (the eighth row of Fig. 2a). However, five cell clusters were well distinguished after integrating with scRNA-seq data, as shown in the VscAI space (the eighth row of Fig. 2b).
Next, we used the area under receiver operating characteristic curve (AUC) to quantitively evaluate the accuracy of scAI in reconstructing cell loading matrix H, gene loading matrix W1, and locus loading matrix W2, which were used for identifying cell clusters, factor-specific genes, and loci in the downstream analyses, respectively. scAI was found to perform robustly and accurately with different sparsity levels and noise levels (Fig. 2c). For example, even with the sparsity levels of X1 and X2 at 98% and 99.6% in dataset 1, and 79.4% and 97.5% in dataset 5, scAI was able to reconstruct these loading matrices with high accuracy (Fig. 2c).
Moreover, to study whether stronger noise or the initial data with less discriminative patterns have effects on the performance of scAI, we added stronger noise and sparsity levels, and also made the initial data less discriminative among clusters by increasing the parameter value coph, on the simulation dataset 8. We found that the noise levels and parameter coph values have little effects on the reconstructed loading matrices. The sparsity level affects the performance if it is larger than some threshold (e.g., the sparsity of scRNA-seq and scATAC-seq data is larger than 98.9% and 99.5%, respectively), as shown in Additional file 2: Figure S3.
Finally, we applied MOFA [17], a multiomics data integration model designed for bulk data and single-cell data, to the eight datasets (Fig. 2d, e). MOFA decomposes multiomics data matrices into several weight matrices and one factor matrix using a statistically generalized principal component analysis method. For all the datasets except for dataset 7, the factors learned by MOFA only accounted for the variability of the scRNA-seq data, and could not capture the variance in the scATAC-seq data (Fig. 2d). We compared scAI with MOFA on cell clustering (Fig. 2e), finding MOFA does not perform as well as scAI for these simulation datasets (Fig. 2e).
The analysis on simulation data indicates scAI’s potential in aggregating scATAC-seq data, identifying important genes and loci, and uncovering discrete and continuous cell states in single-cell transcriptomic and epigenomic data with inherently high sparsity and noise levels.
Identifying subpopulations with subtle transcriptomic differences but strong chromatin accessibility differences
To evaluate scAI in capturing cell subpopulations in complex tissues, we analyzed 8837 cells from mammalian kidney using the paired chromatin accessibility and transcriptome data [6]. In a previous study, a semi-supervised clustering method was applied to the scRNA-seq data, and then, aggregated epigenomic profiles were generated based on the identified cell clusters [6]. As such, the cellular heterogeneity induced by epigenetics was unable to be captured in this method.
scAI identified 17 subpopulations with either distinct gene expression or chromatin accessibility profiles with the default resolution parameter equaling 1 (see the “Methods” section; Fig. 3a, b, d; Additional file 1). Compared to the original findings [6], our integrative analysis of transcriptomic and chromatin accessibility profiles indicated that the known cell types such as Collecting Duct Principal Cells (CDPC) were much more heterogeneous. We identified two subpopulations of CDPC (C9 and C12, Additional file 2: Figure S4a) that were captured by factor 2 and factor 8, respectively (Fig. 3c). Gene loading analysis of these two factors revealed that Fxyd4 and Frmpd4 are the specific markers of C9, while Egfem1 and Calb1 are the specific makers of C12 (Fig. 3c, and Additional file 2: Figure S4b and c). Importantly, while some identified subpopulations showed only subtle differences in their transcriptomic profiles, they exhibited distinct patterns in their epigenomic profiles (Fig. 3b, d). For example, C2 and C7 (subpopulations of proximal tubule S3 cells (type 1)), and C8 and C10 (subpopulations of proximal tubule S1/S2 cells) have similar gene expression profiles (Fig. 3b), but, exhibit strong differential accessibility patterns (Fig. 3e). The average signals of each locus across cells in each subpopulation are significantly different (Fig. 3e).
To further characterize these differential accessible loci and identify the specific transcriptional regulatory mechanisms of these epigenetics-induced subpopulations, we performed gene ontology enrichment and motif discovery analysis using GREAT and HOMER, respectively (Fig. 3f). Notably, for the two subpopulations C8 and C10 of proximal tubule S1/S2 cells, the C8-specific accessible loci were related to the chromatin binding and histone deacetylase complex, and were further enriched for binding motifs of MAFB and JUNB, both of which are known regulators of proximal tubule development [30]. Differential accessible loci of C10 were enriched in VEGFR signaling pathway, consistent with the role in the maintenance of tubulointerstitial integrity and the stimulation of proximal tubule cell proliferation [31].
Moreover, we applied chromVAR [32] to analyzing the differential accessible loci between C2 and C7, and C8 and C10, respectively. chromVAR calculates the bias corrected deviations in accessibility. For each motif, there is a value for each cell, which measures how different the accessibility for loci with that motif is from the expected accessibility based on the average of all the cells. By performing hierarchical clustering of the calculated deviations of top 30 most variable TFs, we found that these TFs were divided into 2 clusters, and each TF cluster was specific to 1 particular cell subpopulation, which was found to be consistent with the clustering by scAI (Additional file 2: Figure S5).
Revealing underlying transition dynamics by analyzing transcription and chromatin accessibility simultaneously
Next, we applied scAI to data from lung adenocarcinoma-derived A549 cells after 0, 1, and 3 h of 100 nM dexamethasone (DEX) treatment, including scRNA-seq and scATAC-seq data from 2641 co-assayed cells [6]. scAI revealed two factors, where factor 1 was enriched with cells from 0 h and factor 2 was enriched with cells from 3 h (Fig. 4a). Factor-specific genes and loci were identified by analyzing the gene and locus loading matrices (Fig. 4b). Among them, known markers of glucocorticoid receptor (GR) activation [33,34,35] (e.g., CKB and NKFBIA) were enriched in factor 2, and markers of early events after treatment [36] (e.g., ZSWIM6 and NR3C1) were enriched in factor 1. We collected TFs of these known markers from hTFtarget database (http://bioinfo.life.hust.edu.cn/hTFtarget/). Interestingly, the TF motifs, such as FOXA1 [37], CEBPB [38], CREB1, NR3C1, SP1, and GATA3 [39], also had high enrichment scores in the inferred factors (Fig. 4c), in agreement with that these motifs are key transcriptional factors of GR activation markers [40]. Particularly, CEBPB binding was shown positively associated with early GR binding [41], and GR binds near CREB1 binding sites that makes enhancer chromatin structure more accessible [42]. In the low-dimensional space visualized by VscAI, markers of early events, such as ZSWIM6 and NR3C1, were located near cells from 0 h, while markers of GR activation, such as CKB, NKFBIA, and ABHD12, were located near cells from 3 h (Fig. 4d), providing a direct and intuitive way to interpret the data.
To systematically assess the top ranked genes and loci in the identified factors, we performed pathway enrichment analysis of genes with MSigDB [43] and loci with GREAT [44]. As expected, several processes relevant to GR activation were uncovered, such as the “neurotrophin signaling pathway,” a pathway previously reported to have a direct effect on GR function [45]. The “Fc epsilon RI signaling pathway” was enriched in factor 2 (Additional file 2: Figure S6a), which is in good agreement with that the reduction of Fc epsilon RI levels might be one of the favorable anti-allergic functions of glucocorticoids in mice [46]. Furthermore, processes such as “genes involved in glycogen breakdown (glycogenolysis),” “genes involved in glycerophospholipid biosynthesis,” and “pentose and glucuronate interconversions” were enriched in the nearby genes of the factor-specific loci (Additional file 2: Figure S6b).
While the DEX treatment of A549 cells is known to increase both transcription and promoter accessibility for markers of GR activation [6], little is known on the regulatory relationships. We inferred regulatory links between cis-regulatory elements and target marker genes using perturbation-based correlation analysis and further identified bounded TFs that regulate target marker genes using nonnegative least square regression (see the “Methods” section). To assess the accuracy of the inference, we evaluated whether these regulatory relationships were enriched in an independent database of TF-target relationships for human (hTFtarget, http://bioinfo.life.hust.edu.cn/hTFtarget/) (see the “Methods” section). Encouragingly, high enrichment of the inferred regulatory relationships for the key markers of GR activation was observed (Fig. 4e), and the inferred regulatory relationships were able to be validated using ChIP-seq and DNase-seq data from ENCODE (https://www.encodeproject.org/). For the GR activation marker ABHD12 that was highly enriched in factor 2, we identified distinct regulatory links between factor 1 (enriched with cells from 0 h) and factor 2 (enriched with cells from 3 h). Among its regulators, the glucocorticoid receptor NR3C1 was revealed in factor 2 (Fig. 4f). Visualizing the chromatin signals of ChIP-seq data of NR3C1 and DNase-seq data using WashU Epigenome Browser (https://epigenomegateway.wustl.edu/browser), we found that most cis-regulatory elements are located in the open regions of the DNase-seq data, and that NR3C1 exhibits signals within 50 kb of the transcription start site (TSS) of ABHD12 at 1 and 3 h but no signals at 0 h in the ChIP-seq data. This is consistent with our prediction on the regulation between NR3C1 and ABHD12 existing in factor 2, but not in factor 1.
scAI provides an unsupervised way to aggregate sparse scATAC-seq data from similar cells through iterative refinement, which facilitates and enhances the direct analysis of scATAC-seq data. We next assess the performance of the aggregated scATAC-seq data in comparison with the raw scATAC-seq or scRNA data, in terms of the identification of cell states, the low-dimensional visualization of cells, and the reconstruction of the pseudotemporal dynamics. The previous study [6] identified two clusters that comprised a group of untreated cells and a group of DEX-treated cells, in which treated cells collected from 1 and 3 h form one cluster. Our analysis recovered three cell states, including an early state enriched by cells from 0 h, a transition state enriched by cells from 1 h, and a late state enriched by cells from 3 h (Fig. 4a). Due to the high sparsity (96.8% for scRNA-seq and 99.2% for scATAC-seq) and the near-binary nature of the scATAC-seq data, dimension reduction methods, such as t-SNE, were found to fail to distinguish the different cell states (Additional file 2: Figure S6c). However, scAI uncovered distinct cell subpopulations, as seen in the low-dimensional space, based on the aggregated data (Additional file 2: Figure S6c).
We next study the pseudotemporal dynamics of A549 cells using our previously developed method scEpath [47]. Compared to the trajectory inferred using only the scRNA-seq data, which lacks well-characterized GR activation trends for cells measured at three different time points (Additional file 2: Figure S6d), a clear and consistent trajectory was inferred when using the aggregated scATAC-seq data (Fig. 4g, h). We identified pseudotime-dependent genes and loci that were significantly changed along the inferred trajectories. The pseudotemporal dynamics of these genes along the trajectory inferred using only the scRNA-seq data were found to be discontinuous, in contrast to the aggregated scATAC-seq data obtained from scAI led to continuous trajectory (Fig. 4i). Previously, we used the measure scEnergy to quantify the developmental process [47]. Here, we found no significant differences in the single-cell energies between different time points when only using the scRNA-seq data. However, significantly decreased scEnergy values were seen during treatment according to the aggregated scATAC-seq data (Additional file 2: Figure S6e).
Overall, the aggregated scATAC-seq data by scAI can better characterize the dynamics of DEX treatment, and scAI suggests new mechanisms regarding the GR activation process in DEX-treated A549 cells, including a transition state and differential cis-regulatory relationships.
Uncovering coordinated changes in the transcriptome and DNA methylation along a differentiation trajectory
To study data with simultaneous single-cell methylome and transcriptome sequencing [3, 8, 48], we applied scAI to a dataset obtained from 77 mouse embryonic stem cells (mESCs), including 13 cells cultured in “2i” media and 64 serum-grown cells, which were profiled by parallel single-cell methylation and transcriptome sequencing technique scM&T-seq [3]. The DNA methylation levels were characterized in three different genomic contexts, including CpG islands, promoters, and enhancers, which are usually linked to transcriptional repression [49, 50].
Because DNA methylation data are sparse and binary, direct dimensional reduction may fail to capture cell subpopulations (Fig. 5a). scAI was able to distinguish cell subpopulations after aggregation (Fig. 5a), showing three subpopulations, C1, C2, and C3. Among them, C3 was captured by factor 1 with cells cultured in “2i” media and a few serum-grown cells, while C1 and C2 were captured by factors 2 and 3, respectively, with other serum-grown cells (Additional file 2: Figure S7).
Based on the top gene and locus loadings in each factor, we identified 688, 877 and 422 marker genes and 2164, 953 and 4461 differential methylated loci in C1, C2, and C3, respectively, with distinct gene expression and methylation patterns among these three groups (Fig. 5b). Moreover, methylation levels of loci near marker genes also showed group-specific patterns (Fig. 5b). Several known pluripotency markers (e.g., Essrb, Tcl1, Tbx3, Fbxo15, and Zpf42) exhibited the highest gene enrichment scores in factor 1 but the lowest gene enrichment scores in factors 2 and 3. In contrast, differentiation markers, such as Krt8, Tagln, and Krt19, exhibited higher gene enrichment scores in factor 3 but lower enrichment scores in factors 1 and 2 (Fig. 5c). Factor 2 exhibited an intermediate state with a relatively low expression level of both pluripotency and differentiation markers. Interestingly, several new marker genes of this intermediate state were observed, such as Fgf5, an early differentiation marker involved in neural differentiation in human embryonic stem cells [51]. Factor-specific loci located in the CpG, promoter, and enhancer regions of marker genes are also shown in Fig. 5c. The pluripotency markers Essrb and Tcl1 had higher gene enrichment scores, and their corresponding CpG, promoter, and enhancer regions had higher locus enrichment scores in factor 1. This relationship is consistent with the fact that some DNA methylation located in the CpG, promoter, and enhancer regions exhibit a negative relationship with the expression level of target genes.
A continuous differentiation trajectory, which was characterized by the differentiation of naïve pluripotent cells (NPCs) into primed pluripotent cells and ultimately into differentiated cells (DCs), was observed using VscAI (Fig. 5d). Additionally, the embedded genes and factors showed how specific genes and factors contribute to the differentiation trajectory. For example, pluripotency markers, such as Zpf42, Tex19.1, Fbxo15 Morc1, Jam2, and Esrrb [52, 53], were visually close to factor 1, while differentiation markers, such as Krt19 and Krt8 [54], were close to factor 3 (Fig. 5d). Interestingly, although both pluripotency and differentiation markers were not highly expressed in the early differentiated state in factor 2, some methylated loci of these markers (e.g., CpG regions of Zfp42 and Tex19.1, enhancer region of Jam2 and Tcl1, and promoter region of Anxa3) were enriched in factor 2 (Fig. 5d). These observations might be because their other regions (CpG, enhancer, or promoter) are methylated or DNA methylation is not the main driven force for transcriptional silencing. Overall, scAI shows coordinated changes between transcriptome and DNA methylation along the differentiation process.
Comparison with three multiomics data integration methods
We next compared scAI with three recent single-cell integration methods, MOFA [17], Seurat (version 3) [22], and LIGER [23], on A549 and kidney datasets. Similar to the observations on the simulation datasets (Fig. 2d), MOFA cannot capture the variations in the scATAC-seq data as the variances explained by the learned factors in the scATAC-seq data were nearly zero (Additional file 1: Supplementary methods (Details of data analysis by MOFA) and Additional file 2: Figure S8a-e). While Seurat and LIGER were designed for connecting cells measured in different experiments, we applied them to the two co-assayed single-cell multiomics data to test whether they are able to make links between co-assayed cells. We assessed the comparison using two metrics: (a) entropy of batch mixing and (b) silhouette coefficient. The entropy of batch mixing measures the uniformity of mixing for two samples in the aligned space [55], for which scRNA-seq and scATAC-seq profiles were treated as two batches, and a higher entropy value means better alignment. The silhouette coefficient quantifies the separation between cell groups using distance matrices calculated from a low-dimensional space [55], for which cell group labels were taken from the original study [6] and a higher silhouette coefficient indicates better preservation of the differences and structures between different cell groups.
The t-SNE analysis shows the co-assayed cells were aligned better by LIGER than Seurat when the two methods were applied to A549 dataset (Fig. 6a). This observation is further confirmed by computing the entropy of the batch mixing based on the aligned t-SNE space. We also computed the entropy of perfect alignment (i.e., the t-SNE coordinates of each pair of co-assayed cells are the same), and found that LIGER showed higher entropy value than Seurat, but lower entropy than the perfect alignment (Fig. 6a). In addition, we explored the quality of time point-based grouping of cells on the t-SNE space. Cells from 1 and 3 h were mixed together on the t-SNE space generated by Seurat, while there was a gradual change of cells from 0 to 3 h on the t-SNE space generated by LIGER (Fig. 6b). We also performed t-SNE on the cell loading matrix inferred by scAI (Additional file 2: Figure S8f), and found that scAI was able to capture the gradual change of cells transitioning from 0 to 3 h. Quantitatively, scAI produced significantly higher silhouette coefficients than those from both Seurat and LIGER (Fig. 6b).
In the kidney dataset, by computing the entropy of the batch mixing based on the aligned UMAP space, we observed significantly lower entropy of Seurat and LIGER than that of the perfect alignment (Fig. 6c). We then also calculated the silhouette coefficient using the UMAP space for all three methods (Fig. 6d and Additional file 2: Figure S8f). Again, significantly higher silhouette coefficients were observed in scAI, in comparison with those in Seurat and LIGER (Fig. 6d). Together, these results suggest that integration methods designed for measurements in different cells (e.g., Seurat and LIGER) may not accurately identify correspondences between the co-assayed cells, leading to errors in downstream analysis, and the integration of parallel single-cell omics data needs specialized methods, such as scAI, to deal with the epigenomic data with inherently high sparsity and to better preserve intrinsic differences between cell subpopulations.
Comparison with methods using single omics data
To evaluate the significance of the parallel profiling of multiomics over single omics data, we compared scAI with methods that use only transcriptomic data or only epigenomic data on both simulation and real datasets. Specifically, we compared scAI with two methods designed for only scRNA-seq data, including Seurat and SC3 [56], and two methods designed for only scATAC-seq data, including Signac (https://satijalab.org/signac/) and scABC [57]. On simulation datasets, we evaluated the performance of cell clustering using normalized mutual information (NMI). On real datasets, we compared the clustering based on those four methods with prior labels using UMAP.
On simulation datasets, we observed comparable NMI values between scAI and SC3, but slightly lower values of Seurat (Additional file 2: Figure S9). For the clustering of scATAC-seq data, both Signac and scABC showed significantly lower NMI values compared to those by scAI using both scRNA-seq and scATAC-seq data. On A549 real datasets, by visualizing cells in UMAP, we found that both Seurat and SC3 were unable to detect the transition stage and distinguish cells from 1 and 3 h. Cell clusters identified by Signac and scABC using scATAC-seq data alone were found to be inconsistent with the prior labels (Additional file 2: Figure S10a). On kidney dataset, Seurat was unable to distinguish the DCTC cells and CDPC cells, and Signac and scABC were also producing clusters inconsistent with prior labels (Additional file 2: Figure S10b). On mESC dataset, while both Seurat and SC3 correctly identified the cell subpopulations, clusters identified by Signac and scABC also mixed together in UMAP (Additional file 2: Figure S10c). Overall, scAI is able to consistently identify the expected clusters and also the clusters with subtle transcriptomic differences but strong chromatin accessibility differences (as shown in kidney dataset), showing the importance of integrating parallel single-cell multiomics data.