### scREG: a computational method for single cell regulatory analysis from multiome data

We propose a computational method for integrative analysis of single cell multiome gene expression and chromatin accessibility data. Figure 1 shows the schematic overview of the scREG analysis workflow. Our software scREG takes as input the read count matrices of gene expression and chromatin accessibility measured on the same cells which are the same format as the standard output of 10X genomics CellRanger software. The output of scREG is a lower-dimensional representation of cells, clustering label, 2-D embedding, and subpopulation-specific *cis*-regulatory networks. The R package also has an interactive visualization function to plot the peaks, genes, and *cis*-interaction in a subpopulation-specific manner for a given genomics range. scREG processes the input data in three main steps: joint dimension reduction, cell clustering, and regulatory network inference.

First, we develop a non-negative matrix factorization (NMF)-based optimization model to reduce the dimension of multiome data with m_{1} genes and m_{2} peaks to a common K dimension matrix (default value of K is 100). The gene expression data E matrix and chromatin accessibility data O matrix could be treated as two different modalities and thus those need to be integrated. To capture the cross-modalities information, we define a cell level index *cis*-regulatory potential (noted as *R* matrix, rows represent peak-gene pairs and columns represent cells) to measure the regulatory strength of a peak to a TG in a cell. We project the three different data matrices E, O, and R to a common reduced dimension space by an NMF-based optimization model (see the Joint dimension reduction” section). Specifically, we factorize the three data matrices into products of modality-specific factor profiles (W_{1}, W_{2}, and W_{3} for E, O, and R respectively) and a common low dimension representation of the cells (the H matrix).

Second, we cluster the cells based on the reduced dimension matrix. Specifically, we calculate the similarity between cells based on the reduced dimension H matrix and construct a k-nearest neighbor graph. To detect rare populations, we transfer this graph to a weighted graph where the weight between two nodes is defined as the Jaccard similarity of their neighbors in the k-nearest neighbor graph. The Louvain algorithm [37] was applied on this weighted graph to identify the cell populations (see the “Clustering of cells” sections).

Third, we define a *cis-*regulatory score (CRS) for each peak-gene pair in each cluster by an average of the *cis*-regulatory potential over cells from the same cluster (see the “Reconstruction of the *cis*-regulatory networks” section). We select the top 10,000 peak-gene pairs in each subpopulation to construct the *cis*-regulatory network and identify regulatory elements (REs) as peaks regulating at least one gene in a given subpopulation. To obtain more accurate regulatory information, we merge cells from the same subpopulation and perform peak calling on each subpopulation. We replace the REs with cluster-specific peaks, which are shorter and more accurate than the original REs, to obtain the final subpopulation-specific *cis*-regulatory networks.

### scREG performs cross-modalities dimension reduction by data integration

To assess whether the scREG method can reduce dimensions efficiently, we apply our method to peripheral blood mononuclear cells (PBMC) multiome data from 10X genomics (see the “PBMC 10 K data” section). We use the joint NMF-based optimization model in scREG to reduce the dimension of data into 100 dimensions. To validate the results, we used the cell-type labels that were annotated by the 10X Genomics R&D team as surrogates for ground truths [34] and calculated silhouette index (SI) for each cell based on the reduced dimension matrix. A higher SI value indicates that the cell is more similar to cells sharing its label than those not sharing its label. First, we compare our method with dimension reduction methods based on a single dataset PCA on scRNA-seq, and LSA on scATAC-seq. As the dataset contains 14 cell types, we choose the top 14 dimensions of PCA and LSA. To make the results comparable, we perform PCA on our reduced dimension matrix, which is 100 dimensions, to reduce it to the same dimension as the other two methods. Figure 2A shows that scREG achieves higher SI than PCA RNA, and LSA ATAC on 83.39%, and 83.37% of the cells, respectively.

The increment in SI is cell type-dependent. Figure 2C and Additional file 1: Fig. S1 show the distribution of SI of different methods on different cell types. For CD56 bright NK cells, naive B cells, naïve CD4 T cells, and plasmacytoid DC, the SIs are increased in scREG compared to all the other methods (Fig. 2B). The scREG achieved the best performance on 9 out of 14 cell types. Additional file 1: Fig. S1 shows the comparison of the average SI on all the cell types by different dimension reduction methods, except for the effector CD8 cell, on which all methods generally have poor performance. Our method scREG performs better on most of the cell types, with SI range from 0.2371 to 0.9758, and achieves the highest average SI score across cell types (0.5614).

One alternative way to perform dimension reduction is to perform dimension reduction separately on each type of data first and concatenate them together. We construct a new lower dimension representation by concatenating 7 dimensions of PCA on scRNA-seq and 7 dimensions of LSA on scATAC-seq and compare this with scREG. These newly constructed 14 features perform lightly better than the two methods that perform dimension reduction separately but are inferior to scREG on 83.14% of the cells (Fig. 2A).

To evaluate the importance of the *cis-*regulatory potential, we compare our method with three other methods that do not use cis-regulatory information. First, we build a baseline model of scREG by removing the *cis-*regulatory potential term (the 3rd term in eq. (1) of Methods) from the optimization model and then perform the same optimization and dimension reduction as before. The other two methods for comparison are scAI [16] and MOFA+ [33]. We use these three baseline methods to reduce the dimension to 100 and perform PCA to further reduce it to 14 dimensions. The result shows that including the *cis*-regulatory term improves performance: scREG increases SI on 72.08%, 81.18%, and 62.39% of cells compared to scAI, MOFA+, and scREG baseline, respectively (Fig. 2B). Figure 2B, C and Additional file 1: Fig. S1 show that the baseline methods without cis-regulatory potential are inferior to the scREG but perform better than the other methods that take a single dataset as inputs. This result suggests that we achieve more accurate low dimension representation by integrating both gene expression and chromatin accessibility data, and the accuracy would be further improved if we include the *cis*-regulatory potentials into the formulation of the NMF-based optimization model.

### scREG identifies the cell populations with high accuracy

To evaluate the performance of the cell clustering aspect of our method, we compare the scREG with Cell Ranger ARC V1.0, Seurat V4.0, scAI, MOFA+, and the baseline scREG method. Cell Ranger ARC analyzes the gene expression and chromatin accessibility separately and outputs two clustering labels, while scREG scAI, MOFA+, and Seurat perform a joint analysis of the two types of data and output one clustering label. Figure 3A and Additional file 1: Fig. S2A show the UMAP and inferred clustering labels from each of the five methods. Figure 3B and Additional file 1: Fig. S2B show the corresponding UMAPs colored by the surrogate ground truth labels. From the clustering and Umap visualization, we see the scREG gives a consistent label with the surrogate ground truth while alternative methods have obvious misclassification. The naive CD4 T cells and the naive CD8 T cells are not separated in Cell Ranger RNA-seq clustering but separated in Cell Ranger ATAC-seq clustering and the joint clustering methods scREG and Seurat. The non-classical monocytes and the myeloid DC are not separated in both RNA-seq and ATAC-seq clustering but separated in the three joint clustering methods. Memory B cells and naïve B cells are separated clearly in ATAC-seq clustering, but the boundary is not clear in RNA-seq clustering. In Seurat, the boundary between memory B cells and naïve B cells is shifted so a large proportion of memory B cells are labeled as naïve B cells. In scAI clustering, there is a subpopulation that is a mixture of naive CD4 T cells, naive CD8 T cells, and memory CD4 T cells. Memory B cells and naïve B cells are not separated in MOFA+ clustering. These two cell types are separated clearly in scREG_baseline and scREG. To evaluate the clustering results systematically, we calculate the normalized mutual information (NMI) and adjusted Rand index (ARI) based on the surrogate ground truth. It is seen that scREG achieves the highest NMI and ARI compared to other methods. It is worth to notice that all the clustering methods are compared under their default resolution parameter setting, which is 0.8 for Seurat and 1 for all other methods. As clustering accuracy is highly affected by the resolution provided to Louvain clustering, we also compared the clustering performance of these five methods under different resolutions ranging from 0.2 to 2.0 (Additional file 1: Fig. S3). The scREG performs very robust under different resolution parameters and achieve the highest performance among all the method we compared.

We also use scREG and alternative methods on 10X multiome data from the human cerebellum, mouse E18 brain, and lymph node from B cell lymphoma (Fig. 3C–E). The clusters from scREG are consistent with the known marker genes’ expression (Additional file 1: Fig. S4-S6). As ground truth labels are not available for these data, we use an adjusted internal clustering evaluation to compare the clustering of scREG with Seurat. It is not fair to compare scREG clustering with Seurat clustering by calculating internal clustering evaluation metrics. The reason is that, when ground truth is not available, the internal clustering evaluation metrics calculated on different distance matrices are not comparable. This is an underappreciated statistical issue, so we generate two artificial examples to illustrate this (please see Additional file 2 for more detail). From example 1, we see the silhouette index is increased but the real clustering accuracy is decreased. This is because internal clustering evaluation metrics are designed for the comparison of different clustering methods (or different parameters) based on the same cell-cell distance matrix. Since internal clustering evaluation has to be performed on the same distance matrix, we have to choose a common embedding space to compare two clusterings. Here, we use the top 20 principal components (PCs) of scRNA-seq and scATAC-seq to calculate the cell-cell distance. For a given clustering, we can say it is a good clustering if cells with the same labels are close to each other in both modalities; we say it is a bad clustering if cells with the same labels are not close to each other in any modalities. Example 2 in the Additional file 2 has illustrated this. It is possible that method A is better than method B on one modality but worse in the other modality. In this case, we cannot evaluate these two methods. For each clustering method, we compute four different clustering evaluation metrics based on top 20 PCs of scRNA-seq and scATAC-seq on the three datasets (Additional file 1: Fig. S7-S9). Figure 3F shows the comparison of scREG with Seurat. It shows scREG performs better than Seurat for most cases.

We also tested our method on the bone marrow mononuclear cells data from the NeurIPS competition [38]. which include 22,463 cells with known cell labels. Additional file 1: Fig. S10 shows the clustering and 2D embedding results and comparison with Seurat. The clustering of scREG is more consistent with the ground truth label and achieves 0.7649 in NMI and 0.6549 in ARI, which are higher than Seurat (NMI = 0.7409, ARI = 0.5419). Overall, scREG identifies cell populations with high accuracy on different datasets.

### scREG constructs subpopulation specific cis-regulatory networks

To assess the *cis*-regulatory network inference of our method, we evaluate the predictions in several cell populations where experimental *cis*-regulation is available. First, we download variant-gene links defined by the expression quantitative trait loci (eQTL) of CD14 positive monocyte cells [39] and use them to validate the RE-TG prediction of classical monocyte clusters. For each peak-gene pair in the *cis*-regulatory potential *R* matrix, we have a predicted *cis*-regulatory score in each cluster. Taking the eQTL data as ground truth, we plot the receiver operating characteristic (ROC) curve and precision-recall (PR) curve by sliding the *cis*-regulatory score (Fig. 4A, B). As a baseline method for comparison, we calculate the Pearson correlation coefficient (PCC) between the expression of a gene and the accessibility of peak within 1 Mb of the gene’s transcriptional start site. Our method achieves 0.81 area under the ROC (AUROC) curve and 0.46 area under the PR (AUPR) curve, while the AUROC and AUPR of the baseline method are 0.56 and 0.25, respectively. Our method identified 9067 REs in classical monocyte subpopulations, and 1998 of them overlapped with eQTL variants. For those REs that overlapped with eQTL variants, 53.25% of our predictions are connecting them to the same genes as eQTL (Fig. 4C). If we randomly select genes from a 1 Mb distance, this percentage would be 3.26% (16.33-fold), and it would be increased to 18.31% (2.91-fold) if we force the selected peak-gene to have the same distance distribution with our predictions.

Next, we use promoter capture HiC data to validate our predictions. We downloaded the promoter capture HiC data of 7 primary blood cell types [40] and compare them with the *cis-*regulatory networks for the corresponding cell types that we inferred from the PBMC data. Figure 4D and Additional file 1: Fig. S11 shows the precision of our method is 3-fold higher than the randomly selected peak-genes pairs which have the same distance distribution as our predictions. Additional file 1: Fig. S12 shows our method achieves much higher AUROC and AUPR than PCC. The strong performance in the peak-gene link suggests that scREG is effective in inferring regulatory relations.

The *cis*-regulatory networks are highly cell-type-specific. Additional file 1: Fig. S13 shows the Jaccard similarity of clusters in terms of cis-regulation. The average Jaccard similarity between clusters is 0.4760. Hierarchical clustering analysis shows similar cell types have a similar cis-regulatory network. For example, the average similarity between four T cell clusters is 0.7783, and group to one cluster; the similarity of two B cell subpopulations is 0.79; the similarity of two NK cell subpopulations is 0.79; similarity of two monocytes subpopulation is 0.64.

### scREG shed light on the interpretation of disease-associated loci

The *cis-*regulatory network inferred from single cell multiome data may provide new insights for the interpretation of disease-associated loci. To demonstrate this, we download 376 fine mapped variants with a posterior inclusion probability greater than 0.1 for inflammatory bowel disease (IBD) [41, 42]. These fine-mapped GWAS variants showed high enrichment (odds ratio in the range of 9.56 to 27.45) in REs from subpopulations of PBMC data (Fig. 5A, see the “Enrichment of GWAS variants” section) [43]. As a baseline for comparison, we use all peaks from each subpopulation to do the same analysis. As a result, the enrichment odd ratios from all peaks are 1.14-3.02 fold (on average 1.82-fold) lower than the peaks from our networks. These results show that the context specific regulatory network from scREG could improve the interpretation of the disease-associated loci.

The *cis*-regulatory networks produced by scREG connect 34 fine-mapping variants to 32 genes in 12 cellular contexts. These 32 genes include three transcription factors IRF4, IRF8, and CEBPB. Variants rs913678, rs4811031, and rs6063502 are linked to CEBPB in non-classical monocytes; variant rs6935510 in chr6 is linked to IRF4, and variant rs11640143 in chr16 is linked to IRF8 in plasmacytoid dendritic cells. Our scREG package has an interactive visualization function that takes the genomics region as input and plots the genes, peaks, and interactions in the given range. It includes the genes, the raw peaks from all cells (before clustering), peaks of each cluster from MACS2, and the predicted peak-gene association in each cluster. Figure 5B shows the track of around the variant rs11640143 and IRF8. The variant rs11640143 is in an accessible region in myeloid DC, memory B cell, naïve B cell, and plasmacytoid DC cells (highlighted blue in Fig. 5B). The regulatory element that contains this variant is predicted to regulate IRF8 only in the scREG generated a *cis*-regulatory network of plasmacytoid DC cells. It also shows that the cluster level peaks are narrower than the raw peaks and multiple cluster-level peaks are merged into one raw peak (highlighted yellow in Fig. 5B). The interactive visualization function will help users understand the regulatory relations in a cell type-specific manner.

To further investigate the roles of these transcription factors in IBD, we modify our previous PECA2 method to infer their trans-regulatory target genes (see the “Inference of trans-regulatory targets” section) in each subpopulation. We download the differential expression genes list from IBD patients versus healthy controls study [44] for further analysis. First, we find the TF IRF4 is upregulated in IBD patients compared to healthy controls (two samples two-tail *t*-test, *p*-value = 7.59E−31). Next, we compare differential expressed genes with the target genes of the three TFs (Fig. 5C, D). Target genes of IRF4 and IRF8 in plasmacytoid dendritic cells are 2.88 and 2.80 fold enriched (odds ratio) in the upregulated genes in IBD (Fisher’s exact test, *p*-value 7.97E−32, and 7.76E−31). Target genes’ of CEBPB in non-classical monocytes are 3.17-fold enriched in the IBD upregulated genes (Fisher’s exact test, *p*-value 1.51E−31). Interestingly, downregulated genes in IBD are depleted in the IRF4, IRF8, and CEBPB’s target genes (Odds ratios are 0.45, 0.44, and 0.55, *p*-values are 5.61E−06, 2.03E−06, and 0.0037). Overall, scREG is a useful tool for the interpretation of disease-associated loci.