CMOT: Cross-Modality Optimal Transport for multimodal inference

Multimodal measurements of single-cell sequencing technologies facilitate a comprehensive understanding of specific cellular and molecular mechanisms. However, simultaneous profiling of multiple modalities of single cells is challenging, and data integration remains elusive due to missing modalities and cell–cell correspondences. To address this, we developed a computational approach, Cross-Modality Optimal Transport (CMOT), which aligns cells within available multi-modal data (source) onto a common latent space and infers missing modalities for cells from another modality (target) of mapped source cells. CMOT outperforms existing methods in various applications from developing brain, cancers to immunology, and provides biological interpretations improving cell-type or cancer classifications. Supplementary Information The online version contains supplementary material available at 10.1186/s13059-023-02989-8.

of the data to infer the missing modality. However, they only work with multimodal data that must come from the same cells (i.e., fully corresponding). Alignment-based methods like non-linear manifold alignment [10] have been shown to align multimodalities with partial cell-to-cell correspondence information but have not been extended to crossmodality inference. Machine learning has also emerged to help modality inference. For instance, TotalVI [5] builds a variational autoencoder that infers missing protein profiles from gene expression using CITE-seq data. Polarbear [11] also uses autoencoders; however, trains on both single and multi-modal data to infer each modality. However, such autoencoder-based approaches are unsupervised that learn the latent embeddings that likely lack biological interpretability and lack a mechanism to introduce prior knowledge about underlying data distribution [12]. Moreover, training autoencoders typically requires considerable amounts of data and time with intensive hyperparameter tuning.
Optimal Transport (OT) is an efficient approach that uses prior knowledge about data distribution to find an optimal mapping between the distributions [13]. OT can also work on small datasets with limited parameters. Recently, OT has been applied to singlecell multiomics data for various applications [14][15][16][17]. Schiebinger et al. [14] used OT to model the developmental trajectory of single-cell gene expression through unbalanced optimal transport. Single-cell integrative analysis frameworks like SCOT [15], SCOTv2 [16], and Pamona [17] further extended the original OT problem for multi-omics data alignment. Another work [18] used OT with an additional entropic regularization term to improve the unsupervised clustering of single-cell data to understand cell types and cellular states better. However, OT has not yet been applied for cross-modality inference. Thus, we propose that integrating OT with multimodal data alignment can work for cross-modality inference and address the above limitations of prior works.
Particularly, we developed CMOT (Cross-Modality Optimal Transport), a computational approach to infer missing modalities of single cells. CMOT first aligns the cells with multimodal data (source) if the cells do not have complete correspondence, and then applies OT to map the cells from single modality (target) to the source cells via shared modality. Finally, CMOT uses the k-Nearest-Neighbors (kNN) of source cells to infer missing modality for target cells. Moreover, CMOT does not need paired multimodal data for alignment. We found that not only does CMOT outperform existing state-of-art methods, but its inferred gene expression is biologically interpretable by evaluating on emerging single-cell multi-omics datasets. Finally, CMOT is open source at: https:// github. com/ daife ngwan glab/ CMOT.

Overview of Cross-Modality Optimal Transport
CMOT (Cross-Modality Optimal Transport) is a computational approach for crossmodality inference of single cells (Fig. 1). CMOT accepts available multimodal and single modality datasets as inputs. CMOT does not require that the available multimodalities have complete corresponding information, i.e., allowing a fraction of unmatched cells in the source.
CMOT first aligns a group of cells X and Y (source) within available multi-modal data onto a common latent space (Step A), if the cells across multimodalities do not have complete correspondence. However, this is an optional step if the cells across multimodalities have complete correspondence. In this study, we used Non-linear Manifold Alignment (NMA) [19] to align the unmatched multimodalities. Next, CMOT applies optimal transport to map the cells with a single modality Y (target) to cells in the source from the same modality Y by minimizing their cost of transportation using Wasserstein distance (Step B). This distance can be regularized by prior knowledge (e.g., cell types) or induced cell clusters to improve mapping, and an entropy of transport to speed up OT computations. The optimal transport optimization tries to find an efficient mapping π * between cells of Y and Y that is used to transport cells in Y to the same space as cells in Y . Once transported, CMOT uses k-Nearest-Neighbors to infer the missing modality X for the cells in target Y (Step C). Here, the missing or additional modality X inferred by CMOT has the same number of features as X , and in the same space as X . Details about each step can be found in the "Methods" section.
We benchmarked CMOT with state-of-art methods [5, 7,-9, 11, 20, 21] on large-scale single-cell multi-omics (e.g., scRNA-seq and scATAC-seq (Additional File 1: Fig. S1A, Fig. S2A, Fig. S3A, Fig. S4A)). Also, we applied CMOT to additional omics datasets like protein expression. These datasets span across broad contexts including human and mouse brains, cancers and immunology, showing the generalizability of CMOT. CMOT is a computational approach to infer missing modalities for existing single-cell modalities. It has three main steps: A Alignment (optional), B Optimal Transport, and (C) k-Nearest Neighbors inference. CMOT inputs two multi-modalities X and Y (source), where the cells in X and Y need not be completely corresponding. The cell-to-cell correspondence information between X and Y can be specified through p . CMOT aligns X and Y using non-linear manifold alignment (NMA) onto a common low-dimensional latent space if cells in X, Y do not have complete correspondence. Then, CMOT uses optimal transport (OT) to map the cells in source Y to the cells in target Y , where Y and Y share modalities. CMOT minimizes the cost of transportation by finding the Wasserstein distance between cells in Y and Y which is further regularized by prior knowledge or induced cell clusters and entropy of transport. Finally, CMOT infers the missing modality X for cells in Y using k-Nearest Neighbors (kNN). It calculates a weighted average of the k-nearest mapped cells in Y for every cell in Y , using their values from X , and infers X

Human brain
We first applied CMOT to single-cell human brain data with jointly profiled chromatin accessibility and gene expression by 10xMultiome (scATAC-seq and scRNA-seq of 8981 cells) and inferred gene expression of cells from open chromatin regions (OCRs by peaks from scATAC-seq) [1]. We selected the top 1000 most variable genes & peaks (Additional File 1: Supplementary Methods). We randomly split the cells into 80% training for cross-validation and 20% testing set for evaluation. We split the training set into training and validation to find optimal parameters for the model using 5-fold cross-validation. For the alignment, we set K = 5, and latent dimension d = 20. For optimal transport, we set parameters λ = 200 and η = 1. For kNN modality inference, we set k = 600. Also, we used 10 major brain cell types from the dataset. However, to test CMOT's performance when such cell-type information is absent, we induced cell labels by two major cell clusters. We also tested CMOT's performance for different levels of correspondence: p = 25%, 50%, 75%, 100%.
Next, we evaluated if the CMOT inferred gene expression to classify brain cell types. We used known cell-type marker genes provided along with the dataset and selected the top 8 highly predictive cells from each cell type within our inferred gene expression (see Additional File 2). Due to high imbalance of the number of cells within each cell type, we picked the number of cells based on the size of the smallest cell type. In this case, the cell type EC/Peric. had only 8 cells; therefore, we picked only 8 cells from the other cell types. We then calculated the AUPRC of the respective inferred genes in a one-vs-all manner for each cell type against the rest ("Methods"). CMOT obtains the higher AUPRCs for these genes against a baseline of 0.1 (Fig. 2C) for all cell types. The baseline is defined as the proportion of positives in the data. This suggests that the CMOT inferred expression is capable to distinguish cell types, providing the biological interpretability of the CMOT inference. Looking at individual cells (Fig. 2D), CMOT infers individual cell expression with high Pearson correlation and significance (p < 0.05). Furthermore, we also found the enriched functions and pathways relating to brain development from the top 100 highly predictive genes (Fig. 2E). For results of benchmarking on additional state-of-art methods, see Additional File 1: Fig. S1, Tables S1-S4, and Supplemental Methods.
Furthermore, we benchmarked CMOT with the state-of-art methods on a SNARE-seq dataset [22] consisting of jointly profiled gene expression and chromatin peaks in adult mouse brain (see Additional File 1: Fig. S2, Fig. S3, Tables S5-S8, S9-S10, and Supplemental Methods).

Inferring protein expression from gene expression in peripheral blood mononuclear cells
We applied CMOT to infer protein expression from gene expression of peripheral blood mononuclear cells (PBMCs) using emerging CITE-seq data [5]. We trained CMOT on 6885 cells from PBMC10k, with parameters: K = 5, d = 15, λ = 1e02, η = 1, k = 100, and used the top 200 highly variable genes in the training data to find the k nearest neighbors. We induced cell labels by identifying two clusters using gene expression for the label regularization in optimal transport. We evaluated CMOT, MOFA + , Seurat, and TotalVI's using 3994 cells from a different dataset, PBMC5k. Here we show an independent evaluation of CMOT and other methods on PBMC5k while using PBMC10k as the training data (Additional File 1: Tables S11-S13). Additionally, we also show benchmarking on PBMC10k, by splitting it into 80% training and 20% testing data (see Additional File 1: Fig. S7, Supplementary Methods, and Tables S14-S16).

Inference of gene expression using chromatin accessibility for drug-treated lung cancer cells
Next, we applied CMOT to 100nM dexamethasone (DEX)-treated A549 single-cells from lung adenocarcinoma. The DEX-treated 2641 cells were profiled after 0, 1, and 3 h of treatment for gene expression and open chromatin regions (OCRs) using sci-CAR experiments [2]. We focus on the CMOT's performance for gene expression inference from peak signals of OCRs. We stratified-split the dataset into 80% training and 20% test  Table S13). The intensity represents the protein expression level. r is the Pearson correlation coefficient. p is the correlation p-value cells using the treatment hours. We used the treatment hours as the classes for label regularization in optimal transport for training cells. We trained CMOT with the parameters K = 5, d = 10, λ = 1e02, η = 5e − 3, and k = 500 and used the top 20 highly variable OCRs in scATAC-seq to find the k nearest neighbors. Again, we found that CMOT shows a consistent performance across different cell-to-cell correspondence information (p) with high correlation. CMOT (p = 100%) infers gene expression with a mean Pearson correlation of 0.52, similar to MOFA + and outperforming Seurat (median correlation = 0.5, Wilcoxon p-value < 1.27e − 05) (Fig. 4A).
Moreover, CMOT shows a high gene-wise Pearson correlation outperforming Seurat for 636 versus 547 genes (Fig. 4B, Additional File 1: Table S21). Although MOFA + reports a higher gene-wise Pearson correlation for some genes than CMOT (Fig. 4B, Additional File 1: Table S21), we still see that CMOT's inferred expression shows the transitory trend of key druggable marker genes across drug-treatment hours. Figure 4C shows three key genes, identified as makers of early (ZSWIM6) [26] and late (PER1, BIRC3) [23][24][25] events of treatment. Also, we performed enrichment of the 435 high correlation genes identified by CMOT (versus MOFA + in Fig. 4B) (see Additional File 3 for list of genes). As shown in Fig. 4D, we saw a higher enrichment of terms associated with DEX-treated A549 cells like TGF-beta signaling, along with effects on DEX treatment in general, like Mental disorders as compared to enrichment given by MOFA + (Additional File 1: Fig. S6, Additional File 2). Lastly, we also found that the  Table S21). C CMOT inferred normalized gene expression trend (y-axis) across treatment hours (x-axis). Key genes: PER1 and BIRC3 [23][24][25] are markers for glucocorticoid receptor (GR) activation seen later in treatment (3 h). ZSWIM6 [26] is a key gene of early events of DEX treatment (0 h, 1 h) (see Additional File 2 for top 100 highly predictive genes). D Enriched terms associated with CMOT inferred gene expression using 435 genes with a higher gene-wise Pearson correlation compared to MOFA+ 's 748 genes (B, Additional File 1: Fig. S6B). E The measured (x-axis) versus inferred normalized expression (y-axis) of genes (dots) for three select cells. r is the Pearson correlation coefficient. p is the correlation p-value cell-wise correlations between inferred and measured gene expression are also significantly highly correlated in each treatment hour (Fig. 4E). For results of benchmarking on additional state-of-art methods, see Additional File 1: Fig. S4, Supplemental Tables S17-S20, and Supplemental Methods.

Cross-modality inference between gene expression and chromatin accessibility to distinguish cancer types
Finally, we tested CMOT to see how well it can infer between two modalities, especially for relevantly small datasets. We used a pan-cancer scCAT-seq dataset which jointly profiled 206 single-cell gene expression and chromatin accessibility on OCRs for three cancer cell lines: HCT116, HeLa-S3, and K562 [3]. We stratified split data into 80% training and 20% testing sets using cancer-type information. We induced our cell labels for training cells for label regularization in optimal transport. For gene expression inference from OCR peaks, we identified two clusters in chromatin peaks and vice versa. We trained CMOT with the following parameters for gene expression inference from chromatin peaks: K = 5, d = 10, λ = 5e03, η = 1, k = 40, and used the top 150 highly variable OCRs to find the k nearest neighbors. For inferring gene expression from binarized OCR peaks, we evaluated the inferred expression using the same metrics (cell-wise and genewise Pearson correlation) as above.
CMOT significantly outperforms both MOFA + and Seurat, with a cell-wise mean correlation of 0.67 compared to 0.47 (Wilcoxon p-value < 6.81e − 17) and 0.63 (Wilcoxon p-value < 1.32e − 05), respectively (Fig. 5A, Additional File 1: Tables S22-S25). Moreover, CMOT (p = 100%) yields an improved gene-wise correlation for 6235 genes versus 3764 against Seurat (Wilcoxon p-value < 1.59e − 31), and 8259 versus 1740 against  Tables S22-S25). B Silhouette score (x-axis) across measured and inferred gene expressions (x-axis), and measured chromatin peaks (Additional File 1: Table S26). C PCA of inferred gene expression. D Gene-wise correlation between the inferred and measured expression, comparing CMOT (y-axis) with MOFA + and Seurat (x-axis). Dots: Genes; Numbers: Gene numbers above and below the dotted line. E Peak-wise AUROC, comparing CMOT (y-axis) with MOFA + and Seurat (x-axis). Dots: Peaks; Numbers: Peak numbers above and below the dotted line. P-values are calculated by a one-sided Wilcoxon rank-sum test MOFA + (Wilcoxon p-value = 0) (Fig. 5D). Moreover, CMOT's inference is particularly useful to identify the cancer type specific cell clusters. For instance, we calculated the silhouette score (see the "Methods" section) to see if the cells from the same cancer lines exhibit similar gene expression patterns. CMOT reports a high median silhouette score of 0.74 compared to the measured gene expression (0.25), measured chromatin peaks (0.27), and inferred expressions from Seurat (0.61) and MOFA + (− 0.07) (Fig. 5B, Additional File 1: Table S26). As shown in Fig. 5C, the cancer cells from three cancer cell lines can be separated using CMOT-inferred gene expression, suggesting the capability of CMOT inference to reveal cancer-type-specific expression. Then, we evaluated the CMOT's OCR peaks inference from gene expression. We trained CMOT with the parameters: K = 5, d = 10, λ = 1e03, η = 1, k = 10, and used the top 50 highly variable genes to find the k.nearest neighbors. We also stratified split the data into 80% training and 20% testing sets using cancer-type information. We normalized CMOT's inferred peaks and then binarized them by a cutoff of 0.5, and then calculated the peak-wise area under the receiver operating curve (AUORC) of the inferred binarized peaks relative to the binarized measured profile. We also found that CMOT significantly outperforms both MOFA + and Seurat with Wilcoxon p-values < 9.42e − 77 and 9.48e − 10, respectively, for OCR peak inference from gene expression (Fig. 5E, Additional File 1: Fig. S8, Additional File 1: Table S27-S28).

Discussion
With the advent of new single-cell technologies, data is generated with even greater precision. However, most of these technologies continue to profile a single modality for each cell, creating a need for robust cross-modality inference frameworks to jointly study the underlying that can infer additional or missing modalities for such cells.
In this study, we introduce CMOT as a computational approach that integrates manifold alignment, regularized optimal transport, and k-Nearest Neighbors (kNN) for cross-modality inference. By applying emerging single-cell multimodal data, we demonstrated that CMOT was able to predict multimodal features of single cells such as gene expression, chromatin accessibility, and protein expression. Note that CMOT does not require paired samples for aligning multimodalities as shown by its out performances over state-of-arts in some applications. This attribute is particularly useful since joint multimodal profiling is typically challenging and sometimes costly and single modality data is thus still widely favored. To demonstrate this, we evaluated CMOT and other state-of-art methods on single-profile scRNA-seq and scATAC-seq dataset [1] and found that CMOT outperforms all methods (Additional File 1: Fig. S14). Moreover, CMOT is more computationally efficient and faster than state-of-art methods (Additional File 1: Table S29). In the paper, CMOT primarily used nonlinear manifold alignment (NMA) to align multimodalities for achieving the best inference. However, CMOT is flexible and the user can substitute NMA with their preferred alignment method, e.g., SCOT [15,16], MMD-MA [27], and WNN [8].
Furthermore, the optimal transport step in CMOT leverages the information within shared modalities between source and target cells to compute a mapping matrix through Wasserstein distances between them. These distances quantify and minimize the geometric discrepancy of the distributions and map the two distributions for improving cross-modal inference in CMOT.
Additionally, we see that the overall correlation scores of all methods consistently vary across datasets. We attribute this variation in correlation across datasets to factors like the number of available cells, total features used, and the sparsity of the datasets used in training. We noticed a higher correlation for datasets either with more training cells or a high number of features. However, data sparsity continues to be a challenge with singlecell profiling technologies and therefore affects inference performance for such datasets.
Nonetheless, we also examined CMOT's potential limitations. First, nonlinear manifold alignment comes with a high computational cost as the data size increases. It needs to compute Laplacians and similarity matrices of multimodal inputs which scale quadratically with the datasets and slow CMOT's computations over large datasets. However, this can potentially be sped up by using other alignment methods like SCOT [15,16] or Unioncom [28]. Second, Optimal transport assumes a mass-balancing approach between the source and target distributions, where every s (e.g., a cell) in the source has to map to a point in the target. This is a relatively strong assumption requiring a balanced data distribution between the source and target to fit a conservative transport plan. Given many imbalanced datasets in the real world, this limitation can be improved by recent optimal transport techniques such as SCOTv2 [16] through emerging unbalanced optimal transport approaches [29,30].
Also, CMOT can adapt other optimal transport variants to even transport different modalities between the source and target cells. For instance, Gromov Wasserstein distance can map distributions from different modalities [31,32]. Moreover, CMOT has the potential to work with additional single-cell modalities like the morphology and electrophysiology of single neuronal cells from Patch-seq [33]. In addition to inferring modalities, CMOT has the potential to be extended to infer the sample labels such as phenotypes across modalities, e.g., via label transferring [8]. For example, it can predict cell types or disease states of single cells for the modalities without such information.

Conclusion
In this study, we introduced CMOT as a computational framework that can successfully infer additional or missing modalities for cells with single modalities. We applied CMOT to single-cell datasets of different scales and profiles (e.g., gene expression, peaks, proteins), which can be easily extended to other modalities and applications. CMOT uses the underlying data distributions of multimodalities and uses optimal transport to find efficient mappings between available multimodalities and the target single modality, without requiring prior assumptions about their distributions. Moreover, CMOT's design is computationally efficient.

Cross-Modality Optimal Transport (CMOT) workflow
CMOT workflow for cross-modality inference can be divided into 3 steps including an optional first step: Step A (optional): Alignment to project the cells with available multimodal data (source cells) onto common low-dimensional latent space.
Step B: Optimal transport to map cells with the single modality (target cells) to the aligned source cells from the same modality.
Step C: k-Nearest-Neighbors to infer the missing or unprofiled modality of target cells using another modality of nearest mapped source cells.
We describe each step in detail below by introducing the necessary notations:

Step A: Alignment of source cells with multimodal data
We set this as an optional step if the cells across available multimodalities do not have a complete correspondence. That is, if cells across modalities have none to partial correspondence between them, then CMOT first aligns them. Although users are free to use their choice of alignment in such scenarios, we use Nonlinear Manifold Alignment (NMA) [19].
Alignment is an important step that accounts for when the source cells have partial correspondence. NMA is based on a manifold hypothesis that high dimensional multimodal datasets have similar underlying low dimensional manifolds, and therefore, they can be projected onto a common manifold space that preserves the local geometry of each modality and minimizes the differences between the manifolds of modalities. We define X = {x i } i=1,..,s X and Y = {y j } j=1,..,s Y as two multimodal measurements of s X , s Y source cells in Modalities X , Y respectively, where x i ∈ R d X and y j ∈ R d Y represent the measurements of d X features in i th cell of Modality X , and d Y features in j th cell of Modality Y , respectively. We also define W X ∈ R s X ×s X and W Y ∈ R s Y ×s Y as cell similarity matrices for X and Y , respectively, where each similarity matrix is constructed by connecting a cell with its K nearest neighboring cells within the modality. The partial prior known cell-to-cell correspondence information can be quantified by p (0 < p < 100%) to quantify the partial prior known cell-to-cell correspondence information (for example, p% of paired cells across modalities) and encode this information as a cross-modal similarity matrix W ∈ R s X ×s Y . NMA then learns two mapping functions X and Y that project x i and y j to � X (x i ) ∈ R d and Y y j ∈ R d , respectively onto a common manifold space with dimension d ≪ min(d X , d Y ). The d-dimensional manifold preserves the local geometry of each modality and minimizes the distances between corresponding samples after projection. Solving manifold alignment can be reformulated as manifold co-regularization in reproducing kernel Hilbert spaces. The manifold alignment optimization finds optimal mapping functions * X , * Y by solving the following: , where the first two terms preserve the local geometry within each modality, the similarity matrices W X and W Y model the relationships of the cells in each modality that can be identified by K-nearest neighbor graph, and the third term preserves the correspondence information across X and Y modeled by W . The parameter µ controls the trade-off between conserving the local geometry of each modality and cell-to-cell correspondences across modalities. Here, we set µ to 0.5. This allows equal importance to both preserving the local geometry of each modality as well as cell-to-cell correspondences, thereby eliminating the need for assuming underlying assumptions about data distributions of modalities. We also need to add an additional non-zero constraint to avoid mapping of all cells onto a latent space with dimension zero: ) as diagonal elements, and I is the identity matrix [18,34]. Also, two modalities are not required to have a complete correspondence between the cells. Therefore, W is a binary correspondence matrix between cells of X and Y such that if p=100%, i.e., 100% correspondence across cells in X and Y , W would be an identity matrix. For p<100%, W i,j = 1 if x i and j th cells from Modalities X and Y respectively are the corresponding cells and 0 otherwise. After alignment, the resulting d-dimensional modalities share a common latent space that can easily be compared using Euclidean distances. For instance, for every cell y j ∈ Y , we find an aligned cell x j,a ∈ X by finding the closest cell in X using the Euclidean distance. To implement our alignment step, we used the non-linear manifold module from our published Python package ManiNet-Cluster [34].
Unless otherwise stated, we use the term CMOT for our model trained with full correspondence ( p=100%).

Step B: Optimal Transport to map source and target cells by shared modality
The optimal transport theory [35,36] tries to find the most efficient mapping π * that transports one probability distribution to another with minimum transportation cost. A mapping π ∈ Y , Y represents transport plan to map cells from source ( Y ) and target ( Y ) modalites, where Y , Y contains all probabilistic mappings between the source ( Y ) and target ( Y ) cells. We define Y = { y j } j=1,..,s Y as the target single modality measurement with s Y cells, where y j ∈ R d Y represents the measurement of d Y features in j th cell of Modality Y . The classical OT distance (Wasserstein distance) gives the mappings between two probability distributions as the transportation cost. Let C be the cost matrix where C ∈ R +s Y ×s Y and C i,j represents the pairwise cost of mapping the source cell y i ∈ Y to the target cell y j ∈ Y . For discrete probability distributions like Y and Y over the same metric spaces (i.e., matched features of the shared modality), we define the OT problem as: where the first term computes the Frobenius dot product ., . F between the cost matrix C and π . The set is defined as The second term, also called entropic regularization, calculates the entropy of transportation for π where � s (π ) = − i,j π i, j logπ i, j . Entropic regularization addresses the computational complexity of OT as the sample size increases [37]. The intuition behind this term is to relax the sparsity constraints of the OT problem by increasing its entropy so that π * is denser, as source cells ( Y ) are distributed more towards target cells ( Y ). The resulting formulation is strictly convex and can be solved through Sinkhorn's Algorithm [22]. The parameter weights the entropic regularization. As the parameter increases, the sparsity of π * decreases, giving a smoother transport.
The third term is the label regularizer [37], � c = j c �π I c , j � q p , where I c contains the index of rows in π related to the source cells ( Y ) that belong to class c if we have such prior knowledge, e.g., known cell types. Hence, π I c , j is a vector containing the coefficients of the j th target cell in Y . The norm . q p denotes the l p norm to the power of q (here we set p=1 and q=0.5). The parameter η weights the label regularization. The intuition behind this term is to penalize the mappings that match together samples from different labels. This means that even if we do not have the label information for the target cells ( Y ), we can promote group sparsity within the columns of π such that each target cell is only associated with a class. However, in the absence of such label information, we can compute our own labels through unsupervised clustering techniques like hierarchical clustering to induce cell clusters as labels for source cells in Y . Finally, to map the source cells ( Y ) to the target space ( Y ), we use barycentric mapping using Y (t) = π * Y [37]. Now, we can easily compare Y (t) and Y using euclidean distance. To solve the regularized OT optimization step, we used Domain Adaptation functions (ot.da) provided in the Python package Python optimal transport (POT) [38].

Identify outlier cells in target modality
Some cells from the target modality ( Y ) may have a different distribution (e.g., belonging to a cell type absent in source modality). For such cells, the inference is difficult and may even lead to false predictions. To avoid this, we add an additional mechanism to identify and remove such cells. We use the optimal coupling π * and the cost matrix C calculated in step B, and compute an element-wise dot product P = π *T • C T , where P ∈ R s Y ×s Y . Then, we find any outliers within P using the Isolation Forest (IF) algorithm [39]. The Isolation Forest (IF) algorithm isolates samples by randomly selecting a feature and then randomly selecting a split value between the minimum and maximum values of the selected features. Since this algorithm is prone to the curse of dimensionality, we first apply principal component analysis (PCA) to P and use components that explain at least 95% variance. We then apply the IF to the lower dimensional P . We tested this mechanism by randomly replacing cells with noise in the DEX-treated A549 dataset (see Additional File 1). Additionally, CMOT provides a warning to users informing them about the percentage of poorly mapped cells that can be removed to avoid incorrect inferences.
Step C:

k-Nearest Neighbors to infer the additional modality of target cells
Finally, we apply k-Nearest Neighbors (kNNs) to infer the missing modality X of target cells in Y . For each target cell y j ∈ Y , we find its kNN in Y (t) using Euclidean distance. Let S j = {c l j : l = 1, 2, ..k} be the set of k nearest neighboring cells of y j in Y (t) , where c l j is a cell in Y (t) . For cells in S j , we use their values from the aligned modality X to define another setQ j = {q l j : l = 1, 2, ...k} , where q l j represents the profile of the cell c l j within the aligned modality X . Finally we calculate the weighted average of the profiles of all cells in Q j to get x j . This is calculated as: where w l j is the weightage given to q l j such that w l j = e − � y j −y l S j � 2 . Thus, we get the corresponding modality X for Y . We used sklearn's [40] nearest neighbor function for kNN implementation.

Partial correspondence in multi-omics data
Joint profiling of single cells is challenging and therefore, it may not always be feasible to get completely corresponding cells across profiled modalities. In such scenarios, there could be partial to no correspondence across cells of multimodalities. For example, a 50% cell-to-cell correspondence between modalities means that only 50% of the cells have been jointly profiled between the modalities. As a result, training on partially corresponding multimodalities for cross-modality inference can lead to misleading or wrong inferences. Therefore, to address this problem, CMOT first aligns such partially corresponding datasets and then performs inference. In this paper, we have used datasets that have a 100% correspondence originally, so that we can validate the inference performance. However, we simulate different levels of cell-to-cell correspondence by setting the p value in non-linear manifold alignment (Methods Step A). In particular, we randomly chose p percent cells for whom correspondence information is assumed available, while the remaining cells are treated as non-corresponding. We report CMOT's performance when trained on p=25%, 50%, 75%, 100% cell-to-cell correspondence levels, and show that CMOT's cross-modality inference performance can beat state-of-theart methods that require 100% cell-to-cell correspondence for training.

Human brain
The human brain dataset was generated by 10xGenomics, containing gene expression and open chromatin regions multiome data from the same cells (8981 cells) profiled from post-conceptual week 21 (PCW21) [1]. We filtered out peaks and genes that occurred in less than 3 cells. For scATAC, we normalized the peaks using term frequency-inverse document frequency (TF-IDF) transformation using RunTFIDF [41] to identify the top 1000 most variable peaks. For scRNA, we performed normalization and variance stabilization using SCTransform [42] and picked the top 1000 most variable genes. The resulting data includes gene expression and chromatin regions of 8981 for 1000 genes and regions respectively.

Peripheral Blood Mononuclear cells
The Peripheral Blood Mononuclear cells (PBMC) dataset [5] was generated by CITEseq, containing genes and proteins from the same cells. This data contains cells from two experiments performed on PBMCs: 6855 cells from PBMC10k and 3994 cells from PBMC5k. We preprocessed multiome data from two experiments independently. For scRNA-seq, we performed normalization and variance stabilization using SCTransform [42] and picked 2960 highly variable genes. We identified the variable genes in PBMC10k first and used them as a reference to subset genes in PBMC5k scRNA data. For protein expression, we performed centered log-ratio (CLR) normalization using Seurat's functions in both PBMC10k and PBMC5k. The resulting PBMC10k data dimension was 6855 by 2960 for gene expression and 6855 by 14 for protein expression. The resulting PBMC10k data includes gene and protein expression data of 3994 cells for 2960 genes and 14 proteins.

Dexamethasone-treated A549 cells
The dexamethasone (DEX)-treated A549 dataset [2] was generated using a sci-CAR experiment for single cells from the A549 lung adenocarcinoma cell line. The data contains jointly profiled 2641 cells after 0, 1, and 3 h of 100 nM DEX treatment for gene expression and open chromatin regions. We used a preprocessed dataset previously used by Jin et al. [43], and filtered out lowly expressed cells by gene expression. We reduced the dataset to 2391 cells. For scRNA, we used all 1183 genes. For scATAC, we picked the top highly variable 1183 peaks. The resulting data includes gene expression and chromatin regions of 2391 cells for 1183 genes and regions respectively.

Pan-cancer cell lines
This dataset contains three cancer cell lines [3]: HCT116, HeLa-S3, and K562, generated by joint profiling of 206 single cells using scCAT-seq containing gene expression and open chromatin regions. We used a preprocessed dataset previously used by Huizing et al. [18], however, we reduced the number of genes and peaks to 10,000 by selecting the most variable features. We binarized the scATAC profile where we set all values greater than 0 to 1 and 0 otherwise. The resulting data includes gene expression and chromatin regions of 206 cells for 10,000 genes and regions respectively.

Runtime evaluations
We compare CMOT's running time with state-of-art methods MOFA + , Seurat, TotalVI, and Polarbear for the best-performing parameters used for cross-modality inference (Table S29). We benchmarked all methods on Intel Xeon Gold 6242R CPU @3.10 GHz × 40 with 251.4GiB RAM and NVIDIA RTX A6000 GPU, Hierarchical Clustering. We induced cell labels for datasets with no prior knowledge (e.g., cell types). We use these labels for label regularization in OT optimization (see Methods and Materials Step B) to improve the mappings between cells in the source ( X ) and target ( Y ) modalities. To induce cell labels, we performed hierarchical clustering of training and validation sets combined using the scikit-learn clustering functions [23].
We trained all methods: Seurat [7,8], MOFA + [9], TotalVI [5], and Polarbear [11] using default parameters for all datasets except DEX-treated A549 [2]. For modality inference in Seurat [7,8], we integrated the training modalities first, and then we inferred the missing modality using FindTransferAnchor and TransferData functions [7] between the integrated training modalities and source test modality. For MOFA + [9], we input the missing modality as NA values and trained the model on the multimodalities. We trained TotalVI [5] autoencoder with default parameters, with latent distribution set to "normal, " on the training set. Finally, we trained Polarbear and Polarbear co-assay models [11] using default parameters on the training set.
To identify the highest performing parameters for Steps B and C of CMOT, we performed 5-fold cross-validation on the training set. We reported the best-performing parameters for each dataset in the Results.
For the DEX-treated A549 dataset, we tuned parameters for all methods (Additional File 1: Fig. S9) and benchmark inference performance of CMOT against state-of-arts (Fig. 4A, Additional File 1: Fig. S4).

Parameter selection
In Step A, we found the optimal alignment by testing different values of d common manifold dimensions and K nearest neighbors for building the similarity within each modality (Additional File 1: Fig. S10). In Step B and Step C, we performed cross-validation to select the regularization coefficients and η for optimal transport, and top highly variable features and k-nearest neighbors for modality inference. In particular, we chose the optimal number of features and k-nearest neighbors based on CMOT's performance saturation (Additional File 1: Fig. S11). Also, for all datasets, applied in this paper, we held out a 20% testing set to report CMOT's performance. For datasets with no prior knowledge (e.g., cell types), we induced cell labels by cell clusters through hierarchical clustering of the training set, when training the final model (see Additional File 1: Supplementary Methods). We split the training data into training and validation sets to select parameters through 5-fold cross-validation (see Additional File 1: Supplementary Methods).

Inference versus measurement
To evaluate CMOT's inferred gene and protein expressions, we calculated Pearson's correlation coefficient between the inferred and measured expression values of each cell (cell-wise). Also, we computed the gene-wise correlation between inferred and measured expression values across cells for each gene [11]. For peak inference in open chromatin regions, we used AUROC to evaluate the quality of CMOT's binarized inferred peaks [11]. We computed peak-wise AUROC between individual inferred peak profiles versus measured profiles. This evaluation also applied to the state-of-art methods that we compared. We reported the number of genes with improved correlation/AUROC w.r.t. these methods along with a one-sided Wilcox rank-sum test p-value for each [11].

Classifying known cell type using inferred expression
For the human brain data with known brain cell type information, we evaluated the CMOT inferred expression of cell-type marker genes for classifying the cell type and calculated the AUPRC of the classification [11]. To this end, given a cell type, we labeled all cells that belong to the cell type as positive and the rest as negative. Specifically, we evaluated the Top 8 marker genes from each cell type, due to disproportionate cell-type distribution within the dataset, using a total of 80 cells. We then defined a baseline = 0.1 for the AUPRC as the ratio of the number of positives versus total cells.

Clustering cancer types using inferred gene expression by silhouette score
For the pan-cancer dataset, we evaluated CMOT to separate the cancer types. In particular, we assessed if CMOT's inferred gene expression data can cluster the cells and cell clusters corresponding to different cancer types [3], using the silhouette score. The silhouette score S(m) of a cell m belonging to the cluster C M is calculated as: is the inter-cluster distance defined as the average distance to the closest cluster of cell m except that which it's a part of (i.e., n ∈ C N ) and e(m) = 1 |C M |−1 d(m, n) is the intra-cluster distance defined as the average distance to all other cells in the cluster to which it's a part of (i.e. n ∈ C M , m � = n ). We calculated the silhouette scores by the Python package Scikit-learn [40].

Gene set enrichment analysis
We used Metascape [44] to perform gene set enrichment analysis for the highly predictive genes by CMOT.