Multimodal synthesis as metric learning
Before the advent of multimodal single-cell experiments, computational analysis has focused on variation within a single modality. In contrast, analysis of simultaneous multimodal single-cell experiments (where two or more modalities are available per cell) critically requires reasoning about information across modalities in a mutually consistent way. Our key intuition is that each modality gives us information about the biological similarity among cells in the dataset, which we can mathematically interpret as a modality-specific distance metric. For example, in RNA-seq data, cells are considered biologically similar if their gene expression profiles are shared; this may be proxied as the Euclidean distance between normalized expression vectors, with shorter distances corresponding to greater similarity.
To synthesize these distance metrics, we draw inspiration from metric learning (Additional file 1: Text S3). Given a reference modality, Schema transforms this modality such that its Euclidean distances agree with a set of supplementary distance metrics from the other modalities, while also limiting the distortion of the original reference modality. Analyses on the transformed data will thus incorporate information from all modalities (Fig. 1). For instance, with RNA-seq data as the reference modality, Schema can transform the data so that it incorporates information from other modalities but limits the distortion from the original data so that the output remains amenable to standard RNA-seq analyses (e.g., cell-type inference, trajectory analysis, and visualization).
In our approach, the researcher starts by designating one of the modalities as the primary (i.e., reference) modality, consisting of observations that are mapped to points in a multi-dimensional space. In the analyses presented here, we typically designate the most informative or high-confidence modality as the primary or the reference modality, with RNA-seq being a frequent choice (Discussion). The coordinates of points in the primary modality are then transformed using information from secondary modalities. Importantly, the transformation’s complexity is constrained by limiting the distortion of the primary modality below a researcher-specified threshold. This acts as a regularization, preventing Schema from overfitting to other modalities and ensuring that the high-confidence information contained in the primary modality is preserved. We found this constraint to be crucial to successful multimodal syntheses. Without it, an unconstrained alignment of modalities using, for instance, canonical correlation analysis (CCA), a common approach in statistics for inferring information from cross-covariance matrices, or autoencoders, a deep learning approach for mapping multiple datasets to a shared latent space [27,28,29,30], is prone to overfitting to sample-specific noise, as we show in our results.
To see how Schema’s transformation synthesizes modalities, consider the case where the primary dataset is gene expression data. While the points close in Euclidean space are likely to be biologically similar cells with shared expression profiles, longer Euclidean distances are less informative. Schema’s constrained optimization framework is designed to preserve the information contained in short-range distances, while allowing secondary modalities to enhance the informativity of longer distances by incorporating, for example, cell-type metadata, differences in spatial density, or developmental relationships. To facilitate the representation of complex relationships between modalities, arbitrary distance metrics and kernels are supported for secondary modalities.
Schema’s measure of inter-modality alignment is based on the Pearson correlation of distances, which is optimized via a quadratic programming algorithm, for which further details are provided in “Methods.” An important advantage of Schema’s algorithm is that it returns coefficients that weight features in the primary dataset based on their agreement with the secondary modalities (for example, weighting genes in a primary RNA-seq dataset that best agree with secondary developmental age information). These feature weights enable greater interpretability into data transformations that is not immediately achievable by more complex, nonlinear transformation approaches [27,28,29,30,31,32,33]. We demonstrate this interpretability throughout our applications of Schema.
Inferring cell types by synthesizing gene expression and chromatin accessibility
We first sought to demonstrate the value of Schema by applying it to the increasingly common and broadly interesting setting in which researchers simultaneously profile the transcriptome and chromatin accessibility of single cells [6]. Focusing on cell type inference, a key analytic step in many single-cell studies, we applied Schema on a dataset of 11,296 mouse kidney cells with simultaneously assayed RNA-seq and ATAC-seq modalities and found that synthesizing the two modalities produces more accurate results than using either modality in isolation (Fig. 2f; Additional file 1: Figure S3).
With RNA-seq as the primary (i.e., reference) dataset and ATAC-seq as the secondary, we applied Schema to compute a transformed dataset in which pairwise RNA-seq distances among cells are better aligned with distances in the ATAC-seq peak counts data while retaining a very high correlation with primary RNA-seq distances (≥ 99%, “Methods”). We then clustered the cells by performing Leiden community detection [34] on the transformed dataset and compared these clustering assignments to the Leiden clusters obtained without Schema transformation. We measured the agreement of these fully automated clusterings with expertly defined ground truth cluster labels (from Cao et al. [6]), quantifying this agreement with the adjusted Rand index (ARI), which has a higher value if there is greater agreement between two sets of labels. Leiden clustering on Schema-transformed data better agrees with the ground truth annotations of cell types (ARI of 0.46) than the corresponding Leiden cluster labels using just RNA-seq or ATAC-seq datasets individually (ARIs of 0.40 and 0.04, respectively, Fig. 2f). Here, Schema facilitated a biologically informative synthesis despite limitations of data quality or sparsity in the ATAC-seq secondary modality. We observed that using only ATAC-seq data to identify cell types leads to poor concordance with ground truth labels (Additional file 1: Figure S3A), likely because of the sparsity of this modality (for example, only 0.28% of the peaks were reported to have non-zero counts, on average); this sparsity was also noted by the original study authors.
To further analyze why combining modalities improves cell type clustering, we obtained Leiden cluster labels using either the RNA-seq or the ATAC-seq modalities individually. We then evaluated these cluster assignments by iterating over subsets of the data, each set covering only a pair of ground truth cell types and used the ARI score to quantify how well the cluster labels distinguished between the two cell types. While RNA-seq clusters have higher ARI scores overall, indicating a greater ability to differentiate cell types, ATAC-seq does display a relative strength in distinguishing proximal tubular (PT) cells from other cell types (Fig. 2a). PT cells are crucial to kidney function, with the specific PT cell sub-types playing distinct roles in, for instance, glucose reabsorption [35]. They are also the most numerous cells in this dataset and many of the misclassifications in the RNA-seq based clustering relate to these cells (Fig. 2b–d). When the two modalities are synthesized with Schema, a significant number of these PT cells are correctly assigned to their ground truth cell types (one-sided binomial test, p = 6.7 × 10− 15), leading to an overall improvement in clustering quality (Fig. 2e). Furthermore, upon analyzing Schema’s feature selection output, we found that the genes it up-weighted in the primary RNA-seq modality were differentially expressed in PT cells (one-sided t-test, FDR q < 0.01 for each of the three PT cell types), thus emphasizing the RNA-seq subspace where support from the secondary modality signal was strongest. These genes (the top hits are Pnisr, Ankrd11, and Kmt2c) are enriched for regulation of macromolecule metabolic process (GO:0060255, FDR q = 0.0103) and regulation of nitrogen compound metabolic process (GO:0051171, FDR q = 0.0133).
Schema’s constrained data synthesis outperforms unconstrained approaches
In general, synthesis of multimodal data can also be done by statistical techniques like canonical correlation analysis (CCA) or deep learning architectures that represent multiple modalities in a shared latent space [27,28,29,30,31,32,33]. A key conceptual advance of Schema over these approaches is its emphasis on limiting the distortion of the high-confidence reference modality, allowing it to extract signal from the lower-confidence secondary modalities without overfitting to their noise and artifacts. Intuitively, the synthesis of two modalities requires the identification of a subspace (or latent space) in each modality that aligns well with the other. Due to noise and artifacts, an unconstrained approach may overfit by identifying a pair of subspaces that align well but are biologically uninformative. In contrast, Schema’s constrained optimization formulation, combined with the use of a high-confidence modality as the primary, ensures that any possible alignment will use only a biologically informative subspace of the primary modality and thus guides the quadratic programming optimizer towards correspondingly informative subspaces in the other modalities. To demonstrate the importance of this constrained approach, we evaluated the performance of CCA and totalVI [30] in integrating the RNA-seq and ATAC-seq modalities (Fig. 2f). We applied CCA to synthesize the two modalities and performed Leiden clustering on the resulting dataset, finding its overlap with the ground truth labels (ARI of 0.31) to be lower than that from Schema’s synthesis (0.46). Indeed, this is a lower ARI than is achievable just with RNA-seq data (0.40), indicating that the CCA-based synthesis may be overfitting to the sparse and noisy ATAC-seq data.
To evaluate an autoencoder-based synthesis of these modalities, we applied scVI [27] and totalVI to compute per-modality and dual-modality latent space representations, respectively (Methods). We performed Leiden clustering in the autoencoder latent spaces and evaluated the clustering’s overlap with ground truth labels. We first verified that the single-modality latent space representations did lead to Leiden clusters of comparable quality as had previously been observed from Leiden clustering on the raw data (ARIs of 0.365 and 0.038 for scVI-generated representations of RNA-seq and ATAC-seq data, respectively). However, the dual-modality shared-space representation from totalVI produced a Leiden clustering (Additional file 1: Figure S3B) that had a low overlap with the ground truth (ARI of 0.0043). We hypothesize that the sparsity and low signal-to-noise ratio here in the ATAC-seq modality led totalVI to a latent space representation that corresponds to low biological-information subspaces of the two modalities, rather than their respective high information subspaces. We note that we were able to achieve better performance with totalVI when applying the same procedure to a synthetic, less-noisy secondary modality consisting of partially randomized RNA-seq observations (Methods).
While these CCA and autoencoder results were likely due to overfitting, the Schema-based synthesis constrains the ATAC-seq modality’s influence, enabling us to extract additional signal provided by ATAC-seq while preserving the rich information provided by the transcriptomic modality. We believe that this regularization offered by Schema’s constrained optimization formulation is a key advantage that will be crucial in multimodal single-cell data synthesis. We also note that Schema offers additional advantages: unlike CCA, it can incorporate more than two modalities simultaneously and, unlike totalVI, its synthesis is interpretable, revealing a more accurate characterization of PT cells.
Schema highlights secondary patterns while preserving primary structure
Another powerful use of Schema is to infuse information from other modalities into RNA-seq data while limiting the data’s distortion so that it remains amenable to a range of standard RNA-seq analyses. Since widely used visualization methods such as UMAP [36] do not allow a researcher to specify aspects of the underlying data that they wish to highlight in the visualization, we sought to apply Schema to improve the informativity of single-cell visualizations. We leveraged Schema to highlight the age-related structure in an RNA-seq dataset of Drosophila melanogaster neurons [3] profiled across a full lifespan, while still preserving most of the original transcriptomic structure. We chose RNA-seq as the primary modality and temporal metadata (cell age) as the secondary modality, configuring Schema to maximize the correlation between distances in the two while constraining the distortions induced by the transformation (Methods). We then visualized the transformed result in two dimensions with UMAP.
While some age-related structure does exist in the original data, Schema-based transformation of the data more clearly displays a cellular trajectory consistent with biological age (Fig. 3). Importantly, revealing this age-related structure required only a limited distortion of the data, corresponding to relatively high values (≥ 0.99) of the minimum correlation constraint (Fig. 3c).
Analysis of Schema’s feature selection indicated an up-weighting of genes differentially expressed at the start or end of the aging process (Fig. 3e), with genes implicated in cell organization/biogenesis [37] (e.g., Rm62, CG5010 and IscU) active at the start while ribosomal genes like Rpl22 and Rpl23A were active at the end. We also confirmed that there was a significant overlap between Schema’s highest-ranked genes and those found by a standard differential expression test between timepoints (one-sided binomial test, FDR q < 10− 21 for the 1-, 30-, and 50-min subsets). To additionally verify that Schema was infusing additional age-related structure into RNA-seq data, we performed a diffusion pseudotime analysis of the original and transformed datasets and found that the Spearman rank correlation between this pseudotime estimate and the ground truth cell age increased from 0.365 in the original data to 0.405 and 0.436 in the transformations corresponding to minimum correlation constraints of 0.999 and 0.99, respectively.
We note that the constrained optimization of Schema was again important to retaining biological signal during the synthesis: in comparison, an unconstrained synthesis by CCA led to a lower pseudotime correlation (0.059) than seen in the original RNA-seq dataset; the corresponding CCA-based UMAP visualization was also less clear in conveying the cellular trajectory (Additional file 1: Figure S6). Schema thus enables visualizations that synthesize biological metadata, while preserving much of the distance-related correlation structure of the original primary dataset. With Schema, researchers can therefore investigate single-cell datasets that exhibit strong latent structure (e.g., due to metadata like age or spatial location), infusing this secondary information into the primary RNA-seq modality. We recommend specifying a high minimum correlation constraint (e.g., 0.99) during the synthesis (Discussion), having observed that only a small transformation of the RNA-seq data is needed to make the latent structure visible.
Spatial density-informed differential expression among cerebellar granule cells
In addition to cell type inference, another important single-cell analysis task that stands to benefit from multimodal synthesis is the identification of differentially expressed marker genes. To perform differential expression analysis with Schema, RNA-seq data should be used as the primary modality, while the distance metrics of the secondary modalities specify how cells should be differentiated from each other. We applied Schema to spatial transcriptomics data, another increasingly important multimodal scenario, here encompassing gene expression, cell-type labels, and spatial location.
We obtained Slide-seq data containing 62,468 transcriptomes that are spatially located in the mouse cerebellum. In the original study, these transcriptomes were assigned to putative cell types (noting that these transcriptomes are not guaranteed to be single cell), and thus cell types are located throughout the tissue [10, 38]. Interestingly, we observed spatial density variation for certain cell types; specifically, transcriptomes corresponding to granule cell types are observed in regions of both high and low spatial density (Fig. 4b in this paper; also Fig. 2b of Rodriques et al. [10]).
Schema’s feature selection capabilities could thus identify genes that are differentially expressed in granule cells in high density areas versus granule cells in low density areas. Schema is well suited to the constrained optimization setting of this problem: we optimize for genes expressed specifically in granule cells and in dense regions, but not all granule cells are in dense regions and not all cells in dense regions are granule cells. We specified RNA-seq data as the primary modality and spatial location and cell-type labels as the secondary modalities. In the spatial location modality, the distance metric was defined such that two cells are similar if their spatial neighborhoods have similar density (Methods).
The densely packed granule cell genes identified by Schema are strongly enriched for GO terms and REACTOME pathways [39] related to signal transmission, e.g., ion-channel transport (REACTOME FDR q = 1.82 × 10− 3), ion transport (GO:0022853, FDR q = 1.8 × 10− 17), and electron transfer (GO:009055, FDR q = 2.87 × 10− 11). This finding suggests potentially greater neurotransmission activity within these cells (Additional file 1: Figures S9–S10, Text S6; Additional file 2: Table S2–S3).
Schema outperforms alternative methods for spatial transcriptomic analysis
We sought to benchmark our method by comparing the robustness of Schema’s results with those based on canonical correlation analysis (CCA) and with two methods specifically intended for spatial transcriptomics, namely SpatialDE [18] and Trendsceek [19].
An important point is that CCA, SpatialDE, and Trendsceek are less general than Schema and therefore require non-trivial modifications to approximately match Schema’s capabilities. CCA is limited in that it can correlate only two datasets at a time, whereas here we seek to synthesize three modalities: gene expression, cell-type labels, and spatial density. We adapted CCA by correlating two modalities at a time and combining the sub-results (Methods). In the case of SpatialDE and Trendsceek, their unsupervised formulation does not allow the researcher to specify the spatial features to pick out (we focus on spatial density variation). To adapt these, we collated their results from separate runs on granule and non-granule cells (Methods). Notably, the ad hoc modifications required to extend existing methods beyond two modalities underscore the benefit of Schema’s general analytic formulation that can be naturally extended to incorporate any number of additional data modalities.
Reasoning that a robust computational approach should return consistent results across biological replicates, we evaluated the stability and quality of each spatial transcriptomic technique by comparing its results on three replicate samples of mouse cerebellum tissue (coronal sections prepared on the same day [10]; pucks 180430_1, 180430_5, 180430_6) (Methods). While both Schema and CCA identify a gene set that ostensibly corresponds to granule cells in dense regions (Fig. 4d; Additional file 1: Figure S4), the gene rankings computed by Schema are more consistently preserved between pairs of replicates than those computed by CCA, with the median Spearman rank correlation between sample pairs being 0.68 (Schema) versus 0.46 (CCA). Likewise, with Schema, 69.1% of enriched GO biological-process terms are observed in all three samples and 78% are in at least two samples. The corresponding numbers for CCA were 35.7% and 59.5%, respectively (FDR q < 0.001 in all cases). We thus find that Schema’s results are substantially more robust across the three replicates. Compared to CCA’s unconstrained synthesis, Schema’s constrained formulation avoids overfitting to sample-specific noise, enhancing its robustness (Methods; Fig. 4e; Additional file 1: Figure S5).
When performing the same gene list robustness analysis with SpatialDE and Trendsceek, while also looking at the stability of their gene rankings specific to the precursor cell type (gray points in Fig. 4e), we found that SpatialDE produced slightly more stable gene rankings than Trendsceek, with median sample-pair correlations of 0.089 and − 0.002, respectively, but these were still lower than those for Schema. We also observed that SpatialDE and Trendsceek had substantially longer running times and we performed our analysis of the two methods on subsets of the overall dataset (see “Schema can scale to massive single-cell datasets” for precise runtime and memory usage). These results demonstrate the robustness and efficiency of Schema’s supervised approach.
Beyond gene expression: Schema reveals CDR3 segments crucial to T cell receptor binding specificity
To further demonstrate the generality of Schema, we applied it to synthesize data modalities beyond gene expression. We integrated single-cell multimodal proteomic and functional data with Schema to better understand how sequence diversity in the hypervariable CDR3 segments of T cell receptors (TCRs) relates to antigen binding specificities [40]. De novo design of TCRs for an antigen of interest remains a pressing biological and therapeutic goal [41, 42], making it valuable to identify the key sequence locations and amino acids that govern the binding characteristics of a CDR3 segment. Towards this end, we analyzed a single-cell dataset that recorded clonotype data for 62,858 T cells and their binding specificities against a panel of 44 ligands [5] and used Schema’s feature selection capabilities to estimate the sequence locations and residues in the CDR3 segments of α and β chains important to binding specificity.
To estimate location-specific selection pressure, we ran Schema with the CDR3 peptide sequence data as the primary modality and the binding specificity information as the secondary modality, performing separate runs for α and β chains. In the primary modality, each feature corresponds to a CDR3 sequence location and we used the Hamming distance metric between observations (i.e., the number of locations at which two sequences differ, see Methods). Schema assigned relatively low feature weights to the location segments 3–9 (in α chain CDR3) and 5–12 (in β chain CDR3), suggesting those regions can tolerate greater sequence variability while preserving binding specificity.
To evaluate these results, we compared them to estimates based on CDR3 sequence motifs sourced from VDJdb [43], a curated database of TCRs with known antigen specificities. In VDJdb, TCR motifs are scored using an adaptation of the relative entropy algorithm [44] by Murugan et al. that assigns a score for each location and amino acid in the motif. We aggregated these scores into a per-location score (Methods), allowing a comparison with Schema’s feature weights (Fig. 5). While the comparison at locations 11–20 is somewhat complicated by VDJdb having fewer long sequences (Methods), there is agreement between Schema and VDJdb estimates on locations 1–10 where both datasets have good coverage (Spearman rank correlations of 0.38 and 0.92 for the α and β chains, respectively; Fig. 5c, d). We note that weight estimation using Schema required only a single multimodal dataset; in contrast, extensive data collection, curation, and algorithmic efforts underlie the VDJdb annotations. The latter covers multiple experimental datasets, including the 10x Genomics dataset [5] we investigated here; we saw similar results when comparing against an older version of VDJdb without this dataset.
Next, we used Schema to investigate the selection pressure on amino acids present in the variability-prone locations identified above (Methods). We first selected a sequence location (e.g., location 4 in α chain CDR3) and constructed a primary modality where each cell was represented by a one-hot encoding of the amino acid at the location (i.e., a 20-dimensional Boolean vector). The secondary modality was binding specificity information, as before. We performed separate Schema runs for each such location of interest on the two chains, computing the final score for each amino acid as the average score across these runs. These scores are in good agreement with the corresponding amino acid scores aggregated from the VDJdb database (Spearman rank correlation = 0.74, two-sided t-test p = 2 × 10− 4). The residue and location preferences estimated here can directly be used in any algorithm for computational design of epitope-specific CDR3 sequences to bias its search towards more functionally plausible candidate sequences.
Schema’s ability to efficiently synthesize arbitrarily many modalities, with their relative importance at the researcher’s discretion, allows information that might otherwise be set aside (e.g., metadata like batch information, cell line, or donor information) to be effectively incorporated, enhancing the robustness and accuracy of an analysis. In Methods, we exemplify this use-case on the TCR dataset by incorporating measurements of cell-surface markers as an additional secondary modality, hypothesizing that cell-surface protein levels should be unrelated to V(D)J recombination variability.
Additional demonstrations
Applying Schema on a mouse gastrulation dataset [45] consisting of 16,152 epiblast cells split over three developmental timepoints and with two replicates at each timepoint, we performed differential expression analysis while simultaneously accounting for batch effects and developmental age, and evaluated its results alongside those from MOFA+, a recently introduced single-cell multimodal analysis technique [22, 46] (Additional file 1: Figure S1, Text S1). We also used Schema to study cell differentiation by synthesizing spliced and unspliced mRNA counts in a dataset of 2930 mouse dentate gyrus cells [47]. As in standard RNA velocity analyses, correlating spliced and unspliced counts in a cell picks up on the time derivative of a cell’s expression state and thus illuminates the cell differentiation process. Schema’s results agree with those from the dedicated RNA velocity tool scVelo [48], and we also demonstrate how Schema can be used to infuse velocity information into a t-SNE visualization (Additional file 1: Figure S2, Text S2).
Schema can scale to massive single-cell datasets
We have designed Schema to process large single-cell datasets efficiently, with modest memory requirements. On average, Schema processes data from a Slide-seq replicate (three modalities, 20,823 transcriptomes × 17,607 genes) in 34 min, requiring less than 5 GB of RAM in the process (Additional file 1: Table S1). The runtime includes the entire set of Schema sub-runs performed over an ensemble of parameters, as well as the time taken for the preprocessing transformation.
Schema’s efficiency stems from our novel mathematical formulation. Deviating from standard metric learning approaches, we formulate the synthesis problem as a quadratic program optimization, which can be solved much faster than the semidefinite program formulations typically seen in these approaches (Additional file 1: Text S3). Additionally, while the full Schema algorithm has quadratic scalability in the number of cells, our formulation allows us to obtain good approximations with provably bounded error using only a logarithmic subsample of the dataset (Additional file 1: Text S5), enabling sublinear scalability in the number of cells that will be crucial as multimodal datasets increase in size. These subsampling techniques can also leverage diversity-preserving data sketching techniques [49, 50] that may empirically lead to greater representation of rare cell types in the Schema analysis.