MultiMAP: dimensionality reduction and integration of multimodal data

Multimodal data is rapidly growing in many fields of science and engineering, including single-cell biology. We introduce MultiMAP, a novel algorithm for dimensionality reduction and integration. MultiMAP can integrate any number of datasets, leverages features not present in all datasets, is not restricted to a linear mapping, allows the user to specify the influence of each dataset, and is extremely scalable to large datasets. We apply MultiMAP to single-cell transcriptomics, chromatin accessibility, methylation, and spatial data and show that it outperforms current approaches. On a new thymus dataset, we use MultiMAP to integrate cells along a temporal trajectory. This enables quantitative comparison of transcription factor expression and binding site accessibility over the course of T cell differentiation, revealing patterns of expression versus binding site opening kinetics. Supplementary Information The online version contains supplementary material available at 10.1186/s13059-021-02565-y.


UMAP
Uniform manifold approximation and projection (UMAP) is a popular approach for dimensionality reduction due to its fast runtime and striking ability to preserve the structure of data [14,56]. The formulation of UMAP is motivated by ideas from Riemannian geometry, algebraic topology, and fuzzy set theory. We discuss UMAP in preparation for the introduction of MultiMAP. To more smoothly transition to MultiMAP, we modify the notation and presentation found in the original UMAP paper [14]. We also expand on the motivation for some of the steps.
The key steps of UMAP are summarized as follows. UMAP takes the data to be distributed uniformly on a manifold. UMAP estimates geodesic distances between data points on this manifold. These distances are used to construct a fuzzy set representation of the structure of the data on the manifold. A fuzzy set representation of the data is also constructed in a low-dimensional space. UMAP then optimizes the layout of the data in the low-dimensional space to minimize the cross entropy between the two representations.
UMAP takes as input a single high-dimensional dataset X = {x 1 , ..., x N ∈ R D } and returns a low-dimensional embedding Y = {y 1 , ..., y N ∈ R d }, d < D, where y i is the projection of x i . The data is taken to be uniformly distributed on a manifold M with Riemannian metric g and ambient space R D . The metric g provides a means to calculate geodesic distances between points on M given their coordinates in R D .
UMAP aims to calculate geodesic distances d M (p, q) between data points p, q ∈ M. It would be possible to calculate d M if we knew g, but since we usually do not, UMAP must resort to other means. Since the data is uniformly distributed on M, any ball on the manifold of fixed radius should contain the same number of data points. Conversely, a ball centered on any point p on the manifold containing k data points should have the same radius. If g is a constant diagonal matrix locally, within the ball, then it follows that we can calculate geodesic distances between p and its k-nearest data points by normalizing euclidean distances with respect to the radius of the ball. This is formalized in Lemma 1, adapted from [14].
Lemma 1 Let (M, g) be a Riemannian manifold in an ambient R D , and let p ∈ M be a point. If g is locally constant about p in an open neighbourhood U such that g is a scalar matrix in ambient coordinates, then in a ball B ⊆ U centered at p, the geodesic distance d M from p to any point q ∈ B is 1 σp d R D (p, q), where σ p is the radius of the ball in the ambient space and d R D is the distance metric in the ambient space.
Since g is locally a constant diagonal matrix, a point's k-nearest neighbors on M are the same as its k-nearest neighbors in R D . We can therefore calculate the geodesic distances where N (x i ; k) is the set of the k-nearest neighbors of x i and σ i is the radius in R D of the ball containing N (x i ; k). A robust way to estimate σ i is discussed later in this section. UMAP uses d M to construct a representation of the data distribution on M. This representation takes the form of a fuzzy set (A, µ) defined by a reference set A and a membership strength function µ : A → [0, 1]. UMAP constructs a fuzzy simplicial set, with A being the set of edges (1-simplices) that connect neighboring points. This fuzzy simplicial set can also be viewed as a weighted k-nearest neighbor graph (k-NNG), with nodes {1, ..., N }, edges A, and edge weights µ. Concretely, The value of µ is larger when points are closer together on the manifold. The particular form of µ is motivated by results from algebraic topology and category theory [1,57,58]. A number of packages can be used to efficiently compute approximate nearest neighbors in R D .
To finish constructing the fuzzy simplicial set, σ i still needs to be estimated. Since the data is uniformly distributed on M, the geodesic distances between neighboring data points are i.i.d. and so we expect j µ i|j to be the same for each i. Therefore, σ i is taken to satisfy the relation where c is some constant. UMAP sets c to log 2 (k) based on empirical results. Binary search is used to estimate the values of σ i that satisfy this relation.
UMAP also constructs a fuzzy set (A, ν) representation of the data in the low-dimensional space R d . Ultimately, UMAP optimizes the layout of the data so that (A, ν) resembles (A, µ). We are not interested in creating a fuzzy set that accurately captures the manifold structure of R d . So a simpler form of the membership function is used that encourages points close on M to also be close in the low-dimensional space while keeping ν in the range [0,1]. The membership function in the low-dimensional spaces is defined as where a and b are user-defined positive values. To quantify the difference between the fuzzy set of the data on the manifold and the fuzzy set in the low-dimensional space, UMAP uses fuzzy set cross entropy [59]. The cross entropy of (A, ν) relative to (A, µ) is given by UMAP initializes Y to the spectral layout of the k-NNG, and then uses stochastic gradient descent to minimize C with respect to Y. This optimization scales linearly with the number of datapoints, and so is quite efficient. The optimized Y is returned as the low-dimensional projection of the data.

Generalized Multimodal Setting
Here, we introduce a generalized description of multimodal data, which will be the setting of MultiMAP. In this setting, we have multiple datasets We also assume we can calculate distances between data points in different datasets. To further preserve the generality of the setting, we take these distances to be defined between select pairs of datasets, for a select number of points, in a potentially limited feature space. The existence of only some distances is what characterizes this setting as multimodal. If distances were defined between every point using the the same features, all data would reside in a shared feature space. Concretely, distances can be calculated between members Distances can be calculated based on features, labels, or points shared between datasets. We take X vv = X v and So in summary, in the generalized multimodal setting, we are given {X uv }, {I uv }, {J uv }, and S. Alternatively, instead of {X uv }, we can be given directly This multimodal setting is "generalized" because we do not assume the data all reside in the same feature space, nor that the datasets have the same dimensionality, nor that distances are defined between every pair of data points. As we discuss later, many approaches for multimodal integration require that every data point reside in exactly the same feature space, a setting which makes all of these assumptions. In contrast, approaches in the generalized multimodal setting can potentially leverage features not present in all datasets, define distances between certain data using side information such as class labels, and eliminate distances between certain data points if, for example, they are noisy or uncertain.

Foundations of MultiMAP
Here we introduce Multimodal Manifold Approximation and Projection (Mul-tiMAP). MultiMAP is an approach to dimensionality reduction which projects multimodal data into a single low-dimensional embedding. Whereas traditional dimensionality reduction methods project a single dataset, MultiMAP operates on multiple datasets simultaneously to identify a projection that is suitable for all of the data. MultiMAP builds upon the framework of UMAP to extend it to the generalized multimodal setting. In doing so, MultiMAP preserves and extends the mathematical motivation and computational efficiency of UMAP. We emphasize that MultiMAP is a novel algorithm and distinct from UMAP -MultiMAP operates on any number of datasets with different dimensions (UMAP operates on only 1 dataset) and has different graph construction, edge weighting, and optimization as compared to UMAP.
A principle hypothesis of MultiMAP is that if different data modalities characterize the same underlying system, the data from each modality should be uniformly distributed on a single latent manifold. We refer to this idea as the multimodal manifold hypothesis. This manifold is single landscape on which data from different modalities reside in an integrated and unified manner. In general, the manifold is an abstract object that does not exist in the coordinate system of any particular dataset. In the context of single-cell measurements, this manifold can be thought to be the landscape of cell states, with distinct regions corresponding to cell types and continuous regions corresponding to transitions between cell types. MultiMAP aims to recover the structure of the data distribution on the manifold and project it into a low-dimensional space.
A second principle hypothesis of MultiMAP is that if data points from different modalities are close together on the latent manifold, they will also be close in the coordinate systems of each of the datasets. We refer to this idea as the invariance of similarity. This can be quite a robust assumption: measurement differences often produce global shifts that keep neighboring points in the same vicinity. For example, affine transformations, smooth nonlinear distortions, and random noise generally keep close points together. In the context of single-cell measurements, if cells have a similar underlying biological state (belong to the same type or are at a similar point in a dynamic trajectory) they likely exhibit similar transcriptomes, epigenomes, proteomes, and other characteristics. Visualizations showing that cells of the same type cluster together for a variety of omics, batches, individuals, and species confirm that cell similarity is preserved across modality. MultiMAP leverages the invariance of similarity to recover information about the manifold from the multimodal data.
The key steps of MultiMAP are depicted in Figure 1 and summarized as follows. MultiMAP takes the multimodal data to be distributed uniformly on a manifold. MultiMAP estimates geodesic distances on this manifold between data points. These distances are used to construct a fuzzy set representation of the structure of the data on the manifold. A fuzzy set representation of the data is also constructed in a low-dimensional space. MultiMAP then optimizes the layout of the data in the low-dimensional space to minimize a weighted cross entropy between the two representations. While at a high-level these steps mirror those of UMAP, the mathematical formulation and computation of each step differs to account for the multimodal setting.
MultiMAP operates in the generalized multimodal setting and takes as input {X uv } or {D uv ij }, in addition to {I uv }, {J uv }, and S, as described in the previous section. MultiMAP returns an embedding {Y v = {y v 1 , ..., y v N v ∈ R d }} of all of the data into a single low-dimensional space, where y v i is the projection of x v i . MultiMAP can also return a graph, which we call the MultiGraph, consisting of all of the data integrated into a single graph structure. We describe the MultiGraph in more detail later in this section.
Motivated by the multimodal manifold hypothesis, MultiMAP takes the data to be uniformly distributed on a manifold M with Riemannian metric g. In the generalized multimodal setting, different data points on M exist also in different feature spaces. We therefore allow M to have multiple ambient spaces. Each ambient space X uv is the feature space of dataset X uv . In general, g is different with respect to each of these ambient spaces, so we denote g uv to be the metric in the coordinates of X uv . The metric g uv gives a way to calculate geodesic distances between points on M given the coordinates of the points in X uv .
MultiMAP seeks to estimate geodesic distances between points on M. This is challenging since we usually do not know g uv . But we can make headway if we can say something further about g uv . Motivated by the invariance of similarity, we take g uv to have a form that results in points close on M also being close in the ambient space X uv . Specifically, we take g uv to be a constant diagonal matrix within a ball B(p) centered at point p ∈ M that contains the k nearest data points to p. We do allow g uv to be different for each u, v, and p. If g uv takes this form, data points are nearest neighbors in X uv if and only if they are nearest neighbors on M, preserving a sense of similarity between p and its neighbors. This is formalized in Lemma 2, which is proven at the end of this section.
Lemma 2 Let (M, g uv ) be a Riemannian manifold with an ambient space X uv , p ∈ M be a point, and P be a finite set of points on M. Let N M (p, k) ⊆ P be the set of k points that are nearest to p on M, and N uv (p, k) ⊆ P be the set of k point that are the nearest to p in X uv . If within a ball B centered at p and containing N M (p, k), g uv is locally constant such that it is a scalar matrix in ambient coordinates, then N M (p, k) = N uv (p, k).
Since g uv is a constant diagonal matrix within B(p), it follows that we can calculate geodesic distances between p and its k-nearest data points by normalizing distances in X uv with respect to the radius of the ball. This is formalized in Lemma 3, which adapts the notation of Lemma 1 to the multimodal setting.
Lemma 3 Let (M, g uv ) be a Riemannian manifold with an ambient space X uv , and let p ∈ M be a point. If g uv is locally constant about p in an open neighbourhood U such that g uv is a scalar matrix in the coordinates of X uv , then in a ball B ⊆ U centered at p, the geodesic distance d M from p to any point q ∈ B is 1 σ uv p d X uv (p, q), where σ uv p is the radius of the ball in X uv and d X uv is the distance metric in X uv .

This allows calculation of the geodesic distances
where σ uv i is the radius of B(x u i ) in the ambient space X uv . The restricted values of u, v, i, and j reflect the fact that we only know some D uv ij in the generalized multimodal setting.
To calculate d M , we need an approach to determine if x v j ∈ B(x u i ). Per the multimodal manifold hypothesis, since all of the data are uniformly distributed on M, we expect B(x u i ) to contain kN v /N data points from X v . We also expect B(x u i ) to contain k|J uv |/N data points x v j , j ∈ J uv , as long as J uv can be approximated as a random sample of {1, ..., N v }. By Lemma 2, these x v j are nearest neighbors of x u i in X uv . Since we have pairwise distances between all data in X uv , we can calculate these nearest neighbors, giving us points in B(x u i ). We can therefore estimate the geodesic distances where N uv (x u i ; K) is the set of K points x v i ∈ X uv that are closest to x u i in the space X uv . The value of K must be an integer, so k|J uv |/N is rounded. A robust way to estimate σ uv i is discussed later in this section. MultiMAP uses d M to construct a representation of the data distribution on M. Like UMAP, this representation takes the form of a fuzzy set (A, µ), defined by a reference set A and a membership strength function µ : A → [0, 1]. Mul-tiMAP constructs a fuzzy simplicial set, A being the set of edges (1-simplices) that connect neighboring points, and µ taking on a larger value if the points are closer together on the manifold. In the generalized multimodal setting, we must construct (A, µ) differently from UMAP. Since we cannot define geodesic distances between some pairs of points, we must leave out edges connecting these pairs. The fuzzy simplicial set is defined as The form of µ is the same as that of UMAP and is motivated in [14]. The fuzzy simplicial set can also be viewed as graph with nodes {x v i }, edges A, and edge weights µ. We call this the MultiGraph, as it integrates multimodal data into a single graph structure. The MultiGraph connects data points if they share a neighborhood on M. The MultiGraph is generally not a k-nearest neighbor graph because its nodes can differ in degree. As we discuss later, the MultiGraph is itself useful for integrated analyses.
To finish constructing the fuzzy simplicial set, we still need to estimate σ uv i . Per the multimodal manifold hypothesis, since the data is uniformly distributed on M, we expect v,j µ uv i|j to be the same for each x v i . Say this sum is equal to the constant c. The multimodal manifold hypothesis further states that every individual dataset is uniform on M. Thus if we fix v, we expect j µ uv i|j to still be the same for each x u i . Since the sum now has |J uv | terms instead of N , it should equal c|J uv |/N , as long as J uv can be approximated as a random sample of {1, ..., N V }. Further, if k is reasonably large, ρ u i = min x v j d M (x u i , x v j ) will be about the same for any choice of v. We can therefore take σ uv i to satisfy the relation We set c to log 2 (k) to be consistent with UMAP. Binary search is used to estimate the values of σ uv i that satisfy this relation.
MultiMAP proceeds to construct a fuzzy set (A, ν) representation of the data in the low-dimensional space R d . Ultimately, MultiMAP optimizes the layout of the data so that (A, ν) resembles (A, µ). We are not interested in creating a fuzzy set that accurately captures the manifold structure of R d . So we use a simpler form of the membership function that encourages points close on M to also be close in the low-dimensional space, while keeping ν in the range [0,1] as required by the definition of fuzzy sets. As in UMAP, the membership function in the low-dimensional spaces is defined as where a and b are user-defined positive values. To quantify the difference between the fuzzy set of the data on the manifold and the fuzzy set in the low-dimensional space, like UMAP, MultiMAP uses fuzzy set cross entropy. The weighted cross entropy of (A, ν) relative to (A, µ) is given by .
MultiMAP initializes {Y v } to the spectral layout of the MultiGraph, and then uses stochastic gradient descent to optimize {Y v } to minimize the cross entropy. The optimized {Y v } is returned as the low-dimensional projection of the data. Unlike UMAP, MultiMAP uses a weighted optimization scheme. The contribution of a point to the low-dimensional layout is weighted by the dataset it originates from. The gradient updates, with learning rate α, are given by .
The weight ω v ∈ R controls the influence of dataset X v on the final embedding.
When ω v has a larger relative value, the layout of all of the data will depend more strongly on X v . This is especially useful when one dataset is known to be of higher/lower quality, in which case its ω v can be set larger/smaller, respectively. In the case of two datasets {X 1 , X 2 }, setting ω 1 =0 and ω 2 =1 pulls the layout of the first dataset to that of the second dataset, without the first influencing the layout of the second. This can be useful for "aligning" a query dataset X 1 of unknown content or quality to a reference dataset X 2 . The default value of all ω v is 1, which equally weights the influence of all datasets. When all w v are the same, the optimization of MultiMAP is equivalent to that of UMAP.
To deal with the unique challenges of the multimodal setting, MultiMAP is different from UMAP in several regards. MultiMAP differs from UMAP in its estimation of geodesic distances, construction of the fuzzy simplicial set, and use of a weighted cross entropy. The motivation of MultiMAP also incorporates new ideas including the multimodal manifold hypothesis and the invariance of similarity. These distinctions allow MultiMAP to deal with datasets of different dimensions and feature spaces, unknown distances between data in different datasets, and datasets of varying quality. MultiMAP can be seen as a generalization of UMAP to the multimodal setting, exactly reducing to UMAP when V =1.

Proof of Lemma 2
Lemma 2 Let (M, g uv ) be a Riemannian manifold with an ambient space X uv , p ∈ M be a point, and P be a finite set of points on M. Let N M (p, k) ⊆ P be the set of k points that are nearest to p on M, and N uv (p, k) ⊆ P be the set of k point that are the nearest to p in X uv . If within a ball B centered at p and containing N M (p, k), g uv is locally constant such that it is a scalar matrix in ambient coordinates, then N M (p, k) = N uv (p, k).
We show that each point in N M (p, k) is in N uv (p, k) and each point in Consider a datapoint P such that P ∈ N M and P ̸ ∈ N uv . There must be at least k pointsp i ̸ = P such that d X uv (p,p i ) < d X uv (p, P ). Becausep i is closer than P to p in X uv , and P ∈ B, it must be the case thatp i ∈ B. Since g uv is a constant diagonal matrix at every point within B, by Lemma 3, d M (p, P ) = d X uv (p, P )/σ uv i and d M (p,p i ) = d X uv (p,p i )/σ uv i . Thus there are at least k pointsp i such that d M (p,p i ) < d M (p, P ), which contradicts P ∈ N M . Therefore, if P ∈ N M , then P ∈ N uv . Now consider if P ̸ ∈ N M and P ∈ N uv . Thus there must be k data points p i ̸ = P ,p i ∈ N M , d M (p,p i ) < d M (p, P ). Since P ∈ N uv , P must be closer than at least one otherp j to p in X uv . Therefore, sincep j ∈ B, P ∈ B. Since B is a constant diagonal matrix for all points in B, by Lemma 3, d X uv (p, P ) = d M (p, P )σ uv i and d X uv (p,p i ) = d M (p,p i )σ uv i . Thus there are k pointsp i such that d X uv (p,p i ) < d X uv (p, P ), which contradicts P ∈ N uv . Therefore, if P ∈ N uv , then P ∈ N M . Therefore N M (p, k) and N uv (p, k) have the same points, i.e. N M (p, k) = N uv (p, k).

Computation of MultiMAP
MultiMAP is outlined in Algorithm 1. For each data point, MultiMAP finds a set of nearest data points in each modality. The distances to these nearest neighbors are converted to a geodesic distances on a shared latent manifold by normalizing with respect to a radius value. The number of nearest neighbors and the radius depend on the data point and the modality of the neighbor. The nearest neighbor pairs and geodesic distances are used to construct a fuzzy simplicial set and a MultiGraph. MultiMAP then initializes a layout of the data in a low-dimensional space to the spectral layout of the MultiGraph. MultiMAP proceeds to optimize the layout to minimize the weighted cross entropy between the fuzzy simplicial set and a fuzzy set of the data in the low-dimensional space. The optimization is performed using stochastic gradient descent. MultiMAP returns the optimized low-dimensional layout and the MultiGraph.
8. MultiMAP returns a MultiGraph which integrates multimodal data into a single graph structure. Construction of the MultiGraph is even more efficient than MultiMAP since the low-dimensional fuzzy set and embedding do not need to be constructed or optimized. The MultiGraph can be used with graph algorithms for clustering, node and link prediction, and other analyses.