MUON: multimodal omics analysis framework

Advances in multi-omics have led to an explosion of multimodal datasets to address questions from basic biology to translation. While these data provide novel opportunities for discovery, they also pose management and analysis challenges, thus motivating the development of tailored computational solutions. Here, we present a data standard and an analysis framework for multi-omics, MUON, designed to organise, analyse, visualise, and exchange multimodal data. MUON stores multimodal data in an efficient yet flexible and interoperable data structure. MUON enables a versatile range of analyses, from data preprocessing to flexible multi-omics alignment. Supplementary Information The online version contains supplementary material available at 10.1186/s13059-021-02577-8.

programming languages. The currently existing solutions for multi-omics data (Seurat [8], MultiAssayExperiment [14]) are confined to the R programming language ecosystem, and typically require loading the full dataset, a limitation that prohibits dealing with larger datasets and can only be partially mitigated by using additional third-party software [15,16].
To address this, we here present MUON (multimodal omics analysis), an analysis framework that is designed from the ground-up to organise, analyse, visualise, and exchange multimodal data. MUON is implemented in Python and comes with an extensive toolbox to flexibly construct, manipulate and analyse multi-omics datasets. At the core of the framework is MuData, an open data structure standard, which is compatible with and extends previous data formats for single omics [9,17]. MuData files can be seamlessly accessed from different programming languages, including Python [18], R [19], and Julia [20]. We illustrate MUON in the context of different vignettes of its application with a major focus on single-cell data, including analysis of combined gene expression and chromatin accessibility assays as well as gene expression and epitope profiling.

MuData: a cross-platform multimodal omics data container
At the core of MUON is MuData (multimodal data)-an open data structure for multimodal datasets. MuData handles multimodal datasets as containers of unimodal data. This hierarchical data model generalises existing matrix-based data formats for single omics, whereby data from each individual omics layer are stored as an AnnData [17] object (Fig. 1a, c). MuData also provides a coherent structure for storing associated metadata and other side information, both at the level of samples (e.g. cells or individuals) and features (e.g. genes or genomics locations).  1 Architecture and content of a multimodal data container (MuData). a Schematic representation of the hierarchical structure of a MuData container. Raw data matrices from multiple modalities together with associated metadata are encapsulated in an array structure. For illustration, blue and red denote RNA-seq and ATAC-seq data modalities; green denotes multimodal annotation or derived data. b Example content of the structure in a. Shown are example content of a MuData container, consisting of count matrices, embeddings, neighbourhood graphs and cell annotations for individual modalities (blue, red), as well as derived data from multi-omics analyses (green). c Schematic representation of MUON storage model and its serialisation scheme using the HDF5 file format on disk. Left: Hierarchy of the storage model, with plates denoting different levels of hierarchy. Arrows signify access schemes of the HDF5 file using various programming languages. Right: Representation of the MuData object in Python, with metadata and derived annotations represented as NumPy arrays or Pandas DataFrames, and with individual modalities as AnnData objects Metadata tables can either be specific to a single stored data modality, or they can represent joint sample annotations that apply to all modalities stored in a MuData container. In a similar manner, MuData containers can be used to store derived data and analysis outputs, such as cluster labels or an inferred sample embedding (Fig. 1b).
MuData objects are serialised to HDF5 [21] files by default-the industry standard for storing hierarchical data. Individual omics layers are serialised using the existing AnnData serialisation format, thus permitting direct access to single omics using existing toolchains that build on this data standard (Fig. 1c). Basic access to MuData files is possible from all major programming environments that support access to HDF5 array objects. Additionally, MUON comes with dedicated libraries to create, read and write MuData files from Python, R, and Julia. These tools facilitate the exchange of multi-omics data across platforms and ensure consistent file format definitions.

MUON: a framework for multimodal omics data
The MUON framework allows for managing, processing, and visualising multi-omics data using the MuData containers. Existing workflows developed for single-omics can be reused and applied to the contents of a multi-omics container. For example, individual modalities of the simultaneous gene expression and chromatin accessibility profiling [22] can be processed using existing RNA [23] and ATAC [24] workflows. In this manner, canonical processing steps, including quality control, sample filtering, data normalisation and the selection of features for analysis can be transferred from single-omics analysis (Fig. 2a).
The integration of multiple modalities within a MuData container facilitates the definition of multi-omics analysis workflows, allowing to flexibly combine alternative processing steps (from left to right in Fig. 2b). For example, single-omics dimensionality reduction methods such as principal component analysis or factor analysis [25][26][27][28] can be used to separately process RNA-seq and ATAC-seq count matrices. Additionally, MUON comes with interfaces to multi-omics analysis methods that jointly process multiple modalities, including multi-omics factor analysis [29,30] (MOFA) to obtain lower-dimensional representations, and weighted nearest neighbours [31] (WNN) to calculate multimodal neighbours. Once the results from either dimensionality reduction strategy are stored in a MUON container, they can be used as input for defining cell neighbour graphs. This graph can be either estimated from individual omics modalities, from a multi-omics representation (e.g. as obtained from MOFA), or by fusing two single-omics neighbour representations (e.g. using methods such as similarity network fusion, SNF [32], or WNN [31]).
Finally, the latent or neighbourhood representations can serve as a starting point for downstream analysis and interpretation. For example, uniform manifold approximation and projection (UMAP) [33] can be directly applied to cell neighbourhood graphs to generate nonlinear embeddings of cells. Similarly, the alternative cell neighbourhood graphs can be used as input for identifying connected components and thereby putative cell types (e.g. using multiplex community detection techniques [34]). Example multi-omics analysis workflows implemented using MUON. a Construction and processing of individual modalities of a multi-omics scRNA-seq and scATAC-seq dataset. Processing steps for individual omics from left to right. Rectangles denote count matrices following each processing step, which are stored in a shared MUON data container. MUON provides processing functionalities for a wide range of single-omics, including RNA-seq, ATAC-seq, CITE-seq. Existing workflows and methods can be utilised, including those implemented in scanpy. Respective analysis steps are described below each step. b Alternative workflows for integrating multiple omics for latent space inference and clustering. MUON enables combining alternative analysis steps to define tailored multi-omics data integrations. Shown are canonical workflows from left to right: dimensionality reduction, definition of cell neighbourhood graphs, followed by either nonlinear estimation of cell embeddings or clustering. Letters W and Z denote matrices with feature weights (loadings) and factors (components), respectively. Triangles represent cell-cell distance matrices, with shading corresponding to cell similarity. Green colour signifies steps that combine information from multiple modalities; steps based on individual modalities only are marked with blue (RNA) or red (ATAC) respectively. The outputs of the respective workflows (right) are from top to bottom: UMAP space (i) and cell labels (ii) based on RNA or alternatively based on ATAC modality (iii, iv), cell labels based on two cell neighbour graphs from individual modalities (v), UMAP space and cell labels based on WNN output (vi, vii), UMAP space and cell labels based on MOFA output (viii, ix) The flexibility to choose and control individual processing steps in MUON makes it possible to compose tailored workflows for a particular dataset.

Application of MUON to single-cell multi-omics data
To illustrate MUON, we considered data from simultaneous scRNA-seq and scATACseq profiling of peripheral blood mononuclear cells (PBMCs), which were generated using the Chromium Single Cell Multiome ATAC + Gene Expression protocol by 10x Genomics [22]. Features in the RNA modality correspond to the expression level of genes, whereas the ATAC modality encodes accessible genomic loci as peaks. MUON supports the application of alternative dimensionality reduction strategies (Fig. 2). For example, multi-omics factors analysis [30]-an approach for integrating different omics modalities based on matrix factorization-yields a lower dimensional representation, including factors that capture variation of individual omics or shared variability (Additional file 1: Fig S1a), which in turn can be interpreted on the level of individual features (Fig. 3a, Additional file 1: Fig S1b). Here, the factors that explain the largest fraction of variance in PBMCs capture canonical biological differences, such as the myeloid-lymphoid axis and cytotoxicity (Fig. 3a, left). These factors capture both variation in mRNA abundance and chromatin accessibility, e.g. as CD3E expression and BCL11B promoter accessibility, which are characteristic for T cells [35,36] (Fig. 3a,  right).
A two-dimensional latent space recapitulating the structure of the data is commonly used for visualising cell type composition, cell-level covariates, or feature counts. For this, it is important that MUON allows to generate, store, and operate with multiple different embeddings constructed for individual modalities (Fig. 3b, left) or jointly for both modalities based, for instance, on the MOFA factors or the WNN graph (Fig. 3b,  right). Such visualisations can be generated from MuData objects without loading all the data into memory.
As a second example, we considered CITE-seq [37] data, which comprise gene expression and epitope abundance information in the same cells. To process the latter, specialised normalisation strategies for denoising and scaling [38] are available. Normalised protein counts can then be used to define cell types, akin to gating in flow cytometry [39] (Fig. 3c). Once the count matrices are processed, these can be integrated using alternative multimodal options (Fig. 2). For instance, using both modalities for cell-type annotation as well as for dimensionality reduction allows to attribute the distinction between naïve and memory T cells to the abundance of CD45 isoforms RA and RO at the protein level (Fig. 3d).

Discussion
Multimodal omics designs are increasingly accessible, allowing for characterising and integrating different dimensions of cellular variation, including gene expression, DNA methylation, chromatin accessibility, and protein abundance [3,40,41]. MUON directly addresses the computational needs posed by such multi-omics designs, including data processing, analysis, interpretation, and sharing (Fig. 1). Designed for the Python ecosystem, MUON operates on MuData objects that build on community standards for single-omics analysis [9]. Serialisation to HDF5 makes MuData objects accessible to other programming languages, including R and Julia. MUON is designed in a modular fashion, which means that existing methods and tools for processing individual omics can be reused to design more complex analysis workflows (Figs. 2 and 3). At the same time, the software facilitates combining single-omics analysis methods with a growing spectrum of multi-omics integration strategies [42,43] to define novel multi-omics workflows.
Looking ahead, MUON will be a robust platform to build upon and support future developments. On the one hand, handling novel assays for multi-omics that are emerging can be integrated. For example, mRNA and proteins can be assayed together not only with CITE-seq [37] but also with QuRIE-seq [44] or INs-seq [45]. Other examples include explicit support for genomic-coordinate based assays [46] or assays with spatial coordinates [47]. Moreover, trimodal assays such as scNMT-seq [48] or TEA-seq [49] allow to generate data beyond just two modalities and can be handled with MUON, which is designed to manage an arbitrary number of modalities. On the other hand, the complexity of experimental designs is rapidly increasing [50,51]. Already, MUON can take additional covariates into account during multimodal integration, for example, to perform temporally aware factor analysis [52]. Future development of MUON will include incorporating additional relationships in MuData, for example, to explicitly model the dependencies between feature sets across omics, or to account for dependencies between multiple sets of multi-omics experiments.

Conclusions
With MuData proposing a standardised and language-agnostic approach to manage, store, and share multimodal omics data, it is now possible to build methods and tools that can be applied to an increasingly large number of multi-omics datasets. As a multimodal framework, MUON addresses the need for multi-omics analysis workflows that are well integrated into the existing Python ecosystem, in particular with tools for omics analysis such as Scanpy [9]. At the same time, MuData facilitates the compatibility and data exchange with R and Julia.
(See figure on previous page.) Fig. 3 Single-cell multi-omics datasets processed and visualised using MUON. a MOFA factors estimated from simultaneous scRNA-seq and scATAC-seq profiling of PBMCs, with cells coloured by either left: coarsegrained cell type; or right: gene expression (in blue) and peak accessibility (in red). Displayed genes and peaks are selected to represent cell-type-specific variability along factor axes. b UMAP latent space for the same dataset as in a, constructed from left: principal components for individual modalities; or right: MOFA factors and WNN cell neighbourhood graph. Cells are coloured by coarse-grained cell type. c. Examples of individual feature values of protein abundance in the CITE-seq profiling of PBMCs after applying dsb normalisation. Colours correspond to the relative local density of cells with red for high density and blue for low density. d UMAP latent space for the same dataset as in c, constructed from MOFA factors (top) or WNN cell neighbourhood graph (bottom). Cells are coloured by their coarse-grained cell type or feature values (blue for gene expression, yellow for protein abundance

Implementation of MuData
The reference MuData implementation is written in the Python programming language and builds on AnnData [17]. A MuData object can be cast as a collection of singleomics modalities, each of which is represented as an AnnData object. Additionally, the MuData object provides basic selector operations, including access to individual modalities, subsetting of samples and/or features. When subsetting samples, these are selected in each modality as well as in multimodal annotations; features from different modalities can be used to obtain a MuData object with desired features. As with AnnData, unstructured data can be stored in a MuData object, which can be used for recording assay-specific information. Feature relations across modalities can be stored in the MuData object as a sparse multimodal graph.
MuData objects are serialised to .h5mu files, which are based on HDF5-industry standard for hierarchical storage of numerical data supported by many programming languages [21]. Individual modalities are stored in the file hierarchy in a way compliant with AnnData serialisation, enabling access to individual modalities from disk. Disk backing is implemented for MuData objects so that MuData files can be read without loading count matrices of individual modalities.
Cross-language capabilities of MuData files are demonstrated with Julia and R libraries. Julia library implements native AnnData and MuData objects whereas R libraries create MultiAssayExperiment [14] or Seurat [8] objects with information from MuData files. As .h5ad and .h5mu are not the native formats for R frameworks, standards are still to be developed for how to serialise auxiliary information stored in the R objectand, conversely, deserialize this information back from the files.

Comparison of MuData with alternative data formats
MuData and MUON take inspiration and build on concepts from AnnData [17] and Scanpy [9]. In fact, the software incorporates ideas and extends it in a modular fashion, similar to the existing practice in the Bioconductor community [60].

Processing gene expression and chromatin accessibility data
Single-cell multiome ATAC + gene expression demonstration data for peripheral blood mononuclear cells (PBMCs) from a healthy donor with granulocytes removed through cell sorting processed with ARC 1.0.0 pipeline were provided by 10X Genomics (https://support.10xgenomics.com/single-cell-multiome-atac-gex/datasets). Lognormalisation was used for both gene and peak counts, and respective values for highly variable features scaled and centred to zero mean and unit variance were then used as input to discussed algorithms such as PCA, as implemented in scikit-learn [55] and scanpy [9], or MOFA+ [30]. Differentially expressed genes and differentially accessible peaks were identified with respective functionality in scanpy and were used to compile gene lists for cell type identification.
Processing CITE-seq data CITE-seq data for PBMCs from a healthy donor were provided by 10X Genomics (https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.2/5k_pbmc_ protein_v3). Log-normalisation was used for gene counts, and dsb [38] was used to denoise and scale protein counts. Respective values for highly variable features scaled and centred to zero mean and unit variance were then used as input to discussed algorithms. The respective vignettes are available at https://muon-tutorials.readthedocs.io/ en/latest/cite-seq.
Additional file 1. Figure S1 Additional file 2. Review history