Skip to main content

Panpipes: a pipeline for multiomic single-cell and spatial transcriptomic data analysis

Abstract

Single-cell multiomic analysis of the epigenome, transcriptome, and proteome allows for comprehensive characterization of the molecular circuitry that underpins cell identity and state. However, the holistic interpretation of such datasets presents a challenge given a paucity of approaches for systematic, joint evaluation of different modalities. Here, we present Panpipes, a set of computational workflows designed to automate multimodal single-cell and spatial transcriptomic analyses by incorporating widely-used Python-based tools to perform quality control, preprocessing, integration, clustering, and reference mapping at scale. Panpipes allows reliable and customizable analysis and evaluation of individual and integrated modalities, thereby empowering decision-making before downstream investigations.

Background

Single-cell omics is a rapidly evolving field, with studies constantly expanding in size and scope to tackle increasingly complex biological and biomedical questions associated with development and aging, health and disease, and vaccination and therapy [1,2,3]. Single-cell RNA sequencing (scRNA-seq) is now performed as a minimum, with a growing battery of methods becoming available to facilitate profiling of cell type-specific transcripts such as T- and B-cell receptor (TCR and BCR) repertoires (scV(D)J-seq) [4], as well as other molecular modalities, including chromatin states through single-cell sequencing Assay for Transposase-Accessible Chromatin (scATAC-seq) [5, 6], and over 100 cell-surface protein markers using antibody-derived tags (ADT) for Cellular Indexing of Transcriptomes and Epitopes by sequencing (CITE-seq) [7]. Technological advances are also enabling experiments to move beyond the individual profiling of different molecular modalities and instead allow the simultaneous characterization of the cellular genome, epigenome, transcriptome, and/or proteome at a resolution that was previously inaccessible. To date, at least 25 different methods for the joint assaying of two or more modalities in single cells have been developed [3]. Single-cell multiomics are thus set to provide a fundamentally holistic understanding of cell and tissue circuitry and systems—surpassing insights that can be garnered from individual modalities alone.

Best practices for multimodal analysis are now emerging [8], with a wide range of packages and tutorials from which users can develop custom scripts [9, 10]. For an end user, a typical analysis workflow could consist of collections of notebooks, which are run interactively and customized for each individual project. At different stages, the user is required to make choices about, for example, filtering strategies, normalization, dimensionality reduction, and clustering to obtain a biologically meaningful interpretation of their data. However, this scenario does not constitute an efficient application of best practices: relying only on custom scripts poses a risk due to a lack of methodological consistency, thus jeopardizing reproducibility [11]. This problem is particularly relevant for large-scale projects, where sequential analysis rounds are necessary, as the dataset increases in size. Therefore, fully harnessing the power of multiomic single-cell technologies is impeded by the absence of comprehensive pipelines which seamlessly integrate best practices by jointly analyzing modalities in a reproducible, automated, and computationally efficient fashion.

To meet this need, we have developed Panpipes, a set of automated, flexible, and parallelized computational workflows for the processing of multimodal single-cell data. Panpipes is implemented in Python and has at its core the scverse core tools for single-cell omics data analysis [12]. It leverages the efficient and flexible data handling structures AnnData [13] and MuData [14], complemented by a number of widely-used single-cell analysis tools including Scanpy [15], muon [14], scvi-tools [16], and scirpy [17]. These packages have been successfully applied in a variety of settings, including the building of large-scale atlases and deep learning computational tasks, for instance as part of the Human Cell Atlas reference building efforts [18], and they scale to millions of cells, while maintaining reasonable computation times.

Single-cell analysis frameworks, such as Seurat [10] or Scanpy [15], have promoted the democratization of access to single-cell data analysis. Seurat leverages R’s statistical capabilities, while Scanpy relies on Python’s machine learning libraries, and both use distinct data structures (SeuratObject or AnnData and MuData, respectively). Each framework has its own inherent strengths and they cater to different programming communities (largely R versus Python users). Interactively analyzing single-cell data within a single framework can be useful for exploratory investigations and analysis of smaller datasets, but can pose challenges when testing multiple parameter combinations, especially for complex and large datasets. To meet these challenges, pipelines for single-cell analysis are emerging, which utilize workflow management softwares to orchestrate analysis with one or more frameworks. Such pipelines are designed for creating data processing workflows that automate and expedite complex processes involving multiple tasks and dependencies. Utilization of pipelines can thus enable the parallel comparison of different algorithms and tools. This is critical as although benchmarking studies provide important guidance for algorithm or tool selection [19,20,21,22,23,24], no single method will necessarily generate the best results for all datasets, and benchmarking studies can also reach different conclusions [19, 20].

Published pipelines for single-cell analyses such as scFlow [25], scrnaseq [26], bollito [27], and pipeComp [28] are restricted to single modality (RNA only) datasets, and typically use R-based packages such as Seurat [10] and Scater [29]. Other published pipelines are designed to run using cloud computing and employ web-based interfaces such as SCiAp [30], Granatum [31], or ASAP [32]. However, these web-based workflows can be restrictive in terms of analysis parameters and users may struggle with larger datasets. In contrast, Panpipes is designed to run on high-performance computing (HPC) clusters, but retains the capacity to be deployed locally for small datasets, providing the user with added run flexibility. The pipeline is managed using the Computational Genomics Analysis Tools (CGAT)-core framework [33], which simplifies and parallelizes job submission both on local computers or by interacting with common cluster workflow managers such as SLURM [34].

Panpipes is the first set of open-source workflows for the analysis of multimodal single-cell and spatial transcriptomic datasets [35, 36]. Panpipes performs quality control, preprocessing, integration, clustering, reference mapping, and spatial transcriptomics deconvolution at scale. The user’s interaction with Panpipes is highly customizable, enabling analysts to have fine control over their analyses, in a reproducible manner. Our pipeline is written in a modular way such that the workflows can be further developed to keep up with the fast-moving field of single-cell multiomics and spatial transcriptomics. As Panpipes leverages scverse tools which are interoperable between Python and R ecosystems, our choice of relying on scverse, which is a well-maintained community project, will ensure that Panpipes is future-proof.

Results

Panpipes: a pipeline for single-cell multiomic and spatial transcriptomic analysis

Panpipes comprises six workflows for the analysis of single-cell multiomic datasets: “Ingestion,” “Preprocessing,” “Integration,” “Clustering,” “Reference Mapping,” and “Visualization” (Fig. 1).

Fig. 1
figure 1

The Panpipes workflows for single-cell multiomic and spatial transcriptomic analysis. Panpipes has a modular design and performs ingestion, preprocessing, integration and batch correction, clustering, reference mapping, and spatial transcriptomics deconvolution with custom visualization of outputs. The schematic demonstrates the flow of data within (solid arrows) and between (dashed arrows) workflows, and modality-specific steps are indicated

Panpipes also includes four workflows dedicated to spatial transcriptomics, including: “Ingestion,” “Preprocessing,” “Clustering,” and “Deconvolution” (Fig. 1). The unifying aim across these workflows is to guide the user through the key decision-making steps of the analytical process and to gather all the data necessary to annotate cell types and states.

The single-cell multiomics “Ingestion” workflow leverages Scanpy and muon functionalities, together with custom functions, to ingest single-cell data from any combination of assays including gene expression, immune repertoire profiling, CITE-seq, and ATAC-seq. A variety of file formats can be used as input, namely count matrices, CellRanger outputs, and h5 files. This flexibility simplifies the process of quickly analyzing published datasets as well as novel data from any single-cell platform. Finally, the data are concatenated and saved to disk as a MuData (h5mu) file, a multimodal container [14]. Additionally, the workflows can incorporate metadata associated either with the sample, such as patient-level information or with the barcode, such as demultiplexing information or cell-level annotations in the case of preprocessed data. Standard and custom QC metrics are computed and visualized for each modality.

The second stage of Panpipes, the “Preprocessing” workflow for the single-cell multiomic data is used to (i) filter the data based on previously computed quality control (QC) metrics, (ii) (optionally) downsample, (iii) normalize, and (iv) scale the data, with different options available for each modality. After preprocessing, any of the following workflows can be run, depending on the analytical requirements: “Integration,” “Clustering,” and “Reference Mapping.” “Integration” is used to integrate and (optionally) batch correct via a choice of uni- and multimodal methods, which can be run in a parallel fashion. “Clustering” runs parallelized clustering over a wide range of parameters using dimension reductions from either “Preprocessing” or “Integration” as inputs. “Reference Mapping” utilizes query-to-reference (Q2R) and label transfer (LT) functionalities from scvi-tools and single-cell architectural surgery (scArches) [37] to integrate and annotate query data with reference data.

Finally, “Visualization” is included as a separate overarching workflow as the outputs from any of the other workflows can be used as its inputs. It produces a range of plots, combining the experiment-specific metadata and the analysis outputs from the other workflows, to aid the inspection and interpretation of results. Users who have run multiple methods and parameter choices in parallel can evaluate their results at each step and can export the final objects to cellxgene [38] for user-friendly and interactive exploration.

For spatial transcriptomics analyses, Panpipes’ “Ingestion” workflow leverages Scanpy and squidpy [39] functionalities to read data generated through the 10 × Genomics’ Visium or Vizgen’s MERSCOPE platforms. After “Ingestion,” the “Preprocessing” workflow is used to (i) filter the data, (ii) visualize and evaluate QC metrics post-filtering, and (iii) normalize and (iv) scale the data. Subsequently, on each processed sample, “Clustering” is performed. For the 10 × Genomics’ Visium data, whose resolution is dictated by the number of cells found over ‘spots’ containing spatially barcoded capture probes, a “Deconvolution” workflow can also be run after “Preprocessing,” which enables leveraging of single-cell references to computationally achieve a higher resolution of cell type identification within spots.

Evaluation of single-cell multiomic data quality with Panpipes

To enable data QC and thus the identification and obtainment of high-quality cells, Panpipes generates a battery of metrics standard to the evaluation of scRNA-seq data [40, 41], such as the total number of unique molecular identifiers (UMI) per cell-barcode and the percentage of UMIs assigned to mitochondrial transcripts. In addition, users can provide custom gene lists to score specific cellular phenotypes. This can facilitate the retention of cell types with more atypical properties such as plasma cells or neutrophils [42], that might otherwise be excluded. It also renders Panpipes compatible with any genome, thereby enabling analyses of cells from other species.

In addition to RNA-associated metrics, Panpipes produces a number of QC visualizations which are specific to ATAC-seq assays (ATAC) or ADT assays (PROT), or are related to the joint QC of multiple modalities (Fig. 2). For ATAC, the fragment and barcode metrics are incorporated in the data object and the nucleosome signal is computed. With the inclusion of a peak annotation file which maps chromosome coordinates to gene IDs, transcription start site enrichment is also calculated. For PROT, comparing the UMI counts in the cell-containing foreground against the empty droplets in the background can give an indication of whether antibodies are binding specifically, or contributing to ambient contamination in the dataset (Fig. 2A). The level of the background staining in empty droplets on a per ADT basis in the PROT assay, correlates with the signal strength of the ADTs after normalization, and thus is likely to influence downstream analysis. Panpipes provides two PROT normalization options, centered log-ratio transformation (CLR) [7] and denoised and scaled by background (dsb) normalization [43]. CLR generates a natural log ratio of the count for a protein in a cell relative to other cells, hence enabling improved distinction of cell populations, but without endeavoring to account for background or technical noise [7]. The dsb normalization aims to correct for ambient ADTs and unspecific binding of antibodies to cells [43]. Panpipes allows for the normalized PROT expression profile to be visually inspected for individual ADTs via histograms for each normalization method (Fig. 2B, C), while scatter plots facilitate head-to-head comparisons of the methods on a per ADT basis (Fig. 2D). In addition, Panpipes QC can profile the ambient fractions of RNA and PROT expression data, to provide insight into the variation of the background relative to the foreground across samples for both modalities (Fig. 2E).

Fig. 2
figure 2

Cell-surface protein QC metric visualizations generated by Panpipes. A Expression (log2 of raw counts) of cell-surface protein markers in cells (foreground, red) versus empty droplets (background, blue). Dots represent mean expression and lines show the standard error of the mean. B Distribution of cell-surface protein marker expression after CLR normalization. C Distribution of cell-surface protein marker expression after dsb normalization. D Joint distribution plot of cell-surface protein marker expression normalized by CLR (x-axis) versus dsb (y-axis). E Joint distribution plot of RNA (x-axis) versus the cell-surface protein (encoded by the RNA; y-axis) in cells (red) versus empty droplets (blue). Expression of the RNA and protein is plotted as the log of the total counts (per cell barcode)

The capacity to extensively inspect QC metrics for all modalities present in a single-cell dataset is critical for subsequent clustering, annotation and downstream analyses [44], and can help inform decision-making with respect to multimodal integration.

Multimodal integration for unified cellular representation

Following QC, Panpipes offers a parallelized framework to aid the user in choosing a reduced dimensionality representation of a given dataset based on a unimodal or multimodal integration, with the option to apply batch correction to individual modalities or in a joint fashion.

To mimic a typical analysis scenario in which a user may wish to apply different processing choices simultaneously, we demonstrate Panpipes’ functionality on a trimodal dataset (TEA-seq) [45, 46] of three samples with joint single-cell measurements of RNA, PROT, and ATAC. The workflow enables each individual modality to be projected onto a latent representation with or without a selection of batch correction methods [47, 48] (e.g. BBKNN for RNA, Harmony for PROT, and BBKNN and Harmony for ATAC as shown in Additional file 1: Fig. S1). The batch correction methods offered for the different modalities have been selected based on underlying statistical assumptions and published benchmarks [20]. Multimodal batch-aware integration methods can also be employed for two or more modalities, including MultiVI (used for RNA + ATAC with this dataset) [49], totalVI (for RNA + PROT) [50], and weighted nearest neighbor (WNN; for ATAC + RNA + PROT) [10] (Fig. 3A–E). MultiVI and totalVI perform multimodal integration while accounting for batch covariates while WNN affords the highest processing flexibility as it can perform multimodal integration after individual modalities are individually batch corrected. Users are provided with a choice of unimodal and multimodal integration tools as each integration approach may answer a different biological question, depending on the dataset. The variation in the performance of these tools for batch merging can be visualized through UMAP representations (Fig. 3A–E) and is also evaluated by the calculation of Local Inverse Simpson’s Index (LISI) scores (Fig. 3F) [48].

Fig. 3
figure 3

Panpipes integration workflow enables evaluation of multimodal integration and batch correction. UMAPs showing individual batches (batch 1, blue; batch 2, ochre; batch 3, pink) after RNA and ATAC modality integration using MultiVI (A), RNA and ADT cell-surface protein (PROT) integration using totalVI (B), ATAC and PROT integration using WNN with no batch correction (C), RNA, ATAC and PROT integration with no batch correction (D) and with batch correction (E). F LISI score distribution for integrations depicted in A-E. G UMAP of Leiden clustering after trimodal WNN integration with batch correction. H ASW metric values for different clustering labels and cell embeddings. I Graph connectivity metric values for different clustering labels and cell embeddings. J ARI metric values for cluster matching with different clustering choices. K Per-cluster concordance of clustering choices relative to trimodal WNN Leiden clustering

After integration, each unimodal or multimodal embedding is clustered using the “Clustering” workflow (Fig. 3G), and further integration evaluation is carried out using a selection of single-cell integration benchmarking (scIB) metrics [20] and custom visualizations (Fig. 3H–K). Average silhouette width (ASW) and graph connectivity [51] are applied on an integrated object with a choice of clustering labels and cell embeddings (Fig. 3H, I). These metrics estimate how well similar cells cluster together by considering intra- and inter-cluster similarities and local connectivities, respectively, with higher scores signifying better performance. Since each clustering returns a cell partitioning from the embedding it was generated on, it may be anticipated that any single tested clustering would have the highest score for its original embedding; however, this is not always the case. For example, with the TEA-seq dataset, comparing the RNA clustering with the totalVI embedding and the PROT clustering with the WNN embedding yielded similarly high or higher ASW and graph connectivity scores (Fig. 3H, I). As specific multimodal integration metrics have not been developed yet, this demonstrates how Panpipes’ repurposed use of scIB metrics in the multimodal scenario is instrumental in identifying where individual modalities may have uneven contributions to the final cell classification.

To further assess the concordance of clustering choices calculated from different modalities, Panpipes generates a cluster matching metric, the Adjusted Rand Index (ARI) [52], for global concordance evaluation (Fig. 3J). Panpipes also implements another clustering concordance visualization on a per-cluster basis, whereby one clustering choice is selected as the reference (in the example, multimodal WNN clustering; Fig. 3K). For each of the clusters identified in the reference, the extent to which alternative approaches provide at least one cluster that groups together the same cells as the reference is then scored. Higher scores indicate that a high percentage of the cells in the reference cluster are also grouped together in the alternative cell partitioning. With the TEA-seq dataset for instance, WNN cluster 6 is entirely recapitulated by all the alternative clustering choices, while cluster 9 is poorly represented by the alternatives (all scores < 0.60; Fig. 3K).

Thus, Panpipes provides the user with the capacity to efficiently run and thoroughly evaluate the correction of batch effects and the integration of individual and multiple modalities to facilitate the selection of the optimal integration method prior to downstream analyses.

Reference mapping with Panpipes

As large-scale single-cell multiomic datasets become increasingly available [1, 53], users will wish to take advantage of such resources to expedite cell annotation of their own data and aid biological interpretation. However, learning from reference datasets can pose an analytical challenge due to batch effects, computational resource limitations, and data access restrictions [26]. Panpipes can aid in building unimodal or multimodal references (“Reference Mapping”) and enables the user to query multiple references simultaneously using scArches [16, 37, 54]. For example, a user can perform filtering of low-quality cells on the input dataset (via “QC” and “Preprocessing”) and can then immediately run the “Reference mapping” workflow without proceeding with the “Integration” and “Clustering” workflows. Alternatively, users can annotate their query dataset independently, then project it onto a reference and evaluate concordance with the reference labels. The concordance of the transferred labels with the original labels is evaluated in the query via a selection of scIB metrics. Furthermore, users can leverage Panpipes to query the same dataset on multiple references, allowing for comparison between them.

To demonstrate the “Reference mapping” workflow we have performed Q2R and LT using as the query a unimodal scRNA-seq peripheral blood mononuclear cell (PBMC) dataset [55, 56] and three PBMC references varying in size and in the granularity of cell type labels. These references include one RNA-specific unimodal dataset (PBMC_R1) [57] and two multimodal PBMC datasets (PBMC_R2 and PBMC_R3) [10, 58]. Single-cell Annotation using Variational Inference (scANVI) [59] and totalVI were employed for the uni- and multimodal references, respectively (Fig. 4).

Fig. 4
figure 4

Panpipes reference mapping allows evaluation of multiple reference datasets. A UMAP of unimodal Q2R integration between the query dataset (orange) and the PBMC_R1 reference dataset (blue) using scANVI. B UMAP of multimodal Q2R integration between the query dataset (orange) and the PBMC_R2 reference dataset (blue) using totalVI. C UMAP of multimodal Q2R integration between the query dataset (orange) and the PBMC_R3 reference dataset (blue) using totalVI. D Query to PBMC_R1 label transfer concordance with predicted (reference-derived) labels on the x-axis and observed (query-derived) labels on the y-axis. E Query to PBMC_R2 label transfer concordance with predicted (reference-derived) labels on the x-axis and observed (query-derived) labels on the y-axis. F Query to PBMC_R3 label transfer concordance with predicted (reference-derived) labels on the x-axis and observed (query-derived) labels on the y-axis. G Label conservation scoring using scIB metrics for each Q2R integration

Q2R integration is visually assessed by UMAP plots (Fig. 4A–C), while LT performance is evaluated by estimating the concordance of predicted and observed labels (Fig. 4D–F). Notably, for the datasets tested, there was variation in the cells present in the query relative to the reference data, and this was detectable by both the incomplete integrations by inspecting the UMAP generated on the query and reference latent embeddings, and by the imperfect LT concordance (Fig. 4A–F). Interestingly, with reference dataset PBMC_R3 [10, 58], a query cluster annotated as basophils (“BASO”) received three different proliferating lymphoid cell reference labels suggesting that the outputs generated by Panpipes can help to identify annotation inconsistencies for further investigation and thus obtain an optimal annotation. Finally, label conservation is scored using metrics that assess local neighborhoods, (including graph cLISI and graph connectivity), global cluster matching (including ARI and normalized mutual information (NMI) [60]), and relative distances as determined by cell-type ASW (Fig. 4G).

The capacity of Panpipes to employ and compare multiple reference datasets will be critical as single-cell omics atlases continue to expand in scale and complexity and users will likely want to draw upon all resources available to arrive at a high-confidence annotation of their own data.

Orchestrating spatial transcriptomic analysis

The rapid evolution of spatial transcriptomics technologies allows us to capture gene expression within the context of tissue architecture [61,62,63]. Similar to the Panpipes single-cell workflows, the spatial transcriptomics workflows also include “Ingestion,” “Preprocessing,” and “Clustering,” and enable the parallel analysis of data derived from multiple spatial transcriptomics slides. Critically for the 10 × Genomics Visium ‘spot’-based approach, whereby the data for each individual RNA capture area (‘spot’) will represent a mixture of transcriptomes from all the cells found in the area, a “Deconvolution” workflow is provided. This is based on the use of two methods that effectively resolve the transcriptomes from each capture area into finer cell types: the cell2location Bayesian model [64], and the Tangram deep-learning framework [65]. The “Deconvolution” workflow can utilize external single-cell datasets, but also seamlessly integrates with the single-cell multiomics workflows to utilize single-cell data generated subsequent to the Panpipes single-cell “Integration” and “Clustering” (Fig. 1).

Benchmarking

To demonstrate Panpipes’ performance, we ran the “Integration” workflow on six datasets of different sizes, representing the full data and subsamples of a PBMC dataset [10, 58] and the TAURUS study gut dataset [66], assessing runtime (Fig. 5A) and resource usage (Fig. 5B, C). Since Panpipes implements each integration method as an independent component, the main advantage of our pipeline is the management of data flow and the possibility to choose which method to run in a parallel fashion, allowing the independent processing of modalities across multiple methods (Fig. 5).

Fig. 5
figure 5

Panpipes time and resource usage benchmarking. A Total runtimes (bar plot) and run times by integration method (dot plot) when using the “Integration” workflow on six different datasets representing the full data and subsamples of a PBMC CITE-seq dataset and the TAURUS study gut scRNA-seq dataset. B Resource usage as denoted by the number of central processing unit (CPU) cores or graphics processing units (GPUs) utilized for the six datasets. C Memory usage as denoted by the number of GB utilized for the six datasets

Furthermore, Panpipes’ outputs provide a biological perspective for evaluation and interpretation of the data in any biological context. For example, taking the PBMC data [10, 58] used for the time and resource benchmarking, for which cell type annotations are available, we ran the “Integration” workflow in a multimodal, integration batch-aware fashion, with the aim of assessing which of the multimodal integration methods included in Panpipes was able to recapitulate the ground truth annotation. Assessment of integrations generated by MOFA, totalVI, and WNN by the UMAP plot distribution of the cell type labels (Additional file 1: Fig. S2) shows a good separation of the cell types with every method, although with MOFA two batches of each cell type could be observed suggesting poor integration. However, with totalVI and WNN these batches were not discernible, but WNN (with harmony correction of RNA and PROT) resulted in the clearest separation of the CD4+ and CD8+ T and natural killer (NK) cell subsets, for example (Additional file 1: Fig. S2).

Discussion

We have developed Panpipes, a highly flexible pipeline to perform finely-tuned analyses on both single modality and multiomic single-cell datasets. Panpipes is based on scverse [12], which includes the most popular Python-based single-cell packages, and thus readily scales up to millions of cells. Panpipes allows the ingestion, quality checking, filtering, integration, clustering, reference mapping, and visualization of complex single-cell datasets. Our workflow can process any single-cell dataset containing RNA, cell-surface proteins, ATAC, and immune repertoire modalities, as well as spatial transcriptomics data generated through the 10 × Genomics’ Visium or Vizgen’s MERSCOPE platforms.

Panpipes is designed to help analysts run a comprehensive evaluation of their single-cell data. The pipeline does not stop at basic preprocessing but instead constitutes a flexible framework to explore more complex analytical choices including unimodal and multimodal integration with or without batch correction. Panpipes implements nine different integration methods, helping the user select the best parameterization for their specific analytical needs. The reference mapping functionality offered by Panpipes can expedite cell annotation and provides a powerful setting to evaluate the utility of multiple reference datasets. This may be particularly important given that individual references may not perfectly recapitulate the biological or biomedical context of the query dataset. Moreover, given the agility of Panpipes, the pipeline may be well suited to benchmarking studies, in particular in relation to multimodal integration and reference mapping, which are currently lacking in the field of single-cell multiomics.

We have developed Panpipes with a wide range of users in mind. The pipeline is publicly available with extensive documentation and tutorials which allow users to efficiently go from raw data to clustering and annotation in a semi-automated fashion—regardless of whether they are using unimodal, multimodal, or spatial transcriptomic datasets. Being Python-based, Panpipes could represent an easy entry point for users with a Machine Learning background who have more limited single-cell analysis expertise. Our pipeline may also appeal to core computational facilities in academia or in the pharmaceutical industry that need a quick and flexible single-cell pipeline that readily allows for the assessment of common problems such as ambient molecular contamination and batch effects, and that facilitates the utilization of external single-cell datasets to help inform target identification and evaluation.

Panpipes is also FAIR principle [67] compliant, in line with the requirements of many funding agencies. The source code is easily findable and accessible on GitHub (https://github.com/DendrouLab/panpipes), where we implement GitHub actions to automatically build and test the code as we develop Panpipes, in compliance with software engineering best practices [35, 36]. Panpipes is also available as a PyPi package, and further documentation and tutorials (https://panpipes-pipelines.readthedocs.io/en/latest/) are provided to encourage users to adapt the pipeline to their own needs.

Conclusions

The last few years have seen a continuous and rapid development of multimodal protocols that scale to millions of cells and thousands of analytes in the single-cell omics field. The collection of analytical methods that deal with the complexity of large single-cell datasets is likely to increase, with a marked interest in methods that allow integration of multiomic assays [68]. Given this fast evolution of the single-cell and also the spatial omics analysis landscape, Panpipes is in continuous development. Panpipes is modular by design to enable its extension to incorporate new methods that can deal with further omics modalities in the future. These could include single-cell genomic DNA sequencing and epigenome profiling beyond chromatin accessibility, and other technologies such as flow cytometry, mass spectrometry, and hyperplexed imaging. Panpipes provides a platform for both customization and reproducibility of single-cell multiomic and spatial transcriptomic analyses, ensuring a stable foundation for the consistency and continuity of scientific discovery.

Methods

Implementation details

Panpipes comprises workflows implemented using the CGAT-core framework [33]. CGAT-core automates submission to and parallelization of jobs across HPC clusters. Flexible environment control is implemented using Conda. To interact with the pipeline, the user is required to simply edit a YAML file for each workflow to customize the parameters for their own analyses. Finer details of these options are listed below. We provide documentation on each workflow and how to run them in https://panpipes-pipelines.readthedocs.io.

Ingestion

Data from various sources is ingested to be combined and formatted as a MuData object. Specific QC metrics are computed for each modality, following guidelines defined in single-cell best practices [8]. Scrublet is used to compute doublet scores [69]. Cells are also scored based on custom gene lists (e.g. mitochondrial and ribosomal gene proportions). Gene lists compatible with human and mouse are provided and users can readily input features for alternative species or define their own QC metrics based on custom gene lists.

Preprocessing

The thresholds determined by the QC pipeline outputs are included as parameters in the YAML file, and the data are filtered accordingly. In the “Preprocessing” workflow, the user is able to specify custom filtering options on any set of metrics computed in the QC workflow. Next, for the RNA data, the data are normalized and scaled, and the highly variable genes are computed using Scanpy functionalities. In parallel, PROT data are normalized using either CLR [7] or dsb [43] using muon functionalities and functions implemented ad hoc. For example, users can specify which margin to normalize the PROT data to, namely by cell or within the features’ distribution. ATAC data are log normalized or normalized by term frequency-inverse document frequency, following the options offered by the muon package.

Integration

Integration” implements a range of algorithms in order to batch correct individual modalities, and to combine multiple modalities in a low-dimensional space. For each unimodal processing, the dimensionality reduction of choice (PCA and/or Latent Semantic Indexing (LSI) for ATAC) is applied and the data are batch corrected based on user-defined parameters. Four unimodal batch correction algorithms are included in Panpipes: BBKNN [47], Harmony [48], Scanorama [70], and scVI [16, 54]. Panpipes supports both modality-specific multi-modal integration batch-aware methods such as MultiVI for ATAC and gene expression, and totalVI [50] for PROT and gene expression, and modality-agnostic methods such as MOFA [71] and WNN [10]. The results of these integrations are compared with the aid of scIB metrics [20], inspection of LISI scores [48], and visual inspection of UMAP plots.

Clustering

Clustering” implements both Leiden and Louvain clustering of a connectivity graph constructed on a reduced dimension computed in the “Integration” workflow. The reduced dimension data can be a single modality representation, e.g., PCA or Harmony components, or a multi-modality representation e.g. MultiVI or totalVI reduced dimension. The clusters are then visualized on a UMAP computed from the same dimensionality reduction, or the user has the option to project clusters onto any of the computed UMAPs from alternative dimensionality reductions. The user can compute clustering for a wide range of resolutions, to quickly assess the cell type representation within their dataset. Cluster assignments across different resolutions are compared using clustree [72]. Finally, the workflow calculates the top multimodal markers for each computed clustering, offering a choice of different statistical tests for the scoring of the features based on Scanpy’s rank_genes_groups().

Reference mapping

The “Reference mapping” workflow implements Q2R and LT from scvi-tools and scArches-supported models, namely scVI, scANVI, and totalVI models. Code is implemented with the scvi-tools package. Data for query and reference datasets can be supplied as individual AnnData [13] or MuData [14] objects, and reference models generated with any of the aforementioned methods. The user is required to specify a minimal set of mandatory parameters and can specify additional covariates and define custom training parameters by customizing the pipeline.yml.

Visualization

The “Visualization” workflow is implemented to aid inspection and interpretation of results. The visualization workflow uses matplotlib, seaborn, and ggplot to generate boxplots, histograms, scatterplots, and dimensionality reduction plots (such as PCA or UMAPs), using any combination of variables across the modalities, and experimental metadata. The “Visualization” workflow is also used to export the data objects to cellxgene [38] for interactive visualization. Importantly, this cellxgene object contains UMAP plots from multiple modalities so that the user can directly review gene, protein, peak expression, and repertoire information on the same set of UMAPs.

Ingestion (spatial)

The “Ingestion_spatial” workflow is implemented to ingest data from various spatial transcriptomics platforms such as 10 × Genomics’ Visium or Vizgen’s MERSCOPE. Multiple slides can be processed in parallel. Similar to the single-cell “Ingestion” workflow, the spatial data are quality controlled following best practices recommendations. This workflow produces a MuData object with a “spatial” layer and the newly generated QC values.

Preprocessing and Clustering (spatial)

Similar to the single-cell workflow, “Preprocessing” for the spatial data follows the “Ingestion” workflow to allow filtering and processing of the spatial data. Custom QC parameter thresholds are included in the YAML file, and the data are filtered accordingly. Next, the data are normalized and scaled, and the highly variable genes are computed using Scanpy functionalities. Finally, dimensionality reduction is run and saved to the MuData “spatial” object.

The output of the spatial “Preprocessing” workflow can be run through the spatial “Clustering” which is as described for the single-cell multiomics workflow but with additional parameters for spatial transcriptomic data.

Deconvolution

The “Deconvolution” workflow allows the inference of cell type composition of ‘spot’-based spatial transcriptomic data, using a single-cell reference. “Deconvolution” implements cell2location [64] and Tangram [65] and can be run on multiple individual slides with the same single-cell reference.

Processing of data for figures

Uni- and multimodal processing of the trimodal TEA-seq data

Data for the trimodal TEA-seq dataset was obtained from [45, 46]. Briefly, the three raw datasets for each individual modality were each concatenated into a unimodal AnnData object. ATAC fragment indexes were regenerated using Tabix [73], and the peaks of the three batches were merged following the signac tutorial [74]. The three objects were then partitioned to the cell barcodes in common across the modalities and fed to the “QC” pipeline as individual AnnData objects, which produced a unified MuData container for the three modalities. QC and filtering were performed independently on each modality and the intersection of the passing QC cells across the three modalities was chosen for further analysis. Cells were integrated using unimodal or multimodal integration methods as described (Fig. 4 and Additional file 1: Fig. S1).

Reference mapping

We mapped the same query PBMC dataset [55, 56] to three different PBMC references [10, 57, 58]. For the scANVI example, we produced the reference model using scVI and then updated the model to scANVI to leverage the label transfer functionality. The query was then mapped to the reference data using the “Reference mapping” workflow by specifying the batch covariate and the cell type label. For the TotalVI example, we used the data presented in https://docs.scvi-tools.org/en/stable/tutorials/notebooks/totalVI_reference_mapping.html. The second reference dataset (PBMC_R2) was obtained by downloading the 10 × Genomics PBMC 5 k and 10 k datasets presented in https://scarches.readthedocs.io/en/latest/totalvi_surgery_pipeline.html. The reference was generated using the “Integration” workflow calling the totalVI algorithm, and cells were labeled using the expression of the protein surface markers. We generated the third reference dataset (PBMC_R3) following the process described in the scvi-tools tutorial, downloading the data using the internal scVI function adata_reference = scvi.data.pbmc_seurat_v4_cite_seq(mask_protein_batches = 5). To enable transferring the labels to the query from both PBMC_R2 and PBMC_R3, a Random Forest classifier was trained on the latent TotalVI embedding of the reference model. All query and reference datasets were provided to the “Reference mapping” workflow to perform Q2R and LT, specifying batch covariates and cell type labels, training the query model with number of epochs = 200, and leaving all other default parameters.

Benchmarking

A PBMC CITE-seq dataset [10, 58] and the TAURUS study gut scRNA-seq dataset [66] were used for benchmarking. For these analyses, both the full datasets and the downsampled datasets (10 K, 50 K, and 100 K cells for the PBMCs and 500 K cells for the gut cells) were utilized. The code used for the benchmarking is available at https://github.com/DendrouLab/panpipes-benchmarks.

Availability of data and materials

Code availability: Source code, full documentation, and tutorials are available at https://github.com/DendrouLab/panpipes and https://panpipes-pipelines.readthedocs.io [35, 36], under an open source BSD 3-Clause License. Panpipes maintenance and updates are the responsibility of the co-senior authors Calliope A. Dendrou and Fabian J. Theis and co-first author Fabiola Curion.

Single-cell datasets: The data used in Fig. 2 to showcase the ADT-associated metrics were obtained from https://www.10xgenomics.com/resources/datasets/10-k-human-pbm-cs-with-total-seq-b-human-tbnk-antibody-cocktail-3-v-3-1-3-1-standard-6-0-0. The trimodal TEA-seq dataset was downloaded using dbGaP accession number phs002316 [45, 46]. For reference mapping analyses, the PBMC CITE-seq dataset used as the query was obtained from the Gene Expression Omnibus (GEO) under accession number GSE155673 [55, 56]. The PBMC datasets used as references are available as follows: CITE-seq data from [10, 58] available via GEO (accession number GSE164378) and dbGaP (accession number phs002315.v1.p1), and scRNA-seq data from [57] available via covid19cellatlas.org. The PBMC data (accession number GSE164378) were also used in Fig. 5 and the gut data for this figure were obtained from the TAURUS study [66].

References

  1. Regev A, Teichmann SA, Lander ES, Amit I, Benoist C, Birney E, et al. The human cell atlas. eLife. 2017;6:e27041.

    Article  PubMed  PubMed Central  Google Scholar 

  2. Rajewsky N, Almouzni G, Gorski SA, Aerts S, Amit I, Bertero MG, et al. The human cell atlas. Nature. 2020;587:377–86.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Rood E, Maartens A, Hupalowska A, Teichmann SA, Regev A. Impact of the human cell atlas on medicine. Nat Med. 2022;28:2486–96.

    Article  CAS  PubMed  Google Scholar 

  4. McDaniel JR, DeKosky BJ, Tanno H, Ellington AD, Georgiou G. Ultra-high-throughput sequencing of the immune receptor repertoire from millions of lymphocytes. Nat Protoc. 2016;11:429–42.

    Article  CAS  PubMed  Google Scholar 

  5. Buenrostro JD, Wu B, Litzenburger UM, Ruff D, Gonzales ML, Snyder MP, et al. Single-cell chromatin accessibility reveals principles of regulatory variation. Nature. 2015;523:486–90.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Cusanovich DA, Daza R, Adey A, Pliner HA, Christiansen L, Gunderson KL, et al. Multiplex single cell profiling of chromatin accessibility by combinatorial cellular indexing. Science. 2015;348:910–4.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Stoeckius M, Hafemeister C, Stephenson W, Houck-Loomis B, Chattopadhyay PK, Swerdlow H, et al. Simultaneous epitope and transcriptome measurement in single cells. Nat Methods. 2017;14:865–8.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Heumos L, Shaar AC, Lance C, Litinetskaya A, Drost F, Zappia L, et al. Best practices for single cell analysis across modalities. Nat Rev Genet. 2023;24:550–72.

    Article  CAS  PubMed  Google Scholar 

  9. Amezquita RA, Lun ATL, Becht E, Carey VJ, Carpp LN, Geistlinger L, et al. Orchestrating single-cell analysis with Bioconductor. Nat Methods. 2020;17:137–45.

    Article  CAS  PubMed  Google Scholar 

  10. Hao Y, Hao S, Andersen-Nissen E, Mauck WM, Zheng S, Butler A, et al. Integrated analysis of multimodal single-cell data. Cell. 2021;184:3573–87.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Gibson G. Perspectives on rigor and reproducibility in single cell genomics. PLoS Genet. 2022;18:e1010210.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. scverse. https://scverse.org. Accessed March 2023.

  13. Virshup I, Rybakov S, Theis FJ, Angerer P, Wolf FA. anndata: Annotated data. bioRxiv. 2021. https://doi.org/10.1101/2021.12.16.473007.

  14. Bredikhin D, Kats I, Oliver S. MUON: multimodal omics analysis framework. Genome Biol. 2022;23:42.

    Article  PubMed  PubMed Central  Google Scholar 

  15. Wolf FA, Angerer P, Theis FJ. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 2018;2018(19):15.

    Article  Google Scholar 

  16. Gayoso A, Lopez R, Xing G, Boyeau P, Pour Amiri VV, Hong J, et al. A Python library for probabilistic analysis of single-cell omics data. Nat Biotechnol. 2022;40:163–6.

    Article  CAS  PubMed  Google Scholar 

  17. Sturm G, Szabo T, Fotakis G, Haider M, Rieder D, Trajanoski Z, et al. Scirpy: a Scanpy extension for analysing single-cell T-cell receptor-sequencing data. Bioinformatics. 2020;36:4817–8.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Sikkema L, Ramirez-Suastegui C, Strobl DC, Gillett TE, Zappia L, et al. An integrated cell atlas of the lung in health and disease. Nature. 2023;29:1563–77.

    CAS  Google Scholar 

  19. Tran HTN, Ang KS, Chevrier M, Zhang X, Lee NYS, et al. A benchmark of batch-effect correction methods for single-cell RNA sequencing data. Genome Biol. 2020;21:12.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Luecken MD, Büttner M, Chaichoompu K, Danese A, Interlandi M, Mueller MF, et al. Benchmarking atlas-level data integration in single-cell genomics. Nat Methods. 2022;19:41–50.

    Article  CAS  PubMed  Google Scholar 

  21. Brombacher E, Hackenberg M, Kreutz C, Binder H, Treppner M. The performance of deep generative models for learning joint embeddings of single-cell multi-omics data. Front Mol Biosci. 2022;9. https://doi.org/10.3389/fmolb.2022.962644.

  22. Lance C, Luecken MD, Burkhardt DB, Cannoodt R, Rautenstrauch P, et al. Multimodal single cell data integration challenge: results and lessons learned. bioRxiv. 2022. https://doi.org/10.1101/2022.04.11.487796.

  23. Lee MYY, Kaestner KH, Li M. Benchmarking algorithms for joint integration of unpaired and paired single-cell RNA-seq and ATAC-seq data. bioRxiv. 2023. https://doi.org/10.1101/2023.02.01.526609.

  24. Makrodimitris S, Pronk B, Abdelaal T, Reinders M. An in-depth comparison of linear and non-linear joint embedding methods for bulk and single-cell multi-omics. bioRxiv. 2023. https://doi.org/10.1101/2023.04.10.535672.

  25. Khozoie C, Fancy N, Marjaneh MM, Murphy AE, Matthew PM, et al. scFlow: A scalable and reproducible analysis pipeline for single-cell RNA sequencing data. bioRxiv. 2021. https://doi.org/10.1101/2021.08.16.456499.

  26. scrnaseq. https://github.com/nf-core/scrnaseq/blob/2.3.2/README.md.

  27. García-Jimeno L, Fustero-Torre C, Jiménez-Santos MJ, Gómez-López G, Di Domenico T, Al-Shahrour F. Bollito: A flexible pipeline for comprehensive single-cell RNA-seq analyses. Bioinformatics. 2021;btab758. https://doi.org/10.1093/bioinformatics/btab758.

  28. Germain P-L, Sonrel A, Robinson MD. pipeComp, a general framework for the evaluation of computational pipelines, reveals performant single cell RNA-seq preprocessing tools. Genome Biol. 2020;21:227.

    Article  PubMed  PubMed Central  Google Scholar 

  29. McCarthy DJ, Campbell KR, Lun ATL, Wills QF. Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R. Bioinformatics. 2017;33:1179–86.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Moreno P, Huang N, Manning JR, Mohammed S, Solovyev A, Polanski K, et al. User-friendly, scalable tools and workflows for single-cell RNA-seq analysis. Nat Methods. 2021;18:327–8.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Zhu X, Wolfgruber TK, Tasato A, Arisdakessian C, Garmire DG, Garmire LX. Granatum: a graphical single-cell RNA-Seq analysis pipeline for genomics scientists. Genome Med. 2017;9:108.

    Article  PubMed  PubMed Central  Google Scholar 

  32. Gardeux V, David FPA, Shajkofci A, Schwalie PC, Deplancke B. ASAP: a web-based platform for the analysis and interactive visualization of single-cell RNA-seq data. Bioinformatics. 2017;33:3123c25.

    Article  Google Scholar 

  33. Cribbs AP, Luna-Valero S, George C, Sudbery IM, Berlanga-Taylor AJ, Sansom SN, et al. CGAT-core: a python framework for building scalable, reproducible computational biology workflows. F1000Research. 2019;8:377.

    Article  CAS  Google Scholar 

  34. Yoo AB, Jette MA, Grondona M. SLURM: Simple Linux Utility for Resource Management. In: Feitelson D, Rudolph L, Schwiegelshohn U, editors. Job Scheduling Strategies for Parallel Processing. JSSPP 2003. Lecture Notes in Computer Science, vol 2862. Springer; Berlin, Heidelberg. https://doi.org/10.1007/10968987_3.

  35. Curion F, Rich-Griffin C, Dendrou CA. Panpipes: a pipeline for multiomic single-cell and spatial transcriptomic data analysis (Version 1.0.0, computer software). GitHub. 2024. https://github.com/DendrouLab/panpipes.

  36. Curion F, Rich-Griffin C, Dendrou CA. Panpipes: a pipeline for multiomic single-cell and spatial transcriptomic data analysis. Zenodo. 2024. https://doi.org/10.5281/zenodo.11636539.

  37. Lotfollahi M, Naghipourfar M, Luecken MD, Khajavi M, Büttner M, Wagenstetter M, et al. Mapping single-cell data to reference atlases by transfer learning. Nat Biotechnol. 2022;40:121–30.

    Article  CAS  PubMed  Google Scholar 

  38. CZ CELLxGENE Discover, Chan Zuckerberg Initiative. https://cellxgene.cziscience.com/. Accessed Mar 2023.

  39. Palla G, Spitzer H, Klein M, Fischer D, Schaar AC, et al. Squidpy: a scalable framework for spatial omics analysis. Nat Methods. 2022;19:171–8.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Ilicic T, Kim JK, Kolodziejczyk AA, Bagger FO, McCarthy DJ, Marioni JC, et al. Classification of low quality cells from single-cell RNA-seq data. Genome Biol. 2016;17:29.

    Article  PubMed  PubMed Central  Google Scholar 

  41. Luecken MD, Theis FJ. Current best practices in single-cell RNA-seq analysis: a tutorial. Mol Syst Biol. 2019;15:e8746.

    Article  PubMed  PubMed Central  Google Scholar 

  42. Schulte-Schrepping J, Reusch N, Paclik D, Baßler K, Schlickeiser S, Zhang B, et al. Severe COVID-19 is marked by a dysregulated myeloid cell compartment. Cell. 2020;182:1419–40.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  43. Mulè MP, Martins AJ, Tsang JS. Normalizing and denoising protein expression data from droplet-based single cell profiling. Nat Commun. 2022;13:2099.

    Article  PubMed  PubMed Central  Google Scholar 

  44. COvid-19 Multi-omics Blood ATlas (COMBAT) Consortium. A blood atlas of COVID-19 defines hallmarks of disease severity and specificity. Cell. 2022;185:916–38.

    Article  Google Scholar 

  45. Swanson E, Lord C, Reading J, Heubeck AT, Genge PC, Thomson Z, et al. Simultaneous trimodal single-cell measurement of transcripts, epitopes, and chromatin accessibility using TEA-seq. Elife. 2021;10:e63632.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  46. Swanson E, Lord C, Reading J, Heubeck AT, Genge PC, Thomson Z, et al. Simultaneous trimodal single-cell measurement of transcripts, epitopes, and chromatin accessibility using TEA-seq. dbGaP. 2021. https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs002316.v1.p1.

  47. Polański K, Young MD, Miao Z, Meyer KB, Teichmann SA, Park J-E. BBKNN: fast batch alignment of single cell transcriptomes. Bioinformatics. 2020;36:964–5.

    Article  PubMed  Google Scholar 

  48. Korsunsky I, Millard N, Fan J, Slowikowski K, Zhang F, Wei K, et al. Fast, sensitive and accurate integration of single-cell data with Harmony. Nat Methods. 2019;16:1289–96.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  49. Ashuach T, Gabitto MI, Jordan MI, Yosef N. MultiVI: deep generative model for the integration of multi-modal data. bioRxiv. 2021. https://doi.org/10.1101/2021.08.20.457057.

  50. Gayoso A, Steier Z, Lopez R, Regier J, Nazor KL, Streets A, et al. Joint probabilistic modeling of single-cell multi-omic data with totalVI. Nat Methods. 2021;18:272–82.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  51. Büttner M, Miao Z, Wolf FA, Teichmann SA, Theis FJ. A test metric for assessing single-cell RNA-seq batch correction. Nat Methods. 2019;16:43–9.

    Article  PubMed  Google Scholar 

  52. Hubert L, Arabie P. Comparing partitions. J Classification. 1985;2:193–218.

    Article  Google Scholar 

  53. The Tabula Sapiens Consortium. The Tabula Sapiens: a multiple-organ, single-cell transcriptomic atlas of humans. Science. 2022;376:896.

    Article  Google Scholar 

  54. Lopez R, Regier J, Cole MB, Jordan MI, Yosef N. Deep generative modeling for single-cell transcriptomics. Nat Methods. 2018;15:1053–8.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  55. Arunachalam PS, Wimmers F, Mok CKP, Perera RAPM, Scott M, Hagan T, et al. Systems biological assessment of immunity to mild versus severe COVID-19 infection in humans. Science. 2020;369:1210–20.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  56. Arunachalam PS, Wimmers F, Mok CKP, Perera RAPM, Scott M, Hagan T, et al. Systems biological assessment of immunity to mild versus severe COVID-19 infection in humans. GEO. 2020. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE155673.

  57. Chan Zuckerberg Initiative Single-Cell COVID-19 Consortia, Ballestar E, Farber DL, Glover S, Horwitz B, Meyer K, et al. Single cell profiling of COVID-19 patients: an international data resource from multiple tissues. medRxiv. 2020. https://doi.org/10.1101/2020.11.20.20227355.

  58. Hao Y, Hao S, Andersen-Nissen E, Mauck WM, Zheng S, Butler A, et al. Integrated analysis of multimodal single-cell data. dbGaP. 2021. https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs002315.v1.p1.

  59. Xu C, Lopez R, Melhman E, Regier J, Jordan MI, Yosef N. Probabilistic harmonization and annotation of single-cell transcriptomics data with deep generative models. Mol Syst Biol. 2021;17:e9620.

    Article  PubMed  PubMed Central  Google Scholar 

  60. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: machine learning in Python. J Mach Learn Res. 2011;12:2825–30.

    Google Scholar 

  61. Williams CG, Lee HJ, Asatsuma T, Vento-Tormo R, Haque A. An introduction to spatial transcriptomics for biomedical research. Genome Med. 2022;14:68.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  62. Moffitt JR, Lundberg E, Heyn H. The emerging landscape of spatial profiling technologies. Nat Rev Genet. 2022;23:741–59.

    Article  CAS  PubMed  Google Scholar 

  63. Cheng M, Jiang Y, Xu J, Mentis A-FA, Wang S, et al. Spatially resolved transcriptomics: a comprehensive review off their technological advances, applications, and challenges. J Genet Genomics. 2023;27:S1673-8527(23)00075-9.

    Google Scholar 

  64. Kleshchevnikov V, Shmatko A, Dann E, Aivazidis A, King HW, et al. Cell 2location maps fine-grained cell types in spatial transcriptomics. Nat Biotechnol. 2022;40:661–71.

    Article  CAS  PubMed  Google Scholar 

  65. Biancalani T, Scalia G, Buffoni L, Avasthi R, Lu Z, et al. Deep learning and alignment of spatially resolved single-cell transcriptomes with Tangram. Nat Methods. 2021;18:1352–62.

    Article  PubMed  PubMed Central  Google Scholar 

  66. Thomas T, Rich-Griffin C, Pohin M, Friedrich M, Aschenbrenner D, et al. A longitudinal single-cell therapeutic atlas of anti-tumour necrosis factor treatment in inflammatory bowel disease. bioRxiv. 2023. https://doi.org/10.1101/2023.05.05.539635.

  67. Wilkinson MD, Dumontier M, Aalbersberg IJ, Appleton G, Axton M, et al. The FAIR guiding principles for scientific data management and stewardship. Sci Data. 2016;3:160018.

    Article  PubMed  PubMed Central  Google Scholar 

  68. Lotfollahi M, Litinetskaya A, Theis FJ. Multigrate: single-cell multi-omic data integration. bioRxiv. 2022. https://doi.org/10.1101/2022.03.16.484643.

  69. Wolock SL, Lopez R, Klein AM. Scrublet: computational identification of cell doublets in single-cell transcriptomic data. Cell Syst. 2019;8:281–91.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  70. Hie B, Bryson B, Berger B. Efficient integration of heterogeneous single-cell transcriptomes using Scanorama. Nat Biotechnol. 2019;37:685–91.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  71. Argelaguet R, Velten B, Arnol D, Dietrich S, Thorsten Zenz T, Marioni JC, et al. Multi-omics factor analysis—a framework for unsupervised integration of multi-omics data sets. Mol Syst Biol. 2018;14:e8124.

    Article  PubMed  PubMed Central  Google Scholar 

  72. Zappia L, Oshlack A. Clustering trees: a visualization for evaluating clusterings at multiple resolutions. GigaScience. 2018;7:giy083.

    Article  PubMed  PubMed Central  Google Scholar 

  73. Li H. Tabix: fast retrieval of sequence features from generic TAB-delimited files. Bioinformatics. 2011;27:718–9.

    Article  PubMed  PubMed Central  Google Scholar 

  74. Merging objects. https://stuartlab.org/signac/articles/merging.html. Accessed Dec 2022.

Download references

Acknowledgements

The authors thank Dr Melissa Grant-Peters, Dr Jacqueline Siu, Mr Jarne Beliën, Ms Nitya Gupta, Mr Kin (Andrew) Lee, and Mr Vadim Nazarov for testing the code and identifying bugs. The authors also wish to thank Dr Luke Zappia for revising the manuscript and for helpful discussions. This work has been performed within the framework of the Cartography Consortium (https://www.medsci.ox.ac.uk/for-staff/resources/business-partnerships-office/oxford-janssen-working-in-collaboration), which authors CAD, CRG, and DA are members of, and which is funded by Janssen Biotech, Inc/Johnson & Johnson Innovative Medicine.

Peer review information

Veronique van den Berghe was the primary editor of this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Review history

The review history is available as Additional File 2.

Funding

This work was performed with support from the Wellcome Trust and Royal Society (204290/Z/16/Z), the UK Medical Research Council (MR/T030410/1), the Rosetrees Trust (R35579/AA002/M85-F2), Cartography Consortium funding from Janssen Biotech Inc, the Kennedy Trust for Rheumatology Research, and the NIHR Oxford Biomedical Research Centre, Inflammation Across Tissues and Cell and Gene Therapy Themes to CAD; the NIHR Oxford BRC (BRCRCF10-04) and an Oxford-Janssen Cartography Consortium Fellowship from Janssen Biotech Inc/Johnson & Johnson Innovative Medicine to CRG and to DA; DFG—German Research Foundation (–SFB- TRR 338/1 2021 –452881907) and the Deutsche Forschungsgemeinschaft to FC and FJT; the Interdisciplinary Bioscience DTP, supported by the BBSRC to GELG; the Kennedy Trust for Rheumatology Research Arthritis Therapy Acceleration Programme (A-TAP) and an Educational Grant from Celsius Therapeutics to TT; and by the UK Medical Research Council (MR/R015708/1) to WL.

Author information

Authors and Affiliations

Authors

Contributions

FC and CRG conceived the study with input from CAD and FJT. FC and CRG wrote the code with contributions from DA, SO, KRA, LM, GELG, LH, TT, and WL. FC and CRG wrote the manuscript with input from CAD and FJT. CRG, FC, DS, CAD, and FJT read and revised the manuscript. FC and CRG are equal contributors to this work and can reference this work as a first authorship paper in their curriculum vitae.

Authors’ X handles

X handles: Lukas Heumos (@LukasHeumos); Devika Agarwal (@deevdevil); Calliope A. Dendrou (@CalliDendrou); Giulia E. L. Garcia (@Giuliaelgarcia).

Corresponding authors

Correspondence to Fabian J. Theis or Calliope A. Dendrou.

Ethics declarations

Ethics approval and consent to participate

Not relevant to our study.

Consent for publication

Not relevant to our study.

Competing interests

FJT consults for Immunai Inc., Singularity Bio B.V., CytoReason Ltd, and Omniscope Ltd, and has ownership interest in Dermagnostix GmbH and Cellarity. LH is an employee of Lamin Labs. The other authors declare no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Curion, F., Rich-Griffin, C., Agarwal, D. et al. Panpipes: a pipeline for multiomic single-cell and spatial transcriptomic data analysis. Genome Biol 25, 181 (2024). https://doi.org/10.1186/s13059-024-03322-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s13059-024-03322-7