geneBasis: an iterative approach for unsupervised selection of targeted gene panels from scRNA-seq

scRNA-seq datasets are increasingly used to identify gene panels that can be probed using alternative technologies, such as spatial transcriptomics, where choosing the best subset of genes is vital. Existing methods are limited by a reliance on pre-existing cell type labels or by difficulties in identifying markers of rare cells. We introduce an iterative approach, geneBasis, for selecting an optimal gene panel, where each newly added gene captures the maximum distance between the true manifold and the manifold constructed using the currently selected gene panel. Our approach outperforms existing strategies and can resolve cell types and subtle cell state differences. Supplementary Information The online version contains supplementary material available at 10.1186/s13059-021-02548-z.

For each cell (in black):

True graph Zoomed
Take vector of distances from cell to all other cells Normalisation: Zscore and multiply to -1   . Systematic assessment of the 'completeness' of the generated gene panels. We provide several semi-orthogonal metrics to evaluate how complete the current panel is. Analysis is performed for each dataset (in rows).
(A) Heatmap represents the fraction of cells that have low neighbourhood preservation score -as a function of number of genes (Xaxis) and threshold for neighbourhood preservation score (Y-axis). Colours are log-scaled to facilitate visualization. Numbers inside correspond to actual values.
(B) Distribution of gene prediction score as a function of the gene panel size. Colours correspond to different methods.
(C) Heatmap represents statistics regarding gene prediction score -as a function of number of genes (X-axis) and threshold for gene prediction score (Y-axis). Numbers inside correspond to the number of genes that show a gene prediction score lower than the corresponding threshold. Colours correspond to maximum across correlations for all pairwise comparisons between genes with low gene prediction scores. Colours are log-scaled to facilitate visualisation.  (A) Heatmaps showing the co-expression (Spearman correlation) between genes selected by geneBasis and SCMER. The overall degree of co-expression is lower for geneBasis compared to SCMER, and this is consistent for all benchmarked datasets.

Mouse embryo
(B) Boxplots represent pairwise correlation in log-normalized expression values (Y-axis) for the first genes selected for geneBasis (red) and SCMER (light blue).

Pancreas
Initial selection Initial selection + geneBasis  (A) Mouse embryo. Celltype confusion matrices for the initial semi-random selection (left), for the updated selection with 6 additional genes (middle) and updated selection with 12 additional genes (right).
(B) Spleen. Celltype confusion matrices for the initial semi-random selection (left), for the updated selection with 12 additional genes (middle) and updated selection with 24 additional genes (right).
(C) Pancreas. Celltype confusion matrices for the initial semi-random selection (left), for the updated selection with 6 additional genes (middle) and updated selection with 12 additional genes (right  Fig. S5. geneBasis accounts for batch effects even with highly unbalanced celltype composition. (A) Overview of the in silico experiment. Gene search was performed for 5 datasets (in rows), and inclusion of blood lineage for each sample (i.e. batch) is specified in columns.
(B) Table represents whether blood markers were selected for each dataset and method.
(C) UMAPs for mouse embryogenesis, coloured by expression of the selected blood marker genes.  (A) UMAP representation coloured by celltypes. Note that it duplicates part of Figure 2C and is introduced here solely to facilitate interpretation of the data.
(B) Celltype confusion matrix for the first 20 selected genes.
(C) Celltype confusion matrix for the first 50 selected genes.
(D) Box Plots representing bulk log-normalised expression across celltypes for genes that are differentially expressed between Visceral endoderm and Gut.
(E) Box Plots representing bulk log-normalised expression across celltypes for genes that are differentially expressed between Caudal mesoderm and NMP.
(F) Co-expression (within Cardiomyocytes) of genes prioritised by geneBasis and differentially expressed in Cardiomyocytes.
(G)For manually selected genes denoted as relevant to inter-cardiomyocytes heterogeneity: UMAP representation of Cardiomyocytes coloured by expression and box plots representing bulk log-normalised expression across mapped cardiac clusters.
(H) Co-expression (within Gut cells) of genes prioritised by geneBasis and differentially expressed in Gut.
(I) For manually selected genes denoted as relevant to inter-gut heterogeneity: UMAP representation of Gut cells colored by expression and box plots representing bulk log-normalised expression across mapped cardiac clusters.  (A) UMAP representation coloured by celltypes. Note that it duplicates part of Figure 2C and is introduced here solely to facilitate interpretation of the data.
(C) Celltype confusion matrix for the first 75 selected genes.

(D) UMAP representation of T cells (Upper panel), B cells (Middle panel) and Plasma cells and Plamblasts (Lower panel), colored by
genes relevant for inter-celltype variability within corresponding celltypes.
(E) UMAP representation colored by celltypes. Note that it duplicates part of Figure 2C and is introduced here solely to facilitate interpretation of the data.

A
All seqFISH+ genes Selection from seqFISH+ data Selection from scRNA-seq data D C B Fig. S8   Fig. S8. geneBasis selects genes that recover biological heterogeneity in seqFISH datasets.
(A) UMAP plots representing joint embedding for matched scRNA-seq (left) and seqFISH+ (right) of an olfactory bulb of a mouse.
Colours correspond to respectively annotated cell types/clusters.

(B) Heatmap representing confusion matrix for seqFISH+ cells between cluster originally assigned in seqFISH+ (X-axis) and mapped
Class from scRNA-seq (Y-axis).
(C) Heatmaps representing cell type mapping accuracy (for seqFISH+ data) for the selection of 150 genes derived from scRNA-seq (left); selection of 150 genes derived from seqFISH+ (center); all 10000 genes probed in seqFISH+.   (A) Box plots representing overall cell neighborhood preservation score distribution as a function of number of genes in the selections (X-axis) and different orders of Minkowski distance (p) used for the algorithm (in colour).
(B) Box plots representing overall gene prediction score distribution as a function of number of genes in the selections (X-axis) and p (orders of Minkowski distance) used for the algorithm (in colour).
(C) Heatmaps representing the ability of a selection with the given order of Minkowski distance to prioritise lowly expressed genes. Xaxis corresponds to the choice of p, Y-axis corresponds to number of genes in the selection, and both numbers (absolute scale) and colours (log scale) correspond to the minimum of expression levels (in the context fraction of cells with non-zero counts) across selected genes.