An algorithm for gene selection.
For a given scRNA-seq dataset, we represent transcriptional similarities between cells using k-nearest neighbor (k-NN) graphs (Methods). A k-NN graph that is constructed using the entire transcriptome represents the “true” relationships between cells as manifested by the similarities in expression levels of individual genes between cells and their “true” neighbors. Subsequently, we aim to find a selection of genes that yields a k-NN graph similar to the “true” graph and is capable of predicting the true expression of any gene by using each cell’s neighbors.
Specifically, at each iteration (i.e., given the currently selected gene panel) we compare the graph constructed using the entire transcriptome (“true” k-NN graph) with a graph constructed using the current “selection” (“selection” graph). At a single gene level, we assess how well we can predict a gene’s expression levels across cells, using the cells neighbors in the “selection” graph compared to the “true” graph, and select the gene that shows the biggest discrepancy between the two graphs (Fig. 1). More precisely, for each gene, in the “true” graph we compute, across cells, the Minkowski distance between a gene’s expression in a given cell and its average expression in that cell’s k-nearest neighbors. Intuitively, this provides a baseline measure for how well a gene’s expression can be predicted by its nearest neighbors in a best-case scenario. Similarly, we compute the Minkowski distance for each gene using the “selection” graph. We then compute, for each gene, the difference in the distances between the “true” and “selection” graph and add the gene with the largest difference in distance to the selected set. This process is then repeated until the desired number of genes has been chosen.
An important limitation of existing methods is the inability to properly assess how complete and informative the designed gene panel is. To address this, we derived three metrics that evaluate gene panels at different levels of granularity: cell type, cell, and gene level.
The first metric explores whether cell types can be assigned in a specific and sensitive manner. Specifically, while geneBasis does not rely on clustering, for scRNA-seq datasets where annotations already exist, they can be exploited. To do so, we construct a k-NN graph for all cells using the selected gene panel and compare for each cell its (pre-determined) cell type label with the (pre-determined) labels assigned to its nearest neighbors (Methods). This allows construction of a confusion matrix that provides insight into whether cells from the same type are grouped together in the graph constructed using the selected panel.
As a second metric, which does not rely on cell type annotation, we focused on examining how well an individual cell’s neighborhood is preserved. For each cell, we compute its nearest neighbors in the “true” and “selected” graph. Subsequently, in the “true” graph space, we compare the distance between each cell and these two sets of nearest neighbors (Additional file 1: Fig. S1). Intuitively, when the set of neighbors is identical in both the “true” and “selection” graph, this metric will output a score of 1, while a score close to 0 indicates that a cell’s neighbors in the “selection” graph are randomly distributed across the “true” manifold.
Finally, a third metric for assessing the quality of the selected gene panels focuses on individual genes (Methods). For each gene, we compute the correlation (across cells) between its log-normalized expression values and its average log-normalized expression values across k-nearest neighbors in a graph. We compute correlations for the “true graph” and for the “selected graph,” and use the ratio (“selected”/“true”) as the final gene prediction score.
geneBasis allows recovery of local and global variability
To evaluate whether geneBasis efficiently recovers a selection of genes that preserves both local and global structure, we applied geneBasis, as well as SCMER and scGeneFit (Methods), to four diverse systems representing challenges that frequently arise in scRNA-seq datasets:
-
1.
A study of mouse embryogenesis at embryonic day (E)8.5, coinciding with early organogenesis [32]. This dataset contains both abundant and rare cell types as well as multiple different trajectories.
-
2.
A newly generated dataset profiling cells from the adult human spleen. As part of the lymphatic system, the spleen contains transcriptionally similar immune cell types that are not straightforward to resolve with a limited number of genes.
-
3.
Multiple independent studies of the adult human pancreas dataset, consisting of multiple batches from various donors and experiments [33,34,35,36,37,38].
-
4.
Transcriptional profiling of melanoma samples from 19 donors collected at different stages of disease progression [39].
Overall, using the metrics defined above, the gene panels chosen by geneBasis (Additional file 2: Table S1) yield better performance than those chosen by SCMER (Fig. 2B, C). This is true both in terms of preserving cell neighborhoods and in terms of the fraction of cells for which the neighborhood was poorly predicted (Fig. 2B, Additional file 1: Fig. S2A). The effect was most noticeable when a small number of genes were selected, with differences in performance decreasing as the number of genes selected increased (with the exception of the spleen, where geneBasis consistently outperforms SCMER). This is of practical relevance for designing gene panels for FISH-based experiments, where the number of selected genes can be an important limiting factor.
We hypothesize that the difference in performance arises due to the nature of the L1-regularization utilized by SCMER. In general, regularizations like Lasso tend to discourage redundancy among the selected features. However, this may not hold if the strength of the regularization is chosen manually and lies above an appropriately selected range [40]. We empirically support this hypothesis by assessing the redundancy within selected gene panels for both geneBasis and SCMER (Fig. 2A, Additional file 1: Fig. S3). Specifically, when a low number of genes is selected, SCMER tends to select highly co-expressed genes.
Additionally, from the beginning of the search, geneBasis prioritizes genes that allow delineation of cell types as well as scGeneFit, a method that specifically addresses the task of cell type separation (Fig. 2C). Moreover, geneBasis robustly recovers genes that delineate cell types even if a preselected set of genes is provided (Methods, Additional file 1: Fig. S4). Taken together, we suggest that geneBasis represents an efficient method to successfully resolve cell types and to preserve the overall manifold.
Another essential aspect of targeted gene panel design is to determine whether enough genes have been selected to capture variability in the dataset of interest. To this end, the cell neighborhood preservation metric can be exploited to investigate whether specific transcriptional neighborhoods are enriched for cells with low scores, suggesting that these neighborhoods are not preserved well by the current selection. Across the benchmarked datasets, we illustrate this by focusing on the first 150 selected genes (Fig. 2D, Additional file 1: Fig. S2). We observe that although there exist subtle differences between cell types, the majority of cell types (Fig. 2D, upper panel) and regions of the high-dimensional space (Fig. 2D, lower panel) were well covered by the selected panels, indicative of good performance (the only striking exception was in the melanoma dataset, in particular for cells that could not be assigned a clear label in the original analysis). Consistent with these results, we also observed high imputation accuracy for the expression of genes that were not among the selected set (Additional file 1: Fig. S2B, C). More generally, we anticipate that a user may not necessarily want to select an exact number of genes, but rather may have an upper limit of genes that they could feasibly profile. In these cases, we advise running geneBasis until the limit is reached, before assessing when the cell neighborhood preservation and gene prediction score distributions converge as a function of the number of genes (Fig. 2, Additional file 1: Fig. S2). Finally, it is important to stress that the gene panel evaluation is independent of the selection method employed and can be used to compare two or more panels generated by any approach.
geneBasis accounts for batch effect and handles unbalanced cell type composition
Careful consideration and adjustment for batch effects is one of the most challenging aspects of scRNA-seq data analysis [41]. In the context of gene selection, SCMER performs gene selection for each sample individually, and then retains features identified in all/most samples. While this is an efficient way to discard genes that show technical variability, such an approach might also discard genes with strong biological relevance that are only captured in a small fraction of samples, e.g., cell type markers for cell types present in < 50% of the batches. Importantly, such unbalanced cell type composition is not uncommon in practice.
To account for this, in cases where a batch is specified by a user (i.e., every cell is assigned to a batch), we construct k-NN graphs (“true” and “selection”) within each batch, thereby assigning nearest neighbors only from the same batch and mitigating technical effects. Minkowski distances per genes are calculated across all cells from every batch thus resulting in a single scalar value for each gene. Importantly, if a certain cell type is present only in one batch, cells from this cell type will be consistently “mismapped” if none of its marker genes are selected, and therefore, the algorithm will select one of the marker genes.
To assess our approach, we utilized the first dataset described above, which consisted of four independent batches (samples) from the same stage of mouse development (Additional file 1: Fig. S5). In each batch, a considerable fraction of cells were associated with blood lineage (11–27%, Methods). Next, we artificially removed cells in a batch-aware manner, thus keeping cells from the blood lineage in 1, 2, or 3 batches, as well as adding a positive control (retaining blood lineage in all the batches) and a negative control (removing blood lineage from all the batches). We applied geneBasis and SCMER to these different settings. geneBasis efficiently (among first 10 genes) selected one of the blood markers for each combination except for the negative control (Additional file 1: Fig. S5C). SCMER identified strong markers in the positive control setting and when the blood lineage was retained in at least 2 of the 4 samples. However, it did not select any blood markers when the blood lineage was retained in only one sample (overall fraction of blood lineage cells in this combination was nearly 5%), even though the selection provided by SCMER was identified by the algorithm as of satisfactory quality (all selected genes for this case and KL divergence are listed in Additional file 3: Table S2).
Computational complexity of geneBasis
With rapid increases in the size of scRNA-seq datasets, computational complexity of any algorithm is of increasing importance. Consequently, we sought to estimate the running time of geneBasis, to ensure its scalability. We established that three parameters are relevant for the computational complexity of the algorithm: n - number of cells, D - number of pre-filtered genes that undergo gene search, and p - number of genes we want to select. In practice, p will almost never exceed a few hundreds and D will never exceed 15,000–20,000, meaning that the limiting parameter is the number of cells. Accordingly, we estimated running time for a series of spleen scRNA-seq sub-datasets (Methods), while varying n from ~ 5000 to ~ 30,000. We also varied D from 1000 to 10,000, and p from 50 to 150. For the selected range, we observe a linear relationship between elapsed time and number of cells, and exponential increase in complexity as a function of p (Fig. 3A). As anticipated, change in n had a higher effect on the running time compared to change in D (Fig. 3B). Additionally, we compared computational complexity between geneBasis and SCMER and established that across the tested range of n, p, and D geneBasis outperforms SCMER (Fig. 3C, for the details of the comparison see Methods). Note that this might not hold true for higher values of p, but since for all tested datasets selections with 150 genes appeared to be sufficient and complete (Fig. 2), we suggest that p < = 150 is a reasonable assumption. We further suggest that if a greater reduction in computational complexity is desired, appropriate downsampling strategies can be applied (Additional file 4: Supplementary Note 1).
geneBasis resolves rare cell types and unannotated inter-cell-type variability
Cellular identity, which is typically represented on the level of cell type, is a basic and essential unit of information that any gene library needs to recapitulate. In other words, a “biologically” complete gene library should delineate all cell types, including rare ones, for which statistical methods can be occasionally underpowered. However, it also should be able to capture intra-cell-type variability in cases where discrete clustering has not been performed to the appropriate resolution and identify genes that display gradual changes in expression across the high-dimensional space, consistent with developmental trajectories. To address whether geneBasis satisfies the above criteria, we focused on three datasets: mouse embryogenesis, human spleen, and human pancreas. All datasets contain annotated cell type labels and involve combinations of abundant and rare cell types (Fig. 4B, H, J; Additional file 1: Fig. S6A; S7A,E). Moreover, each dataset contains cell types that are distinct and closely related, making them highly appropriate for the task (Additional file 1: Fig. S6A; S7A,E).
For the study of mouse embryogenesis, selecting only 20 genes accurately resolves most cell types (Additional file 1: Fig. S6B), with 50–100 genes being sufficient to achieve nearly perfect matching (Fig. 4A; Additional file 1: Fig. S6C). The minor exceptions comprise transcriptionally related cell types, such as the visceral endoderm and the gut, where only few (if any) differentiating markers exist and cell types, and cell type classification is made using the degree of expression of commonly shared markers. Nevertheless, in all such cases, geneBasis selects several genes that could in principle be used to differentiate the cell types (Additional file 1: Fig. S6D, E).
Having determined that geneBasis identifies genes that characterize annotated cell types, we next asked whether it could find genes associated with more subtle biological signals present in the data. To this end, we exploited recently published studies focused on myocardium development [42] and gut tube formation [43] (Methods). We first integrated (using all genes) cells annotated as cardiomyocytes from the scRNA-seq atlas for the whole embryo together with cells from [42] that were isolated from the anterior cardiac region of mouse embryos, and assigned cluster identities to cardiomyocytes cells (Methods; Fig. 4C). As expected, all cells were assigned to the mesodermal cardiac lineage, with the majority being assigned to the most differentiated cardiomyocytes (Me3). Consistent with Tyser et al., the integrated data also reveal two trajectories leading to the most differentiated Me3 cluster: an FHF-like differentiation trajectory linking Me5, Me4, and Me3 and a SHF-like trajectory via Me7. Importantly, when using only the first 100 genes selected by geneBasis, we were able to recapitulate this structure (Fig. 4C, D). Among the 100 genes, 13 were differentially expressed in cardiomyocytes (Additional file 1: Fig. S6F), including markers for the FHF-trajectory (Sfrp5), strong markers for the differentiated Me3-state (Myl2) and contractile markers (Ttn) that exhibit gradual expression along the differentiation trajectory of the myocardium (Additional file 1: Fig. S6G). Additionally, the selection of genes when performed only within Cardiomyocytes also contains genes marking different Me-clusters (Additional file 1: Fig. S6J). Similarly, when focusing on the gut tube, and using the refined annotation provided by [43], we observed that the set of genes chosen by geneBasis recovered the distinct populations of cells arranged across the Anterior-Posterior axis (Fig. 4E, F; Additional file 1: Fig. S6H, I, K).
Next, we explored the ability of geneBasis to select genes that captured heterogeneity among the populations of immune cells present in the adult human spleen. As expected, compared to mouse development, the ability to discriminate between the set of transcriptionally similar cell types present in the spleen required more genes (Additional file 1: Fig. S7B, C). Nevertheless, selection of 100 genes allowed most cell types to be identified in a sensitive and specific manner (Fig. 4G, Additional file 1: Fig. S7D). Similarly to mouse embryogenesis, the occasional mismapping of cells occurred between transcriptionally similar cell types, such as follicular and marginal zone B cells, as well as CD4+ and CD8+ T cells and innate lymphoid cells. Of note, the difficulty of using only the transcriptome to distinguish between different types of T cells is well known [44], and it has been suggested that a more precise annotation of T cells can be obtained by generating paired measurements of cellular transcriptomes and immunophenotypes [45].
Finally, we benchmarked geneBasis on a thoroughly annotated and integrated dataset of human pancreas, containing numerous rare cell types—8 of the 13 annotated cell types account for less than 5% of the overall dataset. Nevertheless, and similarly to the previous examples, we quickly and accurately predict all cell types with the exception of rare cycling cells, which are occasionally conflated with highly abundant alpha cells (Fig. 4I; Additional file 1: Fig. S7F, G).
geneBasis captures signals associated with cell states
We next examined whether geneBasis could identify sources of biological variation that in principle can be recovered from scRNA-seq data but that are frequently overlooked because they do not contribute to cell identity per se. Examples of such genes include cell state markers associated with transient biological processes, such as cell cycle, DNA damage repair, and oxidative stress. Depending on the biological context, this cell state information can be highly relevant for in situ profiling. For example, cell cycle genes are typically not included in gene panels for spatial transcriptomics experiments [11, 14, 16], partially due to high abundance of cyclin mRNAs and partially due to the notion that this signal is frequently deemed to be uninteresting. However, in the context of tumorigenesis, DNA damage, cell cycle, and proliferation of malignant cells are the hallmarks of disease progression, and mapping this information spatially could be highly insightful.
To illustrate that geneBasis successfully and sufficiently recovers cell state genes indicative of ongoing biological processes, we analyzed the set of genes recovered in two distinct biological settings: melanoma (Fig. 5A) as well as activated lymphocytes (Additional file 5: Supplementary Note 2, for CITE-seq dataset [46]). The melanoma dataset consists of ~ 4500 cells isolated from 19 patients, containing malignant, immune, stromal, and endothelial cells. Within the first 100 genes selected, we identified 53 markers that were differentially expressed between the annotated cell types (Methods) and verified that these resolve all cell types, including the rare population of NK cells that are transcriptionally similar to highly abundant T cells (Fig. 5B, C).
The majority of the remaining genes were globally differentially expressed between malignant and non-malignant cells (8 and 26 genes were down- and upregulated respectively in malignant cells). In the original study, a substantial degree of transcriptional heterogeneity was characterized across individual cells, both malignant and healthy, mostly associated with cell cycle, variation in MITF and AXL levels, and activation of the exhaustion program in T cells. Importantly, among the upregulated genes, we identified markers of transcriptional programs described in the original study such as tumor progression and cell exhaustion.
Finally, 13 genes were neither identified as cell type markers nor significantly up- or downregulated in malignant cells. Among those we found JunB, B2M, and Narf2—markers of inflammation, cell exhaustion, and melanoma oncogenic programs taking place in malignant and T cells. Overall, we select cell state markers for all cell states characterized in the original publication programs including MITF and AXL (Fig. 5D). We observed a low degree of co-expression among these genes, with the exception of cell cycle genes and genes associated with the MITF program. Low co-expression is consistent both across all cells and within malignant cells, suggesting that these cell state genes are indicative of distinct transcriptional programs.
geneBasis selects genes that resolve cell types in seqFISH+ data
Technical differences between scRNA-seq and FISH-based technologies affect the scaling, magnitude of variance, and overall distribution of the expression, either across multiple genes or in a gene-specific manner [46]. To assess whether the panel selected by geneBasis resolves cell types not only in the scRNA-seq from which it was derived, but in the matched -FISH data themselves, we applied geneBasis to a published seqFISH+ study profiling the olfactory bulb of a mouse using a 10,000 gene probe set [13]. To select a ranked gene panel, we used corresponding scRNA-seq from [47], covering various anatomical units of a mouse brain.
First, to ensure that the reference scRNA-seq dataset sufficiently covers the heterogeneity captured in the seqFISH+ sample, we generated a joint embedding in the latent space between scRNA-seq and seqFISH+ cells (Methods, Additional file 1: Fig. S8A). Both data sets contain cluster/cell type labels, and we confirm that cells from the same or comparable cell types fall in the same latent spaces (Additional file 1: Fig. S8A, B). This suggests that biological variability between cells from different cell types outweigh technical differences between scRNA-seq and seqFISH technologies, thereby supporting the notion of using scRNA-seq data to select genes that are relevant for seqFISH.
Next we applied geneBasis to the corresponding scRNA-seq dataset and considered the top 150 genes (across genes probed in the seqFISH+ assay) and assessed whether the selected genes can resolve distinct seqFISH+ clusters identified using the full dataset. To do so, we estimated the cell type mapping accuracy for normalized logcounts from the seqFISH+ data using only the selected panel (Additional file 1: Fig. S8C, left panel). To estimate how much scaling differences between scRNA-seq and seqFISH factor in the performance of the library, we added a positive control where we applied geneBasis to the seqFISH+ data itself, and then evaluated the performance of this panel using the seqFISH+ data (Additional file 1: Fig. S8C, center panel). Finally, to estimate a per cell type upper bound of cell type mapping accuracy, we calculated cell type mapping accuracy on seqFISH+ data using all 10,000 genes (Additional file 1: Fig. S8C, right panel). Overall, the selection from seqFISH+ shows a moderate increase in cell type mapping accuracy for some cell types when compared to the selection from scRNA-seq. Nevertheless, both selections show rapid convergence to the estimated upper limit with the exception of a small number of rare cell types (Additional file 1: Fig. S8C, D). Thus, we suggest that geneBasis captures biological heterogeneity that is capable of overcoming technical differences between scRNA-seq and seqFISH.