A clusterability measure for single-cell transcriptomics reveals phenotypic subpopulations

The ability to discover new cell populations by unsupervised clustering of single-cell transcriptomics data has revolutionized biology. Currently, there is no principled way to decide, whether a cluster of cells contains meaningful subpopulations that should be further resolved. Here we present SIGMA, a clusterability measure derived from random matrix theory, that can be used to identify cell clusters with non-random sub-structure, testably leading to the discovery of previously overlooked phenotypes.

To alleviate this problem, we developed SIGnal-Measurement-Angle (SIGMA), a clusterability measure for scRNA-seq data.We consider clusterability to be the theoretically achievable agreement with the unknown ground truth clustering, for a given signal-to-noise ratio.
Importantly, our measure can estimate the level of achievable agreement without knowledge of the ground truth.High clusterability (indicated by SIGMA close to 1) means that multiple phenotypic subpopulations are present and clustering algorithms should be able to distinguish them.Low clusterability (indicated by SIGMA close to 0) means that the noise is too strong for even the best possible clustering algorithm to find any clusters accurately.If SIGMA equals 0, the observed variability within a cluster is consistent with random noise.
To derive SIGMA, we considered the unobserved, actual gene expression profiles (the signal matrix) as a perturbation to a random noise matrix (Fig. 1a).This is the exact opposite of the conventional view, which considers noise as a perturbation to a signal.Note that both the biological variability within a phenotype as well as technical variability (due to variable capture and conversion efficiencies etc.) contribute to random noise.Our point of view allowed us to leverage well-established results from random matrix theory 6,7 and perturbation theory 8 .Briefly, we first calculate the singular value distribution of the measured expression matrix.If the data is preprocessed appropriately (Extended Data Fig. 1), the bulk of this distribution is described by the Marchenko-Pastur (MP) distribution, which corresponds to the random component of the measurement.The singular values outside of the MP distribution and above the Tracy Widom (TW) threshold correspond to the signal (i.e. the unobserved gene expression profiles).Using just these singular values and the dimensions of the measurement matrix, we can calculate the angles between the singular vectors of the measured expression matrix and those of the (unobserved) signal matrix.SIGMA is the squared cosine of the smallest angle.See Supplementary Note 1 for a detailed derivation.Simulations of data sets with varying signal-to-noise ratios illustrate the calculation of SIGMA (Fig. 1b,c).Data sets with higher signal-to-noise ratios have more easily separable clusters and larger singular values outside of the MP distribution (Fig. 1b).By definition, that results in higher values of SIGMA (Fig. 1c).
To show that SIGMA is a proxy for clusterability, we have to make the concept of clusterability more precise and quantifiable.First, we adopted the Adjusted Rand Index (ARI) 9 as a wellestablished measure for the agreement between an empirically obtained clustering and the ground truth.Next, we argued that perfect agreement with the ground truth (ARI = 1) is not achievable in the presence of noise, even with the best conceivable clustering algorithm.Using a simple case of two clusters of cells with varying signal-to-noise ratios, we estimated the Bayesian error rate 10 (i.e. the lowest possible error) for this binary classification problem in simulated data (Extended Data Fig. 2a).Based on this error rate, we calculated a theoretically achievable ARI (tARI, see also Supplementary Note 1).We showed empirically that commonly used clustering methods do not exceed this limit (Extended Data Fig. 2b,c).The tARI, therefore, quantifies our notion of clusterability.Importantly, SIGMA is strongly correlated with the tARI (Fig. 1d) and thus allows us to estimate clusterability without knowing the ground truth.To confirm this result in experimentally measured data, we chose two very distinct clusters from a PBMC data set 11 and created two new clusters as weighted averages, which allowed us control over the signal-to-noise ratio.Also for this data, SIGMA strongly correlates with the tARI (Fig. 1d).As an alternative to the tARI, we also calculated the theoretically achievable silhouette coefficient 12 (tSIL), which considers the distances between the best possible clusters (Extended Data Fig. 3 a-c).The tSIL quickly jumps to higher values for minimal deviations from pure noise, due to the correct classification of a few outlier cells, which makes it less useful for assessing overall clusterability.We also compared SIGMA to ROGUE 13 , a recently published clusterability measure (Extended Data Fig. 3d).In contrast to it.This is not a direct replacement for differential expression tests, but a way to understand the variability within the cell-singular vectors.
Finally, we tested the performance of SIGMA and G-SIGMA in measurements of complex tissues.In a data set of bone marrow mononuclear cells (BMNC) 16 we calculated SIGMA for the clusters reported by the authors.After correction for confounding factors (Extended Data Fig. 4 d,e), SIGMA corresponded well with a visual inspection of the cluster UMAPs (Fig. 2a).
For all clusters, the bulk of the singular value distribution was well-described by the MP distribution and, by construction, only clusters with SIGMA > 0 had significant singular values (Fig. 2b).Reassuringly, many progenitor cell types received a high SIGMA (indicating possible sub-structure) in agreement with the known higher variability in these cell types.Ranking existing clusters by G-SIGMA resulted in a very similar order (Extended Data Fig. 7a).To confirm the presence of relevant sub-structure, we sub-clustered the two original clusters with the highest SIGMA (Extended Data Fig. 7 b-e).In the red blood cell (RBC) progenitors, we identified 4 subsets that correspond to different stages of differentiation, ranging from erythroid precursors to highly differentiated RBCs as identified by F.V Mello et al. 17 .In the dendritic cell (DC) progenitor cluster, two sub-clusters were identified, which correspond to precursors of either classical or plasmacytoid DCs 18 .For both examples, the variance-driving genes found in the gene-singular vectors were localized to their corresponding clusters (Extended Data Fig. 7 c,d) and overlapped strongly with differentially expressed genes found after subclustering (see Supplementary Table 2).
In a second example, we applied SIGMA to a fetal human kidney data set we published previously 19 (Fig 3a).As for BMNCs, SIGMA corresponded well with a qualitative assessment of cluster variability and G-SIGMA resulted in a similar ranking (Extended Data Fig. 8a).Subclustering of the cluster with the highest SIGMA, ureteric bud/collecting duct (UBCD), revealed a subset of cells with markers of urothelial cells (UPK1A, KRT7) (Fig. 3b, Extended Data Fig. 8b-e).Immunostaining of these two genes, together with CDH1 expressed in the collecting system, in week 15 fetal kidney sections confirmed the presence of the urothelial subcluster (Fig. 3c, Extended Data Fig. 9a).Another subset of cells we did not find in our original analysis, are the parietal epithelial cells (PECs), which could now be identified within the SSBpr cluster (S-shaped body proximal precursor cells) (Fig. 3b, Extended Data Fig. 8b-e).To reveal these cells in situ, we stained for AKAP12 and CAV2, which were among the top differentially expressed genes in this subcluster (Supplementary Table 3), together with CLDN1, a known marker of PECs, and MAFB, a marker of the neighboring podocytes (Fig. 3d, Extended Data Fig. 9b).Together with the PECs and proximal tubule precursor cells, SSBpr also contained a few cells that were misclassified in the original analysis, indicating the additional usefulness of SIGMA as a means to identify clustering errors.Further analysis of a cluster of interstitial cells (ICa) revealed multiple subpopulations (Fig. 3b, Extended Data Fig. 8b-e).
Immunostainings revealed that a POSTN-positive population is found mostly in the cortex, often surrounding blood vessels, whereas a SULT1E1-positive population is located in the inner medulla and papilla, often surrounding tubules (Fig. 3e, Extended Data Fig. 9c).
CLDN11, another gene identified by analysis of the gene-singular vectors (Extended Data Fig. 8b-e) was found mostly in the medulla, but also in the outermost cortex.A more detailed, biological interpretation of the results can be found in Supplementary Note 2.
In summary, we presented SIGMA, a clusterability measure that can help to detect easily overlooked, subtle phenotypes in scRNA-seq data.Our approach also identifies variancedriving genes and brings renewed awareness to random noise as a factor setting hard limits on clustering and identifying differential expression.

Preprocessing
Before applying the method to simulated or measured single-cell RNA-seq data sets, several preprocessing steps are necessary.The raw counts are first normalized and log-transformed.
Next, the expression matrix is standardized, first gene-wise, then cell-wise.These steps assure the proper agreement of the bulk of the singular value distribution with the MP distribution (Extended Data Fig. 1).See also Supplementary Note, Section 3.1.

Signal-Measurement angle (SIGMA)
SIGMA is based on the assumption that the expression matrix  " measured by scRNA-seq, can be written as the sum of a random matrix , which contains random biological variability and technical noise, and a signal matrix , which contains the unobserved expression profiles of each cell:

𝑋 " = 𝑋 + 𝑃
Note that in this decomposition, cells that belong to the same cell type have identical expression profiles in the signal matrix .This notion of clusterability, based on the signal-tonoise ratio, is inspired by the notion of detectability in networks 20,21 .
Treating the signal matrix  as a perturbation to the random matrix , we can apply results from both random matrix theory and low-rank perturbation theory.Random matrix theory 22,23 predicts that the singular value distribution of  is a Marchenko-Pastur (MP) distribution 7,24,25 , which coincides with the bulk of the singular value distribution 6,26,27 of  ' .The singular values of  " above the values predicted by the MP distribution characterize the signal matrix .Since the agreement with the MP distribution holds strictly only for infinite matrices, we use two additional concepts to identify relevant singular values exceeding the range defined by the MP distribution.The Tracy-Widom 25,28 (TW) distribution describes the probability of a singular value to exceed the MP distribution, if the matrix is finite.Additionally, since singular vectors of a random matrix are normally distributed, relevant singular vectors have to be significantly different from normal 6 .To test for normality we used the Shapiro-Wilk test.
We apply low-rank perturbation theory 8 to calculate the singular values ( * ) of  from the relevant singular values ( * ) of the measured expression matrix  ' : where c is the cell-to-gene ratio, i.e. the total number of cells divided by the total number of genes.
The values of  * are then used to obtain the angles  * between the singular vectors of  " and .These angles are conveniently expressed in terms of their squared cosine as > ?@ 4> ?@ =;5 .
The squared cosine of the smallest angle, i.e. the largest squared cosine, is then used as a measure of clusterability: For a detailed derivation of SIGMA, see Supplementary Note 1, Section 2.1-2.4.

Confounder regression
scRNA-seq data contains various confounding factors that drive uninformative variability.
These either emerge from technical issues (such as the varying efficiency of transcript recovery, which cannot be fully eliminated by normalization) or biological factors (such as cell cycle phase, metabolic state, or stress), see Extended Data Fig. 4. To account for these factors, a regression step, inspired by current gene expression normalization methods 11,15 , is included.If a singular vector is biased by any of the considered confounders, its singular value will be reduced, which leads to a lower SIGMA value.See also Supplementary Note, Section 3.2.

Theoretically achievable clustering quality
To construct a Bayes classifier 10 , which achieves the minimal error rate, we need to know the ground truth clustering.Hence, we used data simulated with Splatter 29 , containing two clusters.For each ground truth cluster, we fit a multidimensional Gaussian to the corresponding elements of the singular vectors (see Extended Data Fig. 2a).We only consider singular vectors with singular values larger than predicted by the MP distribution.For the fit, we use the mclust 30 R package (V 5.4.6).We then construct a classifier by assigning a cell to the cluster for which it has the highest value of the fitted Gaussian distribution.This classifier is thus approximately a Bayes classifier (for a true Bayes classifier, we would need to know the exact distributions of the singular vector entries).The ARI 9 calculated based on this classification is thus approximately the best theoretically achievable ARI (tARI).
The silhouette coefficient 12 was calculated on Euclidean distances in the first singular vectors and the average silhouette coefficient was reported.In the RNA-mix data, Euclidean distances were calculated using singular vectors whose singular values exceed the range defined by the MP distribution and the ground truth clustering.For the simulated data sets with 2 clusters, the silhouette coefficient was calculated on the first singular vector and the clusterings produced by the different methods (Extended Data Fig. 3).tSIL was calculated with Euclidean distances in the first singular vector on the best theoretically achievable clustering.The calculation of tARI and tSIL is described in more detail in Supplementary Note 1, section 2.5.

Clustering methods
For For the RNA-mix data set ROGUE (V 1.0) was used with 1 sample (see Fig S5 ), "UMI" platform, and a span of 0.6.For the simulated data sets, ROGUE was used with k = 10 (Extended Data Fig. 2 d).

Variance driving genes
Genes that drive the variance in the significant singular vectors can be used to explore the biological information in the sub-structures.Genes with large positive or negative entries in a gene-singular vector are localized in cells with high positive or negative entries in the corresponding cell-singular vector.It is also possible to assess the signal-to-noise ratio for the genes by calculating the angle between the gene singular vectors of the measured expression matrix  " and the gene singular vectors of the signal matrix , given by 15 , where c is the cell-to-gene ratio.We call  N the gene SIGMA (G-SIGMA).See Supplementary Note 1, section 2.4 for a more detailed discussion.

Data sets
Simulated data were produced with the splatter 29 R package (V 1.10.1).The parameters used for the simulation are shown in Supplementary Table 1.For Fig. 1c, Extended Data Fig. 2b, Extended Data Fig. 2c, Extended Data Fig. 3, and Extended Data Fig. 6a  ).For the calculation of the tARI, clustering with Scanpy, TSCAN, k-means, and hierarchical clustering, preprocessing was performed with the scanpy python package (V 1.4.6)following the provided pipeline (https://scanpy-tutorials.readthedocs.io/en/latest/pbmc3k.html) for the filtering of cells and genes, normalization, and log-transformation as well as cluster annotation.
For the clustering with Seurat, the provided Seurat pipeline was used (https://satijalab.org/seurat/archive/v3.2/pbmc3k_tutorial.html)for preprocessing, such as cell and gene filtering, normalization, log-transformation and cluster annotation using the Seurat R package (V 3.1.5).CD8 T cells and B cells were extracted from the data and each cluster was standardized gene-wise and cell-wise before the calculation of the singular value decomposition.To remove any sub-structure in these clusters and before the reconstruction of the matrices from the SVD, singular values above the MP distribution were moved into the bulk, and the transcriptome mode (i.e. the singular vector that would have the largest singular value without normalization, see Supplementary Methods RNA-mix data 14 was downloaded from the provided GitHub page.The data were normalized with the R scran package (V 1.14.6) and then log-transformed.Confounder regression was performed for the total number of transcripts, average mitochondrial gene expression, and average ribosomal gene expression.Two different merged clusters were created from the provided RNA mixtures as shown in Extended Data Fig. 5.
Bone marrow mononuclear cell data set (BMNC) 16 was downloaded from the R package SeuratData (bmcite, V 0.2.1).Normalization and the calculation of the G2M score 32 were performed with the Seurat R package (V 3.1.5).Confounder regression was performed for the log-transformed total number of transcripts, cell cycle score, and average expressions of each: mitochondrial genes and ribosomal genes (list obtained from the HGNC website).
For the fetal kidney data set 19 , the same preprocessing and normalization was used as reported previously (scran R package 33 ).The data was then log-transformed and the G2M score was calculated with the Seurat R package.Confounder regression was performed for the log-transformed total number of transcripts, G2M scores, and the average expressions of each: mitochondrial genes, ribosomal genes, and stress-related genes 34 .

Differential expression test
Differentially expressed genes within the sub-clusters found in Extended Data Fig. 7 and Extended Data Fig. 8 were calculated with the function findMarkers of the scran R package on log-transformed normalized counts.Genes with a false discovery rate below 0.05 were selected and then sorted by log2 fold change.In Figures S7e and S8e, genes with the top 20 highest/lowest values in the gene singular vectors are listed and colored blue if they correspond to the top 20 DE genes.

Staining
A human fetal kidney (female) at week 15 of gestation was used for immunofluorescence using the same procedure as reported previously 19 .The following primary antibodies were used: rabbit anti-UPK1A ( the validation of the tARI and tSIL, several clustering methods were used on simulated data with two clusters.Seurat clustering 1 was performed with the Seurat R package with 10 principal components (PCs) and 20 nearest neighbors.Three different resolution parameters were used: 0.1, 0.6, and 1.6.Scanpy clustering 2 was performed with the scanpy python package with 10 PCs and 20 nearest neighbors.Three different resolution parameters were used: 0.1, 0.6, and 1.6.Hierarchical clustering 4 was performed on the first 10 PCs and Euclidean distances.The hierarchical tree was built with the Ward linkage and the tree was cut at a height where 2 clusters could be identified.K-means 3 was performed on the first 10 PCs using Euclidean distances and two centers.TSCAN31 was calculated on the first 10 PCs.ROGUEROGUE13 is an entropy-based clusterability measure.A null model is defined under the assumption of Gamma-Poisson distributed gene expression and its differential entropy is then compared to the actual differential entropy of the gene expression.

Note 1 )
was moved above the MP distribution.Then, two synthetic clusters containing 150 cells each were created from the cleaned-up original clusters.For cluster 1, a weighted average of a randomly picked B cell with expression profile  O and a randomly picked CD8 T cell with expression profile  PQR S was calculated according to:  < =  •  O + (1 − ) •  PQR S .For cluster 2, the weights were flipped:  0 = (1 − ) •  O +  •  PQR S . was chosen in a range from 0 to 1.  close to 0.5 produced highly similar clusters, while  close to 0 or 1 produced maximally different clusters (see Fig S2d).For each value of , the procedure was repeated 50 times, each with a different seed for selecting 300 cells per cell type, and the results were averaged.

Uniform
Manifold Approximation and Projections35 (UMAPs) for individual clusters were calculated with the R package umap (V 0.2.7.0) on the first 10 PCs, 20 nearest neighbors, min_dist = 0.3, and Euclidean distances.The umap for BMNC data was calculated with the Seurat R package using 2000 highly variable genes (hvg), d = 50, k = 50, min.dist= 0.6 and metric = cosine.For the fetal kidney data set a force-directed graph layout was calculated using the scanpy python package.The graph was constructed using 100 nearest neighbors, 50 PCs, and the ForceAtlas2 layout for visualization.