Accounting for cell type hierarchy in evaluating single cell RNA-seq clustering

Cell clustering is one of the most common routines in single cell RNA-seq data analyses, for which a number of specialized methods are available. The evaluation of these methods ignores an important biological characteristic that the structure for a population of cells is hierarchical, which could result in misleading evaluation results. In this work, we develop two new metrics that take into account the hierarchical structure of cell types. We illustrate the application of the new metrics in constructed examples as well as several real single cell datasets and show that they provide more biologically plausible results.

The structured entropy that reflects the separation of the branches is defined as Obviously, when all cells are considered as one type, H(R 1 ) = 0. When all J types are considered distinct, defining d j = 1∀j reduces H * to H. However, smaller d values for the lower branches give lower weight to the details of the tree compared to higher weights for the main structure of the tree. Two trees with the same topology but different branch lengths can lead to different structured entropy H * values.  Figure S2A represents four clusters that are essentially distinct and all d close to 1, thus the H * is approximately the same as the classical entropy. Figure S2A represents two large classes each with two sub group that are very similar, in which case the weighted entropy of four classes is only slightly greater than the entropy of dividing into two large classes (and ignoring the entropy within sub groups).

S3 Sensitivity analyses
There are various ways to set the weight matrices to reflect the partial credit or penalty a user wishes to attribute to different clustering results. The weights may be derived from prior biological knowledge such as cell lineage. When one lacks external information in constructing the weights, we may estimate it from the same data set that the clustering is performed on. For illustration purposes, the examples shown in Figure 2 and Figure S3-S5 have used this internal weight. This, however, is not the preferred method and should only be used when there is no external information on the hierarchical relationships between cell types. In general, we recommend using external public databases and biological knowledge for wellstudied cell types. Below we assess the sensitivities of the proposed weight metrics to the methods of constructing weight matrices.
S3.1 Sensitivity to the number of genes used to compute the weights When weights are computed using the same dataset that the clustering is performed on, we use the genes with the most variable expression across cell types. To be specific, we compute mean expression level within each cell type, and compute the variance of these average expression levels across cell types. Top ranked genes with the largest variances will be selected as features, and their expression levels are used to compute the weights. The number of genes used in this procedure will have some impacts on the weights, and subsequently the values of the weighted metrics.  Figure S6 shows wRI and wNMI from different methods, using different numbers of genes to construct weight matrices from the Brain and PBMC2 datasets. When different number of genes are included, the wRI generally tends to give modest increase when more genes are used to compute the similarity measure. This is not surprising because when more genes are included, cells may appear more similar. On the other hand, the wNMI reaches the maximum when 500 genes are used to construct the weights for all methods. In general, we find that the relative performances among methods for the weighted metrics (wRI and wNMI) are similar regardless of the numbers of genes used. This is, however, not true when using unweighted metrics (traditional RI and NMI). For example, Seurat is significantly better than monocle, CIDR, and TSCAN in the PBMC2 dataset in terms of RI. Under wRI, these methods become rather similar when 500 or more genes are used in constructing the weights.
When using the same dataset to construct weights, in addition to select highly variable genes, another option is to directly identify genes with differential expression. Since reference cell labels are needed for traditional RI/NMI calculation, we can identify genes that demonstrate differential expression for each cell type beyond a certain false discovery rate. As discussed in the manuscript, it is critical that the choice of the reference tree/weights are made independent of the clustering methods to avoid cherry-picking.

S3.2 Sensitivity to the similarity measure
In the illustration in Figure 2 and Figure S3-S5, Pearson's correlation coefficient r is used as similarity measure for constructing the weight matrices used in the wRI. There are other choices that are qualitatively consistent. For example, note that since the power function is monotonic, r k with other powers can also be used as similarity measures with the same orders of similarity between cell types. When k increases, the weights become closer to the identity matrix. Thus, the weighted index will reduce to the un-weighted index when k goes to infinity. Figure S7 shows the sensitivity of wRI to different choices of similarity measure. In particular, we construct weight matrices for wRI using Pearson's correlation (r) or R 2 . In general, using R 2 results in slightly lower wRI, but the relative orders of the performances from different methods remain mostly the same as using r.