scPred: accurate supervised method for cell-type classification from single-cell RNA-seq data

Single-cell RNA sequencing has enabled the characterization of highly specific cell types in many tissues, as well as both primary and stem cell-derived cell lines. An important facet of these studies is the ability to identify the transcriptional signatures that define a cell type or state. In theory, this information can be used to classify an individual cell based on its transcriptional profile. Here, we present scPred, a new generalizable method that is able to provide highly accurate classification of single cells, using a combination of unbiased feature selection from a reduced-dimension space, and machine-learning probability-based prediction method. We apply scPred to scRNA-seq data from pancreatic tissue, mononuclear cells, colorectal tumor biopsies, and circulating dendritic cells and show that scPred is able to classify individual cells with high accuracy. The generalized method is available at https://github.com/powellgenomicslab/scPred/.

. Effect of the sequencing depth on prediction performance of gastric tumour cells. The sensitivity of the classification showed no changes across sequencing depths, while the specificity, AUROC, and AUPRC show a considerable decrease once the average reads per cell is 20,000. An average sequencing depth of 20,000 reads per cell represents approximately 50% sequencing saturation of the original library. The specificity dropped down to zero when the average sequencing depth was 10,000 reads. Ten bootstrap replicates were performed. Fig. S3. Effect of the number of cells on prediction performance of gastric tumour cells. Sensitivity and specificity decay proportional directly proportional to the number of cells used to train the classifiers. When only 100 cells were included the mean sensitivity fell to 0.741 whilst the specificity changed from 0.974 to 0.885 and the F1 score from 0.990 to 0.776 with respect to the 953 cells used originally. The AUROC and AUPRC showed minimum decrease. Ten bootstrap replicates were performed. Fig. S4. Distribution of conditional class probabilities for single cells from the Baron test dataset across all four models. Each panel corresponds to the true cell type classes and each color to the predicted class by scPred. The right-skewed distributions for α, β, δ, and γ cells indicate a high confidence prediction for most cells from the Islets of Langerhans. The left-skewed distributions for "Other" cells suggests that most of these cells are not likely to belong to any of the cell types of interest.  Table S2. Summary of pancreas datasets. Training dataset consisted of 4,292 cells and 32 human samples in total. All datasets were generated using different protocols. All four samples from the Muraro dataset derive from healthy individuals, as well as 6 samples from the Segerstolpe dataset and 12 from Xin. All the remaining 10 samples from the training reference and 4 from the testing phase come from diabetic individuals. The incorporation of 32 individuals -both healthy and diabetic-to train the prediction model captured a broad biological variability to assess the cell identity of pancreatic cells in other datasets.  Table S3: Prediction results of pancreatic cells from Baron dataset. The first column shows the predicted classes by scPred and the remaining columns the true classes. Values along the diagonal corresponds to the number of cells that were correctly classified. The "Unassigned" label is used by scPred when a cell cannot be classified with confidence as Alpha, Beta, Delta or Gamma. The "Other" column comprises other cell types except cells from the islets of Langerhans.

Method
Additional file 2:  Table S5: Prediction performance of pancreatic cells from Baron et al. dataset using different prediction models described in table S1. Using a threshold of 0.9 to define class identity for each cell, the support vector machine model with a radial kernel performed better compared to other models as the prediction results show high specificity (for other cells) and high sensitivity (for cell types from the islets of Langerhans).
Additional file 2: Table S6: Prediction results using Baron dataset as reference. Values along the diagonal represent the sensitivities for the target cell-type classes. The average sensitivity and specificity across all three datasets were 0.89 and 0.40 respectively. Contingency tables are also shown for all predictions.
Additional file 2: Table S7: Classification performance of scmap-cluster using the Baron dataset as training. Classifier was applied to Muraro, Segerstolpe, and Xin datasets. All gamma cells were labeled as "unassigned".
Additional file 2: Table S8: Classification performance of scmap-cell using the Baron dataset as training. Classifier was applied to Muraro, Segerstolpe, and Xin datasets. All gamma cells were labeled as "unassigned".
Additional file 2: Table S9: Classification performance of caSTLe using the Baron dataset as training. Classifier was applied to Muraro, Segerstolpe, and Xin datasets. All gamma cells from the Segerstolpe and Xin datasets were labeled incorrectly Additional file 2: Table S10: Classification performance of singleCellNet using the Baron dataset as training. Classifier was applied to Muraro, Segerstolpe, and Xin datasets. All gamma cells across all datasets had an accuracy lower than 10% Additional file 2: Except from random forests, all models showed a high accuracy for dendritic cells from peripheral blood. For cord blood-derived cells, wide differences are observable across models due to the presence of other subpopulations. The support vector machine model showed the best accuracy for peripheral blood-derived dendritic cells. Accuracy is defined as the fraction of real dendritic cells correctly predicted by scPred.
Additional file 2: Table S14: Differentially expressed genes between unassigned cells by scPred and remaining cord blood-derived cells. Top upregulated genes include T-cell receptor gamma locus and myeloid-related genes.
Additional file 2: Table S15: Gene ontology overrepresentation results of overexpressed genes from unassigned cells. A Fisher's exact with FDR multiple test correction. Biological processes involving myeloid and neuthophiles were overrepresented. X06776, XIST, BC039116, M64936, TARP, FCGR1C, and ECRP gene identifiers did not map the query database.