Skip to main content

Table 1 Statistical methods for single-cell RNA-seq experiments

From: Design and computational analysis of single-cell RNA-sequencing experiments

Name Description Requirements/deliverables
Normalization
GRM [57] Fits polynomial gamma regression model to FPKM data from spike-ins; estimated parameters are used to convert FPKM of endogenous genes to an absolute scale within each cell Performs within cell normalization and may be used with FPKM, RPKM, or TPM
SAMstrt [56] The resampling-based bulk normalization method in SAMseq is applied to spike-ins Assumes that an equal number of spike-in control RNA molecules have been added to all samples
Identifying highly variable genes
Brennecke et al. [48] A gamma generalized linear model fit to the mean-variance relationship quantified by the square of the coefficient of variation (CV2) of the spike-ins estimates technical noise parameters. These parameters are then used to estimate technical variability for endogenous genes and to test whether each gene exceeds a variability threshold Spike-ins and endogenous genes are normalized separately using the median normalization method. Gene specific P values are provided to identify highly variable genes
Kim et al. [63] Uses spike-ins to estimate parameters related to technical variance, allowing for differences in variability across cells. Estimates gene-specific biological variability by subtracting technical variability from total variance Normalization factors are estimated using the median normalization method. A simulation based framework to test for highly variable genes is provided
BASiCS [54] Jointly models spike-ins and endogenous genes as two Poisson-Gamma hierarchicalmodels with shared parameters Estimates normalization parameters jointly across all genes. Gene-specific posterior probabilities are provided to identify both lowly and highly variable genes
Noise reduction
scLVM [47] Uses a Gaussian Process Latent variable model to estimate the covariance matrix associated with latent factors. Residuals from a linear mixed model with the covariance term represent de-noised expression estimates Requires genes associated with the latent factor to be identified a priori. Normalization factors are estimated using the median normalization method
OEFinder [12] Uses orthogonal polynomial regression to identify genes whose expression is associated with position on the C1 Fluidigm integrated fluidic circuit (IFC) Gene-specific P values are provided to identify genes affected by the artifact
Sub-population identification
ZIFA [70] Models dropout rate as a function of expression in a factor analysis (linear dimension reduction) framework Requires normalized, log-transformed estimates of gene expression (zeros are not transformed)
Destiny [81, 82] Extends diffusion maps (a non-linear dimension reduction approach) to handle zeros and sampling density heterogeneities inherent in single cell data Requires variance-stabilized gene expression estimates; works best with a large number of cells
SNN-Cliq [71] Clusters cells by identifying and merging sub-graphs (quasi-cliques) in a shared nearest neighbor (SNN) graph; the number of clusters is chosen automatically Requires a reduced set of genes. Xu and Su [71] recommend using genes with average RPKM >20 and using a log transformation to reduce the effect of outliers. Relies on a valid choice of graph parameters
RaceID [59] Uses k-means applied to a similarity matrix of Pearson’s correlation coefficients for all pairs of cells; the number of clusters is chosen using the gap statistic. Outlier cells are those that cannot be explained by a background model that accounts for technical and biological noise. In a second step, rare subpopulations can be identified and outlier cells may be merged to an outlier cluster; new cluster centers are then computed and each cell is assigned to the most highly correlated cluster center Requires a reduced set of genes. Grün et al. [59] consider genes with a minimum of five transcripts in at least one cell
SCUBA [73] Uses k-means to cluster data along a binary tree detailing bifurcation events for time-course data. Models expression regulation along the tree using bifurcation theory Requires a reduced set of genes. Marco et al. [73] recommend using the 1000 most variable genes that are expressed in at least 30 % of cells
BackSPIN [60] Iteratively splits a two-way sorted (by both genes and cells) expression matrix into two clusters containing independent cells and genes, for a maximum number of splits. The algorithm has a stopping condition to avoid splitting data that are very homogeneous Requires a reduced set of genes and the maximum number of splits allowed. Zeisel et al. [60] recommend selecting the top 5000 genes that have the largest residuals after fitting a simple noise model
PCA/t-SNE [69] Linear/non-linear dimension reduction approach used for unsupervised clustering of cells Input is typically a correlation or similarity matrix
PAGODA [68] Allows for both detection and interpretation of the transcriptional heterogeneity within a cell population. A weighted principal component analysis (PCA) is conducted for each gene set; those sets for which the variance explained by the first principal component significantly exceeds genome-wide background expectation are identified. To provide a non-redundant view of heterogeneity structure, principal components from different gene sets showing high similarity are combined to form a single component of heterogeneity Requires un-normalized gene expression counts (performs internal correction as in SCDE). Uses gene ontology (GO) annotated or user-defined gene sets
Differential detection
MAST [76] A logistic regression model is used to test differential expression rate between groups while a Gaussian generalized linear model (GLM) describes expression conditionally on non-zero expression estimates. Models are corrected for cellular detection rate Requires normalized gene expression estimates and provides gene-specific P values from summing likelihood ratio or Wald tests from the two components
SCDE [77] Models gene-specific expression as a two-component mixture: a Poisson component describes zero and a Negative Binomial describes non-zero measurements Requires un-normalized gene expression counts (performs internal correction) and provides gene-specific posterior probabilities of differential expression (DE) between two biological conditions. Tests for DE are performed on non-zeros
scDD [78] Models expressed counts as a Dirichlet process mixture (DPM) of normals to test for differentially distributed (DD) genes associated with multi-modality in the expressed component. Samples from the posterior further characterize the gene-specific distributional difference between two biological conditions to identify genes that are differentially expressed (DE), differ in the proportion of cells within modes (DP), differ in the number of modes (DM), or are both DE and DM (DB) Requires normalized, log-transformed gene expression estimates and provides gene-specific P values (or a false discovery rate (FDR)-controlled list) of DD genes between two biological conditions. Each DD gene is then classified into a specific type of distributional difference
Pseudotemporal ordering
Monocle [36] Reduces data using independent component analysis (ICA) and constructs a minimum spanning tree (MST) to order cells in pseudotime Requires normalized, log-transformed gene expression estimates and a reduced set of genes. Trapnell et al. [39] recommend identifying genes that are differentially expressed between time points or, if data at multiple time points are not available, choosing genes above a mean and variance threshold
Waterfall [80] Unsupervised clustering is used to identify clusters of cells for which a putative ordering is determined on the basis of their relative location in a PCA plot. K-means clustering of single-cell transcriptomes on the PCA plot and an MST that connects cluster centers determines pseudotime Requires normalized estimates of gene expression with outliers removed
Sincell [83] A flexible R workflow for building cell hierarchies with multiple options for dimension reduction, clustering, and graph building. Allows the user to assess the similarity of graphs and performs resampling or random cell substitution with simulated replicates to assess the robustness of estimated hierarchies Requires normalized, log-transformed gene expression estimates and a reduced set of genes. Juliá et al. [83] recommend identifying highly variable genes
Oscope [11] Uses a paired-sine model and K-medoids clustering to identify groups of oscillatory genes. For each oscillatory group, an extended nearest insertion algorithm is used to construct the cyclic order of cells, defined as the order that specifies each cell’s position within one cycle of the oscillation of that group Identifies groups of oscillatory genes, when present. Requires normalized gene expression and use of only high mean, high variance genes is recommended
Wanderlust [93] Cells are represented as nodes in an ensemble of k-nearest neighbor graphs. For each graph, a user-defined starting cell is used to calculate an orientation trajectory by iteratively computing the shortest-path distance between cells. The final trajectory is an average over all graphs Developed using single-cell mass cytometry data, which typically describe few genes (<50) and tens of thousands of cells