PseudotimeDE: inference of differential gene expression along cell pseudotime with well-calibrated p-values from single-cell RNA sequencing data

To investigate molecular mechanisms underlying cell state changes, a crucial analysis is to identify differentially expressed (DE) genes along the pseudotime inferred from single-cell RNA-sequencing data. However, existing methods do not account for pseudotime inference uncertainty, and they have either ill-posed p-values or restrictive models. Here we propose PseudotimeDE, a DE gene identification method that adapts to various pseudotime inference methods, accounts for pseudotime inference uncertainty, and outputs well-calibrated p-values. Comprehensive simulations and real-data applications verify that PseudotimeDE outperforms existing methods in false discovery rate control and power. Supplementary Information The online version contains supplementary material available at (10.1186/s13059-021-02341-y).

1. In PseudotimeDE, the dispersion parameter φ j of gene j is estimated simultaneously with the mean parameter µ ij of gene j in cell i by the penalized-restricted maximum likelihood (PRML); in contrast, NBAMSeq uses the maximum a posterior (MAP) dispersion estimate calculated by R package DESeq2 [1]. NBAMSeq is designed for bulk RNA-seq and accounts for a samll sample size (i.e., number of replicates); therefore, the MAP estimate by DESeq2 is more biased (for variance reduction purpose) than the PRML estimate used by Pseudo-timeDE. Given that scRNA-seq has a large enough sample size (i.e., number of cells), we think that the PRML estimate of φ j is preferred over the MAP estimate.
2. In PseudotimeDE, the number of knots K is pre-defined as a fix number (the default K = 6, which is usually large enough). In contrast, tradeSeq randomly subsamples a small set of genes to choose a K for all genes.
3. PseudotimeDE fits a NB-GAM to each lineage, while tradeSeq incorporates a lineage covariate so that it can fit a NB-GAM to multiple lineages jointly. The reason of this difference is that PseudotimeDE needs to account for the uncertainty in pseudotime inference, and within-lineage model fitting makes the inference easier. 4. Pseudotime also implements a zero-inflated NB-GAM (ZINB-GAM) that includes a dropout probability parameter for each gene, and PseudotimeDE estimates this parameter internally.
In contrast, while tradeSeq can also turn its NB-GAM into ZINB-GAM, it estimates the dropout probability parameters externally by ZINB-WaVE [2]; NBAMSeq is designed for bulk RNA-seq data and does not consider zero inflation.
Although the NB-GAM fitting of PseudotimeDE is similar to that of the other two methods, its inference (p-value calculation) is completely different, and this is the main novelty of PseudotimeDE.

The number of knots K
The number of knots, K, a parameter in the generalized additive model (GAM), defines the number and positions of piecewise polynomial functions in a spline function.
is the smooth spline function of gene j, T i is the pseudotime of cell i, and K is the number of knots. It is worth noting that K −1 is also the nominal degree of freedom of the model. K needs to be specified before the negative binomial-GAM (NB-GAM) is fitted. For the GAM, "exact choice of K is not generally critical [3]," if the GAM is correctly used for inference. This is why we do not include the K selection step in PseudotimeDE. The default K = 6 in PseudotimeDE is usually large enough for modelling gene trajectories. Using a larger K is not harmful, but it will increase the computational time.
In contrast, tradeSeq has a K selection step, which we think is problematic. To choose K, tradeSeq randomly samples a small set of genes and uses AIC to find an "appropriate" K for all genes (see Choosing an appropriate number of knots in the tradeSeq paper [4]). We follow their vignette and obtain K = 4 in Fig. 3a and K = 6 in Fig. 3f. In tradeSeq, the null distribution of the test statistic S j is Therefore, when K = 4 (Fig. 3a), the null distribution of every gene is χ 2 3 ; when K = 6 ( Fig. 3f), the null distribution of every gene is χ 2 5 . As a result, the distributions of tradeSeq p-values under the null are quite different between Fig. 3a and Fig. 3f. It seems that the result of tradeSeq is highly sensitive to the choice of K. Such sensitivity is probably due to the incorrect null distribution.
That is, tradeSeq uses the nominal degree of freedom K − 1 in the null distribution (χ 2 K−1 ); however, the correct way is to use the effective degree of freedom, which is gene-specific and always smaller than K − 1 due to the penalization on the likelihood function. As a result, tradeSeq uses null distributions (of its test statistics) that have heavier right tails than those of the correct null distributions, resulting in conservative p-values (whose distribution under the null has a mode near 1). Details about the theory and usage of the effective degree of freedom in GAM can be found in Dr. Simon Woods' book [5] and 2013 Biometrika paper [6].