IDEAS: individual level differential expression analysis for single-cell RNA-seq data

We consider an increasingly popular study design where single-cell RNA-seq data are collected from multiple individuals and the question of interest is to find genes that are differentially expressed between two groups of individuals. Towards this end, we propose a statistical method named IDEAS (individual level differential expression analysis for scRNA-seq). For each gene, IDEAS summarizes its expression in each individual by a distribution and then assesses whether these individual-specific distributions are different between two groups of individuals. We apply IDEAS to assess gene expression differences of autism patients versus controls and COVID-19 patients with mild versus severe symptoms. Supplementary Information The online version contains supplementary material available at (10.1186/s13059-022-02605-1).

1 Supplementary Methods 1.1 Implementation of DCA-direct estimation method DCA-direct method estimates the expression distribution of one gene across the cells of an individual by making use of the parameter estimates from DCA output in a straight forward manner. DCA takes the observed count matrix (rows for genes and columns for cells) as input and outputs four parameter matrices, normalized mean matrixM , dispersion parameter matrix Θ, dropout probability matrix Π, and mean matrix M . Each matrix has the same dimension as that of the input count matrix, with the value at each position providing a denoised parameter estimate for the corresponding gene and cell. The difference between normalized mean matrixM and mean matrix M lies in DCA's adjustment for cell-level read-depth. To account for the factors that read-depth varies among different cells, DCA first normalizes the input raw count matrix by size factors computed based on cell-level read-depth, before proceeding to learn essential latent features and generate denoised parameter estimates. As an intermediate output of DCA for estimating the mean parameters,M is on the normalized scale. While as a final output, M is on the original scale of raw count matrix, and it is derived by multiplying the size factors back toM . For DCA-direct method, we make use of normalized mean matrixM together with Θ and Π. For the simplicity of notation, we assume that the cells belong to one specific subject occupy the first J columns of the count matrix. For one specific gene i and a cell j, let π ij ,μ ij and θ ij denote the corresponding elements on position (i, j) of matrices Π,M and Θ respectively. The denoised distributions from DCA with normalized mean for the expression of this gene on the J cells are: P i1 = ZINB(π i1 ,μ i1 , θ i1 ), P i2 = ZINB(π i2 ,μ i2 , θ i2 ), . . . , P iJ = ZINB(π iJ ,μ iJ , θ iJ ).
For this individual, to get an expression distribution estimate for gene i across cells, on each possible count value, we calculate the probability estimate simply by averaging the corresponding probabilities from P ij , j = 1, 2, . . . , J. Once we have estimated distributions for individuals, we can calculate the distance matrix either by Jensen-Shannon divergence or Wasserstein distance, and compute p-values through kernel-based association test or Permutational Multivariate Analysis of Variance.

Supplementary results from simulation study 2.1 Power analysis for all possible options of IDEAS methods
We have two options for each of three choices: • methods for testing: kernel regression (KR), and Permutational Multivariate Analysis of Variance (PERMANOVA, or PS), • methods for density estimation: zero-inflated negative binomial (zinb) or kernel density estimate (kde), • methods for calculating distances of two distributions: Jensen-Shannon divergence (JSD) or Wasserstein distance (Was).
Here we show the performance are similar for all 8 options to run IDEAS from two simulation setups. The results are similar in other simulation setups. It worth noting that we did find in real data analysis, the non-parametric estimation of density (kde) has worse performance when it is necessary to account for cell level read-depth difference. In the following, if not otherwise specified, the IDEAS results are based on permutation test (PS) and Wasserstein distance (Was).  Figure S1: Evaluation of type I error and power for different versions of IDEAS methods using simulated data. (A) 5 cases vs. 5 controls, with 1080 cells per subject. (B) 13 cases vs. 10 controls, with 360 cells per subject. In the legend "KR" means kernel regression, and "PS" means Permutational Multivariate Analysis of Variance (PERMANOVA). "kde" and "zinb" are two methods to estimate the distributions by non-parametric kernel density estimates or zero-inflated negative binomial distribution, respectively. "JSD" and "was" are two methods to calculate the distance of two distributions: Jensen-Shannon divergence (JSD) or Wasserstein distance (was).

Power analysis for additional simulation setups with 360 cells per subject
Here we fix the effect sizes to be 1.2 fold change for mean expression and 1.5 fold changes for variance, and compare the results with different sample sizes. The case with sample size of 13 cases vs. 10 controls is included in Figure 2B of main text. The IDEAS results are based on permutation test (PS) and Wasserstein distance (Was), and two options to calculate density, parametric (ZINB) or non-parametric density estimation (KDE).  Figure S4: 20 cases vs. 20 controls.

Power analysis for additional simulation setups with 1080 cells per subject
Again, we fix the effect sizes to be 1.2 fold change for mean expression and 1.5 fold changes for variance, and compare the results with different sample sizes. The case with sample size of 5 cases vs. 5 controls is included in Figure 2A of main text. The IDEAS results are based on permutation test (PS) and Wasserstein distance (Was), and two options to calculate density, parametric (ZINB) or non-parametric density estimation (KDE). It is computationally much more demanding to run scDD and ZINB-WaVE when there are 1080 cells per subject. Since we have demonstrated that they have inflated type I error, we skip their results here.   Figure S7: 20 cases vs. 20 controls.

Power analysis for different effect sizes
Here we fix the sample size to be 10 cases vs. 10 controls and examine the power for different effect sizes. The methods that compare the gene expression between two groups of cells (e.g., rank-sum test, MAST) are not included because they cannot control type I error. The IDEAS results are based on permutation test (PS) and Wasserstein distance (Was), and two options to calculate density, parametric (ZINB) or non-parametric density estimation (KDE). As we have pointed out in the main text, although MAST glmer has good type I error control in simulations, it dose have inflated type I error in real data analysis with similar sample sizes, likely due to extra noise in the real data. Therefore MAST glmer should be used with caution in real data analysis. For example, to double check its results after permuting case/control labels.  Figure S8: Compare the type I error of different methods with respect to effect size changes. The x-axis shows the fold change for mean, and the variance fold change is always mean fold change + 0.3. Note that the type I error were evaluated on the subset of equivalently expressed (EE) genes and thus their expression are not affected by neither the mean nor the variance fold change. In this section we compare p-values from using NB or ZINB distributions to characterize individual-level gene expression through running analysis on both simulated data and real count data. Supplementary Figure S10 demonstrates the comparison on simulated data generated based on count data from the cell type excitatory neurons on layer 2/3 (L2/3) according to the process described in "Design of simulation studies" subsection in main text. The p-values from these two approaches are very similar. (I) Figure S10: Comparison between the pvalues from NB and those from ZINB approaches on simulated data. Data generation process is described in "Design of simulation studies" subsection under the Result section of the main text. Each row corresponds to one group of simulated genes, with top, middle and bottom row corresponding to meanDE, varDE and EE respectively. The first column gives the histogram of p-values from using NB distribution in each group, and the second column gives those from using ZINB distribution. The last column provides the scatter plot comparing NB and ZINB approaches on their negative log10-transformed p-values. Each point in the scatter plot corresponds to one of the 8,000 simulated genes. For both NB and ZINB, the cell-level covariate to adjust for is read-depth, the method used to calculated distance between gene expression distribution is Wasserstein distance (Was). The p-values are computed through PERMANOVA.
Supplementary Figure S11 compares the p-values from IDEAS with NB and ZINB distributions on real count data from the cell type L2/3. The shape of the histograms and the scatter plot show that the results from these two approaches are highly consistent. Scatter plot of -log10(p-values) from NB v.s. that from ZINB approach. Each point in the plot corresponds to one of the 8,260 genes expressed in at least 20% of 8,626 L2/3 neuron cells. The value on x-axis gives the negative log-transformed p-value with based 10 from ZINB, and the value on y-axis gives that from NB. For both NB and ZINB, the cell-level covariate to adjust for is read-depth, the method used to calculated distance between gene expression distribution is Wasserstein distance (Was), and the p-values are computed through PERMANOVA.

Comparison of sampling-based approach vs. DCA-direct method
Besides DCA-direct method, another way to utilize DCA outputs is to generate count data according to the denoised distribution estimate given by DCA for each gene and each cell, then pool the data across cells and fit a negative distribution for each individual using the pooled data, calculate the distance matrix and compute p values. Following the notations as in the subsection "Implementation of DCA-direct estimation method" in main text, different from DCA-direct method, this DCA sampling-based approach makes use of the mean matrix M instead of the normalized mean matrixM , together with the dispersion parameter matrix Θ and dropout probability matrix Π.
Again for the simplicity of notation, we assume that the cells belonging to a specific individual occupy the first J columns of the count matrix. For gene i and cell j, let π ij , µ ij and θ ij denote the corresponding elements on position (i, j) of matrices Π, M and Θ respectively. The denoised distributions from DCA with mean on the scale of original raw count for this gene on the J cells are: Then for each gene and each cell, we can generate multiple counts (for example, m = 5, 10, or 20) from the corresponding denoised distribution: Now for this individual and gene i, we have in total mJ counts sampled from the denoised distributions for the original J cells. Next, we treat all these mJ sampled counts as counts from different cells, and apply IDEAS on the matrix consisting of these sampled counts.
Supplementary Figure S12 compares the p-values from DCA sampling-based and DCA-direct approaches, under generated sample size m = 5, 10, 20 for DCA sampling-based approach. The trends look relatively consistent, but the DCA sampling-based approach is computationally much more expensive due to the extra computational cost to fit NB distributions. The cell type here is L2/3. (A) Comparison between -log10(p-values) from DCA sampling-based approach (y-axis) with m = 5 and those from DCA-direct approach (x-axis). (B-C) Similar comparison to that in (A), except that (B) has m = 10 and (C) has m = 20 for DCA sampling-based approach on y-axis. For DCA sampling-based approach, we fit a NB distribution per individual, with adjustment for the cell-level read-depth. For both DCA sampling-based and DCA-direct approaches, the method taken to calculated distance between gene expression distribution is Wasserstein distance (Was), and the p-values are computed through PERMANOVA.

Estimates of type I error
We estimate type I error in real data by applying different methods on permutated case/control labels. We estimate the type I error for each method across 17 cell types by calculating the proportion of genes with p-values smaller than a threshold and summarize them by a box-plot (Supplementary Figure S13). The conclusions are consistent with our results from simulation studies. First, there are severe inflation of type I error for cell-level DE methods (rank sum test or MAST). Second, MAST lmer has moderate level of inflation of type I error. Third, methods designed for individual level DE (DESeq2 or IDEAS methods) do not have any inflation of type I error. If case/control status indeed affects gene expression, using the permutated case/control labels leads to an mis-specified model where one informative covariate is ignored, which can leads to over-estimate of variation and thus deflated type I error. This may explain the slight deflation of type I error by IDEAS. IDEAS-DCA and IDEAS-SAVER have slightly more deflated type I error, which is likely because stronger dependence across genes are introduced after denoising the scRNA-seq data.

Mean and pseudo dispersion regression
To explore what kind of differential expression patterns IDEAS DCA captures, we compute subject-level mean and a pseudo dispersion parameter based on the denoised cell-level distributions given by DCA, and explore the associations between the log transformation of these two quantities with case/control status.
Following the notations in section 1.1 of supplementary material, for one specific gene i and a cell j, let π ij , µ ij and θ ij denote the corresponding elements on position (i, j) of dropout probability matrix Π, normalized mean matrixM , and dispersion parameter matrix Θ respectively. The DCA-denoised distribution of gene i in cell j is P ij = ZINB(π ij ,μ ij , θ ij ).
We treat these J distributions equally and view the cell-level expression of this gene in this subject as a random variable Y ik with 1/J probability to be from each of the J distributions. Then the mean and variance of Y ik can be computed as If we assume Y ik can be approximated by a negative binomial distribution, but we can compute a pseudo dispersion parameter for it as Next we can assess the association between case/control status and mean or pseudo dispersion by linear regression. Specifically, for gene i, we can assess the association between mean expression and case/control status by a linear regression with log[E(Y ik )] as the response variable and case/control status together with other individual level features as covariates. Similar analysis can be conducted using log[disp(Y ik )] as response instead as well.
Figure S14: Compare the mean and variance for gene-specific and subject-specific distributions in the Autism data. Each of 8260 × 23 points represents one gene and one subject. 23 points having mean ≥ 35 are treated as outliers and are removed. The reference lines are y = x in both figures.  Rank sum p-val = 3.7e-17 Figure S15: Compare the four groups of genes that identified as DE/EE (differentially expressed / equivalently expressed) by IDEAS DCA or DESeq2 : 38 genes identified as DE genes by both methods, 222 and 93 genes identified by IDEAS DCA and DESeq2 only, respectively, and the remaining 7907 genes. The first / second row shows the p-value (or log10(p-value) distribution when comparing mean values / pseudo-dispersion from DCA-denoised scRNA-seq data.

Supplementary results from COVID-19 data analysis 4.1 Data processing and DE results
We studied a single-cell RNA-seq data set of COVID-19 patients to assess gene expression difference between patients with mild and severe symptoms. The data set we used was collected from blood samples (cohort 1 of Schulte-Schrepping et al. 2020) using 10x Chromium platform. It contains normalized expression of 46,584 genes in 99,049 cells, and can be downloaded from link https://beta.fastgenomics.org/datasets/ detail-dataset-952687f71ef34322a850553c4a24e82e. We recovered the RNA-seq counts from the normalized expression by taking exponential, subtracting 1, dividing by 10,000 and multiplying by total UMI counts. 10,176 genes were left after genes appearing in less than 2,000 cells (about 2%) were filtered out. We focused on cell type CD8+ T cells 1 (cluster 1 of CD8+ T cells) in COVID patients and filtered out the patients with less than 10 cells in this this cluster. After filtering, the dataset includes 5,557 cells from 17 COVID patients (10 severe cases vs. 7 mild ones). Altogether 5,160 genes expressed in at least 90% of the 5,557 cells were kept for analysis.
The current version of DCA code from https://github.com/theislab/dca (accessed on 11/02/2021) no longer provides the normalized mean matrixM as one of the output files. ButM can be retrieved based on the mean matrix M from the output and the original count matrix. The code for this step of processing on COVID data is included in our github repo: https://github.com/Sun-lab/ideas_pipeline/tree/main/COVID.

Type I error evaluation
Similar to the analysis of Autism data, we permuted case/control labels and then calculated the proportion of DE genes at different p-value cutoffs to evaluate type I error. For Autism data, we considered multiple cell types and thus summarized the type I error across cell types by a box plot. Here we only considered one cell type and we listed the type I error at different p-value cutoffs. At p-value cutoff 0.01 or 0.05, MAST glmer has much higher type I error than in Autism data or simulation, likely because smaller sample size and more uneven number of cells per subject.

GSEA using gene mean expression level
To explore the possible influence of gene expression level on the pathways identified by different methods in the COVID-19 data set, we compute a mean expression level for each gene, rank genes by this quantity, and carry out gene set enrichment analysis.
To compute the mean expression level, we first rescale the counts in the pseudo bulk count matrix to adjust for subject-level read depth. For a pseudo bulk count matrix K of size (m, n), with each row corresponding to a gene and each column corresponding to a subject, the rescaled counts are computed by After rescaling, the mean expression level of the i-th gene is computed as n j=1 R ij /n and it is used as the gene-level statistic for gene set enrichment analysis. On this COVID-19 data set, at adjusted p-value cutoff 0.05, GSEA using mean expression level identified 235 pathways.

GSEA results
The GSEA results from all the methods were included in Additional File 2. Here we illustrate the results based on the gene rankings by IDEAS DCA and IDEAS SAVER.  Figure S16: GSEA results based on the gene rankings by IDEAS DCA (A) and IDEAS SAVER(B). All these pathways have adjusted p-value smaller than 0.05. The underscored pathways are those mentioned in the main text, and the ones with * to the left are those that are also identified by GSEA analysis if we rank genes by gene expression level.