 Method
 Open Access
 Published:
CoCoAdiff: counterfactual inference for singlecell gene expression analysis
Genome Biology volume 22, Article number: 228 (2021)
Abstract
Finding a causal gene is a fundamental problem in genomic medicine. We present a causal inference framework, CoCoAdiff, that prioritizes disease genes by adjusting confounders without prior knowledge of control variables in singlecell RNAseq data. We demonstrate that our method substantially improves statistical power in simulations and realworld data analysis of 70k brain cells collected for dissecting Alzheimer’s disease. We identify 215 differentially regulated causal genes in various cell types, including highly relevant genes with a proper cell type context. Genes found in different types enrich distinctive pathways, implicating the importance of cell types in understanding multifaceted disease mechanisms.
Backgrounds
Singlecell RNAseq is a scalable approach to measure thousands of gene expression values in hundreds of thousands of cells, sampled from a hundred individuals. As technology becomes mature and economical, singlecell sequencing methods have been used to solve a variety of biological and medical problems, and many largescale data sets are becoming available to research communities. Unlike previous bulk RNAseq, singlecell RNAseq analysis quantifies gene expression changes from a large number of cells, and researchers dare to ask unprecedented questions, which had not been feasible in bulk data analysis. Only a subset of such examples includes celllevel developmental trajectory analysis [1], spatial transcriptomics [2], regulatory network reconstruction with perturbation [3], and variance quantitative trait analysis [4, 5].
Interestingly, some research questions hitherto remain fundamentally attractive since gene expression microarrays [6,7,8] and bulk RNAseq [9,10,11,12,13] era. Differential expression analysis is such a classical problem. For casecontrol studies, knowing differentially expressed genes (DEGs) is often of research and clinical interest. Our primary interest also centres on developing a statistical method for differential expression analysis between different groups of individuals, not between cells. The underlying statistical problem is straightforward. However, finding DEGs from casecontrol singlecell data poses several challenges in practice. This work seeks to identify and propose an algorithmic approach that resolves two of those challenges from a causal inference perspective.
Firstly, cells are not independently and identically distributed. Instead, cells belong to a particular individual, hierarchically organized, and naturally create “batch” effects (Fig. 1a). Cells belonging to the same individual are necessarily affected by the same biological and technical factors. The number of individuals essentially determines the statistical power of DEG discovery in singlecell data. Along the same line, a benchmark comparison demonstrates that existing bulk RNAseq methods on pseudobulk data (using the individuallevel aggregate of cells of a particular cell type) still perform decently while correctly controlling false discovery rates [14, 15]. Likewise, for genetic analysis (expression quantitative trait loci), the statistical power of eQTL discovery is primarily determined by the degree of genetic variation across individuals rather than the number of cells per individual [16]. Nonetheless, differential expression analysis of singlecell RNAseq is a stateoftheart and unbiased approach to characterize celltypespecific transcriptomic changes.
Another challenge stems from the study design of casecontrol data analysis. In contrast to randomized control trials, most studies are observational, and we have incomplete knowledge of a disease assignment mechanism. Investigators usually cannot make an intervention for practical and ethical reasons. Considering that many complex disease phenotypes occur at the late onset of a lifetime, finding a suitable set of covariates for causal inference is often infeasible as well. Matrix factorization or latent variable modelling can be used to characterize technical covariates or batch effects. However, it is difficult to identify which principal axes of variation capture confounding effects, independently from unknown diseasecausing mechanisms. A latent variable model of a singlecell count matrix is frequently used for clustering and cell type annotations, and the resulting latent factors are more suitable for the characterization of intercellular heterogeneity than interindividual variability.
We present a novel application of a causal inference method as a straightforward approach to improve the statistical power in casecontrol singlecell analysis while adjusting for unwanted confounding effects existing across heterogeneous individuals. We establish our causal claims in differential expression analysis based on Rubin’s potential outcome framework [17, 18]. Our method is inspired by the seminary work of outcome regression analysis by a matching algorithm [19, 20]. We highlight that our causal inference approach is beneficial in the analysis of disease casecontrol studies, especially when metadata for covariates are scarcely available, and covariates may influence both disease status and gene expressions simultaneously. With respect to the underlying causal structural model (disease to gene expression), we seek to identify genes that are differentially expressed as a result of disease.
Results
Overview of our causal inference approach
Definition of causal genes
Here, we ask whether a gene is causally affecting or affected by a disease variable but not affected by other technical and biological covariates, which may confound the disease status and gene expressions. In this work, a causal gene is defined as a gene affecting or being affected by a disease status independent of other confounding variables. Although many differentially expressed genes can be considered a result of disease status for most lateonset disorders, we also acknowledge that aberrant changes on a handful of genes can initiate disease phenotypes. To distinguish causal vs. anticausal mechanisms, we would need additional perturbation experiments. Alternatively, driver genes can be characterized by mediation analysis using genetic variants as an instrumental variable (Mendelian randomization) [21].
Moreover, concerning cell types and states, we need to assume that cell type fractions are not a mediating factor between the disease and gene expression variables. We found a negligible correlation between celltype proportions and observed disease status in the study of Alzheimer’s disease [22]. Under this causal assumption, the stratification procedure for cell types provides a legitimate strategy to control celltype biases that may impact on identifying DEGs. We think there is almost no chance of a “mediation fallacy [23,24,25].”
Differential analysis on pseudobulk expression profiles
We are interested in comparing pseudobulk gene expression profiles stratified within each cell type and individual between the case and control samples. Letting Y_{gj} be a gene expression of a gene g on a cell j and S_{i} be a set of cell indexes for an individual i ∈ [n], we can create a pseudobulk expression by aggregating all the expression vectors. We will use λ_{gi} to generally refer to a pseudobulk estimate of a gene g on an individual i. For instance, we could take an average, \( {\lambda}_{gi}\approx \sum \limits_j\leftI\right.\left\{j\in {S}_i\right\}{Y}_{gj}/\mid {S}_i\mid \), or take the total count, \( {\lambda}_{gi}\approx \sum \limits_j{\leftI\right.}_{j\in {S}_i}{Y}_{gj} \). Given the estimate of the λ values across n individuals, {λ_{gi} : i ∈ [n]}, we can construct a hypothesis test that seeks to reject a null hypothesis that the distributions of pseudobulk profiles are the same among the case and control individuals (Wilcoxon’s test).
Potential outcome framework for singlecell differential expression analysis
In observational data, where the label assignment is not controlled, data matrices of raw {Y_{gj}} and pseudobulk count {λ_{gi}} can become confounded with the disease label assignment by unknown biological and technical covariates (Fig. 1b). Such confounding factors obfuscate actual diseasespecific effects with other effects of unknown covariates and may lead to false discoveries and dampen the statistical power of differential expression analysis. Rubin’s potential outcome framework [17, 18] seeks to separate the actual disease (or treatment) effects from other effects by asking the following counterfactual questions:

What would be a gene expression if an individual had not been exposed to a disease?

What would be a gene expression if an individual had been exposed a disease?
In our pseudobulk analysis context, we are interested in estimating the following quantities:

\( {\lambda}_{gi}^{(0)} \): What would be the pseudobulk expression of a gene g if an individual i had not been exposed to a disease?

\( {\lambda}_{gi}^{(1)} \): What would be the pseudobulk expression of a gene g if an individual i had been exposed to a disease?
In a binary casecontrol study, we observe one of the values for each individual while the other side is left unobserved (denoted by the “?” mark). Letting W_{i} ∈ {0, 1} be a disease label assignment variable for an individual i, only a part of potential gene expressions are made directly observable from data:
At the cell level (∀j ∈ S_{i}), we have the same structure:
If both sides of the potential expression, \( \left\{{Y}_{gj}^{(0)},{Y}_{gj}^{(1)}\right\} \), were known, we would be able to estimate the disease effect on a gene g for each individual by comparing pseudobulk profiles (\( {\lambda}_{gi}^{(0)} \) vs. \( {\lambda}_{gi}^{(1)} \)) constructed from the potential singlecell gene expressions. The ultimate goal of causal inference in Rubin’s potential outcome framework is to impute the missing part of potential outcomes since a comparison between the case and control becomes straightforward on the imputed data.
The definition of a confounding variable and causal assumptions
We define that a variable can confound a disease label (W) and gene expressions (λ and Y) if (1) it is associated with the disease and gene expression variables and (2) it is still associated with the expression even after conditioning on the disease label [26]. Unless we adjust/stratify a sufficient set of confounding variables, gene expression changes observed between the case and control samples are not necessarily the causal effect of disease mechanisms.
It is crucial to state casual assumptions to proceed with our causal inference:

Stable individual disease effect: We assume that the potential expressions of an individual i, namely \( {\lambda}_{gi}^{(0)} \) and \( {\lambda}_{gi}^{(1)} \), are not affected by the expressions of other individuals {i ′ ∈ [n] : i ′ ≠ i}.

Conditional independence of the potential expression and disease exposure (conditional ignorability [17, 18]): For a noncausal gene, by definition, gene expressions are independent of disease status. Therefore, we do not need an assumption on this matter. However, for a causal gene, we assume that potential (counterfactual) gene expressions are independent of a disease label conditioning on a sufficient set of confounding variables. In other words, genes differentially regulated for a diseased individual would not have been aberrantly expressed if this individual had not developed the disease.

Overlap of confounding effects between the case and control: Within a stratum of individuals, homogeneous with respect to confounding variables, we have both the case and control subjects with nonzero probability. In the singlecell analysis, we assume disease and nondisease cells simultaneously exist in a homogeneous group of cells stratified by confounding factors.
CoCoAdiff for singlecell differential expression analysis
The purpose of our counterfactual confounder adjustment for differential singlecell gene expression analysis (CoCoAdiff) (Fig. 1c) is to impute the missing part of potential outcomes of singlecell profiles (step 1), propagate the imputed results to the pseudobulk estimation, and decompose the total pseudobulk profiles into the confounding (step 2) and differential effects (step 3). Using a singlecell gene expression matrix, \( \left\{{Y}_{gj}:g\in \mathsf{genes},j\in \mathsf{cells}\right\} \), we want to estimate two types of pseudobulk data: (1) the estimated confounders, \( \left\{{\mu}_{gi}:g\in \mathsf{genes},i\in \mathsf{individuals}\right\} \), and (2) the residual differential effects, \( \left\{{\delta}_{gi}:g\in \mathsf{genes},i\in \mathsf{individuals}\right\} \). In other words, we want to estimate the decomposition of pseudobulk data, such as λ_{gi} = μ_{gi}δ_{gi}. Briefly, the algorithm proceeds as follows: (1) we seek to estimate (or impute) counterfactual measurement of single cells’ expression by matching cells in a particular condition with neighbouring cells in the opposite conditions. The distance between cells was calculated on the top principal component axes. (2) Having paired sets of observed and counterfactual singlecell data, we estimate the mean expression of genes shared across two opposite conditions in Bayesian posterior inference. We treat them as putative confounding factors. (3) While holding the estimated confounding effects fixed, we measure the conditional (or residual) mean effect on the observed cells. We refer the readers to Materials and Methods for technical details.
To demonstrate how CoCoAdiff actually works, we simulated a singlecell data matrix consisting of 10,000 genes and 40 individuals (Fig. 1d). Each individual contains 50 cells on average; 50 of the 10,000 genes are causally affected by disease labels (W → λ) and confounding factors (X → λ). The other genes are only affected by confounding factors (X → λ); we introduced five confounding variables \( \left\{{X}_{ik}:i\in \mathsf{individuals},k\in \left[5\right]\right\} \), and a linear combination of these X variables introduces biases on W_{i} and λ_{gi}. Here, we set the variance of λ explained by confounding variables X to 0.5 and the variance of disease label W explained by the same confounding variables to 0.5, but we varied the true disease variability between 0.1 to 0.3 on 50 causal genes (W → λ; the xaxes of Fig. 1e, f). As expected, on noncausal genes (Fig. 1e), we found a strong correlation between the estimated confounding effects (μ_{gi}) and true confounding effects (the linear combination of X_{ik} variables), which is far greater than the correlation with the estimated differential effects (δ_{gi}). We also observed that the unconfounded pseudobulk data (removing the effects of X variables) are correlated with the estimated differential effects (δ_{gi}), consistently not affected by the change of disease variability (Fig. 1f).
As an example, we demonstrate the effectiveness of our approach in the case of APOE gene measured in microglia samples (Fig. 1 g). For better visualization, we removed an individual with only a single read was observed on the APOE gene. In pseudobulk data analysis with 39 individuals, it appears that the total expression values are only mildly upregulated in disease subjects (Wilcoxon p = 0.23) even though APOE overexpression is one of the most frequently observed hallmark of Alzheimer’s disease (AD). We also found that the shared confounding factors across the case and control exhibit almost no apparent correlation with the disease label (p = 0.46). After adjusting the confounders on the data, we recover a significant correlation of APOE gene expressions with AD status (p = 5.2 ×10^{−7}).
Simulation experiments
The design of simulation experiments
We evaluated the performance of our approach in differential expression analysis using a simulated singlecell data matrix (Y) of 10,000 genes with 40 individuals with 50 causal and 9,950 noncausal genes (Fig. 2a). We introduced two types of total five covariates on the individuallevel expressions λ_{gi}, explicitly designating confounding variables X and nonconfounding batch effect B. By definition, confounding factors X_{ik} affect the label assignment W_{i} and the individuallevel mean values λ_{gi}. For a causal gene g, λ_{gi} values are determined by the disease label W_{i}, the confounders X_{ik}, and the batch effect variables B_{il}; for a noncausal gene g, there is no contribution from the disease variable. In each simulation experiment, we specify the following parameters:

\( {\sigma}_{X\to W}^2 \): the variance of W explained by the covariate X.

\( {\sigma}_{W\to Y}^2 \): the variance of logλ explained by the disease assignment W.

\( {\sigma}_{X,B\to Y}^2 \): the variance of logλ explained by the covariates X and B.

d_{C}: the number of confounding variables (from 1 to 5).

d_{B}: The number of nonconfounding batch effects (from 0 to 4).
For each individual i ∈ [n], we first sample covariates \( {X}_{ik}\sim \mathcal{N}\left(0,1\right) \) for k ∈ [d_{C}] and \( {B}_{il}\sim \mathcal{N}\left(0,1\right) \) for l ∈ [d_{B}]. Given the X matrix, we sample the parameter vector α required to introduce biases on W and the residual error vector ϵ_{W} from isotropic Gaussian distributions and adjusted the scale of the error vector to have the simulation proportion of variance matched with the prescribed σ^{2} value, i.e. \( \mathbbm{V}\left[X\upalpha \right]/\mathbbm{V}\left[X\upalpha +{\upvarepsilon}_W\right]={\sigma}_{X\to W}^2 \). We generate a binary label assignment for an individual i by flipping a coin:
Combining these values, we construct the mean values of a gene expression g for an individual i by a generalized linear model:
where the covariate effect \( {\beta}_{kg}\sim \mathcal{N}\left(0,{\sigma}_{X,B\to Y}^2/\left({d}_C+{d}_B\right)\right),{\gamma}_{lg}\sim \mathcal{N}\left(0,{\sigma}_{X,B\to Y}^2/\left({d}_C+{d}_B\right)\right) \), genelevel causal effect \( {\tau}_g\sim \mathcal{N}\left(0,{\sigma}_{W\to Y}^2\right) \), and the residuals \( {\epsilon}_{\lambda}\sim \mathcal{N}\left(0,1{\sigma}_{X\to Y}^2{\sigma}_{W\to Y}^2\right) \). Using the individuallevel mean values λ, we stochastically generated celllevel expressions by multiplying the individuallevel average expressions with random sequencing depth (ρ), sampled from ρ ∼ Gamma(6.25, 6.25). For each cell j,
Once we have estimated different types of pseudobulk data, we ranked genes based on Wilcoxon’s ranksum test [27] implemented in R and constructed receiveroperating and precisionrecall curves to calculate the power and AUPRC using DescTool [28] implemented in R.
Here, we show and compare the performance of differential analysis conducted on the five different pseudobulk data:

CoCoA: individuallevel disease effects δ_{gi} estimated by CoCoAdiff algorithm (Fig. 1c, step 3).

Confounder: the confounder effects μ_{gi} estimated by CoCoAdiff algorithm (Fig. 1c, step 2)

Bayesian: Bayesian estimate of pseudobulk expression averaged over cells within each individual (μ_{gi}δ_{gi} combined, not decomposed).

Mean: arithmetic mean of celllevel expressions within each individual (\( {\left{S}_i\right}^{1}\sum \limits_{j\in {S}_i}{Y}_{gj} \)).

Total: summation of celllevel expressions within each individual \( \left({\sum}_{j\in {S}_i}{Y}_{gj}\right) \).
In addition, we considered a celllevel differential expression method although such celllevel model estimation/hypothesis test violates exchangeability assumptions across different individuals (Fig. 1a).

MAST: Modelbased Analysis of Singlecell Transcriptomics [29] implemented in R
Counterfactual adjustment of pseudobulk data improves statistical power
We repeated our experiments 20 times for all the different configurations and summarized the performance by the area under the precisionrecall curve (AUPRC), varying the gene expression variance caused by disease (\( 0\le {\sigma}_{W\to Y}^2\le 0.3 \)) and the number of individuals (from 10 to 40), also considering a different number of cells per individual (20 and 50). Since our method primarily focuses on adjusting confounding factors (X variables in Fig. 2a), we highlight the results, where all the five covariates act as a confounder (Fig. 2b). However, we generally reached qualitatively a similar conclusion in further experiments, where batch effect variables (B variables in Fig. 2a) coexist with confounding effects (see Fig. S2 for the details). The performance gap between the CoCoAdiff and other pseudobulk analysis methods persists in almost all cases, regardless of the sample size of individuals and cells. As expected, causal genes are located at the bottom of the list ranked by confounding effects, yielding poor AUPRC scores. The celllevel DEG analysis (MAST) performed better than other pseudobulk methods only if the number of individuals is few (N = 10). Considering that model fitting based on celllevel data generally demands higher computational costs, individuallevel pseudobulk analysis is better suited for DEG analysis if the data come with sufficient sample size (individuals) and casecontrol labels were assigned at an individual level.
As demonstrated by previous analysis [14], we also confirmed that pseudobulk methods effectively control type I errors (Fig. 2c–e). However, a celllevel test often produces an inflated pvalue histogram; thus, a subsequent empirical FDR estimation method, such as qvalue [30, 31], ends up drawing a decision boundary at a wrong p value cutoff. Even when we included no causal effect, the celllevel method (MAST) predicted that a high fraction of genes are differentially expressed, whereas the other pseudobulkbased methods, including CoCoAdiff, made almost no discoveries (Fig. 2c). As long as we keep the contributions from confounding effects low, all the pseudobulk methods conservatively (and correctly) control type I errors (Fig. 2d). We found that CoCoAdiff might loosely control type I errors, higher than desired by an empirical false discovery rate (eFDR) calibration method when the simulated data were contaminated by confounding factors (Fig. 2e). We define eFDR as the fraction of positive discovery when multiple hypothesisadjusted q values were empirically calibrated by qvalue [30, 31] package controlled at 1%.
In some sense, the loosened type I error control can arise due to the suboptimal choice of the kNearest Neighbour parameter in the cellcell matching step. We evaluated the performance of the CoCoAdiff methods with different settings of the kNN parameters (Fig. 2f) and calculated eFDR (Fig. 2g). When confounding effects are absent (\( {\sigma}_{X,B\to Y}^2=0 \)), the k parameter does not affect the AUPRC performance and eFDR for k > 1. However, a sufficient number of knearest neighbours are required for large confounding effects (\( {\sigma}_{X,B\to Y}^2=0.5 \)); in our experiments, the AUPRC scores saturated after k ≥ 50 (Fig. 2f) and the eFDR levels decreased when a larger knearest neighbours were used to control confounding effects.
We further conducted simulation experiments, where the conditional ignorability [17, 18] assumption no longer holds due to feedback loops on causal genes and leaking causal effects (see Fig. S1a). In particular, we infused the first principal component (PC) of disease effects on causal genes and reintroduced collider effects shared between the causal and noncausal genes. Unfortunately, unlike confounding variables, adjusting a collider between multiple variables creates spurious associations between them [32]. Moreover, having such a collider variable creates substantial challenges in eFDR calibrations for all the methods (Fig. S1b). However, in terms of gene ranking tasks, CoCoAdiffadjusted pseudobulk analysis still outperforms other methods, consistently across many different settings (Fig. S2c).
Case study: finding celltypespecific causal genes in Alzheimer’s disease
We reanalyzed published singlenuclei RNAseq (snRNAseq) data of 48 individuals in postmortem brain samples [22]. To our knowledge, this is one of the largest snRNAseq data on casecontrol disease studies. Of the 48 individuals, we included 40 individuals for differential expression analysis because we found no casecontrol disease labels on the reaming eight individuals.
Cell type annotations of 70,634 cells
We directly annotated the cell types of these 70,634 cells using the list of celltypespecific genes provided by the PsychENCODE project [33]. Of the total 2648 PsychENCODE marker genes, we used 1726 genes expressed in this data set as features (Fig. S3). We identified the eight celltype clusters of cells while estimating a mixture of von MisesFisher distributions, measuring cells’ likelihood to centroids by angular distance (see the “Methods” section). We found that this genetocelltype membership information was sufficient enough to distinguish eight cell types. These eight cell types include expiatory (Ex) and inhibitory neurons (In), oligodendrocytes (Oligo), oligodendrocyte progenitor cells (OPC), microglia, astrocytes (Astro), pericytes (Per), and endothelial cells (Endo). We found that our annotation almost perfectly agrees with the original paper’s cell type annotation (Fig. S4). We also found that cell types correspond to unique cell clusters after BBKNN (batchbalancing kNearest Neighbour) [34] preprocessing (Fig. 3a), showing no apparent bias induced by other demographic and pathological variables (Fig. S5). Moreover, we further dissected four different cortical layerspecific cell types for excitatory neurons (Fig. 3b) and four different subtypes for inhibitory neurons (Fig. 3c) using a refined set of marker genes provided by previous singlenucleus analysis [35].
We notice a wide spectrum of celltype variability across 48 individuals, both in terms of the number of cells and proportions (Fig. 3d). The Mathys et al. data contain on average 1444 cells per individual, of which 50.65% cells stem from Ex (N = 726 ± 382 SD), 12.71% cells from In (N = 182 ± 107), 25.72% cells for Oligo (N = 380 ± 252), 3.56% cells from OPC (N = 54 ± 34), 2.43% cells from Microglia (N = 33 ± 24), 4.94% cells from Astro (N = 69 ± 46), and 0.1% cells from Endo and Per. We have 726 excitatory neurons per individual (Fig. 3e), of which 52.62% (N = 399 ± 250) cells from the layer 4, 33.1% (N = 233 ± 108) cells from the layer L2/3, 8.26% (N = 50 ± 28) cells from the cortical layer L5/6 (CC), and 6.03% (N = 44 ± 31) cells from the layer 5/6. We have 182 inhibitory neurons per individual (Fig. 3c), consisting of 32.9% (N = 59 ± 34) cells from inhibitory neurons (VIP), 31.72% (N = 59 ± 40) cells from inhibitory neurons (PV), 24.51% (N = 43 ± 26) cells from inhibitory neurons (SST), and 11.1% (N = 22 ± 17) cells from inhibitory neurons (SV2C).
Cell type stratification improves the statistical power and interpretation of differential expression analysis
We investigated the impact of such a high level of celltype heterogeneity on subsequent differential expression analysis. Tissuelevel bulk RNAseq analysis data can be arguably thought of as an aggregate of singlecelllevel expressions. If genes were similarly affected by the disease phenotype in most cell types, we would expect the bulklevel associations to be similar, and celltypestratified analysis would benefit less than more of stochasticity—fewer cells per individual. On the other hand, if most diseaseresponsive genes act through a celltypespecific mechanism, celltypestratified data analysis will improve statistical power and render better biological interpretations in genomics analysis.
Using these cell type annotations, we constructed celltypestratified pseudobulk data for all the genes and individuals in each cell type independently, treated them as a gene expression matrix, and tested associations of genes with AD status. We also constructed the pseudobulk profiles by combining all the cells in each individual, ignoring the cell type annotations, and carried out the same association analysis. It is clearly shown that the number of discoveries (unique genes) dramatically increase with celltypespecific stratification steps in both studies using CoCoAdiff and total expression profiles (Fig. 3g). Considering the variety of cell types in each p value cutoff, such celltypespecific mechanisms are likely to remain hidden in bulk, combined differential analysis but better revealed after taking into account cell type heterogeneity (Fig. 3h).
Disease status modulates the celltypespecific gene expressions
215 genes are differentially regulated with AD pathology
We prioritized genes based on testing a hypothesis that the pseudobulk profiles processed by CoCoAdiff are differentially ranked by AD pathology (Wilcoxon’s ranksum test) [27]. We conservatively adjusted putative confounding effects with (100nearest neighbour search) in a spectral space constructed by 50 principal components. Controlling the false discovery rate (FDR [36]) < 1%, we found 1648 genes (11.68% of 14,106), consisting of 672 genes found in ExL4, 522 in ExL2/3, 297 in ExL5/6, 210 in InPV, 98 in Oligo, 84 in Microglia, 80 in Astro, 57 in InVIP, 49 in OPC, 11 in InSST, and 4 in InSV2C. Controlling familywise error rate (FWER [37]) at 1%, we found a total of 215 genes (1.52%), which consist of 55 genes found in ExL4, 39 in ExL2/3, 28 in ExL5/6, 28 in Oligo 24 in InPV, 19 in Microglia, 19 in Astro, 9 in OPC, 7 in InVIP, and 3 in InSST.
We confirmed that the CoCoAdiff procedure did not introduce a systematic bias by shrinking variance on the case or control samples (Fig. S8). We tested our method on four different phenotypes using twelve celltypespecifically confounderadjusted profiles and celltypesorted pseudobulk data. Moreover, visual inspection of the p value distributions for Wilcoxon’s tests suggests no apparent inflation/deflation in our multiple hypothesis testing (Fig. S9).
In addition to the nonparametric ranksum test, we propose a modelbased Wald statistic for an individuallevel test (for each gene g and an individual i), namely \( {Z}_{gi}=\mathbbm{E}\left[\ln \kern0.20em {\delta}_{gi}\right]/\mathbbm{V}\left[\ln \kern0.20em {\delta}_{gi}\right] \), and the groupwise average disease effect size (ADE) and standard error (SE) for each gene g:
where \( {\omega}_i=1/\mathbbm{V}\left[\ln \kern0.20em {\delta}_{gi}\right] \) for brevity (the method). We found that genelevel ADE values are marginally independent of average confounding effects (the top panels of Fig. S10). However, we confirmed that average disease effects on the disease samples (ADD) generally align well with the average disease effects computed on the control samples (the bottom, Fig. S10).
The false sign rate (FSR) of these Bayesian estimates of ADE and SE can be controlled by an empirical Bayes procedure, such as ashr [38]. Controlling the FSR of ADEs and FDR of the ranksum tests both below 1%, we found a total of 1330 AD genes (1669 gene and cell type pairs) and an average of 152 (± 206) genes per cell type. Of them, we highlighted 182 genes sampled at most 20 genes within each cell type (Fig. 3i) and annotated 17 genes specifically acted in the microglia. We found multiple lines of independent evidence to corroborate the causal role of these genes.
Of these top AD genes found in microglial cells, we highlight five genes, including APOE, showing gene expressions upregulated clearly among the AD individuals (the bottom panels of Fig. 3i). SH3RF3 gene has been found significantly associated with the age at onset of AD in the familybased genomewide association studies [39]. Interestingly, regarding Parkinson’s disease, another neurodegenerative disorder, genetic variants located in LHFPL2 have been associated with accelerated onset of the disease by 8 to 12 years [40]. BLNK plays a key regulatory role in wellknown microgliaspecific TREM2 signalling pathway [41] and has been proved to be upregulated with the increase of amyloid β protein [42]. To some degree, conditional genetic analysis suggested that PTPRJ is a link to explain pleiotropy between lateonset AD and major depressive disorder [43].
Gene ontology analysis characterizes a variety of celltypespecific pathways in AD
We sought to characterize celltypespecific mechanisms potentially perturbed by average 1077 (± 1122) significant AD genes found in each cell type (FDR < 20%) using goseq [44] package. Gene ontology (GO) enrichment analysis shows that DEGs identified in different cell types indeed influence markedly different biological mechanisms. By visual inspection, we can identify celltypespecific modules of the enriched GO terms in the biological process category (Fig. 4b–d). For instance, upregulated AD genes found in excitatory neurons are highly enriched in neurodevelopmental pathways, such as “modulation of chemical synaptic transmission” and “regulation of transsynaptic signalling.” However, microglial DEGs are mostly found in immunerelated activities, such as “interferongammamediated signalling pathway” and “regulation of lymphocytemediated immunity,” and oligodendrocyte DEGs enrich terms reflect the functional role of the cell type such as “myelination” and “axon ensheathment.” For the GO terms in the cellular component category, DEGs found in neurons overrepresent synapse and axon, but glial celltypespecific DEGs highlight cellcell junction and focal adhesion (Fig. 4e). DEGs found in neurons generally participate in ion channel activities, but we noticed that microglial DEGs are highly relevant to Rho GTPase and cadherin binding activities (Fig. 4f).
Discussion
We addressed a subset of a causal inference problem that emerges in disease studies. We sought to characterize and estimate the average causal effect of genes between the case and control individuals from observational singlecell data. Delineating confounding and noncausal factors from causal effects is a crucial step to many genomics problems. Not to be trapped in circular reasoning (identifiability issue), the genomics community has been using socalled control genes and samples to extract factors shared in both control and discovery data [45,46,47]. One of the steps in our algorithm enjoys a similar idea, but there is no need to prescribe control cells or genes for our purposes. Along the same line, only if control features were known a priori, contrastive principal component analysis [48] could pick out noncausal factors in its latent space. Likewise, only if nuisance variables are independently observed, the variational fair autoencoder model [49] project cells onto unconfounded (“fair”) latent space.
Our method builds on the outcome regression facilitate by a matching algorithm [19, 20]. Like most existing singlecell analysis pipelines, finding reliable knearest neighbour cells is a crucial step. If some cells in one condition were poorly matched with other cells in the opposite condition, failing to capture a shared component of confounding effects, our analysis might not work as expected. However, we want to emphasize that a failure of the matching step does not lead to an overcorrection of pseudobulk data. It is important to understand and reliably quantify to what degree a cellcell matching procedure can address the intrinsic and another technical variability of a singlecell RNAseq data matrix.
A sparsity of singlecell data still casts a wide range of modelling questions. As we only consider the average effect within each individual, and we take a simple model that is just enough to capture our estimands. We ignored the notion of zeroinflation since we treat singlecell data as a count matrix, not being transformed by logarithm [50]. However, future research can take advantage of more sophisticated modelling of the individual and celllevel observations [51], perhaps involving latent variables for representational learning.
Conclusions
We present a causal inference method that identifies and removes putative confounding effects from singlecell RNAseq data so that the subsequent differential expression analysis can become unbiased and gain more statistical power. We have empirically shown that CoCoAdiff improved the downstream data analysis in extensive simulation experiments. We also demonstrated in realworld snRNAseq data that the CoCoAdiff approach was necessary to reveal both wellestablished and novel causal genes in AD. Our work is the first application of counterfactual inference to singlecell genomics to the best of our knowledge. We expect that many existing inference methods and models can be reformulated in the same causal inference framework. More broadly, we believe that causal inference methods can improve the interpretation of genomics analysis and ultimately benefit translation researches.
Methods
Preliminary modelling of singlecell RNAseq counting data
Individuallevel gene expression quantification
We describe the singlecell RNAseq datagenerating process in a PoissonGamma hierarchical model. For each individual, we measure thousands of gene expression on nearly a thousand cells. Here, we denote each individual, gene, and cell by index i, g, j, respectively. We model the expression count Y_{gj} of a gene g in a cell j follows Poisson distribution with the composite rate parameter, λ_{gi}ρ_{j}, where λ_{gi} quantifies the gene’s mean activity in the corresponding individual i, and ρ_{j} accounts for the sequencing depth of a cell j. More precisely, we define the likelihood of Y_{gj}:
We assume the gene and cell parameters, λ, ρ, follow a conjugate prior distribution (Gamma); more precisely, we parameterize the density function: \( \mathsf{Gamma}\left(\theta a,b\right)\equiv {b}^a/\Gamma (a){\theta}^{a1}\exp \left( b\theta \right) \). We assume smooth a prior distribution for the ρ and λ parameters, namely \( {\rho}_j,{\lambda}_{gi}\sim \mathsf{Gamma}\left(1,1\right) \). A smaller value for the hyperparameters, such as \( \mathsf{Gamma}\left({10}^{4},{10}^{4}\right) \), could encourage the effect of prior distributions vanish; however, we found it often results in numerically unstable posterior estimation when RNAseq samples are shallowly sampled.
For the gene parameter λ_{gi}, if we defined its distribution: \( {\lambda}_{gi}\sim \mathsf{Gamma}\left({\phi}^{1},{\phi}^{1}/{\mu}_{gi}\right) \), we would have \( \mathbbm{E}\left[\lambda \right]=\mu \) and \( \mathbbm{V}\left[\lambda \right]={\mu}^2\phi \). Integrating out the uncertainty over λ, we derive the following negative binomial model:
which preserves the characteristic quadratic relationship between the mean and variance: \( \mathbbm{V}\left[Y\right]=\mathbbm{E}\left[Y\right]+\mathbbm{E}{\left[Y\right]}^2\phi \).
Variational Bayes for parametric inference
We estimate the posterior distribution of λ_{gi} and ρ_{j} by minimizing KullbackLeibler divergence between the joint likelihood \( \mathcal{L}\equiv \prod \limits_{gj}f\left({Y}_{gj};{\lambda}_{gi},{\rho}_j\right)f\left(\lambda \right)f\left(\rho \right) \) and the fully factored variational distributions [52], \( q\left(\lambda \right)=\mathsf{Gamma}\left(\lambda {\alpha}_{\lambda },{\beta}_{\lambda}\right) \) and \( q\left(\rho \right)=\mathsf{Gamma}\left(\rho {\alpha}_{\rho },{\beta}_{\rho}\right) \). We can quickly reach convergence by alternating the following update equations:
Here, we first initialize \( \mathbbm{E}\left[{\rho}_j\right]=1 \) for all j, and add pseudocount 1 on both numerators and denominators because of the prior distribution of ρ and λ.
Counterfactual confounder adjustment for differential expression analysis
Step 1: Imputation of potential outcomes by Poisson regression)
We assume binary treatment assignment and denote disease assignment (or nature’s treatment) by W ∈ {0, 1}. We denote an individual have suffered from a disease by W = 1 and the healthy one by W = 0. For clarity, we introduce the potential outcome notations to the gene expression variables. Let \( {Y}_{gj}^{(w)} \) be gene expression of a gene g in a cell j if this expression value was observed from an individual with a disease label W = w. For a disease individual, Y^{(1)} is the same as observed Y value, but Y^{(0)} is unknown, requiring counterfactual inference; for the opposite case, a healthy individual, Y^{(0)} is observed, but Y^{(1)} is counterfactual. To proceed, we assume the following causal assumptions [18, 20, 53]: (1) The disease assignment mechanism (W) is unconfounded with potential outcomes Y^{(0)}, Y^{(1)}, conditioning on some covariates X. (2) There is sufficient overlap between the case and control cells with respect to the covariates X. In other words, in almost every X = x, we have 0 < P(W = 1 X = x) < 1.
How do we find the counterfactual Y^{(1 − w)} for the observed Y^{(w)}? We construct feature vectors for potential outcome prediction by searching knearest neighbours (kNN) from the cells belonging to the opposite conditions. To avoid the curse of dimensionality, we first perform spectral decomposition of singlecell data matrix and efficiently search kNN on the spectral domain with hierarchical hashing algorithm [54]. Using these counterfactually matched cells, we construct feature matrix with each element \( {F}_{gk}^{\left(1w\right)}=\log \left(1+{Y}_{gk}^{\left(1w\right)}\right) \) and quickly estimate regression coefficients β’s in the Poisson regression by coordinatewise descent method [55]:
where β_{0} captures the intercept term.
Given the optimized coefficients, we predict the potential outcome \( {\hat{Y}}_{gj}^{\left(1w\right)}\leftarrow \exp \left(\sum \limits_{j^{\prime}}^k{F}_{gj\prime}^{\left(1w\right)}{\hat{\beta}}_{j\prime }+{\hat{\beta}}_0\right) \), ignoring the residual errors (ϵ). We also considered a nonparametric imputation method which takes weighted average over the matched cells [34, 56, 57]. Although such nonparametric methods are frequently used in singlecell data analysis, we found that Poisson regression yields more robust performance with fewer neighbouring cells than the other kNNbased imputation methods.
Step 2: Identification of potential confounding effects)
After the matching followed by the regression, we have observed \( {Y}_{gj}^{(w)} \) and counterfactual \( {\hat{Y}}_{gj}^{\left(1w\right)} \). By construction, one of them carry diseaserelevant effects unlike the other one. However, both of them can provide diseaseinvariant information that implicate potential confounding effects, denoted by μ_{gi} for a gene g and individual i:
where we introduce the conditionalspecific sequencing depth parameters ρ^{(w)}. However, note that μ_{gi} is shared and labelinvariant.
We estimate the posterior mean of μ_{gi} by variational Bayes by alternating the following update equations until convergence:
for all w ∈ {0, 1}.
Step 3: Confounder adjustment)
While fixing the value \( {\hat{\mu}}_{gi} \) to its (variational) posterior mean \( {\mathbbm{E}}_q\kern0ex \left[{\mu}_{gi}\right] \), we redeem the confounderadjusted mean parameters δ_{gi}, by maximizing the data likelihood:
Again, the posterior distributions are found by alternating the following update equations:
Since the δ_{gi} variable follows Gamma distribution, we also have
where ψ(·) is the digamma function, and approximate its variance,
See the following derivations of the Gaussian approximation of Gamma distribution.
Technical details
Local Gaussian approximation of Gamma distribution
We approximate the distribution of lnλ by constructing a local quadratic approximation of the original logprobability density function:
Letting ϕ = ln λ, we can rewrite the above as:
At some \( \hat{\phi} \), we can find a quadratic form:
Setting \( {e}^{\hat{\phi}}=\left(\alpha 1\right)/\beta \) (the mode of Gamma distribution), we have
Finally, we have
\( p\left(\phi \alpha, \beta \right)\approx \mathcal{N}\left\left(\phi \ln \left(\left(\alpha 1\right)/\beta \right),{\left(\alpha 1\right)}^{1}\right)\right.. \) In our case, we assumed λ ∼ Gamma(1, 1) a priori and only derived approximate Gaussian whenever we have at least 1 read per individual; therefore, α > 1. However, if 0 < α ≤ 1, we can approximate the Gaussian at λ = α/β, and this results in \( p\left(\phi \alpha, \beta \right)\approx \mathcal{N}\left\left(\phi \ln \left(\alpha /\beta \right),{\alpha}^{1}\right)\right.. \)
Derivation of average disease effect across individuals (metaanalysis)
From the above, we derived the posterior distribution of ϕ_{i}(≡ ln λ_{i}) variables. Let η_{i} = ln((α_{i} − 1)/β) and \( {\sigma}_i^2={\left({\alpha}_i1\right)}^{1} \). Then we have \( {\phi}_i\sim \mathcal{N}\left({\phi}_i{\eta}_i,{\sigma}_i^2\right) \). We can find another variational distribution \( r\equiv \mathcal{N}\left({\phi}_i\overline{\eta},{\overline{\sigma}}^2\right) \) averaging over all these individuallevel posterior distributions by optimizing the following KullbackLeibler divergence:
By Jensen’s inequality,
Optimizing this with respect to \( \overline{\eta} \) and \( \overline{\sigma} \), we have:
Cell type annotation by constrained mixture of von MisesFisher
We classify a cell type of 70,634 cells based on the prior knowledge of celltypespecific 2648 marker genes on 8 brain cell types [33]. Using 1726 genes present in our data, we construct a normalized vector m_{j} for each cell with the dimensionality d = 1726 and ∥m_{j} ∥ = 1. Additionally, we define a label matrix L to designate the activities of the marker genes to the relevant cell types. Each element L_{gk} takes 1 if and only if a gene g ∈ [d] is active on a k ∈ [8] cell type; otherwise, we set L_{jk} = 0. We assume that each normalized vector m_{j} follows von MisesFisher (vMF) distribution with cell type kspecific mean vector θ_{k} with the concentration parameter κ, shared across all the cell types:
where n = 70634 for the cells and K = 8 for the cell types. Here, we introduce z_{jk}, an indicator variable to mark the assignment of a cell j to a cell type k. Our goal is to estimate the posterior probability of z_{jk} = 1 by stochastic expectation maximization (EM) algorithm. In the Estep, we simply sample the latent membership z_{jk} from the discrete distribution proportional to \( \exp \left({\mathbf{m}}_j^{\top }{\boldsymbol{\uptheta}}_k\right) \). In the Mstep, we maximize the mean and concentration parameters with the cell type constraints L:
where \( r=\left\Vert \sum \limits_j{\mathbf{m}}_j\right\Vert /n \). Derivation for the optimization of κ can be found in the previous work on von Mises Fisher mixture model [58].
Availability of data and materials
We made the C++ source code of binary programs used in simulation and data analysis available in the following public repository, https://ypark.github.io/mmutil/ under MIT License. We also deposited a compressed tarball in Zenodo under the following accession [59]: https://doi.org/10.5281/zenodo.5106691 A full list of differential expression analysis, analysis pipeline (GNU Makefile), and the vignettes of simulation experiments are available in the separate repository: https://ypark.github.io/cocoa_paper/ [60].
The results published here are in whole or in part based on data obtained from the AD Knowledge Portal (https://adknowledgeportal.synapse.org). Study data were provided by the Rush Alzheimer’s Disease Center, Rush University Medical Center, Chicago.
References
Trapnell C, Cacchiarelli D, Grimsby J, Pokharel P, Li S, Morse M, et al. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nat Biotechnol. 2014;32(4):381–6 https://doi.org/10.1038/nbt.2859.
Ståhl PL, Salmén F, Vickovic S, Lundmark A, Navarro JF, Magnusson J, et al. Visualization and analysis of gene expression in tissue sections by spatial transcriptomics. Science. 2016;353(6294):78–82 https://doi.org/10.1126/science.aaf2403.
Norman TM, Horlbeck MA, Replogle JM, Ge AY, Xu A, Jost M, et al. Exploring genetic interaction manifolds constructed from rich singlecell phenotypes. Science. 2019;365(6455):786–93 https://doi.org/10.1126/science.aax4438.
van der Wijst MGP, et al. Singlecell RNA sequencing identifies celltypespecific ciseQTLs and coexpression QTLs. Nat Genet. 2018.
Sarkar AK, Tung PY, Blischak JD, Burnett JE, Li YI, Stephens M, et al. Discovery and characterization of variance QTLs in human induced pluripotent stem cells. PLoS Genet. 2019;15(4):e1008045 https://doi.org/10.1371/journal.pgen.1008045.
Tusher VG, Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci U S A. 2001;98(9):5116–21 https://doi.org/10.1073/pnas.091062498.
Irizarry RA, Warren D, Spencer F, Kim IF, Biswal S, Frank BC, et al. Multiplelaboratory comparison of microarray platforms. Nat Methods. 2005;2(5):345–50 https://doi.org/10.1038/nmeth756.
Irizarry RA, Hobbs B, Collin F, BeazerBarclay YD, Antonellis KJ, Scherf U, et al. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics. 2003;4(2):249–64 https://doi.org/10.1093/biostatistics/4.2.249.
Robinson MD, Smyth GK. Moderated statistical tests for assessing differences in tag abundance. Bioinformatics. 2007;23(21):2881–7 https://doi.org/10.1093/bioinformatics/btm453.
Robinson MD, Smyth GK. Smallsample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics. 2008;9(2):321–32 https://doi.org/10.1093/biostatistics/kxm030.
Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biol. 2010;11(10):R106 https://doi.org/10.1186/gb20101110r106.
Law CW, Chen Y, Shi W, Smyth GK. voom: Precision weights unlock linear model analysis tools for RNAseq read counts. Genome Biol. 2014;15:R29.
Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNAseq data with DESeq2. Genome Biol. 2014;15:1–21.
Soneson C, Robinson MD. Bias, robustness and scalability in singlecell differential expression analysis. Nat Methods. 2018;15(4):255–61 https://doi.org/10.1038/nmeth.4612.
Crowell HL, et al. On the discovery of populationspecific state transitions from multisample multicondition singlecell RNA sequencing data. bioRxiv. 2019:713412.
Mandric I, et al. Optimized design of singlecell RNA sequencing experiments for celltypespecific eQTL analysis. Nat Commun. 2020;11:1–9.
Rubin DB. Bayesian inference for causal effects: the role of randomization. aos. 1978;6:34–58.
Rosenbaum PR, Rubin DB. Constructing a control group using multivariate matched sampling methods that incorporate the propensity score. Am Stat. 1985;39:33–8.
Heckman JJ, Ichimura H, Todd PE. Matching as an econometric evaluation estimator: evidence from evaluating a job training programme. Rev Econ Stud. 1997;64(4):605–54 https://doi.org/10.2307/2971733.
Abadie A, Imbens GW. Large sample properties of matching estimators for average treatment effects. Econometrica. 2006;74(1):235–67 https://doi.org/10.1111/j.14680262.2006.00655.x.
Davey Smith G, Hemani G. Mendelian randomization: genetic anchors for causal inference in epidemiological studies. Hum Mol Genet. 2014;23(R1):R89–98 https://doi.org/10.1093/hmg/ddu328.
Mathys H, DavilaVelderrain J, Peng Z, Gao F, Mohammadi S, Young JZ, et al. Singlecell transcriptomic analysis of Alzheimer’s disease. Nature. 2019;570(7761):332–7 https://doi.org/10.1038/s4158601911952.
VanderWeele TJ. Bias formulas for sensitivity analysis for direct and indirect effects. Epidemiology. 2010;21(4):540–51 https://doi.org/10.1097/EDE.0b013e3181df191c.
Glynn AN. The product and difference fallacies for indirect effects: the product and difference fallacies for indirect effects. Am J Polit Sci. 2012;56(1):257–69 https://doi.org/10.1111/j.15405907.2011.00543.x.
Pearl, J. & Mackenzie, D. The book of why: the new science of cause and effect. 2018. Basic Books.
VanderWeele TJ, Shpitser I. On the definition of a confounder. Ann Stat. 2013;41(1):196–220 https://doi.org/10.1214/12aos1058.
Wilcoxon F. Individual comparisons by ranking methods. Biom Bull. 1945;1(6):80–3 https://doi.org/10.2307/3001968.
Andri et mult. al., S. DescTools: tools for descriptive statistics. 2021.
Finak G, McDavid A, Yajima M, Deng J, Gersuk V, Shalek AK, et al. MAST: A flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in singlecell RNA sequencing data. Genome Biol. 2015;16(1):278 https://doi.org/10.1186/s1305901508445.
Storey JD. A direct approach to false discovery rates. J R Stat Soc Ser B Stat Methodol. 2002;64(3):479–98 https://doi.org/10.1111/14679868.00346.
Storey JD, Tibshirani R. Statistical significance for genomewide studies. Proc Natl Acad Sci. 2003;100(16):9440–5 https://doi.org/10.1073/pnas.1530509100.
Cole SR, Platt RW, Schisterman EF, Chu H, Westreich D, Richardson D, et al. Illustrating bias due to conditioning on a collider. Int J Epidemiol. 2010;39(2):417–20 https://doi.org/10.1093/ije/dyp334.
Gandal MJ, Zhang P, Hadjimichael E, Walker RL, Chen C, Liu S, et al. Transcriptomewide isoformlevel dysregulation in ASD, schizophrenia, and bipolar disorder. Science. 2018;362(6420):eaat8127 https://doi.org/10.1126/science.aat8127.
Polański K, Young MD, Miao Z, Meyer KB, Teichmann SA, Park JE. BBKNN: fast batch alignment of single cell transcriptomes. Bioinformatics. 2019; https://doi.org/10.1093/bioinformatics/btz625.
Velmeshev D, Schirmer L, Jung D, Haeussler M, Perez Y, Mayer S, et al. Singlecell genomics identifies cell typespecific molecular changes in autism. Science. 2019;364(6441):685–9 https://doi.org/10.1126/science.aav8130.
Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B Stat Methodol. 1995;57:289–300.
Holm S. A simple sequentially rejective multiple test procedure. Scand Stat Theory Appl. 1979;6:65–70.
Stephens M. False discovery rates: a new deal. Biostatistics. 2017;18(2):275–94 https://doi.org/10.1093/biostatistics/kxw041.
Lee JH, Cheng R, Vardarajan B, Lantigua R, ReyesDumeyer D, Ortmann W, et al. Genetic modifiers of age at onset in carriers of the G206A mutation in PSEN1 with familial Alzheimer disease among caribbean hispanics. JAMA Neurol. 2015;72(9):1043–51 https://doi.org/10.1001/jamaneurol.2015.1424.
HillBurns EM, Ross OA, Wissemann WT, SotoOrtolaza AI, Zareparsi S, Siuda J, et al. Identification of genetic modifiers of ageatonset for familial Parkinson’s disease. Hum Mol Genet. 2016;25(17):3849–62 https://doi.org/10.1093/hmg/ddw206.
Zajkowicz A, GdowiczKłosok A, Krześniak M, Janus P, Łasut B, Rusin M. The Alzheimer’s diseaseassociated TREM2 gene is regulated by p53 tumor suppressor protein. Neurosci Lett. 2018;681:62–7 https://doi.org/10.1016/j.neulet.2018.05.037.
Sierksma A, et al. Novel Alzheimer risk genes determine the microglia response to amyloidβ but not to TAU pathology. EMBO Mol Med. 2020;12:e10606.
Lutz MW, Sprague D, Barrera J, ChibaFalek O. Shared genetic etiology underlying Alzheimer’s disease and major depressive disorder. Transl Psychiatry. 2020;10(1):88 https://doi.org/10.1038/s413980200769y.
Young MD, Wakefield MJ, Smyth GK, Oshlack A. Gene ontology analysis for RNAseq: accounting for selection bias. Genome Biol. 2010;11(2):R14 https://doi.org/10.1186/gb2010112r14.
Risso D, Ngai J, Speed TP, Dudoit S. Normalization of RNAseq data using factor analysis of control genes or samples. Nat Biotechnol. 2014;32(9):896–902 https://doi.org/10.1038/nbt.2931.
GagnonBartsch JA, Speed TP. Using control genes to correct for unwanted variation in microarray data. Biostatistics. 2012;13(3):539–52 https://doi.org/10.1093/biostatistics/kxr034.
Schölkopf B, Hogg DW, Wang D, ForemanMackey D, Janzing D, SimonGabriel CJ, et al. Modeling confounding by halfsibling regression. Proc Natl Acad Sci U S A. 2016;113(27):7391–8 https://doi.org/10.1073/pnas.1511656113.
Abid A, Zhang MJ, Bagaria VK, Zou J. Exploring patterns enriched in a dataset with contrastive principal component analysis. Nat Commun. 2018;9(1):2134 https://doi.org/10.1038/s41467018046088.
Louizos, C., Swersky, K., Li, Y., Welling, M., Zemel, R. The variational fair autoencoder. 2015. at http://arxiv.org/abs/1511.00830.
Townes FW, Hicks SC, Aryee MJ, Irizarry RA. Feature selection and dimension reduction for singlecell RNASeq based on a multinomial model. Genome Biol. 2019;20(1):295 https://doi.org/10.1186/s1305901918616.
Sarkar A, Stephens M. Separating measurement and expression models clarifies confusion in single cell RNAseq analysis. bioRxiv. 2020:2020.04.07.030007.
Jordan MI, Jaakkola TS, Saul LK. An introduction to variational methods for graphical models. Mach Learn. 1999;37(2):183–233 https://doi.org/10.1023/A:1007665907178.
Hirano K, Imbens GW, Ridder G. Efficient estimation of average treatment effects using the estimated propensity score. Econometrica. 2003;71(4):1161–89 https://doi.org/10.1111/14680262.00442.
Malkov, Y. A. & Yashunin, D. A. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. 2016. at http://arxiv.org/abs/1603.09320.
Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw. 2010;33(1):1–22.
Haghverdi L, Lun ATL, Morgan MD, Marioni JC. Batch effects in singlecell RNAsequencing data are corrected by matching mutual nearest neighbors. Nat Biotechnol. 2018;36(5):421–7 https://doi.org/10.1038/nbt.4091.
Hie B, Bryson B, Berger B. Efficient integration of heterogeneous singlecell transcriptomes using scanorama. Nat Biotechnol. 2019;37(6):685–91 https://doi.org/10.1038/s4158701901133.
Banerjee A, Dhillon IS, Ghosh J, Sra S. Clustering on the unit hypersphere using von MisesFisher distributions. J Mach Learn Res. 2005;6:1345–82.
Park, Y. Matrix Market Utility for singlecell sequencing data analysis. 2021. Zenodo. https://doi.org/10.5281/zenodo.5106691.
Park, Y. CoCoAdiff: counterfactual inference for singlecell gene expression analysis. Source code. 2021. https://github.com/ypark/cocoa_paper.
Acknowledgements
We thank Liang He, Matthew Lincoln, and Tomokazu Sumida for biological inspiration, which later led to adopting von MisesFisher model for cell type annotations. We also thank Abhishek Sarkar for helpful discussion on individuallevel quantification model for singlecell RNAseq data. We owe a debt of gratitude to anonymous reviewers for constructive feedback that significantly improved the exposition of this manuscript.
Review history
The review history is available as Additional file 2.
Peer review information
Barbara Cheifet was the primary editor of this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.
Funding
This work was supported by NIH grants U01NS110453, U24HG009446, and U01RFAHG009088 (MK). We also acknowledge generous supports from the BC Cancer Foundation, Project ID 1NSRG048 (YPP).
Author information
Authors and Affiliations
Contributions
YPP and MK conceived the concept and designed the study. MK directed and coordinated data acquisition. YPP implemented software and scripts, and conducted the statistical analysis. YPP and MK wrote the manuscript.
Authors’ information
Authors’ Twitter handle:
YPP: @ypp_lab, MK: @manoliskellis
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Data collection was supported through funding by NIA grants RF1AG57473 (single nucleus RNAseq) and the Illinois Department of Public Health (ROSMAP). The Religious Orders Study and Rush Memory and Aging Project were approved by an IRB of Rush University Medical Center.
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1: Figs. S1 to S10.
with the figure legend texts.
Additional file 2.
Review history
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Park, Y.P., Kellis, M. CoCoAdiff: counterfactual inference for singlecell gene expression analysis. Genome Biol 22, 228 (2021). https://doi.org/10.1186/s13059021024384
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13059021024384
Keywords
 Causal inference
 Singlecell RNAseq
 Counterfactual inference
 Alzheimer’s disease