Statistical model
We assume there are K known distinct cell populations among the M cell samples, where cell j has Nj observed transcripts. We denote native expression distribution for cell population k as a G-length vector ϕk. For the notational convenience, we will use ϕ−k={ϕk′:k′≠k,k′∈{1,2,...,K}} to represent gene expressions from all other cell populations other than k. Each cell j has a parameter θj to represent the proportion of transcript counts that are derived from native expression distribution. θj is assumed to come from a global beta distribution which leverages the variation of contamination level across all the cells in the dataset, with hyperparameters a1 and a2 a priori. The tth transcript xjt in cell j has a hidden state, yjt, which follows a Bernoulli distribution parameterized by θj and denotes the transcript’s membership to native expression distribution (yjt=1) or contamination distribution (yjt=0). Assuming that transcripts are conditionally independent given hidden state yjt and cell’s population zj,xjt follows a multinomial distribution either parameterized by \(\phi _{z_{j} }\) denoting native expression or \(\boldsymbol {\phi }_{-z_{j}}\) denoting contamination. The joint posterior distribution can be expressed as:
$$\begin{array}{@{}rcl@{}} {\begin{aligned} P(\boldsymbol{X}, \boldsymbol{Z}, \boldsymbol{Y}, \boldsymbol{\theta} | \boldsymbol{\phi}, a_{1},a_{2}) &= \prod_{j=1}^{M} p(\theta_{j} | a_{1}, a_{2}) \prod_{t=1}^{N_{j}}\\& \left(\left[ p(y_{jt}=1|\theta_{j}) \cdot p(x_{jt}=g|\phi_{z_{j} }) \right]^{I(y_{jt}=1)}\right.\\ & \left.\left[ p(y_{jt}=0|\theta_{j}) \cdot p(x_{jt}=g|\boldsymbol{\phi}_{-z_{j}}) \right]^{I (y_{jt}=0)} \right). \end{aligned}} \end{array} $$
(1)
To simplify computation work and notation, we assume the contamination distribution ηk is a simple linear combination of ϕ−k, such that:
$$\begin{array}{@{}rcl@{}} \eta_{k} &=& \sum_{k': k' \neq k} w_{k'} \phi_{k'}, \end{array} $$
(2)
where the weight \(\phantom {\dot {i}\!}w_{k^{\prime }}\) is the proportion of native transcripts from cluster k′ and is calculated using expected values, of which the full definition is given later in inference. DecontX model construction is shown in the plate diagram:
Variational inference
We use variational inference [21] to approximate the posterior probability of our model.
Similar to LDA [14], the following variational distributions are introduced to break down the coupling of θ and Y for variational inference:
$$\begin{array}{@{}rcl@{}} \begin{aligned} q (\boldsymbol{\theta}, \boldsymbol{Y} | \boldsymbol{\gamma}, \boldsymbol{\pi}) &= \prod_{j=1}^{M} q(\theta_{j} | \gamma_{j}) \prod_{t=1}^{N_{j}} q(y_{jt} | \pi_{jt}), \end{aligned} \end{array} $$
(3)
where the beta parameter γj={γj1,γj2} and Bernoulli parameter πjt={πjt1,πjt2} are the free variational parameters. πjt satisfies πjt1+πjt2=1, and \( q(y_{jt}) = \pi _{jt1}^{ I(y_{jt} =1)} \pi _{jt2}^{ I(y_{jt} =0)} \). The variational beta distribution for θj is \(q(\theta _{j}) = \frac {\Gamma (\gamma _{j1} + \gamma _{j2}) }{\Gamma (\gamma _{j1}) \Gamma (\gamma _{j2}) } \theta _{j1}^{\gamma _{j1} -1} \theta _{j2}^{\gamma _{j2} -1} \).
The need to compute the expectation of the θj arises in deriving the variational inference. Using the general fact for exponential family that the derivative of the log normalization factor with respect to the natural parameter is equal to the expectation of the sufficient statistic (logθji,i∈{1,2} in our beta distribution), we have:
$$\begin{array}{@{}rcl@{}} \begin{aligned} E [ \log \theta_{ji} | \gamma_{j1}, \gamma_{j2} ] = \Psi (\gamma_{ji}) - \Psi (\gamma_{j1} + \gamma_{j2}), i \in \{1, 2\}, \end{aligned} \end{array} $$
(4)
where Ψ is the digamma function, the first derivative of the log gamma function.
For simplicity in notation, let us use Q={θ,Y} and a={a1,a2}. We begin variational inference by bounding the log-likelihood using Jensen’s inequality.
$$\begin{array}{@{}rcl@{}} {\begin{aligned} \log p(\boldsymbol{X}, \boldsymbol{Z} | {a}, \boldsymbol{ \phi}) &= \log \int_{Q} p(\boldsymbol{X}, \boldsymbol{Z}, \boldsymbol{\theta}, \boldsymbol{Y} | a, \boldsymbol{ \phi}) d Q\\ & = \log \int_{Q} \frac{ p(\boldsymbol{X}, \boldsymbol{Z}, \boldsymbol{\theta}, \boldsymbol{Y} | a, \boldsymbol{\phi}) } { q (\boldsymbol{\theta}, \boldsymbol{Y} | \boldsymbol{\gamma}, \boldsymbol{\pi})} q(\boldsymbol{\theta}, \boldsymbol{Y} | \boldsymbol{\gamma}, \boldsymbol{\pi})) d Q \\ & \ge \int_{Q} \log \frac{ p(\boldsymbol{X}, \boldsymbol{Z}, \boldsymbol{\theta}, \boldsymbol{Y} | a, \boldsymbol{\phi}) } { q (\boldsymbol{\theta}, \boldsymbol{Y} | \boldsymbol{\gamma}, \boldsymbol{\pi})} q(\boldsymbol{\theta}, \boldsymbol{Y} | \boldsymbol{\gamma}, \boldsymbol{\pi})) d Q \\ & = E_{Q} [\log p(\boldsymbol{X}, \boldsymbol{Z}, \boldsymbol{\theta}, \boldsymbol{Y} |a, \boldsymbol{\phi}) ] - E_{Q} [\log q (\boldsymbol{\theta}, \boldsymbol{Y} | \boldsymbol{\gamma}, \boldsymbol{\pi}) ]. \end{aligned}} \end{array} $$
(5)
Jensen’s inequality provides us with a lower bound on the log likelihood for an arbitrary variational distribution q(θ,Y|γ,π).
We then expand the lower bound:
$$\begin{array}{@{}rcl@{}} \begin{aligned} L(\boldsymbol{\gamma}, \boldsymbol{\pi}; a, \boldsymbol{ \phi}) &= E_{Q} \left[ \log p(\boldsymbol{X}, \boldsymbol{Z}, \boldsymbol{\theta}, \boldsymbol{Y} |a, \boldsymbol{\phi}) \right] \\&\quad- E_{Q} \left[\log q (\boldsymbol{\theta}, \boldsymbol{Y} | \boldsymbol{\gamma}, \boldsymbol{\pi}) \right] \\ &= E_{Q} \left[ \log p(\boldsymbol{\theta} | a) + \log p(\boldsymbol{Y} | \boldsymbol{\theta})\right. \\ &\quad \left.+ \log p(\boldsymbol{X}, \boldsymbol{Z} | \boldsymbol{Y}, \boldsymbol{\phi}) \right] \\ & \quad - E_{Q} \left[ \log q(\boldsymbol{\theta} | {\boldsymbol{\gamma}}) + \log q(\boldsymbol{Y} | \boldsymbol{\pi}) \right]. \end{aligned} \end{array} $$
(6)
Expanding each term in the lower bound by taking expectation with respect to q(θ,Y|γ,π):
$$\begin{array}{@{}rcl@{}} {\begin{aligned} E_{Q} \left[ \log p(\boldsymbol{\theta} | a) \right] &= E_{Q} \left[ \log \prod_{j=1}^{M} p (\theta_{j} | a) \right] \\&= E_{Q} \left[ \sum_{j=1}^{M} \log p(\theta_{j} |a) \right] = \sum_{j=1}^{M} E_{Q} \left[ \log p(\theta_{j} |a) \right] \\ &= \sum_{j=1}^{M} E_{Q} \left[ \log \Gamma(a_{1}+a_{2}) - \log \Gamma(a_{1}) \right.\\& \left.- \log \Gamma(a_{2}) +(a_{1}-1) \log \theta_{j1} + (a_{2}-1) \log \theta_{j2} \right] \\ &= \sum_{j=1}^{M} \left[ \log \Gamma(a_{1}+a_{2}) -\left(\sum_{i=1}^{2} \log \Gamma(a_{i}) \right)\right. \\ & \left.+\sum_{i=1}^{2}(a_{i}-1) \left(\Psi \left(\gamma_{ji} \right) - \Psi \left(\gamma_{j1}+\gamma_{j2} \right) \right) \right] \end{aligned}} \end{array} $$
(7)
$$\begin{array}{@{}rcl@{}} {\begin{aligned} E_{Q} \left[ \log p(\boldsymbol{Y} | \boldsymbol{\theta}) \right] &= E_{Q} \left[ \log \prod_{j=1}^{M} \prod_{t=1}^{N_{j}} p(y_{jt} | \theta_{j}) \right] \\&= E_{Q} \left[ \sum_{j=1}^{M} \sum_{t=1}^{N_{j}} \log p(y_{jt} | \theta_{j}) \right] \\ & = \sum_{j=1}^{M} \sum_{t=1}^{N_{j}} E_{Q} \left[ \log p(y_{jt} | \theta_{j}) \right] \\ &= \sum_{j=1}^{M} \sum_{t=1}^{N_{j}} E_{Q} \left[ y_{jt} \log \theta_{j1} + (1-y_{jt}) \log \theta_{j2} \right] \\ &= \sum_{j=1}^{M} \sum_{t=1}^{N_{j}} \left[ \pi_{jt1} \left(\Psi(\gamma_{j1}) - \Psi(\gamma_{j1}+\gamma_{j2}) \right) \right.\\& \left.+ \pi_{jt2} \left(\Psi(\gamma_{j2}) - \Psi(\gamma_{j1}+\gamma_{j2}) \right) \right] \end{aligned}} \end{array} $$
(8)
$$\begin{array}{@{}rcl@{}} {\begin{aligned} E_{Q} \left[ \log p(\boldsymbol{X}, \boldsymbol{Z} | \boldsymbol{Y}, \boldsymbol{\phi}) \right] &= E_{Q} \left[ \log \prod_{j=1}^{M} \prod_{t=1}^{N_{j}} p(x_{jt}, z_{j} | y_{jt}, \phi_{z_{j}},\eta_{z_{j}}) \right] \\ & = \sum_{j=1}^{M} \sum_{t=1}^{N_{j}} E_{Q} \left[ \log p(x_{jt}, z_{j} | y_{jt}, \phi_{z_{j}},\eta_{z_{j}}) \right]\\ &= \sum_{j=1}^{M} \sum_{t=1}^{N_{j}} E_{Q} \left[ \sum_{g=1}^{G} x_{jt}^{g} y_{jt} \log \phi_{z_{j},g} \right.\\ &\quad \left.+ x_{jt}^{g} (1- y_{jt}) \log \eta_{z_{j},g} \right] \\&= \sum_{j=1}^{M} \sum_{t=1}^{N_{j}} \sum_{g=1}^{G} E_{Q} \left[ x_{jt}^{g} y_{jt} \log \phi_{z_{j},g} \right.\\& \quad \left.+ x_{jt}^{g} (1- y_{jt}) \log \eta_{z_{j},g} \right] \\ &= \sum_{j=1}^{M} \sum_{t=1}^{N_{j}} \sum_{g=1}^{G} \left[ x_{jt}^{g} \pi_{jt1} \log \phi_{z_{j},g} \right. \\&\quad \left.+ x_{jt}^{g} \pi_{jt2} \log \eta_{z_{j},g} \right] \end{aligned}} \end{array} $$
(9)
$$\begin{array}{@{}rcl@{}} {\begin{aligned} E_{Q} \left[ \log q(\boldsymbol{ \theta} | \boldsymbol{ \gamma}) \right] &= E_{Q} \left[ \log \prod_{j=1}^{M} q(\theta_{j} | \gamma_{j}) \right] \\&= E_{Q} \left[ \sum_{j=1}^{M} \log q(\theta_{j} | \gamma_{j}) \right] = \sum_{j=1}^{M} E_{Q} \left[ \log q(\theta_{j} | \gamma_{j}) \right] \\ &= \sum_{j=1}^{M} E_{Q} \left[ \log \Gamma \left(\gamma_{j1} + \gamma_{j2} \right) - \log \Gamma (\gamma_{j1}) \right.\\ &- \log \Gamma (\gamma_{j2}) + \left(\gamma_{j1} -1 \right) \log \theta_{j1} \\ & \left. + \left(\gamma_{j2} -1 \right) \log \theta_{j2} \right] \\ &= \sum_{j=1}^{M} \left[ \log \Gamma \left(\gamma_{j1} + \gamma_{j2} \right) - \left(\sum_{i=1}^{2}\log \Gamma (\gamma_{ji}) \right) \right. \\& \left.+\sum_{i=1}^{2} \left(\gamma_{ji} -1 \right) (\Psi (\gamma_{ji}) - \Psi (\gamma_{j1} + \gamma_{j2})) \right] \end{aligned}} \end{array} $$
(10)
$$\begin{array}{@{}rcl@{}} {\begin{aligned} E_{Q} \left[ \log q(\boldsymbol{ Y} | \boldsymbol{ \pi}) \right] &= E_{Q} \left[ \log \prod_{j=1}^{M} \prod_{t=1}^{N_{j}} q(y_{jt} | \pi_{jt}) \right] \\&\quad= E_{Q} \left[ \sum_{j=1}^{M} \sum_{t=1}^{N_{j}} \log q(y_{jt} | \pi_{jt}) \right] \\ & = \sum_{j=1}^{M} \sum_{t=1}^{N_{j}} E_{Q} \left[ \log q(y_{jt} | \pi_{jt}) \right] \\ &= \sum_{j=1}^{M} \sum_{t=1}^{N_{j}} E_{Q} \left[ y_{jt} \log \pi_{jt1} + (1 - y_{jt}) \log \pi_{jt2} \right] \\ &= \sum_{j=1}^{M} \sum_{t=1}^{N_{j}} \left[ \pi_{jt1} \log \pi_{jt1} + \pi_{jt2} \log \pi_{jt2} \right]. \end{aligned}} \end{array} $$
(11)
We then maximize the lower bound with respect to the variational parameters γ and π.
First we maximize the lower bound with respect to π. Since (πjt)s are independent, for t∈{1,2,...,Nj}, we isolate the terms that contains πjt. Lagrangian multiplier is added due to the constraint πjt1+πjt2=1. We substituted \(x_{jt}^{g} \pi _{jt1} \log \phi _{z_{j},g}\) and \(x_{jt}^{g} \pi _{jt2} \log \eta _{z_{j},g}\) from Eq. 9 with \( \pi _{jt1} \log \phi _{z_{j},g}\) and \(\pi _{jt2} \log \eta _{z_{j},g}\), respectively, since \( x_{jt}^{g} = I(x_{jt} =g) \) and is observed:
$$\begin{array}{@{}rcl@{}} \begin{aligned} L_{[ \pi_{jt} ]} &= \left[ \pi_{jt1} \left(\Psi(\gamma_{j1}) - \Psi(\gamma_{j1}+\gamma_{j2}) \right)\right. \\&\quad \left.+ \pi_{jt2} \left(\Psi(\gamma_{j2}) - \Psi(\gamma_{j1}+\gamma_{j2}) \right) \right] \\ & \quad + \left[ \pi_{jt1} \log \phi_{z_{j},g} + \pi_{jt2} \log \eta_{z_{j},g} \right] \\ & \quad - \left[ \pi_{jt1} \log \pi_{jt1} + \pi_{jt2} \log \pi_{jt2} \right] \\ & \quad - \lambda (\pi_{jt1} + \pi_{jt2} - 1). \end{aligned} \end{array} $$
(12)
Taking derivative with respect to πjt1, we obtain:
$$\begin{array}{@{}rcl@{}} \begin{aligned} \frac{\partial L }{\partial \pi_{jt1}} & = \left(\Psi(\gamma_{j1}) - \Psi(\gamma_{j1}+\gamma_{j2}) \right) \\&\quad+ \log \phi_{z_{j},g} -\log \pi_{jt1} - \lambda - 1. \end{aligned} \end{array} $$
(13)
Setting this derivative to zero yields the maximizing value of the variational parameter πjt1:
$$\begin{array}{@{}rcl@{}} \begin{aligned} \pi_{jt1} \propto \phi_{z_{j},g} \exp (\Psi(\gamma_{j1}) - \Psi(\gamma_{j1}+\gamma_{j2})). \end{aligned} \end{array} $$
(14)
Similarly, we could have πjt2:
$$\begin{array}{@{}rcl@{}} \begin{aligned} \pi_{jt2} \propto \eta_{z_{j},g} \exp (\Psi(\gamma_{j2}) - \Psi(\gamma_{j1}+\gamma_{j2})). \end{aligned} \end{array} $$
(15)
Next, we maximize the lower bound with respect to γ. Since (γj)s are independent for j∈1,2,...,M, each γj can be estimated separately. We isolate the terms that contain γj.
$$\begin{array}{@{}rcl@{}} \begin{aligned} L_{\left[ \gamma_{j} \right]} &= \sum_{i=1}^{2}(a_{i}-1) \left(\Psi \left(\gamma_{ji} \right) - \Psi \left(\gamma_{j1}+\gamma_{j2} \right) \right) \\ & + \sum_{t=1}^{N_{j}} \left[ \pi_{jt1} \left(\Psi(\gamma_{j1}) - \Psi(\gamma_{j1}+\gamma_{j2}) \right)\right. \\& \left. \quad + \pi_{jt2} \left(\Psi(\gamma_{j2}) - \Psi(\gamma_{j1}+\gamma_{j2}) \right) \right] \\ & - \left[ \log \Gamma \left(\gamma_{j1} + \gamma_{j2} \right) - \left(\sum_{i=1}^{2}\log \Gamma (\gamma_{ji}) \right) \right.\\& \left. +\sum_{i=1}^{2} \left(\gamma_{ji} -1 \right) (\Psi (\gamma_{ji}) - \Psi (\gamma_{j1} + \gamma_{j2})) \right]. \end{aligned} \end{array} $$
(16)
Taking derivative with respect to γji, we obtain:
$$\begin{array}{@{}rcl@{}} \begin{aligned} \frac{ \partial L }{ \partial \gamma_{ji} } & = \Psi ' (\gamma_{ji}) \left(a_{i} + \sum_{t=1}^{N_{j}} \pi_{jt1} -\gamma_{j1} \right) - \Psi ' (\gamma_{j1} + \gamma_{j2})\\&\quad \left(a_{1} + \sum_{t=1}^{N_{j}} \pi_{jt1} -\gamma_{j1} + a_{2} + \sum_{t=1}^{N_{j}} \pi_{jt2} -\gamma_{j2} \right), \end{aligned} \end{array} $$
(17)
where Ψ′ is the derivative of the digamma function. Setting this derivative to zero yields a maximum at:
$$\begin{array}{@{}rcl@{}} \begin{aligned} \gamma_{ji} = a_{i} + \sum_{t=1}^{N_{j}} \pi_{jt1}, i \in \{ 1,2 \}. \end{aligned} \end{array} $$
(18)
Finally, we move forward to estimating ϕ and a, and to update η.
To maximize with respect to ϕk, we isolate terms and add Lagrangian multiplier due to the constraint \(\sum _{g=1}^{G} \phi _{kg} = 1\):
$$\begin{array}{@{}rcl@{}} \begin{aligned} L_{[\phi_{k}]} &= \sum_{j: z_{j} = k} \sum_{t=1}^{N_{j}} \sum_{g=1}^{G} x_{jt}^{g} \pi_{jt1} \log \phi_{kg} - \lambda (\sum_{g=1}^{G} \phi_{kg} -1). \end{aligned} \end{array} $$
(19)
Taking the derivative with respect to ϕkg and set it to zero, we get:
$$\begin{array}{@{}rcl@{}} \begin{aligned} \phi_{kg} \propto \sum_{j: z_{j}=k} \sum_{t=1}^{N_{j}} x_{jt}^{g} \pi_{jt1}. \end{aligned} \end{array} $$
(20)
The weight \(\phantom {\dot {i}\!}w_{k^{\prime }}\) is the proportion of native transcripts from cluster k′ and is calculated using expected values:
$$\begin{array}{@{}rcl@{}} \begin{aligned} w_{k'} = \frac{ \sum_{k': k' \neq k} \left(\sum_{j: z_{j} = k'} \sum_{t=1}^{N_{j}} \pi_{jt1} \right) }{ \sum_{j: z_{j} \neq k} \sum_{t=1}^{N_{j}} \pi_{jt1} }. \end{aligned} \end{array} $$
(21)
Hence, we have our updated \(\phantom {\dot {i}\!}\eta _{k^{\prime }g}\) as:
$$\begin{array}{@{}rcl@{}} \begin{aligned} \eta_{kg} &= \frac{\sum_{k': k' \neq k} \left(\sum_{j: z_{j} = k'} \sum_{t=1}^{N_{j}} \pi_{jt1} \right) \phi_{k'g}} { \sum_{j: z_{j} \neq k} \sum_{t=1}^{N_{j}} \pi_{jt1} } \\ &= \frac{ 1} { \sum_{j: z_{j} \neq k} \sum_{t=1}^{N_{j}} \pi_{jt1} } \sum_{k': k' \neq k} \left(\sum_{j: z_{j} = k'} \sum_{t=1}^{N_{j}} \pi_{jt1} \right) \phi_{k'g} \\ &= \frac{ 1} { \sum_{j: z_{j} \neq k} \sum_{t=1}^{N_{j}} \pi_{jt1} } \sum_{k': k' \neq k} \left(\sum_{j: z_{j} = k'} \sum_{t=1}^{N_{j}} \pi_{jt1} \right)\\&\quad \frac{\sum_{j: z_{j}=k'} \sum_{t=1}^{N_{j}} x_{jt}^{g} \pi_{jt1}}{\sum_{j: z_{j} = k'} \sum_{t=1}^{N_{j}} \pi_{jt1}} \\ &= \frac{ 1} { \sum_{j: z_{j} \neq k} \sum_{t=1}^{N_{j}} \pi_{jt1} } \sum_{k': k' \neq k} \left(\sum_{j: z_{j}=k'} \sum_{t=1}^{N_{j}} x_{jt}^{g} \pi_{jt1} \right) \\ &= \frac{ \sum_{k': k' \neq k} \left(\sum_{j: z_{j}=k'} \sum_{t=1}^{N_{j}} x_{jt}^{g} \pi_{jt1} \right) } { \sum_{j: z_{j} \neq k} \sum_{t=1}^{N_{j}} \pi_{jt1} }. \end{aligned} \end{array} $$
(22)
To maximize with respect to a, we isolate terms and get:
$$\begin{array}{@{}rcl@{}} \begin{aligned} L_{[a]} &= \sum_{j=1}^{M} \left[ \log \Gamma(a_{1}+a_{2}) - \left(\sum_{i=1}^{2}\log \Gamma(a_{1}) \right)\right. \\ &\quad \left. +\sum_{i=1}^{2}(a_{i}-1) \left(\Psi \left(\gamma_{ji} \right) - \Psi \left(\gamma_{j1}+\gamma_{j2} \right) \right) \right]. \end{aligned} \end{array} $$
(23)
A Newton iteration can be used to find the maximal point a [22], which requires both the first and second derivatives of L[a]. The first derivative, gradient ∇L, and the second derivative, Hessian matrix H, are:
$$\begin{array}{@{}rcl@{}} \begin{aligned} \nabla L_i = \frac{ \partial L_{[a]} }{ \partial a_{i}} &= \sum_{j=1}^{M} \left(\Psi (a_{1} + a_{2}) - \Psi (a_{i}) \right.\\ &\quad \left.+ \Psi (\gamma_{ji}) - \Psi \left(\gamma_{j1}+\gamma_{j2} \right) \right) \end{aligned} \end{array} $$
(24)
$$\begin{array}{@{}rcl@{}} \begin{aligned} & H_{ii} = \frac{ \partial^{2} L_{[a]} }{ \partial a_{i}^{2} } = M \left(\Psi '(a_{1} + a_{2}) - \Psi (a_{i}) \right), i \in \{1,2\} \\ & H_{ij} = \frac{ \partial^{2} L_{[a]} }{ \partial a_{i} \partial a_{j} } = M \Psi '(a_{1} + a_{2}), j \neq i. \end{aligned} \end{array} $$
(25)
One Newton step is then:
$$\begin{array}{@{}rcl@{}} \begin{aligned} a^{new} = a^{old} - H^{-1} \nabla L. \end{aligned} \end{array} $$
(26)
Analysis of sorted human-mouse mixture single-cell dataset
A mixture of fresh frozen human (HEK293T) and mouse (NIH3T3) cells were sequenced together in 10X Genomics Chromium. This data is available at 10X Genomics website [23]. A total of 6164 human cells, 5915 mouse cells, and 741 multiplets were detected by CellRanger. Excluding multiplets, 12,079 cells with CellRanger-predicted cell type were used to estimate contamination using DecontX.
Analysis of sorted PBMCs single-cell datasets
Nine publicly available PBMC datasets totalling of 84,432 cells [24] were obtained from 10X Genomics. Each dataset consisted of a population of cells that were isolated with flow cytometry based on expression of a predefined protein marker. Cell populations included progenitor cells(CD34+), monocytes (CD14+), B cells (CD19+), natural killer cells (CD56+), helper T cells (CD4+), regulatory T cells (CD4+/CD25+), native T cells (CD4+/CD45RA+/CD25 −), naive cytotoxic T cells (CD8+/CD45RA+), and cytotoxic T cells (CD8+). A total of 7363 genes which contained at least 3 counts across 3 cells were included in the analysis. DecontX used cell label by flow cytometry to estimate contamination. Celda [17] was used to identify 76 gene modules and 21 cell clusters, including 8 clusters predominantly expressing T cell markers, 2 clusters predominantly expressing natural killer cell markers, 2 clusters predominantly expressing B cell markers, 2 clusters predominantly expressing monocyte markers, and 7 clusters predominantly expressing CD34 progenitor cell markers. These computationally inferred cell type labels were used in downstream analyses that examined the percentage of cells that express various marker genes. Using computationally derived cell clustered mitigated instances where a cell was improperly sorted and labeled by flow cytometry as belonging to one population when in fact it was transcriptionally similar to another population.
Analysis of the 4K PBMC single-cell dataset
A total of 4340 PBMCs [25] from a healthy donor were sequenced in a single channel of the 10X Genomics Chromium. A total of 4529 genes which contained at least 3 counts across 3 cells were included in the analysis. Nineteen cell clusters and 150 gene modules were identified with Celda [17]. Cell clusters 2 and 3 were classified as B cells (MS4A1+); cell clusters 5, 6, 7, 8, 9, and 11 were classified as T cells (CD3D+/CD3E+); cell clusters 13 and 14 were identified as LYZ+ monocyte group; cell cluster 15 was identified as FCGR3A+ monocyte group; cell cluster 10 was identified as NKG7+ and GNLY+ NK cell group; cell clusters 18 and 19 were identified as FCER1A+ dendritic cell group; cell cluster 4 was identified as IRF7+ and IRF8+ plasmacytoid dendritic cell group; cell cluster 16 was identified as PPBP+ megakaryocytes; cell cluster 1 was identified as IGHG1+ and IGHG2+ plasma cell group; cell cluster 12 was identified as CD34+ cell group; cell cluster 17 is likely to be multiplets for it has shown IL7R, CD3D, and CD14 markers. DecontX used Celda-estimated cluster label to estimate contamination.
Analysis of benchmark datasets
Data were generated as previously described [7] and is available at their github repository [26]. Briefly, five human lung adenocarcinoma cell lines (HCC827, H1975, A549, H838, and H2228) were cultured separately and the same batch was processed in two different ways to create three datasets. For the two single-cell datasets, single cells from three or five cell lines were mixed together, with libraries generated using three different protocols (10X Chromium, Drop-seq, CEL-seq2). For the mixRNA datasets, RNA was extracted in bulk from three cell lines (HCC827, H1975, and H2228), mixed in seven different proportions, and diluted to single-cell equivalent amounts, with libraries generated using either CEL-seq2 or SORT-seq protocols. Available cell clusters from the paper estimated by Demuxlet were used to estimate contamination by DecontX on both single-cell datasets; all cells were included for DecontX analysis. For the mixRNA datasets, we assign the cell cluster of each pseudo-cell being the predominant cell line that has contributed more than 50% of the total mRNA.
Analysis of three tissues across two 10X Chromium platforms
Three tissue types (mouse brain, mouse heart, PBMC from healthy donor) were profiled using two different 10X 3’ protocols (V2, V3). The six datasets are available at 10X Genomics [27–32]. A total of 1206 cells were detected in BrainV2, 1301 in BrainV3, 712 in HeartV2, 1011 in HeartV3, 996 in PBMCV2, and 1222 in PBMCV3. Genes which contained at least 3 counts across 3 cells were included in the analysis. Automatic clustering is performed on each dataset. Specifically, for each dataset, genes were collapsed into 100 gene modules using Celda [17], UMAP [33] was used on the 100 gene modules to define spacial similarity between cells on a reduced two-dimensional space, and then, density-based spatial clustering of applications with noise [34] (DBSCAN) was used with parameter epsilon as 1 to define cell clusters.