Vireo: Bayesian demultiplexing of pooled single-cell RNA-seq data without genotype reference

Multiplexed single-cell RNA-seq analysis of multiple samples using pooling is a promising experimental design, offering increased throughput while allowing to overcome batch variation. To reconstruct the sample identify of each cell, genetic variants that segregate between the samples in the pool have been proposed as natural barcode for cell demultiplexing. Existing demultiplexing strategies rely on availability of complete genotype data from the pooled samples, which limits the applicability of such methods, in particular when genetic variation is not the primary object of study. To address this, we here present Vireo, a computationally efficient Bayesian model to demultiplex single-cell data from pooled experimental designs. Uniquely, our model can be applied in settings when only partial or no genotype information is available. Using pools based on synthetic mixtures and results on real data, we demonstrate the robustness of Vireo and illustrate the utility of multiplexed experimental designs for common expression analyses.


Supplementary methods
In this supplementary section, we will first re-introduce the notation we use, then derive the detailed computation of the lower bound of the variational distribution L(q) in Eq(7) in the main text, and lastly derive the updates of each variational component in equations Eq(9-11) in the main text. By leveraging the read counts of alternative alleles A and both alleles (namely depth) D from N variants in M cells, Vireo aims to estimate the joint posterior distribution of sample identity Z for each cell j from each sample k, the genotype G for variant i in each sample k, and the corresponding alternative allele rate θ for each genotype t ∈ {0, 1, 2}. As described in the main text, we used multinomial priors for the categorical variables Z and G with hyper-parameters π and U , respectively, and by default both take uniform multinomial priors. We used beta priors for the parameter of the alternative allele rate θ, and we took the hyperparameters (α t , β t ), t ∈ {0, 1, 2} that generally fit well to highly expressed germline variants in standard scRNA-seq data set (not multiplexed). Specifically, the default prior distribution are: θ 0 ∼ beta(0.3, 29.7), θ 1 ∼ beta(3, 3), and θ 2 ∼ beta(29.7, 0.3).
Next, the lower bound L(q) in Eq (7) can be written as follows where each part is expressed below.
Note, the variables with tilde hat are the estimated parameters otherwise are fixed hyper parameters, including α t and β t . Same below. Then, following the general updating rule in the mean-field variational inference (see main text Eq(8)), we can update the parameters in each component alternately while fixing all other components of the variational distributions and reach the finalized equations Eq(9-11) in the main paper.
First, by using the distributions of genotype G and alternative allele rate θ that are estimated from a previous step in the iteration, we can analytically update the distribution of the sample assignment Z by a categorical distribution.
where ϕ(·) is the digamma function, the same below. As q(Z j ) for any j follows a multinomial distribution, we can therefore have the updated parameter r j,k , namely the probability of cell j from component k as follows, Second, with a similar procedure, the analytical updates for the genotype distribution can be written in the form of a categorical distribution as follows, where the updated probability of variant i in component k equals to t can be expressed as follows, . (S10) Lastly, the analytical updates of the distribution of the alternative allele rate θ can be expressed in the form of a beta distribution as follows, where the parameters for this beta distribution are Now, by updating these parameters iteratively, we can achieve the maximized lower bound of L(q), and equivalently the minimized KL divergence KL(q(Z, G, θ)||p(Z, G, θ|A, D)).              Figure S12: Sample alignment between pooled scRNA-seq data and bulk RNA-seq data from HipSci Project (see Methods). Left panel: size of inferred samples from a six-sample pooled scRNA-seq data based on 10x Genomics platform. Right panel: alignment between bulk RNA-seq samples to inferred samples from scRNA-seq data with Vireo by comparing the estimated genotype. The value of heatmap is the fraction SNPs with matched genotype between single-cell and bulk RNA-seq samples. Figure S13: Identification of discriminatory variants from estimated genotype for the six pooled samples as in Supp. Fig. S12. Upper panel: two variants with three genotype categories. 0: homozygous reference allele; 1: heterozygous alleles; 2: homozygous alternative allele. Bottom panel: three variants without homozygous alternative allele.