All relevant source code is available at https://github.com/jon-xu/scSplit/.
Overview
The overall pipeline for the scSplit tool includes seven major steps (Fig. 4):
- 1
Data quality control and filtering: The mixed sample BAM file is first filtered to keep only the reads with a list of valid barcodes to reduce technical noise. Additional filtering is then performed to remove reads that meet any of the following: mapping quality score less than 10, unmapped, failing quality checks, secondary or supplementary alignment, or PCR or optical duplicate. The BAM file is then marked for duplication, sorted and indexed.
- 2
SNV calling (Fig. 4a): Freebayes v1.2 [11] is used to call SNVs on the filtered BAM file, set to ignore insertions and deletions (indels), multi-nucleotide polymorphisms (MNPs), and complex events. A minimum base quality score of one and minimum allele count of two is required to call a variant. The output VCF file is further filtered to keep only SNVs with quality scores greater than 30.
- 3
Building allele count matrices (Fig. 4b): The “matrices.py” script is run which produces two.csv files, one for each of reference and alternate allele counts as output.
- 4
Model initialization (Fig. 4c): find the distinct groups of cells in the scRNA-seq and use them to initialize the Allele Fraction Model (SNVs by samples).
- 5
E-M iterations till convergence (Fig. 4d): Initialized allele fraction model and the two allele count matrices are used together to calculate the probability of each cell belonging to the clusters. After each round, allele fraction model is updated based on the probability of cell assignment and this is iterated until overall likelihood of the model reaches convergence.
- 6
Alternative presence/absence genotypes (Fig. 4e): matrix indicating cluster genotypes at each SNV is built in this step.
- 7
Find distinguishing variants for clusters and use to assign samples to clusters (Fig. 4f): In order to assign each model cluster back to the specific sample, distinguishing variants are identified so that genotyping of the least number of loci using the a suitable platform may be performed. Gram-Schmidt orthogonalization [12] is used to get the minimum set of informative P/A genotypes.
Data quality control
Samtools was used to filter the reads with verified barcodes for mapping and alignment status, mapping quality, and duplication (samtools view -S -bh -q 10 -F 3844 [input] >[output]). Duplicates were removed (samtools rmdup [input] [output]) followed by sorting and indexing.
SNV calling on scRNA-seq dataset
SNVs were called on the scRNA-seq mixed sample BAM file with freebayes [11], a widely used variant calling tool. The freebayes arguments “-iXu -q 1” were set to ignore indels and MNPs and exclude alleles with supporting base quality scores of less than one. This generated a VCF file containing all SNVs from the mixed sample BAM file. Common SNPs of a population (for example results from The international Genome Sample Resource [9]) were recommended be used to filter out noisy SNVs.
Building allele count matrices
Allele count matrices were then built from (1) the provided mixed sample BAM file and (2) the VCF file obtained from the SNV calling program. Two allele count matrices were generated, one for the reference alleles and one for the alternate alleles, each with SNVs in rows and barcodes in columns. Each data element in the matrix indicated either the number of reference or alternate alleles detected in one cell barcode at that specific SNV position. This provided a full map of the distribution of reference and alternate alleles across all barcodes at each SNV.
The allele count matrices captured information from all reads overlapping SNVs to reflect the different allele fraction patterns from different barcodes or samples. To build the allele count matrices, pysam fetch [13] was used to extract reads from the BAM file. The reads overlapping each SNV position were fetched and counted for the presence of the reference or alternate allele. In order to increase overall accuracy and efficiency, SNVs whose GL(RA)(likelihood of heterozygous genotypes) was lower than log10(1−error) where error = 0.01 were filtered out. These were more homozygous and thus less informative for detecting the differences between the multiple samples. The generated matrices were exported
Model initialization by using maximally informative cluster representatives
To initialize the model, initial probabilities of observing an alternative allele on each SNV position in each cluster were calculated. The overall matrix was sparse and a dense sub-matrix with a small number of zero count cells was generated. To do that, cells were first sorted according to their number of zero allele counts (sum of reference and alternative alleles) at all SNVs and SNVs were similarly sorted according to their number of zero allele counts (sum of reference and alternative alleles) across all cells. Next, we selected and filtered out 10% of the cells among those with the most number of zero expressed SNVs and 10% of the SNVs among those where the most number cells had zero counts. This was repeated until all remaining cells had more than 90% of their SNVs with non-zero allele counts and all SNVs had non-zero counts in more than 90% of cells. This subset of matrices was the basis for the seed barcodes to initialize the whole model. The sub-matrix was transformed using PCA with reduced dimensions and then K-means clustering was performed to split the cell subset into expected number of clusters. By using the allele fractions on the subset of SNVs in these initially assigned cells, each cluster of the model could be initialized. Let N(Ac,v) and N(Rc,v) be the Alternative and Reference allele counts on SNV v and cell c accordingly, and let pseudo AR be the pseudo allele count for both Alternative and Reference alleles, and pseudo A be the pseudo allele count for Alternative alleles, we calculated P(Av|Sn), the probability of observing Alternative allele on SNV v in Sample n, according to below equation:
$$ P(A_{v}|S_{n}) = \frac{\left[\sum\limits_{c}N(A_{c,v}) + \text{pseudo}_{A}) \right]} {\left[\sum\limits_{c}N(A_{c,v}) + \sum\limits_{c}N(R_{c,v}) + \text{pseudo}_{AR} \right]} $$
(1)
We also initiated the probability of seeing the nth sample as evenly distributed across all samples. Let P(Sn) be the probability of seeing the n-th sample, and N(S) be the number of samples to be demultiplexed:
$$ P(S_{n}) = \frac{1}{N(S)} $$
(2)
Expectation–maximization approach
The expectation-maximization (EM) algorithm [14] was used to conduct iterations using the full allele count matrices (Fig. 4). Each iteration consisted of an E-step to calculated the probability of seeing cells in all clusters, based on the allele fraction model, and an M-step to use the new probability of seeing cells in all clusters to update the allele fraction model. EM iterations stopped when convergence was reached, so that the overall probability of observing the cells, or the reference/alternative alleles count matrices, was maximized.
During the E-step, the tool first calculated P(Ci|Sn), the likelihood of observing a cell Ci in sample Sn, which was equal to the product of the probability of observing the allele fraction pattern over each SNV, which in turn equaled to the product of probability of having observed the count of alternative alleles and probability of having observed the count of reference alleles. Let ci be the i-th cell, Sn be the n-th sample, Av be the Alternative allele on SNV v, and N(A), N(R) be the quantity of Alternative and Reference alleles:
$$ {\begin{aligned} P(C_{i}|S_{n}) &= P(A_{c_{i}}, R_{c_{i}}|S_{n}) \\&= \prod\limits_{v}\left[P(A_{v}|S_{n})^{N(A_{c_{i},v})}[1\,-\,P(A_{v}|S_{n})]^{N(R_{c_{i},v})}\right] \end{aligned}} $$
(3)
And then P(Ci|Sn) was transformed to P(Sn|Ci), the cell-sample probability, i.e. the probability of a cell Ci belonging to sample Sn, using Bayes’ theorem, assuming equal sample prior probabilities (P(S1) = P(S2) =... = P(Sn)):
$$ P(S_{n}|C_{i}) = \frac{P(C_{i}|S_{n})}{\sum\limits_{x=1}^{N}P(C_{i}|S_{x})} $$
(4)
Next, weighted allele counts were distributed to the different cluster models according to the cell-sample probability, followed by the M-step, where the allele fraction model represented by the alternative allele fractions was updated using the newly distributed allele counts, so that allele fractions at all SNV positions in each sample model were recalculated:
$$ P(A_{v}|S_{n}) = \frac{\sum\limits_{i}N(A_{c,v})P(S_{n}|C_{i})+\text{pseudo}_{A}}{\sum\limits_{i}N(T_{c,v})P(S_{n}|C_{i})+\text{pseudo}_{AR}} $$
(5)
And the sample probability P(Sn) was also updated by the newly calculated cell likelihoods:
$$ P(S_{n}) = \frac{\sum\limits_{i}P(S_{n}|C_{i})}{\sum\limits_{n}\sum\limits_{i}P(S_{n}|C_{i})} $$
(6)
The overall log-likelihood of the whole model [15] was calculated as:
$$ {\begin{aligned} \mathcal{L}_{\text{model}} &= \sum\limits_{i}\log\sum\limits_{n}P(C_{i}|S_{n}) \\&= \sum\limits_{i}\log\sum\limits_{n}\prod\limits_{v}\left[P(C_{i,v}|S_{n,v})P(S_{n})\right] \end{aligned}} $$
(7)
Multiple runs to avoid local maximum likelihood
The entire process was repeated for 30 rounds with the addition of randomness during model initialization and the round with the largest sum of log likelihood was taken as the final result. Randomness was introduced by randomly selecting the 10% of cells and SNVs to be removed from the matrices during initialization from a range of the lowest ranked cells and SNVs as detailed previously.
Cell cluster assignment
Next, probability of a cell belonging to a cluster P(Sn|Ci) was calculated. Cells were assigned to a cluster based on a minimum threshold of P > 0.99. Those cells with no P(Sn|Ci) larger than the threshold were regarded as unassigned.
Handling of doublets
During scRNA-seq experiments, a small proportion of droplets can contain cells from more than one sample. These so called doublets, contain cells from same or different samples sharing the same barcode, which if not addressed would cause bias. Our model took these doublets into consideration. During our hidden state based demultiplexing approach, we included an additional cluster so that doublets could be captured. To identify which cluster in the model was the doublet cluster in each round, the sum of log-likelihood of cross assignments was checked:
$$ P(c ~is~ \text{doublet}) = \sum\limits_{i\notin c}\prod\limits_{v}\left[P(C_{i,v}|S_{c,v})P(S_{c})\right] $$
(8)
The sum log-likelihood of cells from all other clusters being assigned to a specific cluster was calculated for each cluster in turn and compared. The cluster with the largest sum log-likelihood of cross assignment was designated as the doublet cluster. We allow user input on the expected proportion of doublets. If the expected number of doublets was larger than those detected in the doublet cluster, cells with largest read depth were moved from singlet clusters to doublet cluster, so that the total number of doublets meet expectation as input.
Alternative allele presence/absence genotyping for clusters
To identify a minimum set of variants, which can distinguish between sample clusters, we generated alternative allele P/A genotype matrix (SNVs by clusters). To do that, sum of reference and alternate allele counts across all cells assigned to each cluster were calculated. And for each SNV and each cluster, “P” was marked if there were more than 10 alternative allele counts, and “A” for more than 10 reference allele counts but no alternative allele count. "NA" was set if neither criterion was met.
Mapping clusters back to individual samples using minimal set of P/A genotypes
Based on the P/A matrix, we started from informative SNVs which had variations of “P” or “A” across clusters and avoid picking those with “NAs”. Then, unique patterns involved in those SNVs were derived and for each unique P/A pattern, one allele was selected to subset the whole matrix. Next, Gram-Schmidt orthogonalization [12] was applied on the subset of P/A matrix, in order to find the variants which can be basis vectors to effectively distinguish the clusters. If not enough SNVs were found to distinguish all the clusters, the clusters were split into smaller groups so that for each group there was enough variants to distinguish the clusters within that group. And to distinguish clusters from different smaller groups, if the selected variants could not be used to distinguish any pair of clusters, additional variants were selected from the whole list of variants where no NAs were involved and P/A was different between the pair of clusters. Ideal situation was N variants for N clusters, but it was possible that >N variants were needed to distinguish N clusters.
As such, the P/A genotyping of each cluster, on the minimum set of distinguishing variants, could be used as a reference to map samples to clusters. After running genotyping on this minimum set of loci for each of the individual samples, a similar matrix based on sample genotypes could be generated, by setting the alternative presence flag when genotype probability (GP) was larger than 0.9 for RA or AA, or absence flag when GP was larger than 0.9 for RR. By comparing both P/A matrices, we could link the identified clusters in scSplit results to the actual individual samples.
In practice, samples can be genotyped only on the few distinguishing variants, so that scSplit-predicted clusters can be mapped with individual samples, while the whole genotyping is not needed. When the whole genotyping is available, we also provide an option for users to generate distinguishing variants only from variants with R2 >0.9, so that they can compare the distinguishing matrix from scSplit with that from known genotypes on more confident variants.
Data simulation
To test the consistency of the model, and the performance of our demultiplexing tool, reference/alternative count matrices were simulated from a randomly selected scRNA-seq BAM file from Zheng et al. [2] and a 32-sample VCF file used in Fig. 2 supplementary data of Kang et al. [6]. We assume the randomly selected BAM file had a representative gene expression profile.
First, data quality was checked and the BAM and VCF files were filtered. Second, barcodes contained in the BAM file were randomly assigned to samples in the VCF file, which gave us the gold-standard of cell-sample assignments to check against after demultiplexing. Then, all the reads in the BAM file were processed, that if a read overlapped with any SNV position contained in the merged VCF file, its barcode was checked to get its assigned sample and the probability P(Ac,v) of having the alternative allele for that sample was calculated using the logarithm-transformed genotype likelihood (GL) or genotype probability (GP) contained in the VCF file. The probability of an allele being present at that position could then be derived so that the ALT/REF count at the SNV/barcode in the matrices could be simulated based on the alternative allele probability. Let \(\mathcal {L}(AA)\) and \(\mathcal {L}(RA)\) be the likelihood of seeing AA and RA of a certain cell c on a certain SNV v:
$$ {P(A_{c,v}) = \frac{1}{2}10^{[\log_{10}\mathcal{L}(RA)]}+ 10^{[\log_{10}\mathcal{L}(AA)]}} $$
(9)
Finally, doublets were simulated by merging randomly chosen 3% barcodes with another 3% without overlapping in the matrix. This was repeated for every single read in the BAM file. This simulation modeled the number of reads mapped to the reference and alternative alleles directly. In our simulations, there were 61 576 853 reads in the template BAM file for 12 383 cells, which was equivalent to 4973 rpc.
With the simulated allele fraction matrices, the barcodes were demultiplexed using scSplit and the results were compared with the original random barcode sample assignments to validate.
Result evaluation
We used both TPR/FDR and Cohen’s Kappa [16] to evaluate the demultiplexing results against ground truth. R package “cluster” [17] was used in evaluating the clusters on UMAPs in Fig. 3.
Single cell RNA-seq data used in testing scSplit
In Tables 3 and 4, we used published hashtagged data from GSE108313 and PBMC data from GSE96583. For Tables 2 and 5, endometrial stromal cells cultured from 3 women and fibroblast cells cultured from 38 healthy donors over the age of 18 years respectively were run through the 10x Genomics Chromium 3’ scRNA-seq protocol. The libraries were sequenced on the Illumina Nextseq 500. FASTQ files were generated and aligned to Homo sapiens GRCh38p10 using Cell Ranger. Individuals were genotyped prior to pooling using the Infinium PsychArray.
Full sibling data from UK biobank used in simulation
In Table S2 in Additional file 2, we used genotype data of three pairs of full siblings from UK Biobank, which contained 564 981 SNVs, from which we used 258 077 SNVs within gene ranges, provided on the resource website of plink [18]: https://www.cog-genomics.org/plink/1.9/resources.