 Method
 Open Access
 Published:
GMMDemux: sample demultiplexing, multiplet detection, experiment planning, and novel celltype verification in single cell sequencing
Genome Biology volume 21, Article number: 188 (2020)
Abstract
Identifying and removing multiplets are essential to improving the scalability and the reliability of single cell RNA sequencing (scRNAseq). Multiplets create artificial cell types in the dataset. We propose a Gaussian mixture modelbased multiplet identification method, GMMDemux. GMMDemux accurately identifies and removes multiplets through sample barcoding, including cell hashing and MULTIseq. GMMDemux uses a droplet formation model to authenticate putative cell types discovered from a scRNAseq dataset. We generate two inhouse cellhashing datasets and compared GMMDemux against three stateoftheart sample barcoding classifiers. We show that GMMDemux is stable and highly accurate and recognizes 9 multipletinduced fake cell types in a PBMC dataset.
Background
Dropletbased single cell RNA sequencing (scRNAseq) [13, 18, 48] has provided many valuable insights into complex biological systems, such as rare celltype identification [26, 32, 39, 41], differential expression analysis at the single cell level [2, 5, 9], and cell lineage studies [9, 15, 24, 30]. While the percell cost of library preparation has decreased over the years, the scalability of dropletbased scRNAseq remains limited, mostly due to rapidly increasing, yet hard to anticipate, multiplet rates as more cells are loaded during single sequencing cell library preparation [17]. Multiplets significantly confound the analysis of single cell experiments and can lead to false discoveries [10, 17], such as false lineages in cell lineage tracing [14, 20, 29], incorrect categorizations in celltype classification [27, 43, 49], or false findings in rare celltype discovery [22, 44]. Large cell populations are especially required for rare celltype discovery, but loading large cell populations during scRNAseq library preparation leads to high multiplet rates. As a result, researchers are challenged with identifying real raretype cells in a multipletfilled scRNAseq dataset. Overall, the scalability of scRNAseq can be significantly improved, greatly reducing the percell library preparation cost, if multiplets can be identified and removed from downstream analysis. To achieve greater adoption of single cell sequencing technology, it is crucial to (1) identify and remove multiplets from downstream analysis, (2) anticipate the multiplet rate prior to conducting an experiment, and (3) verify whether rare cell types identified from a single cell dataset are authentic and are not multiplets.
Recently, emerging sample barcoding technologies, such as cell hashing [36] or MULTIseq [21], enable identification of multiplets arising from more than one uniquely labeled sample and facilitate their subsequent removal from downstream analysis. Both methods use oligonucleotidelabeled reagents that conjugate on the cell surface to produce samplespecific markings on cells: cell hashing, an extension of the cellular indexing of transcriptomes and epitopes by sequencing (CITEseq) technology [35], uses barcoded oligoconjugated antibodies that target ubiquitously expressed surface markers, such as CD298 and beta2microglobulin, while MULTIseq uses lipid and cholesterolmodified oligonucleotides that attach to the cell surface membrane and the cell nuclei membrane. For simplicity, we refer to the oligonucleotidelabeled reagents used in both methods as samplehashtag oligonucleotides (HTOs). Sample barcoding involves labeling cells from each sample with samplespecific HTO conjugates and then pooling the HTOlabeled cells from different samples for dropletbased scRNAseq sequencing library preparation. During library preparation, the pooled cell assay is driven through a microfluidic chip to form cellassay droplets. A fraction of cellassay droplets are combined with barcodeenclosing gel beads and form Gel Beads in Emulsion, or GEMs. Inside each GEM, HTO barcodes are combined with GEM barcodes. Subsequent sequencing simultaneously recovers the HTO barcode(s) and the GEM barcode for each GEM. An abstract workflow of a 3sample sample barcoding experiment is provided in Additional file 1: Fig. S2. Finally, the count of the HTO unique molecular identifiers (UMIs) for each sample, which translates to the number of cellattached, samplespecific HTO antibodies of each GEM, is summarized in a matrix, called the HTO matrix. Table 1 depicts an example 3sample HTO matrix.
There are three types of droplets in a sample barcoding scRNAseq dataset: (1) multisample multiplets (MSMs), droplets that contain more than one cell from more than one HTO sample; (2) singlesample multiplets (SSMs), droplets that contain more than one cell from a single HTO sample; and (3) singlets, droplets that contain a single cell. We combine singlets and SSMs into a single category called singlesample droplets (SSDs) to differentiate them from MSMs. The relationship between MSM, SSM, singlet, and SSD is summarized in Fig. 1. MSMs can be distinguished from SSDs in the HTO matrix: MSMs typically have high HTO UMI counts from more than one HTO barcode, while SSDs typically have high UMI counts from a single HTO barcode and low HTO UMI counts from all other HTO barcodes. However, sample barcoding cannot separate singlets from SSMs, as these two droplet types are indistinguishable in the HTO matrix. As a result, SSMs cannot be removed by sample barcoding and will remain in the dataset as noise.
GEMs can also be classified based on the number of cell types enclosed in them. GEMs that contain a single cell type are named puretype GEMs whereas GEMs that contain multiple cell types are named phonytype GEMs. An illustration of phonytype GEMs and puretype GEMs is provided in Fig. 2a. Puretype GEMs are not necessarily singlets—a puretype GEM can still be a multiplet, but contains cells of exactly the same cell type. Hence, a puretype GEM could be a singlet, a MSM, or a SSM. Phonytype GEMs, on the other hand, are all multiplets. Hence, they must be either MSMs or SSMs.
Phonytype GEMs can be misclassified as novel rare cell types. Figure 2b depicts the gating results of a 16KGEM PBMC cellhashing and CITEseq dataset. The CD3CD19 scatter plot shows GEMs comprised of CD19^{+} B cells, CD3^{+} T cells, and also a CD3^{+}CD19^{+} doublepositive TB cell GEM cluster. Similarly, in the CD4CD8 scatter plot of all CD3^{+} T cells, besides the CD4^{+} helper T cells and CD8^{+} cytotoxic T cells, there also exists a CD4^{+}CD8^{+} helpercytotoxic T cell GEM cluster. Both clusters are highlighted in red circles. Existence of such PBMC types at the observed frequencies is unlikely, as CD3^{+}CD19^{+} cells and CD4^{+}CD8^{+} T cells are believed to be extremely rare [1, 33]. In fact, as revealed by cell hashing, both cell types, along with many other alleged novel rare cell types discovered in this dataset, are all phony cell types. Most, if not all, GEMs in these phonycelltype clusters are phony GEMs: instead of containing real TB cell(s), each CD3^{+}CD19^{+} GEM is a multiplet that contains individual CD3^{+} T and CD19^{+} B cell(s). When compared against true cell types, such as CD19^{+} B cells or CD4^{+} helper T cells, phonytype GEMs are most likely to be MSMs. Figure 2c displays the MSM ratios of the CD19^{+} B cell, the CD4^{+} helper T cell, and the CD8^{+} cytotoxic T cell truecelltype GEM clusters (also referred to as puretype GEM clusters), as well as the MSM ratios of the CD3^{+}CD19^{+} and the CD4^{+}CD8^{+} phonycelltype GEM clusters (or simply phonytype GEM clusters). From the figure, we observe that phonycelltype GEM clusters have much higher MSM ratios than truecelltype GEM clusters (∼75% vs. <14%).
Existing MSM classifiers, including the heuristic classifier from Seurat [4, 36], the heuristic classifier from MULTIseq [23], and the modelbased classifier demuxEM [8], suffer from one or multiple shortcomings, including low classification accuracy, nondeterministic output, unreliable heuristics, and inaccurate model assumptions. Additionally, existing classifiers do not model SSM. Therefore, they cannot estimate the percentage of singlets and SSMs in the dataset and they cannot predict the percentages of MSMs, singlets, and SSMs of the conceived output of a planned sample barcoding experiment. Most importantly, without a droplet formation model, they cannot determine whether an alleged novel cell typedefining GEM cluster consists of mainly puretype GEMs. Hence, they are not able to (and are not designed to) use the sample barcoding information to authenticate the legitimacy of putative novel cell types in a scRNAseq dataset.
In this work, we propose a modelbased Bayesian framework, GMMDemux, for sample barcoding data processing. GMMDemux consistently and accurately separates MSMs from SSDs; estimates the percentage of SSMs and singlets among SSDs; anticipates the MSM, SSM, and singlet rates of planned future sample barcoding experiments; and verifies the legitimacy of putative novel cell types discovered in samplebarcoded scRNAseq datasets. Specifically, GMMDemux independently fits the HTO UMI counts of each sample into a Gaussian mixture model [34]. From each Gaussian mixture model, GMMDemux computes the posterior probability of a GEM containing cells from the corresponding sample. From the posterior probabilities, GMMDemux computes the probabilities of a GEM being a MSM or a SSD. Among SSDs, GMMDemux estimates the proportion of SSMs and singlets in each sample using an augmented binomial probabilistic model. Using the probabilistic model, GMMDemux checks if a proposed putative cell typedefining GEM cluster is a puretype GEM cluster or a phonytype GEM cluster, and based on the classification of the GEM cluster, GMMDemux proves or rejects the novel celltype proposition.
To benchmark the performance of GMMDemux, we conducted two inhouse cellhashing and CITEseq experiments; collected a public cellhashing dataset; and simulated 9 in silico cellhashing datasets. We compare GMMDemux against three existing, stateoftheart MSM classifiers and show that GMMDemux is highly accurate and has the most consistent performance among the batch. From the cellhashing and CITEseq PBMC dataset, we extracted 9 putative novel type GEM clusters through in silico gating, Further analysis by GMMDemux shows that all 9 putative noveltype GEM clusters are phonytype GEM clusters and are removed from the dataset. Out of the 15.8K GEMs of the PBMC dataset, GMMDemux identifies and removes 2.8K multiplets, reducing the multiplet rate from 23.9 to 6.45%. After removing all phonytype GEM clusters, GMMDemux further reduces the multiplet rate to 3.29%.
Results
Datasets
Real datasets
We benchmark GMMDemux on three separate HTO datasets from three independent sources. In addition to a public dataset from Stoeckius et al. [36] (PBMC2), we conducted two additional inhouse cellhashing experiments independently in two separate labs (PBMC1, Memory T). A summary of the three datasets is provided in Table 2.
Cells in the PBMC1 dataset are drawn from a healthy donor following the same protocol described in a previous study [38]. These cells are divided into four samples. Each sample is subjected to the TotalseqA and cellhashing protocol [36], targeting a recovery of ∼ 5000 cells per sample. All HTOtagged cells are pooled together and are prepared using the 10X Genomics platform with Gel Bead Kit V2. The prepared assay is subsequently sequenced on an Illumina Hiseq platform with a depth of 50K reads per cell. In addition to cell hashing, cells in this dataset are simultaneously measured for their surface marker abundance through CITEseq [35]. Eight surface markers are measured for every cell: CD3, CD4, CD8, CD11, CD14, CD16, CD19, and CD56.
Cells in the CD4^{+} Memory T dataset were enriched from the peripheral blood of a healthy adult human volunteer using the MACSxpress® Whole Blood CD4 Memory T Cell Isolation Kit, human (Miltenyi Biotec). The cells were then incubated for 12 h at 37 ∘C, 5% CO_{2}, and at a concentration of 1×10^{6} cells/mL in serumfree, XVIVO20 medium (Lonza BioWhittaker) with T cell activation beads coated with antiCD2/CD3/CD28 antibodies (Miltenyi Biotec) alone or in combination with four different sets of recombinant human inflammatory mediators (i.e., five different culture conditions). The cells were then harvested from the culture medium for cellhashing [36] and CITEseq [35] single cell sequencing library preparation following the CITEseq and hashing protocol available at https://citeseq.com. The mRNA, HTO, and ADTderived libraries were then pooled at approximately 85%, 5%, and 10% proportions, respectively, and the pool of these sequencing libraries was sent for 150bp pairedend sequencing in two lanes of an Illumina HiSeq sequencer (MedGenome, Inc.).
All subjects were given informed consent, and the study is approved by the University of Pittsburgh IRB.
Simulation dataset
We also generated a simulated dataset by augmenting the PBMC1 dataset. Specifically, we classify GEMs in the PBMC1 dataset using both GMMDemux and the heuristic classifier of Seurat. Then, we extract GEMs that are classified as SSDs by both classifiers. We recovered SSDs from all four samples. We assume these GEMs are SSDs in truth. A summary of SSDs from the four samples is provided in Table 3. Notice that the mean of the samplelabeling HTO count of sample 1 (HTO 1) is significantly larger than the other three samples (HTO 2 in sample 2, HTO 3 in sample 3, and HTO 4 in sample 4). This shows that the sample barcoding could be susceptible to experimental inconsistencies and may include inconsistent levels of HTO counts among samples.
We used the extracted SSDs to generate a batch of simulated datasets covering a wide range of possible sample barcoding scenarios, including varying number of samples for barcoding, varying MSM percentages, and varying degrees of population imbalances between samples. For each dataset, we randomly distribute the SSDs into droplets. If a droplet is assigned with a single SSD, then it inherits the HTO counts of that SSD. If a droplet is assigned with more than one SSD, then the new HTO counts of the droplet are computed by adding the HTO counts of its assigned SSDs together. Let j denote a simulated multiSSD droplet and \(\mathbb {SSD}_{j}\) denote the set of SSDs assigned to j, we compute the new HTO counts of j as \({\bar {x}}_{j} = \sum _{i \in \mathbb {SSD}_{j}} w_{i} \cdot {x}_{i}\), where w_{i} is a random weight generated from \(\mathcal {N}(\mu = 1,\,\sigma ^{2}=0.04)\) and x_{i} is the HTO count vector of SSD i. Simulated multiSSD droplets that contain SSDs from multiple samples are marked as MSMs in ground truth.
We generated three sets of simulated datasets. In the first set, we generated datasets using different numbers of samples (2, 3, and 4 samples) while maintaining a fixed MSM percentage at 10% and equal SSD populations among samples. In the second set, we used all four samples with equal populations and generated simulated datasets with different MSM percentages (5%, 10%, and 15%). In the third set, we selected three samples (sample 1, sample 2, and sample 3), fixed the MSM percentage at 10%, and downsized sample populations into geometric sequences. We generated three datasets with common ratios of 1, \(\frac {1}{2}\), and \(\frac {1}{3}\), respectively. A summary of all nine simulation datasets is provided in Table 4.
Multisample multiplet classification results
For each cellhashing dataset, we compare the MSM classification results of five MSM classifiers: the GMMDemux classifier, the heuristic classifier of Seurat, the heuristic classifier of MULTIseq, the modelbased classifier demuxEM, and a humansupervised classifier. For the humansupervised classifier, a trained laboratory technician classifies GEMs based on the CLRtransformed HTO matrix.
The classification results are visualized in 2D tSNE plots [16]. The tSNE plots are generated directly from the HTO matrix. Note that the tSNE transformation is probabilistic and nondeterministic: GEMs with similar HTO UMI count profiles are likely to be grouped together, but there is no guarantee [42]. Sometimes, a small fraction of GEMs are incorrectly clustered with dissimilar neighbors, due to inaccuracies of the tSNE transformation. We use tSNE plots only for visualization and do not expect it to 100% reflect the truth.
Classification results on real datasets
The classification results of the PBMC1 dataset are shown in Fig. 3. Shown in the top panel are the GMMDemux classification result, the humansupervised classification result, the Seurat classification result, the MULTIseq classification result, and the demuxEM classification result, and a set of HTO UMI count heat maps of individual samples in the bottom panel. In each heat map, GEMs with higher HTO UMI counts of the sample have darker colors. For simplicity, we lump all MSMs together as a single class—the MSM class, while maintaining SSDs of different samples as separate classes. Additional classification results for the PBMC2 and the Memory T datasets are provided in Additional file 1: Fig. S3. If needed, GMMDemux is able to subdivide MSMs into subclasses where each sample combination is given a distinct class. Distinct MSM classification results are provided in Additional file 1: Fig. S4.
Figure 3 shows that the classification results from all five classifiers are mostly consistent. We compare the classification results against the HTO UMI count heat maps: a correct SSD classification should have a dark color in a single heat map and light colors in the rest of the heat maps; a correct MSM classification should have dark colors in more than one heat map. As evident in Fig. 3, the heat maps reinforce the MSM classifications by GMMDemux.
Even though Seurat generates classification results similar to those produced with the GMMDemux classifier, it is heuristicbased and unstable. Figure 4 illustrates the heuristic and unstable nature of the Seurat classifier. Results in this figure are generated from the PBMC1 dataset. Since the heuristic classifier relies on the HTO UMI count threshold for classification, which is indirectly controlled by t_{l}, it generates different classification results with different t_{l} values, as shown in Fig. 4a–d. From the figures, we observe that while a smaller t_{l} produces fewer negative classifications, it generates more MSM classifications. This is expected as a smaller t_{l} reduces the HTO UMI count threshold, which in turn increases the number of cellenclosing GEMs in each sample. Without ground truth, however, it is not obvious which t_{l} provides the most accurate classification result. Such high variations in the classification results, as well as the heavy reliance on heuristic parameters, reduce the reliability of the Seurat classifier. In practice, it is difficult to select the appropriate t_{l} for the best accuracy.
On top of its heuristic nature, because it uses the nondeterministic Kmedoid clustering algorithm, the Seurat classifier generates different results between two runs even with the same heuristic parameter. This is visualized by comparing Fig. 4a against e. Both figures are generated under t_{l}=0.99. Differences between them (highlighted in reddotted circles) stem solely from the nondeterminism of the Kmedoid algorithm.
Finally, the Seurat classifier is highly sensitive to changes in the dataset. In Fig. 4f, we randomly subsample GEMs from samples 3 and 4 (by 10% and 50%, respectively). When compared against Fig. 4a, we observe substantial changes in the classification result, highlighted in reddotted circles. This is because as the sample composition changes, the HTO count threshold of each sample also changes, even without updating t_{l}. As a result, previously classified MSMs now become SSDs and vice versa.
The GMMDemux classifier, on the other hand, is modelbased, stable, and far more deterministic. The GMMDemux classifier does not require heuristic parameters for MSM classification and generates consistent classification results across repetitive runs. Despite of uncertainties introduced by the EM algorithm, because GMMDemux is modelbased and the HTO UMI count distributions possess obvious features of a 2component Gaussian mixture, the EM algorithm always converges. Hence, GMMDemux generates consistent results. Figure 4g and 4h show the classification results of two repetitive runs of GMMDemux. There exist little differences between the two figures. Similarly, the GMMDemux classifier is much less susceptible to subsampling, as shown in Fig. 4i, where we subsampled GEMs from samples 3 and 4, as we did in Fig. 4f. By comparing Fig. 4i against g, we observe minimal changes in GEM classifications. A more detailed stability analysis across all four sample barcoding classifiers is included in Additional file 1: Fig. S5.
Not all GEMs can be confidently classified by GMMDemux. Some GEMs have low HTO UMI counts across all samples, while other GEMs have similar probabilities between multiple classes (such as between a l_{1} SSD and a l_{1}∩l_{2} MSM). Neither type of GEMs can be well classified: the former are classified as negative GEMs, which should be experimental errors, while the latter are classified as unclear GEMs, which are too ambiguous to be included in the final result. GMMDemux lets the user specify the confidence threshold, c, such that the user can customize the removal of unclear GEMs: a low confidence threshold salvages more unclear GEMs in the final result at the expense of decreased MSM classification quality. Across all three cellhashing datasets, over 99% of GEMs have confidence scores above 0.8. Therefore, we set the default confidence threshold of GMMDemux at 0.8 (c=0.8). Detailed distributions of confidence scores are provided in Additional file 1: Fig. S14.
Classification results on the simulation datasets
We benchmark the accuracy of GMMDemux against the other three classifiers (Seurat, MULTIseq, and demuxEM) by applying all four methods to the 9 simulation datasets and compare their classification results against the ground truth. All classifiers are benchmarked with their default parameters and are repeated 20 times for each dataset. An example set of classification results of the 4sample simulation dataset is visualized in Fig. 5.
For each classification, we compute the Adjusted Mutual Information (AMI) score between itself and the ground truth. The AMI score comparison across all simulation datasets is provided in Fig. 6a. As shown in the figure, GMMDemux achieves high classification accuracies across all scenarios, whereas other sample barcoding classifiers have faltered accuracy under low sample numbers (2 samples) or high sample imbalances (2 × scale and 3 × scale). In particular, MULTIseq failed to derive a stable quantile HTO count cutoff for the 2sample dataset and cannot converge to a stable classification solution. A detailed explanation of why MULTIseq fails is provided in the “Related works” section. Figure 6a proves that GMMDemux is highly accurate and is the most stable sample barcoding classifier.
Figure 6b records the execution time of each classifier over all simulated datasets. As shown in the figure, GMMDemux is significantly faster than other sample barcoding classifiers.
Samesample multiplet rate estimation results
We prove the correctness of the SSM estimator indirectly by validating the GEM formation model. Even though the SSM rate truth is not directly observable, if the underlying probabilistic model is accurate, then the SSM rates derived from the model should also be trustworthy. For this purpose, we compare the modelderived MSM rates against the GMMDemux classifierobserved MSM rates. If the numbers match, then we claim the GEM formation model must accurately characterize the GEM formation process.
For comprehensiveness, we compare not only the overall MSM rates of a dataset, but also the MSM rates of individual sample combinations. For each sample combination, we compare the modelderived MSM UMI count against the MSM classifierobserved UMI count. The comparison results are summarized into Venn diagrams, which illustrate the number of SSDs of each sample as well as the number of MSMs of each sample combination. We compare the modelderived Venn diagram against the MSM classifierobserved Venn diagram. Figure 7 includes the Venn diagram comparisons of the PBMC1 and the CD4^{+} Memory T datasets. Comparison of the PBMC2 dataset is included in the table of Additional file 2 (its persample combination classification result cannot be visualized in a Venn diagram due to a large number of sample combinations).
From Fig. 7, we observe that the modelderived MSM counts are consistent with the observed values from the MSM classifier. Therefore, we prove that the droplet formation model is accurate.
The estimated number of droplets (X) and the modelestimated singlet, MSM (Est. MSM), SSM, and relative SSM (RSSM) rates of each sample are summarized in Table 5. Also included in Table 5 are the GMMDemux classifierobserved MSM rates (Obs. MSM) and the proportions of unclear GEMs (GEMs with confidence scores below c=0.8) and negative GEMs in each dataset. Except the number of droplets (X), all rates are presented as percentiles (%). As shown in the table, the modelderived MSM rates are generally consistent with the classifierobserved MSM rates.
A detailed introduction of the droplet formation modelbased online experiment planner is provided in Additional file 1: Section S8. A suite of profiling results produced by the online experiment planner under varying experimental settings is provided in Additional file 1: Section S9.
Celltype authentication results
Celltype authentication via joint analysis with surface marker data
GEMs in the PBMC1 dataset are manually assigned into 17 distinctive clusters following the gating strategy detailed in Maecker et al. [19], which is visualized in Fig. 8. Among the 17 GEM clusters, 8 of them represent wellcharacterized cell types found in PBMCs (highlighted in green bounding boxes); 9 of them are rarely observed in PBMCs and are labeled as putative novel celltype candidates (highlighted in orange bounding boxes). All GEM clusters, annotated by their defining surface markers and their inferred cell types, if available, are summarized in Table 6.
For each GEM cluster, GMMDemux computes the MSM percentage of the cluster and compares it against the anticipated puretype MSM percentage as well as the anticipated phonytype MSM percentage of the cluster. The anticipated puretype MSM percentage of the cluster is a hypothetical value derived from the GEM formation model by assuming that the GEM cluster represents a real cell type. Similarly, the anticipated phonytype GEM percentage is computed by assuming the GEM cluster is a phonytype GEM cluster. Based on the observed and anticipated MSM percentages, GMMDemux performs puretype and phonytype hypothesis testings and classifies the GEM cluster according to the p values of both tests. The classification results, as well as the intermediate results in classifying each GEM cluster, are also included in Table 6. As summarized in Table 6, the PBMC1 dataset contains 9 cell types rarely observed in PBMCs. Named after their defining surface markers, these are as follows:

CD14^{+}CD56^{+}

CD3^{+}CD4^{+}CD14^{+}

CD3^{+}CD4^{+}CD19^{+}

CD3^{+}CD4^{+}CD56^{+}

CD3^{+}CD4^{+}CD8^{+}

CD3^{+}CD8^{+}CD14^{+}

CD3^{+}CD8^{+}CD19^{+}

CD3^{+}CD8^{+}CD56^{+}

CD3^{+}CD4^{+}CD14^{+}CD56^{+}
Upon further investigation, we observe that all 9 putative novelcelltypedefining GEM clusters have very high MSM percentages, approaching and exceeding their anticipated phonytype MSM percentages. When tested with puretype hypothesis, all 9 clusters have extremely small p values; and large p values from phonytype hypothesis tests. Consequently, GMMDemux designates all 9 GEM clusters as phonytype clusters.
Such result suggests that all 9 GEM clusters contain multiplets of different cell types. For instance, the CD14^{+}CD56^{+} GEM cluster contains multiplets that include both monocyte cells (CD14^{+}) and NK cells (CD56^{+}). Among the 9 phonytype GEM clusters, the CD3^{+}CD4^{+}CD14^{+}CD56^{+} GEM cluster has the largest MSM percentage, significantly larger than the rest. With further examination of its defining surface markers, we conclude that it contains tripletype GEMs—GEMs that include CD3^{+}CD4^{+} T cells, CD14^{+} monocytes, and CD56^{+} NK cells. According to the GEM formation model for phonytype hypothesis testing, detailed in Additional file 1: Section S3, tripletype phony GEM clusters have higher MSM percentages than doubletype phony GEM clusters. This explains the larger MSM percentage of the CD3^{+}CD4^{+}CD14^{+}CD56^{+} GEM cluster.
For the remaining 8 GEM clusters, which represent wellcharacterized cell types in PBMCs, 6 of them are classified as puretype GEM clusters, with the exception of the CD14^{+}CD16^{+} nonclassical monocyte GEM cluster and the CD56^{+}CD16^{+} NK GEM cluster. Both clusters are classified as mixture GEM clusters, suggesting that they contain both puretype and phonytype GEMs. This classification result is reasonable, as both GEM clusters contain fractions of indistinguishable multiplets. For instance, inside the CD14^{+}CD16^{+} GEM cluster, there could be a small fraction of CD14^{+}CD16^{+}andCD14^{+}CD16^{} phonytype GEMs. These phonytype GEMs are CD14^{+}CD16^{+}andCD14^{+}CD16^{} the CD14^{+}CD16^{+} puretype GEMs in gating. In gating, boundaries between cell types are drawn in a logtransformed surface marker space. After log transformation, the surface marker expression profile of a CD14^{+}CD16^{+}andCD14^{+}CD16^{} phonytype GEM is almost identical to a CD14^{+}CD16^{+} puretype GEM, even if they contain the same CD14^{+}CD16^{+} nonclassical monocyte cell. The only difference: the CD14^{+}CD16^{+}andCD14^{+}CD16^{} phonytype GEM is likely to have a slightly larger logtransformed CD14 expression value. Such subtle differences do not warrant the separation of CD14^{+}CD16^{+}andCD14^{+}CD16^{} phonytype GEMs from CD14^{+}CD16^{+} puretype GEMs. Due to intrinsic variations in surface marker expression levels, the two types of GEMs intermix with each other into a single, indivisible GEM cluster. Similarly, CD56^{+}CD16^{+}andCD56^{+}CD16^{} phonytype GEMs are also indistinguishable from CD56^{+}CD16^{+} puretype GEMs. This explains the slightlyhigherthanexpected MSM percentages in the CD14^{+}CD16^{+} monocyte GEM cluster and the CD56^{+}CD16^{+} NK GEM cluster, which resulted in designating them as mixture GEM clusters. Nonetheless, these should be the only phonytype GEMs they contain. Therefore, the MSM percentages of both clusters are only moderately above their corresponding puretype MSM percentages, remaining significantly smaller than their corresponding phonytypequalifying MSM percentage thresholds, reflecting that both clusters still have a puretype GEM majority. Overall, we conclude that the 8 GEM clusters with low MSM percentages represent real cell types in PBMC, in concordance with previous knowledge on PBMCs [19].
To validate the classification results of GMMDemux, we conducted an additional CITEseq sequencing experiment over a PBMC sample from the same donor of PBMC1. The additional CITEseq experiment measures the same set of surface markers as in PBMC1. To control the percentage of multiplets, we loaded only 3.2K cells while harvesting ∼ 1.6K GEMs. The online experiment planner estimated percentage of multiplets of this dataset is 1.9%, compared to 23.9% in PBMC1. We sorted GEMs following the same gating strategy illustrated in Fig. 8. Table 7 records the percentages of the 17 manually gated cell types in both PBMC1 and the validation dataset. We observe that all 9 phonytype GEM clusters identified in PBMC1 have muchreduced, closetozero presence in the validation dataset, while the 8 puretype GEM clusters have similar footprints. This confirms the classification results of GMMDemux.
Table 7 proves that removing MSMs alone does not eliminate all multiplets. None of the phony GEM clusters has a MSM percentage of 100%. All phony GEM clusters have nonnegligible fractions of SSMs, which cannot be revealed or removed through sample barcoding alone. After removing all phonytype GEM clusters, we estimate the RSSM rate of PBMC1 is further reduced to 3.29%, from 6.45%.
Gating refinement and joint celltype authentication with transcriptomic data
The selection of the CD14^{+}CD16^{+} nonclassical monocyte cell and the CD56^{+}CD16^{+} NK cells can be further refined with transcriptomic data. Figure 9a and 9b depict the previous surface markerbased classifications of monocyte and NK cells visualized in RNA UMAP plots, respectively. In both figures, we observe fractions of CD16^{+} cells disperse into the CD16^{} cell groups. Following the assumption that phonytype GEMs inherit RNA profiles from both member cell types, we refine the selection of both CD16^{+} GEM groups by manually removing GEMs that disperse into the CD16^{} GEM cluster. The refined cell selections are highlighted in Fig. 9c and 9d. After refinement, the MSM percentages of CD14^{+}CD16^{+} and CD56^{+}CD16^{+} GEM clusters reduce to 9.32% and 6.77%, respectively.
We also applied DoubletFinder and Scrublet to the PBMC1 dataset. Figure 10 displays the GMMDemux MSM classification result, the distribution of phonytype GEMs in Table 7, and the crosscelltype multiplet identification results of DoubletFinder and Scrublet. Comparing the four plots, we observe that phonytype GEM clusters (Fig. 10b) have higher MSM concentrations (Fig. 10a) and house the majority of the crosscelltype multiplets identified by DoubletFinder and Scrublet (Fig. 10c, d).
The DoubletFinder and Scrublet crosscelltype multiplet identification results reinforce the putative celltype authentication result of GMMDemux. A detailed comparison of the MSM percentages and the DoubletFinderidentified doublet percentages of individual putative cell types is provided in Table 8. Putative cell types that have high MSM percentages also have high DoubletFinderidentified crosscelltype multiplet percentages and vice versa. The concordance between the GMMDemux authentication result and the RNAbased crosscelltype multiplet identification results provides support for the correctness of GMMDemux. Parameter selections for both DoubletFinder and Scrublet are included in Additional file 1: Section S12.
Additional analysis on the impact of phonytype GEMs in downstream scRNAseq analysis is provided in Additional file 1: Section S13. We show that phonytype GEMs can confound downstream analysis and degrade RNA clustering accuracy, as well as generating lowquality clusters with high MSM concentrations.
Discussion
Related works
Currently, there are three analytical methods for processing sample barcoding data: the heuristic classifier provided by Seurat (or simply Seurat), the heuristic classifier provided by MULTIseq (or simply MULTIseq), and the modelbased classifier demuxEM. Seurat relies on the Kmedoid clustering algorithm [11], a probabilistic method [31], to classify MSMs. Assuming there are a total of M samples, for each sample, it clusters all GEMs into M groups using the Kmedoid clustering algorithm. Then, it removes the group with the highest mean, combines the remaining groups, fits the combined data with a negative binomial distribution, excludes the top 5% values as outliers, computes the q=t_{l} quantile (t_{l} is set to 99% by default) of the fitted distribution, and finally tags GEMs with HTO UMI values that are greater than q as samplespecific GEMs. If a GEM is classified as cellenclosing in multiple samples, then Seurat brands it as a MSM.
While Seurat has sufficiently demonstrated the benefit of sample barcoding, it is heuristicbased and is unstable. It includes a number of arbitrary parameters. It does not explain why it fits the data with a negative binomial distribution as opposed to other distributions, nor does it explain why it removes the top 5% values as outliers or sets t_{l}=99% as the default value. As we will see in the “Results” section, by setting t_{l} differently, it generates conflicting results and it is not evident which t_{l} provides the best result. Furthermore, because it relies on the Kmedoid clustering algorithm, which generates inconsistent results over repetitive runs, Seurat also generates inconsistent classification results over repetitive executions.
MULTIseq uses simple quantile cutoffs to classify GEMs. It assumes that the HTO antibody distributions across all samples have similar shapes. By design, MULTIseq first finds the two maximums that correspond to the two peaks of the two Gaussian components in each HTO distribution (CLRtransformed), termed the ontarget (\(\mathcal {N}_{high}\)) and the offtarget (\(\mathcal {N}_{low}\)) maximums. It then sets a universal quantile HTO count cutoff between the two maximums across all barcodes: GEMs with HTO counts of a sample that exceed the quantile cutoff are classified as containing cells from that sample. GEMs which have HTO counts from a single sample exceeding the quantile cutoff are SSDs, GEMs that have HTO counts from multiple samples exceeding the quantile cutoff are MSMs, and GEMs that do not have any HTO count exceeding the quantile cutoff are negative GEMs. MULTIseq sets the quantile cutoff in an iterative and heuristic manner: it finds a cutoff that yields the highest count of SSDs across all samples. Then, it classifies all droplets accordingly and removes all negative GEMs. It repeats the process until there is no negative droplet left. MULTIseq performs a final, reclassification step which uses Kmeans to update the classification of some of the previously classified negative GEMs into SSDs.
The implementation of MULTIseq, however, depends on an unreliable heuristic. Instead of finding the HTO values that correspond to the two peaks of the two Gaussian components in each HTO distribution, MULTIseq generates an array of local maximums in each distribution and designates the maxima with the largest HTO count as the ontarget maxima of the sample and the maxima that produces the highest peak in the distribution as the offtarget maxima of the sample. In doing so, MULTIseq implicitly assumes that there are always more offtarget GEMs than ontarget GEMs in each HTO distribution. In reality, when there are only two samples in a sample barcoding experiment, or when one sample has a larger population than the rest combined, then the above assumption of MULTIseq no longer holds. In those cases, MULTIseq is not applicable as we show in the “Results” section.
DemuxEM is similar to GMMDemux in principle: it assumes that HTO antibodies in a GEM come from two separate sources—antibodies from the background and antibodies from sample staining. However, it differs from GMMDemux in modeling the background antibodies. GMMDemux models the background antibodies as freefloating antibodies that rebind to cells in pooling. demuxEM models the background antibodies as freefloating antibodies that never bind to any cell but are encapsulated in the GEM emulsion. As a result, demuxEM derives the background antibody distribution by examining empty droplets—droplets that do not contain any cell, instead of examining the antibody distributions of the cellenclosing droplets. Through our experiments, we observe that this core assumption of demuxEM is flawed. Most empty droplets have closetozero antibody counts in all samples while most cellenclosing droplets (GEMs) have decent antibody counts in all samples. As a result, demuxEM underestimates the background antibody distribution, which reduces its classification accuracy, as our simulation shows. A more detailed analysis of background antibodies is provided in Additional file 1: Section S11.
Finally, none of the above methods proposes a model for the GEM formation process and none of them models SSMs. As a result, they are incapable of estimating the postMSMremoval multiplet percentages and they cannot authenticate putative cell types.
Prior to sample barcoding, multiplets can be identified experimentally by mixing samples of different donors. The most reliable method of finding multiplets involves mixing cells of different species [6, 13, 18, 48]. Multiplets are identified as GEMs whose reads are confidently mapped across multiple species. However, this method does not work when mixing samples of the same species. Instead, when working with samples of the same species, as long as the donors show sufficient amount of genetic variations, then multiplets can be identified as GEMs which contain distinct genetic signatures from multiple donors [12]. Unfortunately, neither method works when samples come from a single donor, which limits their applicability in scaling up single cell experiments. Sample barcoding, on the other hand, is capable of identifying multiplets even when samples are drawn from the same donor.
Besides the aforementioned methods, it is also plausible to identify some doublets through examining single cell expression profiles. When working with assays that contain multiple cell types, under the assumption that cells of the same type have highly similar expression profiles while cells of different cell types have drastically different expression profiles, multiplets are identified as small GEM groups whose expression profiles share similarities to multiple distinct large GEM groups or to multiple expression profiles of known distinct cell types [35, 48]. This idea can be further expanded to artificially create synthetic doublets from a single cell dataset and detect doublets by selecting GEMs whose expression profiles resemble synthetic doublets [22, 44]. While the idea has shown promise, a major limitation of RNAbased doublet finding studies is the lack of reliable evaluation mechanisms. The most reliable evaluation methods that are employed in these studies are still crossspecies validation, crossdonor validation, and crosscelltype validation. In crosscelltype validations, cell types of distant expression profiles are employed to secure reliable identifications of phony cell types. GMMDemux supplements RNAbased doublet finding studies by providing an additional means for evaluating the efficacy of their doublet identification results.
Sample barcoding provides an additional domain to the above experimental methods and has a wider applicability. Crossspecies, crossdonor, and crosscelltype multiplet identification methods rely on biological features of their respective domains, while sample barcoding gives the end users the freedom to customize the experiment, finetune the multiplet detection resolution, and bypass the reliance on biological features. For instance, in our Memory T dataset, cells of all five samples come from the same donor and consist of a single cell type. None of the traditional multiplet identification methods is applicable to this experiment, as there is only a single species, a single donor and a single cell type. Sample barcodingbased multiplet detection methods, such as GMMDemux, demuxEM, and MULTIseq, remain functional as they do not rely on a specific set of biological features. GMMDemux, specifically, is able to work in junction with multiplet identification methods of other domains (when possible). It can use the sample barcoding information to authenticate multiplet classifications predicted by methods of other domains (when applicable).
There are only a few prior studies on modeling multiplet rates. Demuxlet [12], a genetic variationbased multiplet identifier, models the singlet rate as \((1d_{0})^{\frac {Y}{Y_{0}}}\), where Y is the planned number of cells and d_{0} is the observed doublet rate (obtained through a mixedspecies experiment) when loading Y_{0} cells in library preparation. By default, Demuxlet assumes d_{0}=0.01 with Y_{0}=1K. Although not elaborated in the Demuxlet paper, we notice that the singlet rate equation in Demuxlet bears a striking resemblance to the singlet rate equation used by GMMDemux. Specifically, within the range of Y∈[1K,40K], \({(1\frac {1}{100})^{\frac {Y}{1,000}} \simeq (1\frac {1}{(100,000)})^{Y}}\). This is because the curve \(f(x) = (1  \frac {x}{100,000})^{\frac {Y}{x}}\) is almost flat within x∈[1,10,000]. Hence, the singlet formula used by Demuxlet under d_{0}=0.01 and Y_{0}=1K can be approximately explained by GMMDemux as randomly partitioning Y cells among a total of X=100K cellassay droplets. Despite apparent similarities between their formulas, GMMDemux and Demuxlet employ different underlying statistical mechanics. Demuxlet uses a discriminative model, which uses regression to subjectively model the multiplet rate as a parametrized curve. GMMDemux, on the other hand, uses a selfexplanatory, generative model that directly simulates the GEM formation process. The generative model allows GMMDemux to estimate the MSM rates of puretype and phonytype GEM clusters in a sample barcoding dataset, while the discriminative model of Demuxlet does not. The generative model also enables GMMDemux to accurately simulate multiplets, including both puretype and phonytype GEMs; singlets; SSMs; and MSMs, whereas Demuxlet cannot.
Alternatively, other works model the number of cells in a GEM with Poisson distributions [3, 7, 25]. A major downside of this branch of methods is the difficulty in estimating the model parameters. A Poisson model uses the average number of cells in a GEM as its parameter. However, this number changes when the number of loaded cells changes. As a result, these models cannot be readily used for experiment planning. Interestingly, Poisson distribution is a special case of the binomial distribution, where the number of probabilistic experiments in the binomial process (X, in this case) approaches infinity [28]. Poisson distribution is often used as a numerical approximation of binomial distributions, especially when the number of droplets (X) is large and the average number of cells in a droplet is small. Poisson distributionbased multiplet rate estimators in fact support the GEM formation model of GMMDemux and can be considered as numerical approximations of GMMDemux.
Despite outperforming existing methods, the underlying assumptions of GMMDemux impose a number of limitations. First, GMMDemux assumes a wide gap in the HTO concentrations before and after sample pooling. HTO concentration gaps are key to defining the two peaks in the bimodal distribution of HTO UMI counts. Although from our observation, the two peaks are always well defined and are always far apart from each other in the HTO UMI count distributions, this is not 100% guaranteed, especially when the sample number is low (e.g., M=2). When pooling fewer samples together, the HTO concentration reduction by pooling could diminish. However, this is more of a limitation of the sample barcoding technology, rather than a limitation of GMMDemux. Based on the premise of the sample barcoding technology, which strives to tag only samplespecific cells with HTOs, we believe that the bimodal distribution assumption should always hold. Second, the online experiment planner requires prior knowledge of the number of cellassay droplets generated by the library preparation equipment. We suggest users profile their library preparation equipment once with GMMDemux for the cellassay droplet count and use the profiled number in future experiment planning. While it is logical to assume that the same library preparation equipment generates the same number of cellassay droplets over repetitive runs, this is yet to be confirmed. In reality, based on the total volume of the loaded cell assay, the total count of cellassay droplets could vary, even if the cellassay pump operates at a constant frequency. Such variation, however, does not affect the MSM classifier, the SSM rate estimator, or the putative celltype authenticator. It only affects the online experiment planner and can be potentially alleviated by running the experiment planner with a suite of likely cellassay droplet configurations. Third, GMMDemux cannot identify phonytype GEMs on its own. Rather, GMMDemux authenticates preclustered, potential cell typedefining GEM groups. The efficacy of the celltype authentication result depends on the quality of the clustering input: GMMDemux is able to accurately classify GEM groups into puretype and phonytype GEM clusters if the clustering input has high fidelity (GEMs of different cell types are organized into individual clusters). Otherwise, given a lowquality clustering input, GMMDemux will label most clusters as mixture GEM clusters. By decoupling clustering from celltype authentication, GMMDemux provides the end users the freedom of selecting and customizing the clustering algorithm that best fits their specific applications. Finally, GMMDemux assumes cells are partitioned into droplets independently. This model does not consider the volume taken up by each cell. A more realistic model would assign diminishing likelihoods to having additional cells partitioned into a droplet as more cells accumulate in the droplet. To that end, GMMDemux does not take cell size differences into consideration either. As cells differ in size, a more accurate model would assign a smaller likelihood to having two large cells partitioned into the same droplet than that of two small cells. Unfortunately, the cell size and droplet size information is not readily available in sample barcoding data, which limits us from studying the effect of cell size on multiplet rates. Nevertheless, given that the probability of a droplet containing more than three cells is already close to zero according to our current droplet formation model, and the fact that the cellassay droplet size has to be large enough to accommodate the largest possible cell in a tissue, we believe it is unnecessary to further complicate the GEM formation model to include the cell size information.
We further benchmarked GMMDemux with an additional 4HTO colonoscopic biopsy cellhashing experiment from paired inflamed and uninflamed biopsies from a patient with Crohn’s disease. We observed results that are in concordance with the PBMC and the Memory T datasets in the “Results” section. The medical useonly colonoscopic biopsy dataset is excluded from the main results because of privacy constraints.
Conclusion
In this paper, we proposed a modelbased Bayesian framework, GMMDemux, for detecting sample barcodingdetectable multiplets in a sample barcoding dataset, estimating the percentage of sample barcodingundetectable multiplets in the remaining dataset, predicting the multiplet rates of planned future sample barcoding experiments, and validating the existence of putative cell types. At its core, GMMDemux uses Gaussian mixture models to identify GEMs that contain samplespecific cells and then uncovers MSMs by selecting GEMs that contain cells from multiple samples. We showed that GMMDemux accurately and consistently classifies GEMs into SSDs and MSMs and generates more accurate and more consistent results when compared against existing methods. We further proposed a GEM formation model to estimate the SSM rate in a sample barcoding dataset. The GEM formation model describes the GEM formation process as an augmented binomial process. We showed that the GEM formation model accurately characterizes the GEM formation process. We built an online experiment planner that estimates the multiplet rate of planned future sample barcoding (or an ordinary single cell) experiments. Then, we used the online experiment planner to generate a series of multiplet profiles under various experimental setups. Finally, we proposed putative cell type authenticator that authenticates the existence of putative cell typedefining GEM clusters, and showed that GMMDemux correctly identifies phonytype GEM clusters in single cell datasets.
GMMDemux is the first work that is able to not only accurately and consistently classify sample barcodingdetectable MSMs in a sample barcoding dataset, but also estimate the undetectable SSM rates among the remaining SSDs. Furthermore, GMMDemux is the first work attempting to model the GEM formation process using a generative model. GMMDemux incorporates its GEM formation model into an online experiment planner that is capable of anticipating experimental outcomes of planned sample barcoding experiments, and it is a first in systematically verifying the legitimacy of putative cell types using sample barcoding information.
In our future work, we intend to perform more sample barcoding experiments with different tissues and investigate the underlying mechanisms that govern the number of cellassay droplets and the capture rate in a sample barcoding experiment. We seek to expand the GEM formation model and use it to detect false lineage discoveries and false celltype discoveries in single cell data analysis. We also plan to investigate how to identify SSMs within SSDs.
Methods
GMMDemux is built around four goals: (1) separate MSMs from SSDs in a sample barcoding dataset; (2) estimate singlet and SSM rates of a sample barcoding dataset; (3) plan future sample barcoding experiments—estimate the anticipated MSM, SSM, and singlet rates of a planned future experiment; and (4) determine whether a homogeneous GEM cluster is a puretype GEM cluster. GMMDemux has two separate components: (1) a Gaussian mixture modelbased MSM classifier and (2) a modelbased SSM rate estimator. The MSM classifier classifies GEMs into MSMs and SSDs using Gaussian mixture models and computes the likelihood of each classification. The SSM rate estimator estimates the SSM and the singlet rate of the dataset. The SSM rate estimator models the GEM formation process as an augmented binomial process. It infers the latent parameters of the model, such as the number of cells of each sample and the number of cellassay droplets formed during sequencing, from observed variables, including the number of cellenclosing GEMs of each sample and the number of MSMs of each sample pair. Finally, the SSM rate estimator computes the estimated singlet and SSM rates of each sample with the inferred latent parameters. With the GEM formation model, GMMDemux determines whether a proposed homogeneous GEM cluster is a puretype GEM cluster, a phonytype GEM cluster, or a mixture cluster.
Based on the GEM formation model, we build an online sample barcoding experiment planner that estimates the multiplet rates of future sample barcoding experiments. Researchers can use the experiment planner to anticipate the outcome of a sample barcoding experiment without actually conducting the experiment. The online experiment planner takes the number of cells planned for sequencing as well as the number of samples planned for sample barcoding as inputs and outputs of the estimated MSM, SSM, and singlet rates of the anticipated outcome.
Multisample multiplet (MSM) classifier
The MSM classifier preprocesses the HTO matrix with centeredlogratio (CLR) normalization [35, 36]. CLR normalizes the HTO UMI counts of each GEM columnwise (samplewise) as follows:
Here, \(x_{i}^{l}\) denotes the CLRnormalized HTO UMI count of the lth sample in the ith GEM (the ith row and the lth column of the HTO matrix); \(\bar {x}_{i}^{l}\) denotes the original HTO UMI count of the lth sample in the ith GEM and n denotes the total number of GEMs.
The distributions of the CLRtransformed HTO UMI counts of a 4sample cellhashing experiment are illustrated in Fig. 11. From this figure, we observe that for each sample, the CLRtransformed HTO UMI counts follow a bimodal distribution which resembles a mixture of two Gaussian distributions. GMMDemux models the HTO UMI count distribution with an aggregated twoGaussian distribution mixed model. We color the two distributions as red and green, respectively, in Fig. 11. For a specific sample \(\hat {l}\) (\(l=\hat {l}\)), the Gaussian distribution with the smaller mean, \(\mathcal {N}_{low}^{\hat {l}}(\mu _{low}^{\hat {l}},\,(\sigma _{low}^{\hat {l}})^{2})\) (in red), accounts for GEMs that do not contain cells from \(\hat {l}\) (\(\hat {l}\)cellfree GEMs). The other distribution, \(\mathcal {N}_{high}^{\hat {l}}(\mu _{high}^{\hat {l}},\,(\sigma _{high}^{\hat {l}})^{2})\) (in green), on the contrary, models GEMs that contain cells from \(\hat {l}\) (\(\hat {l}\)cellenclosing GEMs). It is worth noting that GEMs from \(\mathcal {N}_{low}^{\hat {l}}(\mu _{low}^{\hat {l}},\,(\sigma _{low}^{\hat {l}})^{2})\) still have positive HTO counts. In cell hashing, when cell assays of all samples are pooled together, freefloating HTO antibodies that have not yet bound to any cell still exist in the solution, as shown in Additional file 1: Figure S2. These residual freefloating HTO antibodies bind randomly to all cells from all samples (the restaining step in Additional file 1: Figure S2). However, as cell assays are pooled together, antibodies are diluted; hence, \(\mathcal {N}_{low}^{\hat {l}}(\mu _{low}^{\hat {l}},\,(\sigma _{low}^{\hat {l}})^{2})\) has a lower mean (\(\mu _{low}^{\hat {l}} < \mu _{high}^{\hat {l}}\)).
GEMs from \(\mathcal {N}_{high}^{\hat {l}}(\mu _{high}^{\hat {l}},\,(\sigma _{high}^{\hat {l}})^{2})\), on the other hand, bind with HTO antibodies prior to pooling of samples. Before pooling, HTO antibodies have much higher concentrations. As a result, \(\mathcal {N}_{high}^{\hat {l}}(\mu _{high}^{\hat {l}},\,(\sigma _{high}^{\hat {l}})^{2})\) has a higher mean.
For each sample, GMMDemux uses its Gaussian mixture model to find GEMs that contain cells from the sample. Given a GEM, i, and a sample \(\hat {l}\), GMMDemux tests whether \(x_{i}^{\hat {l}}\) originates from the \(\mathcal {N}_{high}^{\hat {l}}(\mu _{high}^{\hat {l}},\,(\sigma _{high}^{\hat {l}})^{2})\) distribution of \(\hat {l}\): if \(x_{i}^{\hat {l}}\) originates from \(\mathcal {N}_{high}^{\hat {l}}(\mu _{high}^{\hat {l}},\,(\sigma _{high}^{\hat {l}})^{2})\), then i must contain cells from \(\hat {l}\); otherwise, \(x_{i}^{\hat {l}}\) must belong to \(\mathcal {N}_{low}^{\hat {l}}(\mu _{low}^{\hat {l}},\,(\sigma _{low}^{\hat {l}})^{2})\), which means GEM i does not contain cells from \(\hat {l}\).
Let \(Z_{i}^{\hat {l}}=high\) denote the event that \(x_{i}^{\hat {l}}\) originates from \(\mathcal {N}_{high}^{\hat {l}}(\mu _{high}^{\hat {l}},\,(\sigma _{high}^{\hat {l}})^{2})\) and \(Z_{i}^{\hat {l}}=low\) denote the event that \(x_{i}^{\hat {l}}\) originates from \(\mathcal {N}_{low}^{\hat {l}}(\mu _{low}^{\hat {l}},\,(\sigma _{low}^{\hat {l}})^{2})\). Let \(P(Z_{i}^{\hat {l}}=high)\) and \(P(Z_{i}^{\hat {l}}=low)\) denote the prior probability of GEM i originating from \(\mathcal {N}_{high}^{\hat {l}}(\mu _{high}^{\hat {l}},\,(\sigma _{high}^{\hat {l}})^{2})\) and \(\mathcal {N}_{low}^{\hat {l}}(\mu _{low}^{\hat {l}},\,(\sigma _{low}^{\hat {l}})^{2})\), respectively. Then, the probability of observing HTO count value \(x_{i}^{\hat {l}}\) in GEM i equals to:
where \(P(x_{i}^{\hat {l}} \mid Z_{i}^{\hat {l}}=high) \sim \mathcal {N}_{high}^{\hat {l}}(\mu _{high}^{\hat {l}},\,(\sigma _{high}^{\hat {l}})^{2})\) and \(P(x_{i}^{\hat {l}} \mid Z_{i}^{\hat {l}}=low) \sim \mathcal {N}_{low}^{\hat {l}}(\mu _{low}^{\hat {l}},\,(\sigma _{low}^{\hat {l}})^{2})\).
GMMDemux computes the mean and the standard deviation of \(\mathcal {N}_{high}^{l}(\mu _{high}^{\hat {l}},\,(\sigma _{high}^{\hat {l}})^{2})\) and \(\mathcal {N}_{low}^{l}(\mu _{low}^{\hat {l}},\,(\sigma _{low}^{\hat {l}})^{2})\), as well as the prior probabilities \(P(Z_{i}^{l}=high)\) and \(P(Z_{i}^{l}=low)\) of each sample l using the Expectation Maximization (EM) Technique [34].
With all Gaussian mixture models computed across all samples, for each GEM i, GMMDemux computes the posterior probability of GEM i containing cells from sample \(\hat {l}\). Let \(P(Z_{i}^{\hat {l}}=high \mid x_{i}^{\hat {l}})\) denote the posterior probability of \(x_{i}^{\hat {l}}\) originating from \(\mathcal {N}_{high}^{\hat {l}}(\mu _{high}^{\hat {l}},\,(\sigma _{high}^{\hat {l}})^{2})\), and \(P(Z_{i}^{\hat {l}}=low \mid x_{i}^{\hat {l}})\) denote the probability of \(x_{i}^{\hat {l}}\) originating from \(\mathcal {N}_{high}^{\hat {l}}(\mu _{high}^{\hat {l}},\,(\sigma _{high}^{\hat {l}})^{2})\). Both posterior probabilities (\(P(Z_{i}^{\hat {l}}=high \mid x_{i}^{\hat {l}})\) and \(P(Z_{i}^{\hat {l}}=low \mid x_{i}^{\hat {l}})\)) are computed using Bayes’ rule:
The probability (\(P(i \in SSD_{\hat {l}})\)) of i being a singlesample droplet (SSD) of sample \(\hat {l}\) (\(SSD_{\hat {l}}\)) can be computed as:
The probability of i being a multisample multiplet (MSM) can be computed as:
GMMDemux classifies GEMs by ranking above probabilities: a GEM i is classified as a SSD of \(\hat {l}\) if \(P(i = SSD_{\hat {l}})\) is the largest among all, or as a MSM if P(i=MSM) is the largest among all.
In fact, GMMDemux is able to compute the probability of a GEM containing cells of any specific multisample configuration. Assume U is a set of samples (e.g., sample l_{1} and sample l_{4}). The probability of GEM i containing cells from U, MSM_{U}, can be computed by:
This allows GMMDemux to not only identify and count SSDs, but also identify and count MSMs of specific sample combinations in a sample barcoding dataset. Counting MSMs of specific sample combinations is key to verifying the correctness of the SSM rate estimator, as we will show in later sections.
GMMDemux lets the user specify a confidence cutoff c to filter out uncertain classifications. Sometimes, GEMs have HTO UMI counts that reside in the junction area between \(\mathcal {N}_{high}^{l}(\mu _{high}^{\hat {l}},\,(\sigma _{high}^{\hat {l}})^{2})\) and \(\mathcal {N}_{low}^{l}(\mu _{low}^{\hat {l}},\,(\sigma _{low}^{\hat {l}})^{2})\) on a HTO sample dimension. Such GEMs produce ambiguous classification results: they have similar likelihoods between multiple classifications, which typically are all below 0.5. Uncertain GEMs are pruned by the confidence cutoff c: GEMs with maximum probabilities across all classifications which are less than c are deemed uncertain GEMs and are removed from the population. By tweaking c, GMMDemux allows users to adjust the level of rigorousness in identifying SSDs and MSMs.
Samesample multiplet (SSM) rate estimator
As previously discussed, sample barcoding cannot distinguish SSMs from singlets. While GMMDemux does not seek to identify SSMs in SSDs, it estimates the percentage of SSMs and singlets in each sample using the SSM rate estimator. Estimating the SSM rate in a dataset is critical for quality control. SSM rate represents the noise level of a sample. Samples with high SSM rates have low quality and should be removed.
GMMDemux estimates the percentage of SSMs among all GEMs using a probabilistic model that models the entire GEM formation process in sample barcoding. The GEM formation process occurs after pooling of samples and governs the subsequent random distribution of cells into GEMs. GMMDemux models the GEM formation process as an augmented binomial process: it assumes that after pooling of samples, the entire cell assay is divided into a finite number of droplets, called cellassay droplets. Each cell is randomly and independently partitioned into a cellassay droplet. During the single cell barcoding process, a fraction of all cellassay droplets are combined with gel beads and form GEMs. The rest of the cellassay droplets do not form GEMs and will not be sequenced. We use the term droplet capture rate to denote the probability that a cellassay drop is combined with a gel bead. GEMs, which contain both cellenclosing cellassay droplets and gel beads, are recovered after sequencing and are summarized in a HTO matrix. A detailed illustration of the GEM formation model is provided in Additional file 1: Section S1.
The rates of multiplets, including both SSM rates and MSM rates, are modeled as the probability of having multiple cells (from the same or different samples) partitioned into the same cellassay droplet. A major challenge for this method is that key parameters, namely the number of cells in each sample, the droplet capture rate, and the total number of cellassay droplets, are not directly observable. Instead, from the MSM classifier, we observe the number of samplespecific GEMs as well as the number of MSMs of any sample pair. Combined with the prior knowledge of the estimated total number of cells loaded for sample barcoding, the SSM rate estimator derives the latent parameters of the model and uses the complete model to estimate the multiplet rates of the dataset.
Modeling multiplets
The SSM rate estimator models the GEM formation process as follows: Assume there are a total of X cellassay droplets. Also assume there are y_{l} cells in a sample, l, with Y denoting the overall population of all cells, or \(Y = \sum _{l} y_{l}\). The model assumes that each cell is independently and randomly partitioned into a cellassay droplet. Consequently, a cell has a probability of 1/X to reside within a specific cellassay droplet. Assuming that no bias exists among cells from different samples, then the probability of a cellassay droplet, i, being a singlet, given that i is not empty, can be calculated as:
where \(\mathbb {E}[\#_{\text {\texttt {singlets}}}]\) is the expected number of singlets and \(\mathbb {E}[\#_{\mathtt {non{\text }empty drops}}]\) is the expected number of nonempty cellassay droplets. For simplicity, in the rest of this paper, we refer to cellassay droplets simply as droplets.
Since cells are randomly partitioned into droplets, \(\mathbb {E}[\#_{\text {\texttt {singlets}}}]\) can be computed from a binomial model. Specifically, we have \({\mathbb {E}[\#_{\text {\texttt {singlets}}}] = X \cdot P(i \in {\text {\texttt {singlet}}})}\), where P(i∈singlet) denotes the probability of having one and only one cell, out of a total of Y cells, residing in i. All other cells are partitioned into other droplets. Mathematically, we have:
Similarly, the expected number of nonempty droplets can be computed as \(\mathbb {E}[\#_{\mathtt {non{\text }empty\ drops}}] = X \cdot P(i \in {\mathtt {non{\text }empty\ drops}})\). P(i∈nonemptydrops) is the probability of i being nonempty, and it equals to 1−P(i∈emptydrops). According to binomial distribution, P(i∈emptydrops) equals to the probability of all cells residing in droplets other than i. Overall, we have:
Equally, the probability of i being a MSM given i is not empty, P(i∈MSM), can be computed as:
with P(i∈SSD) denoting the probability of i being a SSD.
When more than one sample is labeled in sample barcoding, we have \(P(i \in {\text {\texttt {SSD}}})= \sum _{l}P(i \in \text {\texttt {SSD}}_{l})\), where P(i∈SSD_{l}) is the probability of i being a SSD of sample l. Let set D_{l} represent all and only lcellenclosing droplets and set \(D_{l}^{C}\) to represent all and only lcellfree droplets. The probability of i being a SSD of sample \(\hat {l}\) (\(\text {\texttt {SSD}}_{\hat {l}}\)) equals the probability of i being a cellenclosing droplet in \(\hat {l}\) and a cellfree droplet in all other samples. Based on binomial distribution, the probability of i belonging to \(D_{\hat {l}}^{C}\), \(P(i \in D_{\hat {l}}^{C})\), equals the probability of all cells of \(\hat {l}\) residing in droplets other than i, which is \((1  \frac {1}{X})^{y_{\hat {l}}}\). As \(D_{\hat {l}}\) and \(D_{\hat {l}}^{C}\) complement each other, we have \(P(i \in D_{\hat {l}}) = 1  P(i \in D_{\hat {l}}^{C})\). We expand \(P(i \in \text {\texttt {SSD}}_{\hat {l}})\) into the following:
where \(P(i \in D_{\hat {l}})\) and \(\prod _{l\neq \hat {l}}P(i \in D_{l}^{C})\) can be computed as:
Finally, the probability of i being a SSM is simply the probability of i being neither a MSM nor a singlet. Mathematically, we have:
Alternatively, we can compute P(i∈SSM) as \(P(i \in \text {\texttt {SSM}}) = \sum _{l} P(i \in \text {\texttt {SSM}}_{l})\), with P(i∈SSM_{l}) denoting the probability of i being a SSM of sample l. Because a SSD of l must be either a SSM of l or a singlet of l, therefore, event {i∈SSM_{l}∣i∈SSD_{l}} and event {i∈singlet_{l}∣i∈SSD_{l})} must be collectively exhaustive events. Together, P(i∈SSM_{l}) can be computed as:
Since all singlets of l are SSDs of l, we have:
The two methods (Eqs. (13) and (14)) of calculating P(i∈SSM) are equivalent (details are omitted to conserve space).
Overall, given X and y_{l} for every sample l of a sample barcoding dataset, the SSM rate estimator estimates the singlet rate (P(i∈singlet), Eq. (7)), the MSM rate (P(i∈MSM), Eq. (10)), and the SSM rate (P(i∈SSM), Eqs. (13) and (14)) of the dataset. Unlike the SSM rate, which can only be inferred indirectly through the GEM formation model, the MSM rate can be obtained both analytically through the GEM formation model and numerically by interpreting the MSM classification result. As a result, we can validate the correctness of the GEM formation model by comparing the MSM rates obtained through both methods. In the “Results” section, we show that both methods provide consistent MSM rates.
We perform simulations to verify the correctness of the above equations. The simulation results are included in Additional file 1: Section S2. Specifically, we repeatedly simulate the GEM formation process. We show that the singlet, SSM, and MSM rates measured from simulations asymptotically match the values analytically computed with above equations.
Estimating model parameters
GMMDemux relies on X and y_{l} of every sample l to compute the SSM rates. However, neither X nor y_{l} is directly observable in a sample barcoding dataset. Instead, from the classification result, GMMDemux observes z_{l}, the number of GEMs in D_{l}.
Let r_{cap} denote the droplet capture rate. From z_{l} and a userprovided estimation of the total cell count, Y, GMMDemux computes both X, r_{cap}, and y_{l}. For a HTO sample l, based on our multiplet model, we have \(P(i \in D_{l}^{C} \mid X, y_{l}) = (1  \frac {1}{X})^{y_{l}}\) (X and y_{l} serve as parameters) and \(P(i \in D_{l} \mid X, y_{l}) = 1  P(i \in D_{l}^{C} \mid X, y_{l})\). Let random variable Z_{l} denote the number of GEMs that enclose cells from l and let P(Z_{l}=z_{l}∣X,r_{cap},y_{l}) denote the probability of observing z_{l}lcellenclosing GEMs under the parameter set [X,r_{cap},y_{l}]. According to the GEM formation model, which models partitioning of cells into droplets with a binomial distribution, we have:
We derive the model parameters by computing:
where Y is the userprovided total number of cells loaded for library preparation, which can be obtained from the hemocytometer.
Online sample multiplexing experiment planner
The online sample barcoding experiment planner estimates the singlet, SSM, and MSM rates of a planned sample barcoding experiment via the GEM formation model. Specifically, it takes the estimated number of cells (Y), the planned number of samples for sample barcoding (M), the estimated number of droplets (X), and the droplet capture rate (r_{cap}) in library preparation as inputs, and it computes the estimated multiplet rates. The online experiment planner assumes cells are evenly distributed among M samples.
The online experiment planner also estimates the relative singlesample multiplet (RSSM) rate, defined as the estimated number of SSMs among SSDs. Mathematically, the RSSM rate is defined as:
The RSSM rate marks the overall quality of a sample barcoding dataset. It represents the percentage of irremovable multiplets among SSDs, after removing all MSMs in the dataset. If the RSSM rate of the estimated outcome is too high, then the planned experiment should be aborted, as the anticipated outcome will be too noisy for downstream analysis. While dividing the cell assay into more samples drives down the RSSM rate, as it reduces \(\mathbb {E}[\#_{\text {\texttt {SSM}}}]\), it increases both the cost and the complexity of the experiment. With the multiplet rate estimator, researchers can determine the minimum number of HTO samples to use in a sample barcoding experiment, to save cost while meeting the RSSM rate target.
The online experiment planner computes the multiplet rates as follows:
The above equations show that the number of samples, M, does not affect the singlet rate. The singlet rate is solely determined by X and Y. However, a greater M reduces the SSM rate and increases the MSM rate. Therefore, we conclude that dividing a cell assay into more samples by sample barcoding transforms more SSMs into MSMs. Transforming SSMs into MSMs improves the quality of the dataset. With fewer SSMs, the RSSM rate of the dataset decreases. In comparison, having more MSMs does not affect the quality of the dataset, as MSMs are removed by GMMDemux.
Given r_{cap}, the online experiment planner also computes the estimated number of cellenclosing GEMs in the final output, as well as the estimated number of SSDs after removing MSMs. The number of cellenclosing GEMs, #_{nonemptyGEM}=#_{nonemptydrops}·r_{cap} (#_{nonemptydrops} is computed in Eq. (9)). The number of SSDs is computed as #_{SSD}=#_{nonemptyGEM}·(1−P(i∈MSM)).
Among all four inputs, Y and M are usercontrolled while X and r_{cap} are largely dictated by the library preparation equipment. However, based on our observations, we found that X mostly varies between 65K and 80K. To account for the wide ranges of variability of the inputs, the online experiment planner uses sliders for selecting X, Y, M, and r_{cap}, which have ranges of 60K– 100K, 1K– 80K, 1–20, and 0–1, respectively. The online experiment planner supports dynamic updates. It computes the estimated multiplet rates in real time as the user updates input parameters. In practice, we recommend that users profile their library preparation equipment once for the total number of droplets (X) in a sequencing run, by performing a smallscale sample barcoding experiment, and use the profiled X (included in the GMMDemux output) in planning future experiments.
Puretype GEM verification
In novel celltype identification, a celltype classifier is used to group GEMs into clusters. Each cluster is assumed to represent a unique cell type. Clusters with average expression profiles that do not match any known cell types are identified as novel cell types [40].
After clustering, phonytype GEMs are grouped into distinct clusters. Phonytype GEM clusters may be incorrectly identified as novel cell types, as their expression profiles do not match known cell types, generating false discoveries. GMMDemux rectifies true novel cell types by validating if the alleged novel celltype GEM cluster contains mainly puretype GEMs. Based on the GEM composition in the cluster, GMMDemux classifies GEM clusters into three categories: puretype GEM clusters, phonytype GEM clusters, and mixture clusters. Phonytype GEM clusters contain mostly phonytype GEMs. Puretype GEM clusters contain mostly puretype GEMs. Mixture clusters contain large quantities of both puretype and phonytype GEMs.
Let G represent a GEM cluster. GMMDemux classifies G by examining the MSM ratio of G. For simplicity, we assume cells are equally randomly divided into the M sample barcoding samples. If G is a phonytype GEM cluster, the MSM ratio of G must be very high. Elaborated in Additional file 1: Section S3, the expected MSM ratio of a phonytype cluster approaches and exceeds \(1\frac {1}{M}\). Otherwise, if G is a puretype GEM cluster, its MSM ratio should not be greater than the MSM ratio of the entire sample barcoding dataset, which is much smaller than \(1\frac {1}{M}\). The MSM ratio reflects the GEM composition of G: in a phonytype GEM cluster, all GEMs are multiplets; hence, the MSM ratio of G, \(r_{\mathtt {MSM_{G}}}\), equals to \(r_{\mathtt {MSM_{G}}} = \frac {\#_{\mathtt {MSM_{G}}}}{\#_{\mathtt {SSM_{G}}}+\#_{\mathtt {MSM_{G}}}}\), where \(\#_{\mathtt {MSM_{G}}}\) and \(\#_{\mathtt {SSM_{G}}}\) denote the number of MSMs and SSMs in G, respectively; in a puretype GEM cluster, however, we have \(r_{\mathtt {MSM_{G}}} = \frac {\#_{\mathtt {MSM_{G}}}}{\#_{\mathtt {singlet_{G}}}+\#_{\mathtt {SSM_{G}}}+\#_{\mathtt {MSM_{G}}}}\) instead, where \(\#_{\mathtt {singlet_{G}}}\) denotes the number of singlets in G. By comparing the two ratios, we observe that puretype GEM clusters include singlet counts in the denominator, whereas phonytype GEM clusters do not. As a result, the MSM ratio is much higher in phonytype GEM clusters than in puretype GEM clusters. Complex situations where cells are not evenly distributed among sample barcoding samples are discussed in Additional file 1: Section S3.
GMMDemux uses hypothesis testing to measure the confidence of each classification. GMMDemux prepares two hypotheses, the phonytype hypothesis and the puretype hypothesis, which assume G being a puretype or a phonytype GEM cluster, respectively. GMMDemux tests both hypotheses with the binomial test and computes a p value for each hypothesis. Based on the hypothesis testing results, GMMDemux classifies G as a puretype GEM cluster, a phonytype GEM cluster, or a mixture cluster. Details of the hypothesis tests are provided in Additional file 1: Section S3.
Based on the classification result of G, GMMDemux recommends different actions. Being classified as a phonytype GEM cluster suggests that the proportion of puretype GEMs in G, if there exists any, is extremely small and most GEMs in G are phonytype GEMs. GMMDemux recommends excluding G from further analysis. Being classified as a mixture cluster suggests that G mixes puretype GEMs and phonytype GEMs together and has nontrivial numbers of GEMs in both categories. This is often a result of poor clustering quality where G becomes a supercluster over several puretype and phonytype GEM clusters. GMMDemux recommends refinement over the clustering method and subdividing G into puretype GEM and phonytype GEM subclusters. Finally, being classified as a puretype GEM cluster suggests that it is plausible that G defines a real cell type. Further analysis over G is recommended.
Compatibility
The GMMDemux classifier is compatible with CellRanger3.1.0 from 10X Genomics. It takes the sample barcoding data of postfiltering, nonempty droplets, in the market matrix (ṁtx) format, together with the estimated number of cells (Y), as inputs, and it outputs a double column table as the classification result. The row indices of the output table are GEM barcodes. The two columns are the classification of each GEM and the confidence score of each classification, respectively. With M samples, GMMDemux classifies GEMs into a maximum of 2^{M}+1 classes. Besides the uncertain class, the negative class, and M SSD classes, there are \(\binom {M}{2}\) bisample classes, \(\binom {M}{3}\) trisample classes, …and \(\binom {M}{M}=1\)Msample class. Additionally, GMMDemux produces a SSM rate summary file, which includes the SSM rate and the RSSM rate of each sample, and a summary file that includes the multiplet rates of the entire dataset. The summary file also includes the estimated number of cellassay droplets (X) and the estimated droplet capture rate (r_{cap}) of the library preparation equipment. Example outputs are provided in Additional file 1: Section S4.
Availability of data and materials
The source code of GMMDemux is accessible at Github [45] under the MIT license and Zenodo [46]. The inhouse cellhashing datasets are available in the Gene Expression Omnibus repository, under accession GSE152981 [47]. The public PBMC cellhashing dataset is also available in the Gene Expression Omnibus repository, under accession GSE131756 [37]. The colonoscopic biopsy dataset can be shared via a Material Transfer Agreement after it is reviewed by the University of Pittsburgh.
References
 1
Ahmed R, Omidian Z, Giwa A, Cornwell B, Majety N, Bell DR, Lee S, Zhang H, Michels A, Desiderio S, et al.A public bcr present in a unique dualreceptorexpressing lymphocyte from type 1 diabetes patients encodes a potent t cell autoantigen. Cell. 2019; 177(6):1583–99.
 2
Babtie AC, Chan TE, Stumpf MP. Learning regulatory models for cell development from single cell transcriptomic data. Curr Opin Syst Biol. 2017; 5:72–81.
 3
Bloom JD. Estimating the frequency of multiplets in singlecell RNA sequencing from cellmixing experiments. PeerJ. 2018; 6:e557.
 4
Butler A, Hoffman P, Smibert P, Papalexi E, Satija R. Integrating singlecell transcriptomic data across different conditions, technologies, and species. Nat Biotechnol. 2018; 36(5):411.
 5
Campbell KR, Yau C. Order under uncertainty: robust differential expression analysis using probabilistic models for pseudotime inference. PLoS Comput Biol. 2016; 12(11):e1005,212.
 6
Cao J, Packer JS, Ramani V, Cusanovich DA, Huynh C, Daza R, Qiu X, Lee C, Furlan SN, Steemers FJ, et al.Comprehensive singlecell transcriptional profiling of a multicellular organism. Science. 2017; 357(6352):661–667.
 7
Collins DJ, Neild A, Liu AQ, Ai Y, et al.The Poisson distribution and beyond: methods for microfluidic droplet production and single cell encapsulation. Lab Chip. 2015; 15(17):3439–59.
 8
Gaublomme JT, Li B, McCabe C, Knecht A, Yang Y, Drokhlyansky E, Van Wittenberghe N, Waldman J, Dionne D, Nguyen L, et al.Nuclei multiplexing with barcoded antibodies for singlenucleus genomics. Nat Commun. 2019; 10(1):1–8.
 9
Haghverdi L, Buettner M, Wolf FA, Buettner F, Theis FJ. Diffusion pseudotime robustly reconstructs lineage branching. Nat Methods. 2016; 13(10):845.
 10
Ilicic T, Kim JK, Kolodziejczyk AA, Bagger FO, McCarthy DJ, Marioni JC, Teichmann SA. Classification of low quality cells from singlecell RNAseq data. Genome Biol. 2016; 17(1):2.
 11
Jin X, Han J. KMedoids Clustering In: Sammut C, Webb Geoffrey I, editors. Encycl Mach Learn Data Min. US Boston, MA: Springer: 2010. p. 564–565. isbn="9780387301648". https://doi.org/10.1007/9780387301648_426.
 12
Kang HM, Subramaniam M, Targ S, Nguyen M, Maliskova L, McCarthy E, Wan E, Wong S, Byrnes L, Lanata CM, et al.Multiplexed droplet singlecell RNAsequencing using natural genetic variation. Nat Biotechnol. 2018; 36(1):89.
 13
Klein AM, Mazutis L, Akartuna I, Tallapragada N, Veres A, Li V, Peshkin L, Weitz DA, Kirschner MW. Droplet barcoding for singlecell transcriptomics applied to embryonic stem cells. Cell. 2015; 161(5):1187–201.
 14
Kuipers J, Jahn K, Raphael BJ, Beerenwinkel N. Singlecell sequencing data reveal widespread recurrence and loss of mutational hits in the life histories of tumors. Genome Res. 2017; 27(11):1885–1894.
 15
Kumar P, Tan Y, Cahan P. Understanding development and stem cells using single cellbased analyses of gene expression. Development. 2017; 144(1):17–32.
 16
Maaten Lvd, Hinton G. Visualizing data using tSNE. J Mach Learn Res. 2008; 9(Nov):2579–605.
 17
Macaulay IC, Ponting CP, Voet T. Singlecell multiomics: multiple measurements from single cells. Trends Genet. 2017; 33(2):155–68.
 18
Macosko EZ, Basu A, Satija R, Nemesh J, Shekhar K, Goldman M, Tirosh I, Bialas AR, Kamitaki N, Martersteck EM, et al.Highly parallel genomewide expression profiling of individual cells using nanoliter droplets. Cell. 2015; 161(5):1202–14.
 19
Maecker HT, McCoy JP, Nussenblatt R. Standardizing immunophenotyping for the human immunology project. Nat Rev Immunol. 2012; 12(3):191.
 20
Magella B, Adam M, Potter AS, Venkatasubramanian M, Chetal K, Hay SB, Salomonis N, Potter SS. Crossplatform single cell analysis of kidney development shows stromal cells express Gdnf. Dev Biol. 2018; 434(1):36–47.
 21
McGinnis CS, Patterson DM, Winkler J, Hein MY, Srivastava V, Conrad DN, Murrow LM, Weissman JS, Werb Z, Chow ED, et al.Multiseq: scalable sample multiplexing for singlecell rna sequencing using lipidtagged indices. Nature Methods. 2019; 16(7):387241. Nature Publishing Group.
 22
McGinnis CS, Murrow LM, Gartner ZJ. Doubletfinder: doublet detection in singlecell RNA sequencing data using artificial nearest neighbors. Cell Syst. 2019; 8(4):329–337. Elsevier.
 23
McGinnis CS, Patterson DM, Winkler J, Conrad DN, Hein MY, Srivastava V, Hu JL, Murrow LM, Weissman JS, Werb Z, et al.Multiseq: sample multiplexing for singlecell rna sequencing using lipidtagged indices. Nat Methods. 2019; 16(7):619.
 24
Moignard V, Göttgens B. Dissecting stem cell differentiation using single cell expression profiling. Curr Opin Cell Biol. 2016; 43:78–86.
 25
Moon S, Ceyhan E, Gurkan UA, Demirci U. Statistical modeling of single target cell encapsulation. PloS One. 2011; 6(7):e21,580.
 26
Muraro MJ, Dharmadhikari G, Grün D, Groen N, Dielen T, Jansen E, van Gurp L, Engelse MA, Carlotti F, de Koning EJ, et al.A singlecell transcriptome atlas of the human pancreas. Cell Syst. 2016; 3(4):385–94.
 27
Nguyen A, Khoo WH, Moran I, Croucher PI, Phan TG. Single cell RNA sequencing of rare immune cell populations. Front Immunol. 2018; 9:1553.
 28
Novak SY. Extreme value methods with applications to finance. 2011. https://doi.org/10.1201/b11537.
 29
Olsson A, Venkatasubramanian M, Chaudhri VK, Aronow BJ, Salomonis N, Singh H, Grimes HL. Singlecell analysis of mixedlineage states leading to a binary cell fate choice. Nature. 2016; 537(7622):698.
 30
Parra RG, Papadopoulos N, AhumadaArranz L, El Kholtei J, Mottelson N, Horokhovskyi Y, Treutlein B, Soeding J. Reconstructing complex lineage trees from scRNAseq data using MERLoT. Nucleic Acids Research. 2019; 47(17):8961–8974. Oxford University Press.
 31
Popat SK, Emmanuel M. Review and comparative study of clustering techniques. Int J Comput Sci Inf Technol. 2014; 5(1):805–12.
 32
Poulin JF, Tasic B, HjerlingLeffler J, Trimarchi JM, Awatramani R. Disentangling neural cell diversity using singlecell transcriptomics. Nat Neurosci. 2016; 19(9):1131.
 33
Quandt D, Rothe K, Scholz R, Baerwald CW, Wagner U. Peripheral cd4cd8 double positive t cells with a distinct helper cytokine profile are increased in rheumatoid arthritis. PloS One. 2014; 9(3):e93,293.
 34
Reynolds D. Gaussian mixture models. Encycl Biom. 2015:827–32.
 35
Stoeckius M, Hafemeister C, Stephenson W, HouckLoomis B, Chattopadhyay PK, Swerdlow H, Satija R, Smibert P. Simultaneous epitope and transcriptome measurement in single cells. Nat Methods. 2017; 14(9):865.
 36
Stoeckius M, Zheng S, HouckLoomis B, Hao S, Yeung BZ, Mauck WM, Smibert P, Satija R. Cell hashing with barcoded antibodies enables multiplexing and doublet detection for single cell genomics. Genome Biol. 2018; 19(1):224. Accessed 5 January 2019.
 37
Stoeckius M, Zheng S, HouckLoomis B, Hao S, Yeung BZ, Mauck WM, Smibert P, Satija R. Cell hashing with barcoded antibodies enables multiplexing and doublet detection for single cell genomics. Cell Hashing Scrnaseq Data. 2018. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE108313.
 38
Sun Z, Chen L, Xin H, Jiang Y, Huang Q, Cillo AR, Tabib T, Kolls JK, Bruno TC, Lafyatis R, et al.A Bayesian mixture model for clustering dropletbased singlecell transcriptomic data from population studies. Nat Commun. 2019; 10(1):1649.
 39
Trapnell C. Defining cell types and states with singlecell genomics. Genome Res. 2015; 25(10):1491–8.
 40
Tsoucas D, Yuan GC. Giniclust2: a clusteraware, weighted ensemble clustering method for celltype detection. Genome Biol. 2018; 19(1):58.
 41
Villani AC, Satija R, Reynolds G, Sarkizova S, Shekhar K, Fletcher J, Griesbeck M, Butler A, Zheng S, Lazo S, et al.Singlecell RNAseq reveals new types of human blood dendritic cells, monocytes, and progenitors. Science. 2017; 356(6335):eaah4573.
 42
Wattenberg M, Viégas F, Johnson I. How to use tSNE effectively. Distill. 2016; 1(10):e2.
 43
Wersto RP, Chrest FJ, Leary JF, Morris C, StetlerStevenson M, Gabrielson E. Doublet discrimination in DNA cellcycle analysis. Cytom J Int Soc Anal Cytol. 2001; 46(5):296–306.
 44
Wolock SL, Lopez R, Klein AM. Scrublet: computational identification of cell doublets in singlecell transcriptomic data. Cell Syst. 2019.
 45
Xin H, Lian Q, Jiang Y, Luo J, Wang X, Erb C, Xu Z, Zhang X, HeidrichO’Hare E, Yan Q, Duerr R, Chen K, Chen W. GMMDemux: sample demultiplexing, multiplet detection, experiment planning and novel cell type verification in single cell sequencing. 2020. https://github.com/CHPGenetics/GMMdemux. Accessed 1 July 2020.
 46
Xin H, Lian Q, Jiang Y, Luo J, Wang X, Erb C, Xu Z, Zhang X, HeidrichO’Hare E, Yan Q, Duerr R, Chen K, Chen W. GMMDemux: sample demultiplexing, multiplet detection, experiment planning and novel cell type verification in single cell sequencing. 2020. https://doi.org/10.5281/zenodo.3929654.
 47
Xin H, Lian Q, Jiang Y, Luo J, Wang X, Erb C, Xu Z, Zhang X, HeidrichO’Hare E, Yan Q, Duerr R, Chen K, Chen W. GMMDemux: sample demultiplexing, multiplet detection, experiment planning and novel cell type verification in single cell sequencing. CITEseq, scRNAseq and cell hashing data. 2020. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE152981. Accessed 1 July 2020.
 48
Zheng GX, Terry JM, Belgrader P, Ryvkin P, Bent ZW, Wilson R, Ziraldo SB, Wheeler TD, McDermott GP, Zhu J, et al.Massively parallel digital transcriptional profiling of single cells. Nat Commun. 2017; 8:14049.
 49
Zunder ER, Finck R, Behbehani GK, Elad DA, Krishnaswamy S, Gonzalez VD, Lorang CG, Bjornson Z, Spitzer MH, Bodenmiller B, et al.Palladiumbased mass tag cell barcoding with a doubletfiltering scheme and singlecell deconvolution algorithm. Nat Protocol. 2015; 10(2):316.
Acknowledgements
We wish to thank the UPMC Children’s Hospital of Pittsburgh for their support. We also appreciate the useful discussion and suggestion from Dr. Anuradha Ray and Dr. Prabir Ray.
Review history
The review history is available as Additional file 3.
Funding
This work is supported by the National Institute of Health grants R01HL137709 (W.C. and K.C.), U01DK062420 (W.C. and R.D.), P01AI106684 (W.C.), and UL1TR001857 (W.C. and X.W.).
Author information
Affiliations
Contributions
H.X. and Q.L. developed the theoretical formalism of GMMDemux. Q.Y. and Q.L. verified the theoretical soundness of GMMDemux. H.X. engineered the program and performed the experiments. Y.J. benchmarked Seurat. X.W. helped preparing the figures. R.D., E.O., X.Z., Z.X., J.L., C.E., and K.C. generated the inhouse scRNAseq datasets. H.X, R.D, Q.Y., Q.L., K.C., and W.C. contributed to the writing of the manuscript. W.C. and K.C. supervised the project. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Additional file 1
Supplementary figures
Additional file 2
Supplementary tables
Additional file 3
Review history
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Xin, H., Lian, Q., Jiang, Y. et al. GMMDemux: sample demultiplexing, multiplet detection, experiment planning, and novel celltype verification in single cell sequencing. Genome Biol 21, 188 (2020). https://doi.org/10.1186/s13059020020842
Received:
Accepted:
Published:
Keywords
 Single cell RNA
 Multiplets
 Rare cell type
 Phony cell type
 Demultiplex
 Sample barcoding