GMM-Demux: sample demultiplexing, multiplet detection, experiment planning, and novel cell-type verification in single cell sequencing

Identifying and removing multiplets are essential to improving the scalability and the reliability of single cell RNA sequencing (scRNA-seq). Multiplets create artificial cell types in the dataset. We propose a Gaussian mixture model-based multiplet identification method, GMM-Demux. GMM-Demux accurately identifies and removes multiplets through sample barcoding, including cell hashing and MULTI-seq. GMM-Demux uses a droplet formation model to authenticate putative cell types discovered from a scRNA-seq dataset. We generate two in-house cell-hashing datasets and compared GMM-Demux against three state-of-the-art sample barcoding classifiers. We show that GMM-Demux is stable and highly accurate and recognizes 9 multiplet-induced fake cell types in a PBMC dataset.


Modeling the General Droplet Formation Process
: Illustration of the co-flow process.
Outlined in the original droplet-seq paper [18], in droplet-based single cell library preparation, GEMs are formed by "co-flowing" two aqueous solutions across an oil channel to form nanoliter-sized droplets. One flow contains gel beads, microcapsules that contain barcoded microparticles suspended in a lysis buffer. The other flow contains the cell assay, which is a HTO-tagged cell suspension. Figure S1 illustrates the co-flow process. In our model, the two flows are driven by two separate pumps that operate in discrete pulses. Each pulse pumps unit volume solution into oil, forming an emulsion droplet. In our model, the two pumps does not operate in sync and their frequencies differ. Assuming the unit volume of each pulse from the cell assay pump is µ, and the total volume of the cell assay solution is V , then the total number of pulses from the cellassay pump throughout a sample barcoding library preparation can be calculated as X = V µ . Therefore, in the final solution, there are a total of X = V µ droplets that contain cell-assay solution.
Not all cell-assay droplets can be recovered through sequencing. A droplet is sequenced only if it also contains a gel bead, which is not always guaranteed. In our model, we assume r cap as a constant probability of a droplet getting a gel bead. Cell-assay droplets without gel beads are not harvested through sequencing.
An illustration of the workflow of sample barcoding is provided in Figure S2 2 Validating Multiplet Rate Equations We validate the multiplet rate equations by comparing the equation-derived multiplet rates against the multiplet rates obtained through simulation. Specifically, given [Y, M, X] (Y is the total number of cells, M is the number of samples and X is the total number of cell-assay droplets), we divide Y cells evenly into M samples and randomly partition each cell into a droplet. We do not include r cap in our simulation as r cap does not affect the multiplet rates. We collect all droplets from the simulation; count the numbers of singlets, SSMs and MSMs and computes the singlet, SSM and MSM rates. For each [Y, M, X] parameter set, we repeat the simulation 500, 1000 and 1500 times and average the singlet, SSM and MSM rates over all iterations. The comparison results are shown in Table S1, S2 and S3. As shown in the tables, as the number of iterations increases, the multiplet rates obtained through simulation asymptotically approaches the model-derived multiplet rates. This suggests that the equations for computing multiplet rates are correct.

Hypothesis Testings in Pure-Type GEM Verification
To classify a GEM cluster G, GMM-Demux tests two hypotheses: G is a pure-type GEM cluster; and G is a phony-type GEM cluster. In both hypothesis tests, GMM-Demux first computes the expected value of P (i ∈ MSM G |i ∈ G)-the probability of a GEM in G being a MSM, under the respective hypotheses. Then GMM-Demux computes the statistical significance of observing the true MSM population of G under the expected P (i ∈ MSM G |i ∈ G) derived from the previous step. GMM-Demux tests both hypotheses with the binomial test: let P 0 (i ∈ MSM G |i ∈ G) denote the expected value of P (i ∈ MSM G |i ∈ G); GMM-Demux assumes that each GEM i in G has the probability of P 0 (i ∈ MSM G |i ∈ G) to become a MSM, and computes the statistical significance (the p-value) of observing the true number of MSMs in G. The null hypotheses of the binomial test is rejected if the p-value of the test falls below a user defined alpha level (0.05 by default). According to the binomial test results of both hypotheses, G is classified as a pure-type GEM cluster, a phony-type GEM cluster or a mixture cluster accordingly. GMM-Demux derives P 0 (i ∈ MSM G |i ∈ G) from the GEM formation model. For simplicity, we assume cells are uniformly randomly distributed into M sample barcoding samples. Formally, the probability of a GEM i in G is a MSM can be formulated as the following: where c i denotes the number of cells in GEM i and c i = j states that there are j cells in i. If G is a phony-type cluster, since all GEMs are multiplets, P (c i = 1) = 0. For Let k denote the number of cell types in a GEM. If G is a k-cell-type phonytype cluster, where each GEM in G contains (the same) k (k > 1) cell types, In general, let y G represent the total cell population size of the k cell types of G in the cell assay. In a k-cell-type phony-type cluster, P (c i = j) = 0 for all j < k and is slightly different in more complex scenarios where the cell assay is not evenly distributed into the M sample barcoding samples. Assume cells are not uniformly randomly distributed into sample barcoding samples and let p x denote the probability of a cell being distributed to a sample x. Then the probability of a droplet i being a SSM of sample x, given that i already contains j cells. In other words, (p x ) k under the condition of X > y G >> 1 and k > 1.
With P 0 (i ∈ MSM G |c i = j) computed for k > 1, the null and alternative hypotheses of the phony-type hypothesis test are formulated as follows: the null hypothesis states that H phony . The null hypothesis is tested with the binomial test and is rejected if the p-value of the test is smaller than the alpha level. Rejecting the null hypothesis suggests that G is not a phony-type GEM cluster.
The phony-type hypothesis test can be further extended to infer the number of cell types (k) in G. For simplicity, again cells are assumed to be uniformly distributed into M sample barcoding samples, without loss of generality. For k = 2, For each P 0 , GMM-Demux performs the binomial test and rejects the null hypothesis when the p-value drops below the alpha value. Since P 0 (i ∈ MSM G |i ∈ G) increases as k increases, rejecting the null hypothesis suggests that G is not a phony-type GEM cluster of no-less-than k cell types (but could still be a phony-type GEM cluster with fewer cell types).
Rejecting H phony 0 does not make G a pure-type GEM cluster. G could also be a mixture cluster, which contains both pure-type GEMs and phony-type GEMs. To check if G is a pure-type GEM only cluster, GMM-Demux performs the pure-type hypothesis test H pure , which assumes that G includes only k = 1 cell type.
When k = 1, P 0 (i ∈ MSM G |i ∈ G) is much smaller than 1 − 1 M . Let τ denote the cell type of G and assume that G includes all pure-type GEMs of type τ . Let y τ denote the number of τ cells in the cell assay and z G denote the number of GEMs in G. According to the GEM formation model, the probability of i being a pure-type GEM of type τ can be computed as P Consequently, GMM-Demux generates an estimation of y τ by maximizing the likelihood of P (# GEMτ = z G ).
With y τ computed, the probability of i being a pure-type MSM of type τ , denoted as P (i ∈ MSM τ ), can be computed in a manner that is similar to Equation 10. Subsequently, we have P 0 (i ∈ MSM G |i ∈ G) = P (i∈MSMτ )·X z G . The null and alternative hypotheses of the pure-type hypothesis test are formulated as H pure 0 : P i∈M SMτ ≤ P (i ∈ MSM τ ) 0 and H pure a : P (i ∈ MSM τ ) > P (i ∈ MSM τ ) 0 , respectively. GMM-Demux tests the null hypothesis with the binomial test; computes the p-value; compares the p-value against the alpha level and rejects H pure 0 if the p-value is smaller than the alpha level. Rejecting H pure 0 suggests that G is not a pure-type GEM cluster. Based on the hypothesis testing results of both the pure-type and phony-type hypothesis tests, G is classified as a pure-type GEM cluster, a phony-type GEM cluster or a mixture cluster. Specifically, G is classified as a pure-type GEM cluster if H pure G has the potential to reveal a new cell type only if G is classified as a pure-type cluster. When classified as a phony-type cluster, all GEMs in G are recommended to be removed from the dataset, as all SSDs in G has a high probability to be SSMs. Finally, being classified as a mixture cluster suggests that G is not homogeneous and the clustering quality is low. Further refinement over G is recommended.

Example GMM-Demux Output
GMM-Demux generates three output files. It generates a classification result file, a classification header file, and a summary file. An example result and header file set is presented in Table S4 and S5. In the classification result file, each sample and each multi-sample combination is given a unique classification id. The classification id is matched with a description text in the header file. The summary file contains two tables, illustrated in Table S6 and S7. The first table presents a summary of the computed and inferred parameters of the dataset, including the number of cellassay droplets (X), the capture rate (r cap ), the singlet, MSM and SSM rates of the entire dataset. The second table contains a per-sample summary, which stores the estimated SSM rate of each sample.

GEM Barcodes
Classification

Classification Results of the PBMC-2 and the Memory-T Datasets
MSM classification results of the PBMC-2 and the Memory-T datasets are included in Figure S3. Figure S4 shows the GMM-Demux classification results where each sample combination is given a distinct color. Notice that when the number of classes is large (e.g., the PBMC-2 sample), the color difference between classes become less noticeable to a naked eye. Figure S5 shows the stability benchmarks of GMM-Demux, multi-seq, demuxEM and Seurat. The stability is measured by computing the percentage of GEMs that changes classifications between repetitive runs. As shown in the figure, Seurat displays low stability while others classifiers produce consistent classifications between runs.

Online Experiment Planner Illustration
A screenshot of the online experiment planner is presented in Figure S6. To use the online experiment planner, the user provides the anticipated number of cells in the cell assay (Y ), the number of HTO samples for sample barcoding (M ), the estimated number of cell-assay droplets (X) and the estimated capture rate r cap of the single cell library preparation equipment as inputs. The experiment planner outputs the estimated number of GEMs; the estimated singlet, MSM and SSM rates of GEM population; the estimated number of same-sample droplets (SSD) after removing MSMs using GMM-Demux; and the estimated RSSM rate of the remaining SSD population. Both X and r cap can be obtained from the SSM rate estimator by running GMM-Demux over an existing sample barcoding scRNA-seq dataset. We observe that X typically ranges between 65K-80K and r cap typically varies between 0.4-0.6. In practice, we recommend profiling the library preparation equipment by first conducting a pilot sample barcoding scRNA-seq experiment and The online experiment planner dynamically interacts with the user. It dynamically updates the outputs upon changes on input values. Given a fixed number of cells as well as a fixed equipment profile (a fixed number of cell-assay droplets and a fixed capture rate), by dragging the slider handle of the HTO-sample-number input (M ), the user can observe updates on multiplet rates in real time and find the minimum number of HTO samples required to meet a specific RSSM rate target.

Multiplet Rate Profiling Results
The profiles of the singlet, MSM, SSM rates, as well as the estimated number of harvested GEMs, under varying cell populations (Y ) and sample counts (M ) are shown in Figure S7. Figure S7 is generated under the setting of 80K cell-assay droplets (X = 80K) and 0.5 capture rate (r cap = 0.5). The above setting is based on the observation from Table 5.
As we have previously discussed, the singlet rate of an experiment, as well as the number of harvested gel-bead-enclosing GEMs only depend on the cell population size Y and are unaffected by the sample barcoding sample size M . Figure S7a and S7b displays the singlet rates and the estimated number of harvested GEMs under varying cell population sizes (Y ). Between 1K to 40K, as more cells are loaded into the experiment, the singlet rate monotonically and super-linearly decreases, while the number of GEMs monotonically and sub-linearly increases.
The MSM and SSM rates, however, depend on both the cell population size Y and the sample number M . Under a fixed Y , a larger M transforms more SSMs into MSMs, although the margin quickly diminishes as M grows. On the other hand, under a fixed M , a larger Y leads to both greater MSM and greater SSM rates, in almost a linear fashion. The MSM and SSM rates under variate Y s and M s are shown in Figure S7c and S7d, respectively.
The singlet rates reported in Figure S7a are consistent with findings from previous works. As reported in Zheng et al [49], the observed singlet rates are 98.4% for loading 2K cells and 96.9% for 6K cells, under the assumption of a 50% capture rate. In our work, according to the online experiment planner, under X = 80K and r cap = 0.5, the estimated singlet rate of Y = 2K is 98.75% while the singlet rate of Y = 6K is 96.3%.
Finally, we profile the minimum number of samples required to achieve certain RSSM rate targets under various cell population sizes (Y ). Such profile helps in designing future sample barcoding experiments that minimize cost and complexity while meeting data quality targets. Figure S7f and S7e shows the profiling results of 1% and 5% RSSM rate targets, respectively, over various cell population sizes, again under the assumption of X = 80K and r cap = 0.5.
Singlet, MSM, SSM, RSSM rate profiles, as well as GEM count profiles under other settings ([X, r cap ] other than [80K, 0.5]) are provided in Figure S8 to S13. Together, they serve as guidance in designing future sample barcoding experiments.    Figure S4: GMM-Demux detailed classification results. Each sample and each multi-sample combination is given a unique color. As the number of classes grows, difference between colors becomes less noticeable. Figure S5: Stability comparison between sample barcoding classifiers on simulated datasets. Seurat is notably less stable than other classifiers. Figure S6: Screenshot of the online sample barcoding experiment planner. The experiment planner computes multiplet rates in real time as parameters are updated. (e) Sample number profile for 5% RSSM rate target.
(f) Sample number profile for 1% RSSM rate. Figure S7: Sample barcoding experiment profiles under the assumption of 80K cell-assay droplets (X = 80) and 0.5 capture rate (r cap = 50%). The singlet rate monotonically decreases as the cell population grows. Under a fixed cell population size, however, as the number of samples grows, more SSMs transit into MSMs. (e) Sample number profile for 5% RSSM rate target.
(f) Sample number profile for 1% RSSM rate. Figure S13: Sample barcoding profiles under the assumption of 120K cellassay droplets (X = 120K) and 0.5 capture rate (r cap = 50%). Figure S14 shows the confidence score distributions of the three cell hashing datasets presented in Table 2. As shown in the figure. Most droplets (> 95%) have confidence scores above 0.8. Figure S14: Confidence Score Distributions. Figure S15 displays the CLR-normalized per-sample HTO antibody count distributions in the PBMC-1 dataset, including both GEMs and empty droplets. If the background HTO antibody assumption of demuxEM is correct, where HTO antibodies come from either sample-staining antibodies or free-floating antibodies in the suspension, then we should observe only two Gaussian components in each distribution. However, as shown in all plots, there exist more than two Gaussian components. This proves that free-floating antibodies in the GEM suspension is not the only source of background HTO antibodies and the assumption of demuxEM is inconclusive.

Parameter Selections for DoubletFinder and Scrublet
We applied both DoubletFinder and scrublet to the PBMC-1 dataset with default parameters. For scrublet, we did not specify any parameters to the scrub doublet() function. For DoubletFinder, we used the following parameters: PCS = 1:10, pN = 0.25, pK = 0.01 (as suggested in the manuscript), and we assumed a doublet formation rate of 0.125 (we assumed that the DoubletFinder is agnostic to the sample-barcoding information and does not know the total multiplet rate). We used the Seurat clustering result to derive the pure-type multiplet rate and then calculated the phony-type multiplet rate accordingly, following the instructions included in the DoubletFinder manuscript.

Impact of Phony-Type GEMs in Downstream Analysis
We use scanpy [44] to benchmark the impact of phony-type GEMs in downstream scRNA-seq analysis. We compare the RNA-based clustering results of the PBMC-1 dataset with and without removing phony-type GEMs. Both clustering results are visualized in UMAP plots in Figure S16, with their respective surface-marker-based annotations on the side as references. As shown in Figure S16A, without removal of phony-type GEMs, the RNA clustering result is less in accordance with the surface marker annotation result ( Figure S16B). Cluster 6, in particular, has a MSM percentage of 51.57%. Further decomposition of cluster 6 ( Figure S16E) shows that it consists GEMs of multiple types, of which 19.8% are phony-type GEMs. After removal of phony-type GEMs, however, the RNA-based clustering result (Figure S16C) is a lot similar to the surface marker annotation result ( Figure S16D). The AMI scores of both RNA clustering results, measured by comparing against the surface marker annotation, with and without removal of phony-type GEMs, are 0.72 and 0.53, respectively. Note that we only measure the AMI scores against the pure-type GEMs. We remove all phony type GEMs prior to computing the AMI scores in both cases. A higher AMI score indicates that removing phony type GEMs creates better clusters of pure-type GEMs that are more in accordance with their surface marker annotations. Overall, we conclude that phony-type GEMs can confound downstream scRNA-seq analysis and their subsequent removal is critical to the integrity of downstream scRNA-seq analysis.

MSM Estimation Result for PBMC-2
The comparison between observed and model-derived MSM counts for the PBMC-2 dataset is summarized in Table 8 (provided separately). Similar to the PBMC-1 and the CD4 + Memory T datasets, the observed counts and model-derived counts are highly consistent for each and every SSD and MSM category. Figure S16: Impact of phony-type GEMs in downstream RNA analysis. Figure A illustrates the clustering result of the RNA data of the PBMC-1 dataset. Included in the parenthesis are MSM percentages of the cluster. Figure B shows the cell type annotations of each GEM according to their surface marker profiles. Figure C shows the clustering result of the RNA data of just the pure-type GEMs. Figure D depicts the cell type annotation of each GEM in Figure C, using the surface marker data. Figure E shows the composition of cluster 6 in Figure A, according to the surface marker data.