CB2 improves power of cell detection in droplet-based single-cell RNA sequencing data

An important challenge in pre-processing data from droplet-based single-cell RNA sequencing protocols is distinguishing barcodes associated with real cells from those binding background reads. Existing methods test barcodes individually and consequently do not leverage the strong cell-to-cell correlation present in most datasets. To improve cell detection, we introduce CB2, a cluster-based approach for distinguishing real cells from background barcodes. As demonstrated in simulated and case study datasets, CB2 has increased power for identifying real cells which allows for the identification of novel subpopulations and improves the precision of downstream analyses.

scRNA-seq data is distinguishing those barcodes corresponding to real cells from those binding ambient, or background, RNA.
Early methods to address this challenge defined real cells as those barcodes with total read counts exceeding some threshold [1,2]. Such methods are suboptimal as they discard small cells as well as those expressing relatively few genes. To address this, Lun et al. [4] developed EmptyDrops (ED), an approach to identify individual barcodes with distributions varying from a background distribution. Similar to previous approaches [1,2], ED identifies an upper threshold and defines real cells as those barcodes with counts above the threshold. As a second step, ED uses all barcodes with counts below a lower threshold to estimate a background distribution of ambient RNA against which remaining barcodes are tested. Those having expression profiles significantly different from the background distribution are deemed real cells. The ED approach is current state-of-the-art in the field. However, given that ED performs tests for each barcode individually, it does not leverage the strong correlation observed between cells and, consequently, compromises power for identifying cells in many datasets.
To increase the power for identifying real cells, we propose CB2, a cluster-based approach for distinguishing real cells from background barcodes in droplet-based scRNAseq experiments. CB2 extends the ED framework by introducing a clustering step that groups similar barcodes, then conducts a statistical test to identify groups with expression distributions that vary from the background (Fig. 1, Additional file 1: Figure S2). CB2 is implemented in the R package scCB2.

Results
CB2 was evaluated and compared with ED on simulated and case study data. In SIM IA, counts are generated as in Lun et al. [4]. Briefly, given an input dataset, an inflection point dividing low from high-count barcodes is determined. Low count barcodes are pooled to estimate the background distribution. Background barcodes are sampled from this distribution to match the total number and size of barcodes below the inflection point in the input dataset. Six thousand real cells are then generated as follows. First, 2000 barcodes are randomly sampled from the high-count barcodes (referred to as G 1 cells [4]); a second set of 2000 high-count barcodes is sampled and then downsampled by 90% to give G 2 cells; the third set (G 1.5 ) is obtained by sampling 2000 barcodes from the high-count range and downsampling by 50%. We note that in Lun et al. [4], only G 1 and G 2 cells were considered. Here, G 1.5 cells were added to better reflect real data. Additional file 1: Figure S3 shows increased power of CB2 with well controlled false discovery rate (FDR) for the 6 datasets considered in Lun et al. [4] as well as 4 additional datasets. SIM IB, also considered by Lun et al. [4], is similar to SIM IA, but in SIM IB 10% of the genes in the real cells are shuffled making the real cells more different from the background and therefore easier to identify (Additional file 1: Figure  S4). Additional file 1: Figure S5 shows the increased power of CB2 is maintained.
To further evaluate CB2, we applied CB2 and ED to the ten case study datasets used to generate the simulated data as well as one additional dataset considered in the ED case study and compared the number of cells identified in common as well as those uniquely identified by each approach. Additional file 2: Table S1 shows that CB2 finds 24% more cells on average (range 4-81%). Of the extra cells identified, 88% on average (range 44-100%) add to existing subpopulations. The remaining 12% (range 0-56%) make up novel subpopulations.
As an example, Fig. 2 and Additional file 1: Figure S6 show results from the Alzheimer data [5] where CB2 identifies 18% more cells. A detailed look at the unique CB2 identifications suggests that the extra cells identified are not false positives, but rather they add to existing excitatory neuron and inhibitory neuron subpopulations, and also reveal a novel subpopulation consisting of 209 cells. Specifically, Fig. 2 b and c show distribution plots and an expression heatmap of the 100 genes having the highest average expression in Subpop1 (the largest subpopulation) for cells identified by both CB2 and ED as well as those identified uniquely by CB2. As shown, cells uniquely identified by CB2 have a distribution similar to other cells, and they differ from the background. Using the marker genes from Mathys et al. [5], Fig. 2d and Additional file 1: Figure  S6(b) suggest that cells identified uniquely by CB2 in Subpops 1-4 are neurons, as they show relatively high expression of neuron marker genes SYT1, SNAP25, and GRIN1. More specifically, the CB2 cells in Subpops 1-2 exhibit high expression of excitatory neuronal markers whereas the cells in Subpops 3-4 appear to be inhibitory neurons (Additional file 1: Figure S6(c) and (d)). The novel subpopulation (Subpop5) uniquely shows high expression of both oligodendrocyte and astrocyte marker genes, suggesting Fig. 1 Overview of CB2. a Projection of a hypothetical cell population containing three subpopulations (red, green, and blue where intensity corresponds to read depth). CB2 takes as input a gene by barcode matrix of UMI counts and returns a gene by cell matrix. b High-count barcodes with counts above a pre-specified upper threshold are considered real cells; barcodes with counts below a lower threshold are used to estimate a background distribution (Additional file 1: Figure S2). The remaining barcodes are clustered, and tight clusters are tested as a group against the estimated background distribution; barcodes not in tight clusters are tested individually (not shown). High-count barcodes and those identified by CB2 are retained for downstream analysis that this group may be mixed phenotype glial cells [6] (Additional file 1: Figure S6(e) and (f)).
By increasing the number of real cells identified, CB2 also improves the power to differentiate Alzheimer's patients from controls. Specifically, Mathys et al. [5] profiled expression from the prefrontal cortex of 24 AD-pathology patients as well as 24 agematched controls, and they validated differentially expressed genes in different cell types, including 9 genes in excitatory neurons and 9 in inhibitory neurons. Additional file 1: Figure S7 shows that by identifying additional cells, CB2 improves downstream differential expression analysis by resulting in more significant p values and stronger fold changes.
In a second case study (PBMC8K), CB2 increases the number of cells identified across six subpopulations by over 80% (Additional file 2: Table S1). Results are shown in Fig. 3 and Additional file 1: Figure S8. Similar to the Alzheimer's data analysis, Additional file 1: Figure S8(b) and (c) show that cells identified uniquely by CB2 in Subpop1 have an expression profile that is similar to other cells and differs from the background. Figure 3 provides a detailed look at marker gene expression for the Fig. 2 Results from the Alzheimer dataset. a t-SNE plot of cells identified by CB2 and ED. High-count barcodes exceeding an upper threshold are identified as real cells by both methods without a statistical test (dark pink); barcodes identified as cells by both methods following statistical test are shown in pink. Cells identified uniquely by CB2 (yellow) and ED (black) are also shown. CB2 identifies an increased number of cells in existing subpopulations (Subpop1-Subpop4) and also identifies a novel subpopulation (Subpop5). b Distribution plots of the 100 genes having highest average expression in Subpop1 are shown for cells identified by both CB2 and ED (upper) and identified uniquely by CB2 (middle). The estimated background distribution is also shown (lower). Cells uniquely identified by CB2 in Subpop1 have a distribution similar to other Subpop1 cells and differ from the background. c Heatmap of log transformed raw UMI counts for the same 100 genes for barcodes identified by CB2 and ED (left) and barcodes uniquely identified by CB2 (right). d t-SNE plots of cells colored by neuron marker genes SYT1, SNAP25, and GRIN1 in all cells (upper) and those identified uniquely by CB2 (lower) well-characterized PBMC8K cells using markers considered in Zheng et al. [2]. As shown in Fig. 3b, CB2 identifies additional CD14+ monocytes, T cells, B cells, and megakaryocytes. Results from two additional datasets are shown in Additional file 1: Figure S9-S10.

Discussion
Taken together, the results presented here demonstrate that CB2 provides a powerful approach for distinguishing real cells from background barcodes which will increase the number of cells identified in existing cell subpopulations in most datasets and may facilitate the identification of novel subpopulations. While advantages are expected in many settings, users will benefit from the following considerations. CB2 does not test for doublets or multiplets, and consequently, some of the high-count identifications may consist of two or more cells. Methods for identifying multiplets such as Scrublet [7], DoubletDecon [8], or DoubletFinder [9] may prove useful after applying CB2. A second important post-processing step is filtering based on mitochondrial expression. As noted in Lun et al. [4], any method for distinguishing cells from background barcodes is technically correct in identifying low-quality cells given that damaged cells exhibit expression profiles that differ from the background. Specifically, mitochondrial gene expression is often high in damaged cells; an example is shown in Subpopulation 5 of the PBMC8K data (Fig. 3b). Such cells are typically not of interest in downstream analysis and should therefore be removed. The GetCellMat function in R/scCB2 may be used toward this end.

Conclusions
Droplet-based scRNA-seq technologies provide unprecedented opportunity to address biological questions, but efficient pre-processing is required to maximize the information obtained in an experiment. CB2 allows investigators to maximize the number of cells retained and consequently to increase the power and precision of downstream analysis. Fig. 3 Results from the PBMC8K dataset. a t-SNE plot of cells identified by CB2 and ED. High-count barcodes exceeding an upper threshold are identified as real cells by both methods without a statistical test (dark pink); barcodes identified as cells by both methods following statistical test are shown in pink. Cells identified uniquely by CB2 (yellow) and ED (black) are also shown. CB2 increases the number of cells identified across the six subpopulations by over 80% (Additional file 2: Table S1). b Subpopulations 1-5 ordered by median normalized UMI count along with marker gene expression for each subpopulation. Marker gene expression in cells uniquely identified by CB2 is similar to that in other groups, and differs from the background. Subpopulation 5 contained no high-count common cells; subpopulation 6 contained no unique CB2 identifications and is therefore not shown in panel b

CB2
As CB2 relies on ED, we briefly review the ED approach before detailing the clustering test introduced in CB2. ED expects as input a G × B feature-by-barcode matrix with G features (for simplicity, we refer to features as genes) and B barcodes. Barcodes having zero counts for all genes are filtered out, and the remaining barcodes are divided into three groups based on the sum of gene expression (UMI) counts within a barcode. The background group, B 0 , contains all barcodes with counts less than or equal to a pre-defined lower threshold (defaults to 100); the high-count barcodes, B 2 , contain barcodes with counts exceeding an upper threshold (defaults to knee point); the remaining barcodes (B 1 ) are tested (Additional file 1: Figure S2). ED assumes that counts from a background barcode are distributed as Dirichlet-Multinomial with probability vector p B 0 estimated by averaging the counts in B 0 and applying the Good-Turing algorithm [16] to ensure that all probabilities are non-zero, denoted asp B 0 . For a barcode b ∈ B 1 , ED tests p b ¼ p B 0 against the alternative p b ≠p B 0 using the log-likelihood underp B 0 as the test statistic. A Monte-Carlo p value is calculated via simulating Dirichlet-Multinomial barcodes of size |b| underp B 0 and calculating the proportion of simulated barcodes having a test statistic more extreme than (or equal to) b's. The false discovery rate is controlled using the Benjamini-Hochberg procedure [17].
CB2 follows ED by filtering out genes with zero counts and dividing the remaining barcodes into three groups. However, instead of testing all barcodes from B 1 individually, CB2 first clusters barcodes and then tests tight clusters to identify those that differ from the background. As in methods for genome-wide association studies (Mieth et al. 2016 [18]), gene co-expression network analysis (Botía et al., 2017 [19]), and de novo transcriptome analysis (Malik et al., 2018 [20]), clustering prior to testing increases power by reducing the total number of tests and increasing the signal to noise ratio. CB2 proceeds as follows: 1. Barcodes grouped by size. CB2 orders barcodes in B 1 by total counts where X b denotes the count vector of barcode b, |X b | denotes the total UMI count of barcode b, and |B 1 | denotes the number of barcodes in B 1 . Groups of size S (defaults to 1000 in R/scCB2) are constructed consisting of barcodes ranging in size from smallest to largest: is rounded up if not an integer. If jB 1K j < S 2 , barcodes in B 1K are merged with those in B 1(K − 1) . Sorting barcodes by size reduces bias in the clustering and testing steps that follow.
2. Barcodes clustered within group: Barcodes within each group B 1j are clustered using hierarchical clustering with pairwise Pearson correlation as the similarity metric. A cluster is considered tight if the average within-cluster pairwise Pearson correlation exceeds a data-driven threshold. Tight clusters are retained for further analysis as described in step 3, below. To determine thresholds, ten tight clusters of varying size are simulated by generating 100 samples from a multinomial distribution with parameters (N, p) where N ranges from 100 to 1000 in increments of size 100. This range is chosen as we found little variation in thresholds for barcode sizes exceeding 1000; p is set to eitherp B 0 orp B 2 , whichever has larger Shannon entropy [21] as the distribution with larger entropy is less affected by outliers. For each simulated cluster C, the threshold κ C is defined by its average pairwise Pearson correlation. A cluster is considered tight if the average within-cluster pairwise Pearson correlation exceeds κ C for the simulated cluster of closest size.
3. Tight clusters tested: For each tight cluster C, we conduct a Monte-Carlo test to assess dissimilarity from the background. Pairwise Pearson correlations are calculated between every barcode in C andp B 0 ; the test statistic for cluster C, T c , is defined to be the median of these correlations. Similar to ED, to simulate background barcodes, we sample barcodes X Ã 1 ; …; X Ã M from a multinomial (N;p B 0 ) where N is the size of the barcode giving T c . The Monte-Carlo p value is: where cor X Ã i ;0 is the Pearson correlation between X Ã i andp B 0 (M defaults to 1000 in R/scCB2). Monte-Carlo p values are calculated for each cluster followed by Benjamini-Hochberg [17] to control the FDR. All barcodes within a significant cluster are identified as real cells.

Individual barcodes tested: Barcodes that were not included in a tight cluster in
Step 2 as well as those in a tight cluster that were not found to be significant in Step 3 are tested individually using ED. It is important to note that some of the barcodes identified in this step do not overlap with identifications made when ED is applied to the full set of barcodes given differences in the rates of real cells to background barcodes and differences in error rate control.

Simulations
Counts are generated as in Lun et al. [4]. As detailed there, each simulation requires an input dataset. We constructed simulations from 10 datasets: Alzheimer [5], PBMC8K, PBMC33K, mbrain1K, mbrain9K, PanT4K, MALT, PBMC4K, jurkat, and T293 (Additional file 2: Table S2). For each input dataset, the inflection point of the UMI count by sorted barcode plot is used to divide lower count from higher count barcodes. The barcodes in the lower count range are considered background. In SIM IA, two sets of 2000 barcodes randomly sampled from the higher count range are considered real cells. The first set of 2000 is referred to as large ( G 1 ) cells; the second set is downsampled by 90% to give small (G 2 ) cells. We added a third set of medium (G 1.5 ) cells by sampling 2000 cells from the higher count range and downsampling by 50%. The process for simulating data in SIM IB is identical to SIM IA except that in SIM IB, 10% of the genes in each simulated real cell are shuffled making the real cells more different from the background barcodes and, consequently, making real cells easier to identify. SIM IA is a more realistic simulation (Additional file 1: Figure S4).

Case studies
We evaluated the 10 datasets used in the simulation and also the placenta data evaluated in Lun et al. [4]. These datasets vary in sequencing depth as well as in the extent of differences between the real cell and background distributions (Additional file 1: Figure S4). CB2 and ED were applied to each dataset using default settings. For plots that compare identifications between CB2 and ED, cells identified by either approach (or both) were combined and UMI counts were normalized via scran. The Seurat pipeline was used to generate t-SNE plots from the top 4000 most highly variable genes and top 50 principal components. Expression heatmaps show log transformed raw UMI counts. For heatmaps and distribution plots, mitochondrial and ribosomal genes were removed.

Differential expression analysis in Alzheimer data
Cells identified by CB2, ED, or both were combined into a single matrix and filtered similar to Mathys et al. [5]. Specifically, cells with mitochondrial gene expression making up 40% or more of the total UMI counts were removed; genes detected in fewer than two cells were also excluded giving a matrix of 28,208 genes and 74,579 barcodes. Normalization was performed using scran. Cell types were annotated using marker genes as in Mathys et al. [5] Differential expression (DE) tests between cells from Alzheimer's cases and controls were conducted using Wilcoxon rank-sum tests as in Mathys et al. [5]. Results were compared for known DE genes extracted from Mathys et al. [5].

Implementation of CB2 and ED
For all simulation and case study analyses, CB2 and ED were implemented using default parameters. A target FDR was set at 1%.

Existing subpopulations vs. novel subpopulations
The FindNeighbors and FindClusters functions in Seurat were used with default settings to assign each cell to a cluster, referred to here as a subpopulation. For each subpopulation, we calculated the percentage of cells identified by both CB2 and ED as well as those identified uniquely by CB2. Subpopulations for which over 80% of the cells are uniquely identified by CB2 are referred to as novel subpopulations (Additional file 2: Table S3 shows the number of novel subpopulations identified using 70%, 80%, or 90% as thresholds).