Hashtag-enabled demultiplexing based on ubiquitous surface protein expression
We sought to extend antibody-based multiplexing strategies [16, 17] to scRNA-seq using a modification of our CITE-seq method [18]. We initially chose a set of monoclonal antibodies directed against ubiquitously and highly expressed immune surface markers (CD45, CD98, CD44, and CD11a), combined these antibodies into eight identical pools (pool A through H), and subsequently conjugated each pool to a distinct Hashtag oligonucleotide (henceforth referred to as HTO, Fig. 1a; “Methods” section). The HTOs contain a unique 12-bp barcode that can be sequenced alongside the cellular transcriptome, with only minor modifications to standard scRNA-seq protocols. We utilized an improved and simplified conjugation chemistry compared to our previous approach [18], by using iEDDA click chemistry to covalently attach oligonucleotides to antibodies [19] (“Methods” section).
We designed our strategy to enable CITE-seq and Cell Hashing to be performed simultaneously, but to generate separate sequencing libraries. Specifically, the HTOs contain a different amplification handle than our standard CITE-seq antibody-derived tags (ADT) (“Methods” section). This allows HTOs, ADTs, and scRNA-seq libraries to be independently amplified and pooled at desired quantities. Notably, we have previously observed robust recovery of antibody signals from highly expressed epitopes due to their extremely high copy number. This is in contrast to the extensive “dropout” levels observed for scRNA-seq data and suggests that we can faithfully recover HTOs from each single cell, enabling assignment to sample of origin with high fidelity.
To benchmark our strategy and demonstrate its utility, we obtained peripheral blood mononuclear cells (PBMCs) from eight separate human donors (referred to as donors A through H) and independently stained each sample with one of our HTO-conjugated antibody pools, while simultaneously performing a titration experiment with a pool of seven immunophenotypic markers (“Methods” section) for CITE-seq. We subsequently pooled all cells together in equal proportion, alongside an equal number of unstained HEK293T cells (and 3% mouse NIH-3T3 cells) as negative controls, and ran the pool in a single lane on the 10x Genomics Chromium Single Cell 3′ v2 system. Following the approach in Kang et al. [13], we “super-loaded” the 10x Genomics instrument, loading cells at a significantly higher concentration with an expected yield of 20,000 single cells and 5000 multiplets. Based on Poisson statistics, 4365 multiplets should represent cell combinations from distinct samples and can potentially be discarded, leading to an unresolved multiplet rate of 3.1%. Notably, achieving a similar multiplet rate without multiplexing would yield ~ 4000 singlets. As the cost of commercial droplet-based systems is fixed per run for sample preparation, multiplexing therefore allows for the profiling of ~ 400% more cells for the same cost.
We performed partitioning and reverse transcription according to the standard protocols, utilizing only a slightly modified downstream amplification strategy (“Methods” section) to generate transcriptome, HTO, and ADT libraries. We pooled and sequenced these on an Illumina HiSeq2500 (two rapid run flowcells), aiming for a 90%:5%:5% contribution of the three libraries in the sequencing data. Additionally, we performed genotyping of all eight PBMC samples and HEK293T cells with the Illumina Infinium CoreExome array, allowing us to utilize both HTOs and sample genotypes (assessed by demuxlet [13]) as independent demultiplexing approaches.
When examining pairwise expression of two HTO counts, we observed relationships akin to “species-mixing” plots (Fig. 1b), suggesting mutual exclusivity of HTO signal between singlets. Extending beyond pairwise analysis, we developed a statistical model to classify each barcode as “positive” or “negative” for each HTO (“Methods” section). Briefly, we modeled the “background” signal for each HTO independently as a negative binomial distribution, estimating background cells based on the results of an initial k-medoids clustering of all HTO reads (“Methods” section). Barcodes with HTO signals above the 99% quantile for this distribution were labeled as “positive,” and barcodes that were “positive” for more than one HTO were labeled as multiplets. We classified all barcodes where we detected at least 200 RNA UMI, regardless of HTO signal.
Our classifications (visualized as a heatmap in Fig. 1c) suggested a clear identification of 8 singlet populations, as well as multiplet groups. We also identified barcodes with negligible background signal for any of the HTOs (labeled as “negatives”), consisting primarily (86.5%) of HEK293T and mouse cells. We removed all HEK293T and mouse cells from downstream analyses (“Methods” section), with the remaining barcodes representing 14,002 singlets and 2974 identifiable multiplets, in line with expectations. Our classifications were also fully concordant with a tSNE embedding, calculated using only the 8 HTO signals, which enabled the clear visualization not only of the 8 groups of singlets (donors A through H) but also the 28 small groups representing all possible doublet combinations (Fig. 1d). Moreover, we observed a clear positive shift in the distribution of RNA UMI per barcode for multiplets, as expected (Fig. 1e), while the remaining negative barcodes expressed fewer UMIs and may represent failed reactions or “empty” droplets containing only ambient RNA. These results strongly suggest that HTOs successfully assigned each barcode into its original sample and enabled robust detection of cross-sample multiplets. The large dynamic range of RNA UMI per cell barcode in multiplets (Fig. 1e) illustrates the difficulty of unambiguous multiplet assignment based on higher UMI counts. , and we observe the same challenges with total HTO signal (Additional file 1: Figure S1A). Performing transcriptomic clustering of the classified singlets enabled clear detection of seven hematopoietic subpopulations, which were interspersed across all 8 donors (Fig. 1f).
Genotype-based demultiplexing validates Cell Hashing
We next compared our HTO-based classifications to those obtained by demuxlet [13]. Overall, we observed a strong concordance between the techniques, even when considering the precise sample mixture in called doublets (Fig. 2a). Exploring the areas of disagreement, we identified 871 barcodes that were classified based on HTO levels as singlets but were identified as “ambiguous” by demuxlet. Notably, the strength of HTO classification for these discordant barcodes (represented by the number of reads assigned to the most highly expressed HTO) was identical to the barcodes that were classified as singlets by both approaches (Fig. 2b). However, discordant barcodes did have reduced RNA UMI counts (Fig. 2c). We conclude that these barcodes likely could not be genetically classified at our relatively shallow sequencing depth (~ 24,115 reads per cell), which is below the recommended depth for using demuxlet, but likely represent true single cells based on our HTO classifications.
In addition, we also observed 2528 barcodes that received discordant singlet/doublet classifications between the two techniques (Fig. 2d). We note that this does reflect a minority of barcodes (compared to 13,421 concordant classifications) and that in these discordant cases, it is difficult to be certain which of these methods is correct. However, when we examined the UMI distributions of each classification group, we observed that only barcodes classified as doublets by both techniques exhibited a positive shift in transcriptomic complexity (Fig. 2d). This suggests that these discordant calls are largely made up of true singlets and represent conservative false positives from both methods, perhaps due to ambient RNA or HTO signal. Consistent with this interpretation, when we restricted our analysis to cases where demuxlet called barcodes as doublets with > 95% probability, we observed a 75% drop in the number of discordant calls (Fig. 2e). Demuxlet requires sufficient numbers of reads and SNPs to unequivocally classify a cell to a donor, and as expected, discordantly classified cells had lower numbers of sequencing reads and SNPs (Additional file 1: Figure S2A-D).
Finally, we also observed a rare number of cases where both Cell Hashing and demuxlet classified cells as singlets but with discordant (216/11,464; 1.9%) donor classifications. To investigate further, we took advantage of the fact that all donors (A–G) except one (H) were also stained with CITE-seq antibodies, and therefore, donor H cells should not contain ADT reads. However, in 40 instances where demuxlet, but not Cell Hashing, classified cells as donor H, we observed robust (> 1000) ADT counts in 37 cases, suggesting that these discordant calls are misclassification errors from demuxlet (Additional file 1: Figure S2E), in line with demuxlet’s estimated error rate of 1–2% [13].
To further ensure that background binding levels did not lead to incorrectly demultiplexed samples, we performed a separate experiment where we mixed four cell lines (HEK293T, THP1, K562, and KG1) together, each independently labeled with three distinct Cell Hashing oligos. After demultiplexing, to assign each barcode to a cell line of origin, we clustered cells on the basis of their RNA expression levels, obtaining four transcriptomic clusters (as expected). Comparing our transcriptomic clusters with the demultiplexing results, we observed nearly perfect concordance (99.7%), demonstrating a low rate of misassignment for this experiment (Additional file 1: Figure S3A, B).
Finally, we attempted to estimate the false-negative rates for Cell Hashing, representing true single cells that do not receive sufficient Cell Hashing signal to be classified as singlets. To do this, we examined all HTO-classified “singlet” and “negative” barcodes from the PBMC experiment and performed clustering based on transcriptome data. As expected, we found that “negative” cells predominantly formed a distinct cluster from singlets. However, we did observe 117 barcodes originally classified as negatives, but whose transcriptomic profiles clustered across PBMC singlet subtypes. These barcodes likely represent single cells that were incorrectly classified from Cell Hashing, representing a false-negative rate of 0.9% (Additional file 1: Figure S4), but have negligible effects on cell type proportion estimates. Taken together, our results validate that Cell Hashing enables robust and accurate sample classification across diverse systems.
Cell Hashing enables the efficient optimization of CITE-seq antibody panels
Our multiplexing strategy not only enables pooling across donors but also the simultaneous profiling of multiple experimental conditions. This is widely applicable to the simultaneous profiling of diverse environmental and genetic perturbations, but we reasoned that we could also efficiently optimize experimental workflows, such as the titration of antibody concentrations for CITE-seq experiments. In flow cytometry, antibodies are typically run individually over a large dilution series to assess signal-to-noise ratios and identify optimal concentrations [20]. While such experiments would be extremely cost prohibitive if run as individual 10x Genomics lanes, we reasoned that we could multiplex these experiments together using Cell Hashing.
We therefore incubated the PBMCs from different donors with a dilution series of antibody concentrations ranging over three orders of magnitude (“Methods” section). Concentrations of CITE-seq antibodies were staggered between the different samples to keep the total amount of antibody and oligo consistent in each sample. After sample demultiplexing, we examined ADT distributions across all concentrations for each antibody (examples in Fig. 3a–c) and assessed signal-to-noise ratio by calculating a staining index similar to commonly used metrics for flow cytometry optimization (Fig. 3d) (“Methods” section).
All antibodies exhibited only background signal in the negative control conditions and very weak signal-to-noise at 0.06 μg/test. We observed that the signal-to-noise ratio for most antibodies began to saturate within the concentration range of 0.5 to 1 μg/test, comparable to the recommended concentrations for flow cytometry (Fig. 3d). This experiment was meant as a proof of concept; an ideal titration experiment would use cells from the same donor for all conditions and a larger range of concentrations but clearly demonstrates how Cell Hashing can be used to rapidly and efficiently optimize experimental workflows.
Cell Hashtags enable the discrimination of low-quality cells from ambient RNA
Our cell hashtags can discriminate single cells from doublets based on the clear expression of a single HTO, and we next asked whether this feature could also distinguish low-quality cells from ambient RNA. If so, this would enable us to reduce our UMI “cutoff” (previously set at 200) and would allow for the possibility that certain barcodes representing ambient RNA may express more UMI than some true single cells. Most workflows set stringent UMI cutoffs to exclude all ambient RNA, biasing scRNA-seq results against cells with low RNA content and likely skewing proportional estimates of cell type.
Indeed, when considering 4344 barcodes containing 50–200 UMI, we recovered 1110 additional singlets based on HTO classifications, with 3108 barcodes characterized as negatives. We classified each barcode as one of our previously determined 7 hematopoietic populations (“Methods” section; Fig. 1F) and visualized the results on a transcriptomic tSNE embedding, calculated independently for both “singlet” and “negative” groups. For predicted singlets, barcodes projected to B, NK, T, and myeloid populations which were consistently separated on tSNE, suggesting that these barcodes represent true single cells (Fig. 3e). In contrast, “negative” barcodes did not separate based on their forced classification, consistent with these barcodes reflecting ambient RNA mixtures that may blend multiple subpopulations. We therefore conclude that by providing a readout of sample identity that is independent of the transcriptome, Cell Hashing can help recover low-quality cells and/or cells with very low RNA content that can otherwise be difficult to distinguish from ambient RNA (Fig. 3f).
Towards a universal Cell Hashing antibody reagent
For our proof of principle experiments, we used a pool of antibodies directed against highly expressed immune surface markers (CD45, CD98, CD44, and CD11a). To enable multiplexing of any cell type and sample, we decided to redesign our panel to target more ubiquitously expressed surface markers. MHC class I complex (beta-2-microglobulin) and the sodium-potassium ATPase-subunit (CD298) are among the most broadly expressed surface proteins in human tissues [21]. Using a pool of antibodies directed against both proteins would allow us to multiplex virtually any cell type in one experiment. While this manuscript was under revision, the same antibody combination was demonstrated by Hartmann and colleagues to be a universal multiplexing reagent for CyTOF [22]. The extremely high expression levels of both markers should enable robust HTO demultiplexing, but in principle could label cells with an overwhelming number of single-stranded polyA oligos that might compete with polyadenylated cellular mRNAs, resulting in lower gene and/or UMI counts per cell. To investigate this potential competition, we stained Jurkat cells with a dilution series of Cell Hashing antibodies, ran a lane of 10x Chromium single cell 3′ v2 alongside a lane with non-hashed cells, and sequenced the resulting transcriptome libraries. Transcriptomic complexity levels, as indicated by the relationship between sequencing reads and UMI counts per cell, were indistinguishable from non-hashed cells over all tested concentrations of Cell Hashing antibodies, illustrating no disadvantages when multiplexing samples (Additional file 1: Figure S5). Taken together, these results demonstrate how Cell Hashing can be easily applied to virtually any human sample with readily available commercial reagents and without a loss of transcriptomic complexity.