DropletQC: improved identification of empty droplets and damaged cells in single-cell RNA-seq data

Background Advances in droplet-based single-cell RNA-sequencing (scRNA-seq) have dramatically increased throughput, allowing tens of thousands of cells to be routinely sequenced in a single experiment. In addition to cells, droplets capture cell-free “ambient” RNA predominantly caused by lysis of cells during sample preparation. Samples with high ambient RNA concentration can create challenges in accurately distinguishing cell-containing droplets and droplets containing ambient RNA. Current methods to separate these groups often retain a significant number of droplets that do not contain cells or empty droplets. Additionally, there are currently no methods available to detect droplets containing damaged cells, which comprise partially lysed cells, the original source of the ambient RNA. Results Here, we describe DropletQC, a new method that is able to detect empty droplets, damaged, and intact cells, and accurately distinguish them from one another. This approach is based on a novel quality control metric, the nuclear fraction, which quantifies for each droplet the fraction of RNA originating from unspliced, nuclear pre-mRNA. We demonstrate how DropletQC provides a powerful extension to existing computational methods for identifying empty droplets such as EmptyDrops. Conclusions We implement DropletQC as an R package, which can be easily integrated into existing single-cell analysis workflows. Supplementary Information The online version contains supplementary material available at 10.1186/s13059-021-02547-0.

problem, several methods have been developed to estimate and remove its contribution to gene expression [2][3][4].
High levels of ambient RNA also create challenges in accurately identifying cellcontaining droplets. This is a particular problem for data generated from solid tissues, where more fragile cells are more likely to become damaged during dissociation and contribute to ambient RNA. We thus have three scenarios that need to be differentiated: empty droplets containing high concentrations of ambient RNA, droplets containing damaged cells, and droplets containing cells with limited ambient RNA. Using cutoffs based on the total number of RNA fragments assigned to each droplet, such as those originally proposed by Macosko et al. [5] and Zheng et al. [6], risks both including empty droplets and excluding small cells with below-average RNA content. The EmptyDrops method [7] addresses this issue through a more sophisticated approach, calculating the profile of the ambient RNA pool and testing each barcode for significant deviations from this profile. A favored alternative to simple UMI cut-offs, this method has been integrated as the default cell-calling algorithm in the widely used 10x Genomics Cell Ranger pipeline [6]. However, cell-free droplets with high ambient RNA concentration and damaged cells are still retained by this method.
Here, we present DropletQC, a new method that is able to simultaneously improve the detection of cell free droplets and droplets containing damaged cells. Taking advantage of the observation that unspliced and spliced mRNAs can be distinguished in common scRNA-seq protocols [8], we develop a novel metric: the nuclear fraction. The nuclear fraction quantifies, for each droplet, the proportion of RNA originating from unspliced pre-mRNA. Ambient RNA consists predominantly of mature cytoplasmic mRNA. This may arise as RNA is released from damaged cells in which the nuclear envelope remains intact, or capped and polyadenylated transcripts may be more stable in the extracellular environment ( Fig. 1). Regardless, droplets that contain only ambient RNA have a low nuclear fraction compared to droplets containing cells (Additional file 1: Figure S1). In contrast, damaged cells due to the depletion of cytoplasmic RNA will have a higher nuclear fraction compared to intact cells. By using the nuclear fraction score in combination with UMIs per droplet, we are able to accurately distinguish between empty droplets, damaged cells, and intact cells.
To assess the ability of DropletQC to identify both empty droplets and droplets containing damaged cells, we applied it to four independent scRNA-seq datasets; embryonic mouse brain, glioblastoma tumor, peripheral blood mononuclear cells (PBMCs), and Hodgkin's lymphoma tumor. To determine whether DropletQC could identify empty droplets missed by current methods, barcodes were filtered using 10x Genomics Cell Ranger 6.1.1 [6], CellBender [9], EmptyNN [10], or EmptyDrops as implemented in DropletUtils [7]. For all methods tested, DropletQC identified additional cell free droplets (Additional file 1: Figure S2). These droplets are identified by DropletQC using an automatic threshold based on the distinctly lower nuclear fraction scores exhibited compared to droplets containing cells, which contain a mixture of mature (cytoplasmic) and precursor (nuclear) mRNA (Fig. 2). To validate the identified droplets do not contain intact cells, the levels of two transcripts; MALAT1 and NEAT1 were quantified for each droplet. These abundant lncRNAs maintain structural roles in nuclear speckles and paraspeckles respectively and are retained exclusively within the cell nucleus [11]. Droplets identified by DropletQC as cell-free displayed low levels of both transcripts (Additional file 1: Figure S3), indicating the droplets identified as empty do not contain intact cells. Cell free droplets represent 9.5% of mouse brain, 6.0% of Hodgkin's lymphoma, 4.0% of glioblastoma, and 0.4% of PBMCs retained following filtering with Empty-Drops (Additional file 1: Table S1). Cells from dissociated tissue (Fig. 2a-c) contained more empty droplets with high RNA content than PBMCs (Fig. 2d), suggesting ambient RNA may be released from cells damaged during sample preparation.
Following identification of empty droplets, droplets containing damaged cells are identified using expectation maximization and a Gaussian mixture model to separate them from droplets containing intact cells. As both the total UMI count and nuclear fraction scores display distinct distributions for different cell types (Additional file 1: Figure S4), it is necessary to first annotate cell types, so that damaged cells may be identified separately for each. Cells were annotated for each sample using a combination of gene markers and supervised classification with scPred [12]. Of the remaining cells, 14.0% of mouse brain, 5.2% of Hodgkin's lymphoma, 9.8% of glioblastoma tumor cells, and one PBMC cell were identified as damaged cells (Additional file 1: Table S1).
As an additional test of the ability of DropletQC to identify damaged cells, we applied the method to data from a recent investigation on the effects of cryopreservation on the transcriptomes of macaque microglia [13]. DropletQC revealed an increase in the proportion of damaged cells following cryopreservation from 4.1 to 13.8% (Additional file 1: Figure S5, Table S2). These findings have implications for the suitability of prospectively archiving samples for scRNA-seq studies through cryopreservation and To validate that the cells identified by DropletQC are indeed damaged, we applied the method to data from HEK293 cells stressed with staurosporine and captured in healthy, pro-, and lateapoptotic states [14]. DropletQC identified an increase in both the proportion of damaged cells and cell free droplets as cells progressed toward the late-apoptotic state (Additional file 1: Figure S6a-d). HEK293 cells identified as damaged were associated with a higher mitochdondrial gene content (Additional file 1: Figure S6e) a hallmark of damaged and dying cells [15]. Similarly, damaged cells for other datasets were associated with a higher mitochondrial gene content (Additional file 1: Figure S7).
To assess whether DropletQC is applicable to commonly used 5′ scRNA-seq protocols, we examined two additional scRNA-seq datasets; glioblastoma tumor 5′ v1 and mouse splenocytes 5′ v2. After filtering with EmptyDrops, DropletQC identified both empty droplets and damaged cells in both samples (Additional file 1: Figure S8, Table  S3). As common protocols capture unspliced intronic sequences [8], we anticipate Dro-pletQC will be applicable to most scRNA-seq datasets.
By default, the DropletQC software provides a flag for empty droplets and damaged cells, but does not remove them from the dataset. Depending on the biological analyses, damaged cells may retain useful information, and as such, it may be desirable to retain this metadata throughout downstream analyses. In addition, care should be taken that cells are accurately annotated, to avoid confounding distinct cell types as populations of damaged and intact cells. Similarly, cells such as erythrocytes, which contain small amounts of mature mRNA, may be misidentified as empty droplets and can be rescued downstream if desired. We note that intron and exon boundaries should be welldefined in the reference transcriptome for accurate estimation of the nuclear fraction. Total UMI counts (y-axis) and nuclear fraction scores (x-axis) are shown for each cell, with colors representing the status of each cell assigned by DropletQC. Empty droplets contain less RNA than cells and a higher fraction of cytoplasmic RNA (low nuclear fraction score). Damaged cells contain less RNA than intact cells and a higher proportion of unspliced RNA fragments (high nuclear fraction score) An option to calculate the nuclear fraction using a user-provided gene annotation file is provided, so that it may be easily recalculated as the quality of a species' gene annotation is improved.
For samples with large percentages of ambient RNA, some damaged cells and empty droplets may be missed by DropletQC. However, these can be identified by their low RNA content (Fig. 2a) and may be easily flagged using a minimum UMI threshold. We recommend that DropletQC be used in tandem with a tool such as EmptyDrops to prune the majority of cell barcodes, before identifying any remaining cell free droplets or damaged cells. Calculation of the nuclear fraction, identification of empty droplets and damaged cells are implemented as separate functions within the DropletQC package.
In summary, we have shown that DropletQC is able to successfully identify both empty droplets and damaged cells in data from a range of tissue types.

Nuclear fraction calculation
The DropletQC method first calculates the nuclear fraction for each droplet, which is the proportion of RNA fragments that originate from intronic regions. It is calculated as: where NF i is the nuclear fraction for droplet i, IR i are the reads that map to intronic regions for droplet i, and ER i are those that map to exonic regions. We have implemented two methods to map reads to either intronic or exonic regions. The first takes advantage of region tags, such as those added by 10x Genomics' Cell Ranger count analysis pipeline that identify the region type of each genome-aligned RNA fragment; exonic, intronic, or intergenic. These are efficiently counted using the nuclear_fraction_ tags function to calculate a nuclear fraction score for each provided cell barcode. Alternatively, if region tags are missing, our second method assesses aligned reads for overlap with intronic regions using the nuclear_fraction_annotation function in combination with a user-provided gene annotation file. To speed up processing of indexed, coordinate-sorted alignment files, reads are split across a user-specified number of genomic regions to allow parallel computation. The four samples presented in the manuscript were processed with 8 CPUs and 16Gb of RAM with an average processing time of 106 s per 100 million reads using the nuclear_fraction_tags function and 132 s per 100 million reads using the nuclear_fraction_annotation function.

Identifying empty droplets and damaged cells
Empty droplets are classified as all barcodes that fall below a defined nuclear fraction threshold. To identify a suitable threshold, a kernel density estimate is calculated using the nuclear fraction scores. The first derivative of the estimate is then calculated to identify the local minimum immediately following the first peak, corresponding to the population of empty droplets. If the automatically selected cut-off misidentifies the empty droplet population, two user-adjustable parameters are provided; a nuclear fraction threshold and a total UMI threshold, above which all barcodes are marked as cells.
To identify droplets containing damaged cells, barcodes are assessed separately for each cell type. It is assumed damaged cells have both a lower UMI count and higher nuclear fraction score than undamaged cells. We therefore use a two component (k) gaussian mixture model, implemented with mclust [16], to classify droplets containing damaged cells: where X is a dataset with log 10 (UMI) and estimated nuclear fractions for 1-n droplets of a given cell type. μ and σ 2 are the mean and variance, and α represents the mixing weight of a given component. The initial model parameters are calculated as: where N k is the number of data points in the kth component. Following the initialization, we estimate parameters using expectation maximization by asking what is the posterior probability that a droplet (x i ) belongs to component k j : where, N k in the initial component parameters are replaced with the posterior probability and recalculated, with these steps repeated until convergence determined using the Bayesian information criterion. This model identifies the minimum separation required between the identified distributions for a population of droplets to be marked as damaged. We then label droplets as containing a damaged cell based on a higher mean nuclear fraction and lower mean UMI than the cell population, a mean nuclear fraction greater than the cell population mean by a user-adjustable amount (default 0.15), and a mean UMI count lower than the cell population (default 50%).

Data Cell filtering and annotation
For the mouse brain, Hodgkin's lymphoma, glioblastoma, and PBMC samples presented in Fig. 2, prior to calculating the nuclear fraction score, all cell barcodes were assessed for a significant deviation from the ambient RNA expression pattern using the Empty-Drops method implemented in DropletUtils [7]. The lower bound on the total UMI count used to identify empty droplets was increased from 100 to 500 and all other parameters were left at their default values. Barcodes below a false discovery rate threshold of 1% were excluded. Remaining barcodes were additionally filtered for a maximum mitochondrial gene content of 15% to exclude low-quality cells, in line with current best practices when assessing common scRNA-seq quality control metrics. Mouse brain and PBMC cell types were annotated by supervised classification with the scPred [12] using annotated PBMC [17], mouse brain [18], and developing mouse brain [19] reference datasets. The glioblastoma sample cell types were identified using cell-type specific gene markers for oligodendrocytes (

Method comparison
CellBender's remove-background function was run with the number of epochs set to 150 and fpr of 0.01, as per the default parameters. The total-droplets-included argument was set to 20,000 and the expected-cells argument to 10,000 for the mouse brain and PBMC datasets and 5000 for the glioblastoma and Hodgkin's lymphoma samples. EmptyNN was run with the number of k-folds set to 10, for 10 iterations. The UMI counts threshold was set to the default value of 100, as well as a value of 500. Empty-Drops was run with default parameters and the lower bound on the total UMI count set to the default value of 100, as well as 500. Barcodes below a false discovery rate threshold of 1% were excluded as empty droplets. To provide a fair comparison with existing practices for assessing QC metrics, cells were additionally filtered with a mitochondrial gene content threshold of 15% before being assessed with DropletQC.

Supplementary Information
The online version contains supplementary material available at https://doi.org/10.1186/s13059-021-02547-0. Peer review information Barbara Cheifet was the primary editor of this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Review history
The review history is available as Additional file 2.
Authors' contributions WM and JEP developed the DropletQC method and wrote the manuscript. WM conceived the idea of the project and performed the computational analysis. Both authors read and approved the final manuscript.

Authors' information
Twitter handles: @WalterMuskovic (Walter Muskovic); @drjosephpowell (Joseph E Powell) Funding J.E.P is supported by a National Health and Medical Research Council Investigator Grant (1175781). This work was also supported by the Australian Research Council Discovery project DP190100825.

Availability of data and materials
The four 3′ single-cell gene expression datasets presented in Fig. 2 of the manuscript as well as the two 5′ datasets presented in Additional file 1: Figure S8 are made publicly available through the 10x Genomics website: https:// support.10xgenomics.com/single-cell-gene-expression/datasets [26][27][28][29][30][31]. The macaque microglia expression data is available from the NCBI GEO database, under accession GSE162663 [32]. The staurosporine-treated HEK293 cell dataset is available from the ENA repository with the study accession number PRJEB33078 [33]. All of the code used to produce the analyses and figures presented in the manuscript, along with links to individual datasets, are available through GitHub at https://github.com/powellgenomicslab/dropletQC_paper [34] under the MIT license and on zenodo at doi:10.5281/zenodo.5708997 [35]. DropletQC is available as an R package at https://github.com/powellgenomicslab/ DropletQC [36] under the MIT license and on zenodo at doi:10.5281/zenodo.5708994 [37].

Declarations
Ethics approval and consent to participate Not applicable.