 Software
 Open Access
dropEst: pipeline for accurate estimation of molecular counts in dropletbased singlecell RNAseq experiments
 Viktor Petukhov^{1, 2},
 Jimin Guo^{2},
 Ninib Baryawno^{3, 4, 5},
 Nicolas Severe^{3, 4, 5},
 David T. Scadden^{3, 4, 5},
 Maria G. Samsonova^{1} and
 Peter V. Kharchenko^{2, 4}Email author
 Received: 19 August 2017
 Accepted: 9 May 2018
 Published: 19 June 2018
Abstract
Recent singlecell RNAseq protocols based on droplet microfluidics use massively multiplexed barcoding to enable simultaneous measurements of transcriptomes for thousands of individual cells. The increasing complexity of such data creates challenges for subsequent computational processing and troubleshooting of these experiments, with few software options currently available. Here, we describe a flexible pipeline for processing dropletbased transcriptome data that implements barcode corrections, classification of cell quality, and diagnostic information about the droplet libraries. We introduce advanced methods for correcting composition bias and sequencing errors affecting cellular and molecular barcodes to provide more accurate estimates of molecular counts in individual cells.
Background
RNAseq protocols have been optimized to enable largescale transcriptional profiling of individual cells. Such singlecell measurements require both improved molecular techniques as well as effective ways to isolate and process a large number of cells in parallel. While singlecell RNAseq (scRNAseq) remains a challenging technique, several solutions are being increasingly applied, most notably techniques based on droplet microfluidics such as inDrop [1], Dropseq [2], and the 10x Chromium platform. In these approaches, cells are encapsulated in waterbased droplets together with barcoded beads and necessary reagents within an oilbased flow. This allows the RNA material extracted from each cell to be contained within the droplet and tagged by a unique cellular barcode (CB) carried on the bead.
InDrop and similar approaches pool material from different cells to prepare the library, and rely on computational analysis to recognize the reads originating from the same cell based on the CB contained in the read sequence. The reads also carry a random barcode—a unique molecular identifier (UMI) [3, 4]—that can be used to discount the redundant contribution of reads originating from the same cDNA molecule as a result of library amplification. As such, the primary aim of the dataprocessing pipeline, including the one presented here, is to provide accurate estimates of the number of molecules that have been observed for each gene in each measured cell—a molecular count matrix. Accurate estimation of such a matrix is crucial, as it commonly provides the starting point for all downstream analysis, such as cell clustering or tracing of cell trajectories.
Several factors complicate the estimation of this molecular count matrix, well beyond simple parsing of the read sequences. First, the procedure must separate reads originating from droplets containing real cells from contributions of empty droplets which can amplify extracellular background transcripts and significantly outnumber the real cells. Some of the droplets may contain damaged or fragmented cells, which complicates such separation. The procedure must also address problems stemming from sequencing errors, particularly errors within the CB or UMIs which result in misclassification of reads. Similarly, skewed distribution of UMIs can lead to biased estimation of molecular counts. Finally, as dropletbased scRNAseq protocols are still relatively new, detailed diagnostics and multiple quality control steps are typically needed to ensure highquality measurements and identify likely sources of problems. Given the current lack of such general processing pipelines for dropletbased scRNAseq, we have set out to provide an opensource implementation.
Results
We have developed a highperformance pipeline to perform initial preprocessing and analysis of dropletbased scRNAseq data. The pipeline characterizes the quality of a library using a wide range of diagnostic indicators, filters out artefactual cellular barcodes, evaluates and corrects for potentially confounding effects of uneven UMI coverage, and corrects for UMI and cellular barcode sequencing errors based on molecular similarity measures that do not require prior knowledge of the possible barcode sequences. It is designed to be used with different alignment methods and provides configuration options to accommodate alternative scRNAseq protocol designs.
Uneven UMI frequency distribution distorts molecular count estimation
Errors in UMI sequence lead to overestimation of molecular counts
An error introduced into a UMI sequence during the library preparation can be mistakenly interpreted as an additional molecule. Computational corrections have been proposed to avoid such overestimation. The simplest such approach [5] omits for a given gene all UMIs that have an adjacent UMI sequence (Hamming distance equal to 1) with a larger number of reads (as in [6] we refer to this method as cluster). Indeed, the probability of two molecules of the same transcript in the same cell being labeled by UMIs of Hamming distance 1 is low, given sufficient size of the UMI pool relative to the number of molecules of that transcript (see “Methods”). However, for moderately expressed genes the observed number of such events exceeds the expected frequency by a factor of ~ 40 (Additional file 1: Figure S3), suggesting that most adjacent UMI occurrences are erroneous. A more complex, networkbased solution [6] (referred to here as directional) considers a UMI to be erroneous if it has an adjacent UMI with more than twice the number of reads.
An alternative approach, implemented in the 10x Chromium Cell Ranger pipeline [7], uses UMI base call quality to distinguish erroneous UMIs. Examining different dropletbased datasets, we find that the fraction of UMI errors that can be distinguished by lower base call quality varies between datasets, within the range of 29.4–85.6% (Additional file 1: Figure S4). This suggests that a substantial fraction of UMI errors may originate during PCR amplification or other library preparation steps preceding the sequencing itself. Base call quality would not be informative in such cases. Furthermore, the existing methods do not consider the total number of molecules for a given gene, even though the probability of observing adjacent UMIs by chance increases. Such increase is further exacerbated by an uneven distribution of UMI frequencies described in the previous section. For instance, for the inDrop Bone Marrow dataset (dataset 11; see “Methods”), the probability of observing adjacent UMIs under the empirically observed distribution is up to 20% higher than under the uniform distribution (Additional file 1: Figure S5A).
To improve the accuracy of UMI filtering, we developed a Bayesian approach to estimate the posterior probability of a UMI being erroneous based on the gene expression magnitude, the observed number of adjacent UMI sequences, the prior distribution of UMIs, as well as the basecall quality in the position of the nucleotide substitution (see “Methods”). To evaluate performance of different UMI correction approaches, we used artificially trimmed UMIs, where the expected ground truth would be known with high certainty. Specifically, we used the 10x posttransplant bone marrow mononuclear cell (BMMC) data (dataset 7), which has relatively long 10 bp UMIs [8]. Given the lower rate of accidentally observing UMIs at Hamming distance 1 in these longer UMIs, we applied the cluster UMI filtering procedure to obtain benchmark ground truth expression estimates for the dataset. We then simulated datasets with shorter UMIs by trimming the UMI sequences, comparing the resulting molecular count estimates to the corresponding fulllength benchmark values (see “Methods”). As nucleotide diversity can vary depending on the position in the UMI, we used two versions of trimming: from the front of the UMI sequence and from the back. Both scenarios showed significant excess of UMI collisions compared to what is expected from the uniform distribution (Additional file 1: Figure S5B). More collisions were observed under the front trimming scenario leading to more collisions than with back trimming, indicating lower sequence diversity towards the end of the UMI.
To further compare the accuracy of UMI corrections introduced by different methods, we examined the distribution of edit distances between two random UMIs following different corrections, comparing it with the expected analytical estimate of the edit distance distribution. Such validation was originally proposed by Smith et al. [6] in describing the directional method. Figure 2d shows that, in contrast to other correction methods, the distribution of edit distances after the Bayesian correction closely resembles the theoretical estimation. The differences in probabilities are most notable for edit distances of 1 and 2.
Correction of the cellular barcode sequence errors
We first examined methods for correcting CB sequence errors. In doing so we considered two scenarios: one where a list of possible valid CB sequences is known (e.g., 10x or inDrop), and another where CBs can be an arbitrary nucleotide sequence (e.g., Dropseq). Predesigned CB sequences are typically evenly spaced in the sequence space, and replacing an erroneous CB with the closest matching valid CB sequence is an effective strategy. The space of potential valid CBs can be further narrowed down by taking into account that the valid CB shouldn’t have fewer counts than the erroneous CB. However, if the list of possible valid CBs is unknown, or if there are many similar CBs (e.g., short barcodes), the number of possible merge targets increases significantly (Fig. 3b). To accurately determine the probability that two CBs originated from one CB, we used UMI–gene composition similarity, which evaluates the likelihood that two independent cells will end up producing equivalent UMI–gene combinations (see “Methods”). This method was compared with the simpler approach, which, for every CB, checks if another CB exists with similar CB sequence (Hamming distance ≤ 2) and containing more molecules, and then merges such CBs.
Analysis of merge targets on human–mouse mixture datasets
Dataset  Merge type  Number of merges  Fraction of mixed merges  Similarity to merge with barcodes 

10x  Poisson  8999  0.58%  99.74% 
10x  Known barcodes  8985  0.62%  100% 
10x  Simple  21,827  32.96%  20.67% 
Dropseq  Poisson  15,186  0.83%  – 
Dropseq  Simple  26,154  8.74%  – 
Recognizing damaged or low quality cells
The number of molecules associated with a given CB generally provides reasonable criteria for selecting real cells [1, 7]. Similarly, CBs with very few associated reads likely represent empty droplets. However, classifying CBs in the intermediate range poses a challenge. The intermediate size CBs likely contain damaged or dying cells from which relatively little mRNA material could be recovered [9]. This complicates the optimization of a size separation threshold. Such lowquality cells could also cover a range of sizes, making the use of a single size cutoff ineffective.
Fivefold CV comparison of classifiers
Classifier  Sensitivity on CV (%)  Specificity on CV (%)  Stability on class 1 (%)  Stability on class 0 (%) 

KDE  90.4 (±0.8)  91.1 (±3.4)  89.8 (±2.9)  97.3 (±1.3) 
Random Forest  89.6 (±2.3)  92.7 (±1.6)  87.3 (±7.3)  97.7 (±1.3) 
Robust GP  87.7 (±2)  94.8 (±2.2)  85.3 (±0)  99.3 (±0.5) 
Discussion
Dropletbased microfluidics protocols and other highthroughput methods are enabling production of large singlecell RNAseq datasets (10^{3}–10^{6} cells). Complex barcoding schemes employed by such methods require indepth computational analysis to achieve accurate recovery of molecules associated with different cells and genes. In order to avoid collisions of cellular barcodes, large numbers of cells necessitate longer CBs, increasing the probability that a sequence error will be introduced into a CB during the bead construction steps [1], library preparation procedures, or library sequencing. We show that for many such errors there are multiple equidistant CBs from which the molecule may have originated. The implemented solution, which merges CBs based on the probabilistic assessment of the molecular overlap between the CBs, provides accurate correction even in cases when the set of possible valid CBs is not known in advance.
Errors affecting molecular barcodes (UMIs) pose a similar challenge, which in this case is driven by increasing sequencing depth of individual cells. This has been recognized by earlier studies [5], and several correction strategies have been proposed. We show that the overall distribution of UMI sequence occurrences is not uniform, and the resulting bias reduces the effective UMI space leading to increased number of UMI collisions in wellexpressed genes and deflated molecular counts. Some of the UMI errors appear to result from occurrence of aberrant library molecules incorporating mononucleotide primers, such as poly(T) into the UMI position. On the other hand, point mutations in UMIs and aberrant base calls can lead to inflated molecular counts. While most UMI errors can be mitigated experimentally by increasing the UMI length, we show that taking into account empirical distribution of UMI frequencies allows adjustment for both UMI collision and sequence error effects.
Even with corrections of CB sequence errors, most of the CBs encountered in the current dropletbased datasets do not represent real cells. These additional molecules may originate from empty droplets capturing extracellular background. Indeed, examination of mouse–human dataset mixtures suggests that smaller CBs have a higher crossorganism contamination fraction than one would expect from extracellular background. In addition to empty droplets, some of the lowmagnitude CBs may represent damaged, dying, or dead cells, as well as cells that were not successfully measured for other reasons. The challenge of identifying damaged cells has been previously examined by Ilicic et al. [9] in the context of the Fluidigm C1 protocol, where the proportion of lowquality cells is typically in the range of 10–40%. This fraction can be much higher in the inDrop data (e.g., 90% of CBs), and obtaining microscopybased labeling for the classification is challenging given the rapid flow within the devices. We instead explored application of faulttolerant classifiers to identify technical features consistent with an imperfect initial separation of highquality cells based on the size criteria alone. Such an approach is able to pick up relatively large cells that resemble poorly measured cells based on their technical features, and rescue some of the smaller cells that look consistent with the highquality tail of the cell distribution.
Conclusions
Overall, we hope that the developed pipeline will facilitate analysis of dropletbased singlecell RNAseq data, providing helpful diagnostics (see Additional file 3: Supplementary Note 1 for an example of a dropEst pipeline report) and improving the accuracy of the resulting expression estimates.
Methods
The dropEst pipeline operates in three phases: i) identifier parsing phase; ii) read mapping phase; and iii) filtering and quality control phase. The first phase takes as an input FASTQ files containing pairedend read and index data. The output of this phase is a modified FASTQ file with reads which can be aligned to a transcriptome reference during the second phase using a standard spliceaware aligner (e.g., STAR [16] or TopHat 2.1.0 [17], which was used in our work). The third phase takes BAM files with the aligned reads [18] and a gene annotation file in GTF format. BAM files produced by the 10x Cell Ranger pipeline can also be provided when running this stage. The result of pipeline is an Rreadable file that contains molecular count matrix and other processed information, as well as a report with diagnostic information on the library. Sample runtimes for different pipeline steps are shown in Additional file 2: Table S8.
Correction of UMI collisions
In cases when the number of UMIs per gene is comparable to the total UMIs pool size, the gene expression level will be underestimated [3] and needs to be adjusted by taking UMI collisions into account. Fu et al. [3] assumed uniform distribution of UMI probabilities, and then the number of unique UMIs expected for number of molecules n from a UMI pool of size m is k = m[1 − e^{−(n/m)}], thus \( n=m\ \mathit{\ln}\left(1\frac{k}{m}\right) \). To account for a nonuniform UMI distribution observed in the droplet datasets, we estimated n(k, m) by modeling the collisions process. Let’s assume that we have a gene with k distinct UMIs G_{ k } and a distribution of UMI probabilities P(u_{ i }). In this case, the probability of observing a new distinct UMI is \( p\left({u}^{\prime}\notin {G}_k\right)={\sum}_{i=1}^mp\left({u}_i\right)\ast {\left(1p\left({u}_i\right)\right)}^{n(k)} \). The expected number of collisions prior to obtaining a new distinct UMI is equal to p(u' ∉ G_{ k })^{−1} and n(k + 1) = n(k) + p(u' ∉ G_{ k })^{−1}. Thus, we can use a stepbystep procedure to estimate n(k) ∀ k ∈ 1 : m.
To validate the developed method, we simulated UMI collisions using a bootstrap procedure (Additional file 1: Figure S2). To do so, we estimated the number of collisions by sampling UMIs from the common distribution one by one, until the expected number of distinct UMIs was reached.
Correction of UMI sequence errors
To determine whether two UMIs represent technical variations (sequencing errors) of the same UMI, we use a Bayesian approach to estimate the number of errors within each gene within each cell by maximal likelihood. To do so, we can model the process of generating UMI composition.

U, sequence of the first UMI.

u, sequence of the second UMI.

R, number of reads for the first UMI.

r, number of reads for the second UMI, r ≤ R.

N_{ S }, number of adjacent (Hamming distance of 1) UMIs for the UMI U with the number of reads r' : r' ≤ R.

N_{ L }, number of adjacent UMIs for the UMI U with the number of reads R' : R' > R.

S_{ g }, number of UMIs in the gene.

q, mean Phred quality score of the distinguishing position for the UMI u. Here, we use mean value as we already include parameter r, which is strongly correlated with the total (sum) quality score.

#Errors is the number of erroneous UMIs

#Real is the number of real UMIs
Here, event Err(Ω_{ E }) means that all UMIs from Ω_{ E } were generated from U by an error \( \left( Er{r}_{U,{u}^{\prime }}\forall {u}^{\prime}\in {\Omega}_E\right) \). We can omit \( p\left(\overrightarrow{r},R,U,{N}_S,{N}_L,{S}_g\right) \) as it doesn’t depend on the separation Ω_{ E }, Ω_{¬E}. We can also assume independence of properties of Ω_{ E } and properties of Ω_{¬E}:
Direct estimation of the distribution p(#Errors = k) = ∑ p(Ω_{ E }, Ω_{¬E}) requires an exhaustive search over all subsets of Ω, which takes O(2^{Ω}) operations, making it computationally intractable. To optimize this estimation, let us assume that we can estimate p(u ∈ Ω_{ E }). Furthermore, we can assume that event (u ∈ Ω_{ E }) is equal to (u′ ∈ Ω_{ E }) ∀ u′ : p(u′ ∈ Ω_{ E }) ≥ p(u ∈ Ω_{ E }). The opposite would be true as well: event (u ∉ Ω_{ E }) is equal to (u′ ∉ Ω_{ E }) ∀ u′ : p(u′ ∈ Ω_{ E }) < p(u ∈ Ω_{ E }). Thus, we can order all UMIs according to this probability and reduce the search space to O(Ω). In practice, we don’t even need to estimate p(u ∈ Ω_{ E }), because for a fixed U it depends only on two parameters: r and q. Moreover, it decreases exponentially with increasing r (see explanation below), but there is no such fast dependency for q. So, we can order UMIs by descending r (first), and then by q (second).
Estimating probabilities
Estimation of the quality probabilities
Distribution p(q ¬Err_{U, u}) can be estimated as p(q ¬Err_{U, u}) ≈ p(q), since an event ¬Err_{U, u} does not by itself gurantee that u is real as u can be produced by an error from a UMI other than U. Though distribution p(q) is continuous, we estimated quantized version of this distribution through the following procedure. First, we estimated k uniformly distributed quantiles. All quantiles with the difference in indexing variable q less than 10^{−5} were assumed to be equal and merged. Then, each value of q was rounded off to the nearest quantile. As a result we obtained a discrete distribution with no more than k possible values of the indexing variable. In this work we used k = 15.
To estimate p(q Err_{U, u}) we created a training sample, which contained only pairs of UMIs where u occurred because of an error in U. Such a set was assembled by choosing genes containing two adjacent UMIs only. The theoretical probability p(u, U S_{ g } = 2, ¬Err_{(U, u)}) is negligible. We therefore expect almost all such events to have occurred because of an error in U. Such a training sample is representative because q is independent of S_{ g }. Because values of q are discrete following the quantization, estimation of p(q Err_{U, u}) becomes straightforward.
Estimation of the number of real UMIs

L, length of an UMI.

N_{ UMI }, total number of possible UMIs (in most cases is equal to 4^{ L }).

K, maximum number of the adjacent UMIs (in most cases is equal to 3L).

p_{ Adjacent } = p_{ Adjacent }(U), probability to observe an UMI, adjacent to U. It is equal to \( \sum \limits_{u\in Adjacent(U)}p(u) \).

N′, total number of real adjacent UMIs for the UMI U.
 1.
We were previously in the state (k, s − 1) and picked a UMI which was not a new adjacent UMI (i.e., either a previously observed adjacent UMI or not an adjacent UMI). The probability of such a pick is \( \left(1\frac{Kk}{K}{p}_{Adjacent}(U)\right) \).
 2.
We were previously in a state (k − 1, s − 1) and picked an UMI which is a new adjacent UMI. The probability of such a pick is \( \frac{Kk1}{K}{p}_{Adjacent}(U) \).
Dynamic programming matrix with distributions of the number of adjacent UMIs
\( \frac{S_g}{K} \)  1  2  3  …  S _{ g } 
0  1  1 − p_{ Neighb }  (1 − p_{ Neighb })^{2}  ...  \( {\left(1{p}_{Neighb}\right)}^{S_g1} \) 
1  0  p _{ Neighb }  \( \left(1{p}_{Neighb}\right)\ast {p}_{Neighb}++{p}_{Neighb}\ast \left(1{p}_{Neighb}\frac{K1}{K}\right) \)  …  \( {t}_{0,S1}\ast {p}_{Neighb}++{t}_{1,S1}\ast \left(1{p}_{Neighb}\frac{K1}{K}\right) \) 
2  0  0  \( {p}_{Neighb}^2\frac{K1}{K} \)  …  \( {t}_{1,S1}\ast {p}_{Neighb}\frac{K1}{K}++{t}_{2,S1}\ast \left(1{p}_{Neighb}\frac{K2}{K}\right) \) 
…  …  …  …  …  … 
k  0  0  0  …  \( {t}_{k1,S1}\ast {p}_{Neighb}\frac{Kk+1}{K}++{t}_{k,S1}\ast \left(1{p}_{Neighb}\frac{Kk}{K}\right) \) 
…  …  …  …  …  … 
K  0  0  0  …  \( {t}_{K1,S1}\frac{p_{Neighb}}{K}+{t}_{K,S1} \) 
Estimation of the number of erroneous UMIs
where r_{ i } is ith component of vector \( \overrightarrow{r_{\Omega_E}} \).
Iterative procedure of UMI sequence error correction
After the estimation of the decision boundary, all UMIs that are determined to be erroneous are removed. This changes the input parameters S_{ g }, N_{ L }, and N_{ S } of the algorithm. Therefore, to perform a precise filtration, the procedure is run iteratively.This does not add a significant amount of runtime complexity because: i) dynamic programming matrices are calculatd only once, since the gene size cannot increase during filtration; ii) for genes with a small number of UMIs, the procedure converges after one or two iterations.
Validation
UMI trimming
The UMI error correction algorithms become less effective as the number of molecules per gene increases. To model such situations, we used the 10x posttransplant BMMC dataset, which has 10bp UMIs and relatively small sequencing depth. We then simulated more saturated measurements by trimming UMIs to shorter lengths. The information about each UMI consists of its sequence, the number of reads per UMI, and the mean basecall quality for each nucleotide in the sequence. By trimming both the nucleotide sequence and the quality vector we obtain a new, shorter UMI. After trimming, sequences of some UMIs become identical, which naturally models UMI collision events. All such UMIs are merged by summing their number of reads and calculating the weighted mean of basecall quality vectors (the weight of each vector is equal to its number of reads). For most of the analyses, we trimmed UMIs from the end (back trimming). However, to test for variation of nucleotide diversity along the UMI length, we also trimmed UMIs from the front (see “Results”).
Distribution of Hamming distances between UMIs of the same gene
Errors in UMI sequences lead to more frequent occurrence of adjacent UMIs. Yet, simply omitting all adjacent UMIs would also be incorrect, as the probability of adjacent UMI occurrence is nonnegligible for shorter UMIs and highly expressed genes. Thus, to assess the quality of UMI error correction methods we followed Smith et al. [6] and analyzed distribution of Hamming distances between UMIs within the same gene. To do so we first estimated all pairwise distances between UMIs within each gene within each cell, pooling all distances together. Next, we estimated frequencies of each distance value P(ED = k), and compared it with the theoretical distribution P^{∗}(ED = k) of such distances. The theoretical distribution was estimated by random sampling of UMI pairs from a common UMI distribution. The relative difference between the observed distribution and the theoretical one \( \left(\frac{\leftP\left( ED=k\right){P}^{\ast}\left( ED=k\right)\right}{P^{\ast}\left( ED=k\right)}\right) \) was compared for different correction algorithms.
Correction of cellular barcode sequence errors
CB sequence errors split a fraction of the molecules originating from one cell into smaller CBs. Given that the number of reads per UMI is generally higher than one, the smaller CBs will contain some of the same gene–UMI combinations as the true CB. In other words, the smaller CBs will have similar molecular composition—the set of cell unique genes–UMI combinations. We use composition similarity as a criterion for determining whether the two barcodes should be merged. The size of the compositional intersection between two independent cells is modeled using Poisson distribution with the mean dependent on the UMI distribution and the number of molecules associated with the CBs.
Let us denote S_{c, g} as the set of all UMIs detected for gene g in a cell c. The number of common gene–UMI pairs for cells i and j can be estimated as \( {C}_{i,j}=\sum \limits_{k=1}^m\mid {S}_{i,k}\cap {S}_{j,k}\mid \). Thus, expectation would be \( E{C}_{i,j}=E\sum \limits_{k=1}^m\mid {S}_{i,k}\cap {S}_{j,k}\mid =\sum \limits_{k=1}^mE\mid {S}_{i,k}\cap {S}_{j,k}\mid \). The expectation of the UMI intersection (i.e., the number of shared UMIs) for a pair of genes can be estimated as \( E{C}_{i,j}=\sum \limits_{u\in UMIs}\left(1{\left(1p(u)\right)}^{\left{S}_{i,k}^{\prime}\right}\right)\ast \left(1{\left(1p(u)\right)}^{\left{S}_{j,k}^{\prime}\right}\right) \), where \( {S}_{j,k}^{\prime } \) is the number of UMIs in a gene adjusted for UMI collisions. It is important to note that the expected number of shared UMIs needs to be calculated only once for each pair (S_{ i }, S_{ j }) : S_{ i } ≤ S_{ j }. Having estimated EC_{i, j} we can then assume that C_{i, j} follows Poisson distribution with the mean equal to EC_{i, j}. Using this estimated distribution, we then perform a statistical test for hypothesis H_{0}: the observed size of the intersection S^{∗}_{ intersection } was obtained by chance. The P value of this test is the tail probability of the Poisson distribution \( {P}_{\hat{\lambda}}\left({S}_{intersection}\ge {S^{\ast}}_{intersection}\overrightarrow{E_i},\overrightarrow{E_j},{P}_{UMI}\right) \), where \( \hat{\lambda} \) is the estimated intensity parameter.
The implemented pipeline uses this test to compare each cell C_{ i } with all other cells C_{ j } that 1) have a higher total number of molecules (S_{ j } ≥ S_{ i }) and 2) whose CBs have a Hamming distance from the CB of C_{ i } that is lower than a fixed constant. In the presented results this distance constant was taken to be 2. Bonferroni correction was used to account for multiple comparisons.
To compare merge algorithms, we evaluated their quality on 10x [10] and Dropseq [2] human–mouse mixture datasets (datasets 4 and 12) using the following procedure. First, we filtered out all cells that had < 30 genes for 10x and < 20 genes for Dropseq. Next, for each cell we determined the most likely organism, assigning cells to the genome for which they had more molecules. Next, we chose the largest cells and considered them as real. The exact choice of number of real cells did not have a notable impact on the results. We used 6000 cells for 10x and 1000 cells for Dropseq. In comparing merge algorithms we counted the number of merges performed between different organisms. Only merges to real cells were counted.
Classifying damaged or lowquality cells
Classification algorithm
The implemented approach for classification of damaged and low quality cells can be split into three tasks: (i) creation of the training sample, i.e., establishing the initial class labeling; (ii) feature selection; and (iii) application of the classifier algorithm.
The initial class labels were assigned based on the cell size. To do so, the dataset was split into three parts: ‘big’ cells, ‘intermediate’ cells, and ‘small’ cells. To determine the borders of big and small cells we used the plot log(#UMI in cell) vs log(cell rank) (Additional file 1: Figure S10C). This heuristic is based on an observation that the left part of such a plot has a negative second derivative, followed by a linear part, and a third part of the plot has a positive second derivative. We implemented an automated procedure that locates the upper (t_{ U }) and the lower (t_{ L }) position of the linear part. The cells with size smaller than t_{ L } were then assigned the initial label of ‘lowquality’ cells. The top 75% of the cells with size larger than t_{ U } were assigned the initial label of ‘highquality’ cells. The remaining cells were labeled as ‘unknown’. Alternatively, the initial label borders can be specified manually, for instance based on the shape of the log(#UMI in cell ∗ # Cells) vs log(#UMI in cell) plots (Additional file 1: Figure S10A, B).
 1.
Mean number of reads per UMI.
 2.
Mean number of UMIs per gene.
 3.
Fraction of lowexpressed genes (genes with one molecule).
 4.
Fraction of intergenic reads.
 5.
Fraction of notaligned reads (optional feature, as it typically has to be calculated during the identifier parse phase).
Given the initial labeling and the feature set, the cell classification problem was considered as a problem of establishing robust classification in the presence of training label noise [21]. We compared three classifiers: Kernel Density Estimation classifier [12], Random Forest [13], and Robust Gaussian Processes Classifier [14]. Following evaluation of robustness we chose the KDE classifier with Normal Scale bandwidth selector [22] (the implementation provided by the R package ‘ks’ was utilized [23]). The computational complexity of the KDE classifier estimation has exponential dependency on the dimensionality of the feature space. We therefore reduced the feature space by using the first three principal components of the feature space for classification. To increase the algorithm robustness, we used sparse robust principal component analysis [24] (R package ‘pcaPP’) with sparsity level λ = 1.
The algorithm’s performance can be improved by labeling cells with very high fractions of mitochondrial and intergenic reads as ‘lowquality’. This can be done prior to classifier training, or simply as an additional filter after classifier training. To choose between these two options we employed the following condition: if the intergenic or mitochondrial fraction contributes to any of the first three PCs with the loading ≥5%, we assume that the algorithm is able to distinguish between high/low fraction values, and labeling for the corresponding fraction can be done prior to the classifier training. Otherwise, labeling is done after the training. The extreme fraction thresholds we determined as m + 4a, where m is the 20% trimmed mean mitochondrial (or intergenic) fraction across cells in the dataset, and a is the median absolute deviation of the corresponding fraction.
Validation of the results
 1.
For each cluster we identified differentially expressed genes by comparing it against every other cluster using the Seurat R package [25].
 2.
For each differential gene we counted the number of clusters where it was detected. No more than 50 genes with the largest number of clusters were picked.
 3.
Only genes that were expressed in > 60% of the cells in at least one of the clusters were shown.
Mouse bone marrow inDrop measurements
Whole bone marrow cells were isolated from 11weekold C57Bl/6 male mice (Jackson Laboratory). The epiphysis/metaphysis fraction from long bones was collected, crushed, cut into small pieces, and digested using Collagenase I (STEMCELL Technologies) with agitation for 30 min at 37 °C. Bone marrow cells were filtered through a 70μm filter. Red blood cells were lysed using Acklysis (ThermoFisher Scientific) on ice for 5 min, quenched with Media 199 (ThermoFisher Scientific) supplemented with 2% fetal bovine serum (ThermoFisher Scientific), and spun down at 500 g for 5 min. Cells were stained for 30 min with the red blood cell marker TER119 (Biolegend) and cells were sorted using DAPI (ThermoFisher Scientific) as a live/dead viability marker. Live whole bone marrow cells (400,000; negative for TER119) were sorted into medium 199 (ThermoFisher Scientific). Before inDrop encapsulation cells were counted using a Cellometer (Nexcelom Bioscience). Cell viability was over 90%.
InDrop processing
The concentration of cells was adjusted to 300,000 cells/ml by adding PBS to the sorted cells. The cell suspension was then mixed 1:1 (v/v) with PBS containing 30% OptiPrep Density Gradient Medium (Sigma D1556) to obtain 150,000 cells/ml in 15% Optiprep. Using four microfluidics pumps and a polydimethylsiloxane (PDMS) microfluidic device, about 10,000 cells were coencapsulated with barcoded polyacrylamide beads and a reverse transcription mixture containing Superscript III into waterinoil droplets, according to a published protocol [26]. The library preparation and quality control procedures were carried out as described [26]. Indexed libraries were pooled and sequenced on a Nextseq 500 system (Illumina) at 2 pM concentrations.
Mouse–human cell line mixture inDrop measurement
CK1750 mouse lung cancer cells (Carla Kim laboratory, Boston Children’s Hospital) and K562 human immortalized myelogenous leukemia cells (ATCC) were mixed at a 1:1 ratio to obtain 70,000 cells/ml in PBS containing 15% Optiprep. About 3000 cells were coencapsulated with barcoded polyacrylamide beads and a reverse transcription mixture containing Superscript III into waterinoil droplets; and a library was prepared according to a published protocol [26]. The library was sequenced on a MiSeq system (Illumina).
Availability and requirements
Name: dropEst
Homepage: https://github.com/hmsdbmi/dropEst
OS: linux, OS X
Programming language: C++, R
License: GPL3.
Declarations
Acknowledgements
We would like to thank Allon Klein and Sinisa Hrvatin for their helpful comments on the analysis of inDrop datasets, as well as Carla Kim for providing the mouse cell line for inDrop testing.
Funding
PVK was supported by NIH R01HL131768 from NHLBI and CAREER (NSF14532) award from NSF. DTS and PVK were supported by NIH R01DK107784 from NIDDK.
Availability of data and materials
The following datasets were used in evaluating the developed pipeline:
1. Mouse embryonic stem cells (935 cells, inDrop, SRA accession SRR1784310 (GEO GSE65525) [1]).
2. Human pancreatic cells (inDrop, sample 4, run 2, SRA accession SRR3879612 GEO (GSM2230760) [15]).
3. Mouse pancreatic cells (inDrop, sample 2, SRA accession SRR3879617, SRR3879618, SRR3879619 (GSM2230762) [15]).
4. Mixture of 6k human and mouse cells (6807 cells, 10x Genomics mixture of fresh frozen human (HEK293T) and mouse (NIH3T3) cells [7, 10]).
5. Human frozen peripheral blood mononuclear cells (2900 cells, 10x Genomics frozen PBMCs (donor A), [7, 27]).
6. Human pretransplant bone marrow mononuclear cells (900 cells, 10x Genomics AML035 pretransplant BMMCs [7, 28]).
7. Human posttransplant bone marrow mononuclear cells (900 cells, 10x Genomics AML035 posttransplant BMMCs [7, 8]).
8. Human frozen bone marrow mononuclear cells (2000 cells, 10x Genomics frozen BMMCs (healthy control 1) [7, 29]).
9. Human CD34+ cells (9000 cells, 10x Genomics CD34+ [7, 30]).
10. Human 33k PBMCs (33,000 cells, 10x Genomics human 33k PBMCs from a healthy donor [11]).
11. Mouse bone marrow cells (inDrop, GEO GSE109989).
12. Mixture of 1k human and mouse cells (1100 cells, Dropseq, GEO GSE63269 [2]).
13. Human 8k PBMCs (8000 cells, 10x Genomics human 8k PBMCs from a healthy donor, [31]).
The dropEst pipeline implementation is available on github (under GPL3 license): https://github.com/hmsdbmi/dropEst [32].
The code to reproduce the figures in this paper is also available on github: https://github.com/VPetukhov/dropEstAnalysis (under BSD 3Clause “New” or “Revised” License) [33].
Authors’ contributions
VP and PVK designed, implemented, and tested the methods and drafted the manuscript. JG, NB, and NS conducted inDrop measurements of mouse BM and provided technical advice. PVK oversaw the design of the study, with input from MGS and DTS. All authors read and approved the final manuscript.
Ethics approval
The animal work conducted for this study was approved by the Institutional Animal Care and Use Committee (IACUC) of Massachusetts General Hospital.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
 Klein AM, et al. Droplet barcoding for singlecell transcriptomics applied to embryonic stem cells. Cell. 2015;161(5):1187–201.View ArticlePubMedPubMed CentralGoogle Scholar
 Macosko EZ, et al. Highly parallel genomewide expression profiling of individual cells using nanoliter droplets. Cell. 2015;161(5):1202–14.View ArticlePubMedPubMed CentralGoogle Scholar
 Fu GK, et al. Counting individual DNA molecules by the stochastic attachment of diverse labels. Proc Natl Acad Sci U S A. 2011;108(22):9026–31.View ArticlePubMedPubMed CentralGoogle Scholar
 Islam S, et al. Quantitative singlecell RNAseq with unique molecular identifiers. Nat Methods. 2014;11(2):163–6.View ArticlePubMedGoogle Scholar
 Bose S, et al. Scalable microfluidics for singlecell RNA printing and sequencing. Genome Biol. 2015;16:120.View ArticlePubMedPubMed CentralGoogle Scholar
 Smith TA, et al. UMItools: modeling sequencing errors in Unique Molecular Identifiers to improve quantification accuracy. Genome Res. 2017;27(3):491–9.View ArticlePubMedPubMed CentralGoogle Scholar
 Zheng GX, et al. Massively parallel digital transcriptional profiling of single cells. Nat Commun. 2017;8:14049.View ArticlePubMedPubMed CentralGoogle Scholar
 Zheng, G.X., et al. AML035 Posttransplant BMMCs. 2016. Available from: https://support.10xgenomics.com/singlecellgeneexpression/datasets/1.1.0/aml035_post_transplant. [cited 2018 11 March].Google Scholar
 Ilicic T, et al. Classification of low quality cells from singlecell RNAseq data. Genome Biol. 2016;17:29.View ArticlePubMedPubMed CentralGoogle Scholar
 Zheng GX, et al. 6k 1:1 mixture of fresh frozen human (HEK293T) and mouse (NIH3T3) Cells. 2016. Available from: https://support.10xgenomics.com/singlecellgeneexpression/datasets/2.0.1/hgmm_6k. [cited 2018 11 March].Google Scholar
 Zheng GX, et al. 33k PBMCs from a healthy donor. 2016. Available from: https://support.10xgenomics.com/singlecellgeneexpression/datasets/1.1.0/pbmc33k. [cited 2018 11 March].Google Scholar
 Friedman JH, Tibshirani R, Hastie T. The elements of statistical learning: data mining, inference, and prediction. In: Springer series in statistics. New York: Springer; 2009. p. 208–10.Google Scholar
 Breiman L. Random forests. Mach Learn. 2001;45(1):5–32.View ArticleGoogle Scholar
 Kim HC, Ghahramani Z. Outlier robust gaussian process classification. In: da Vitoria Lobo N, et al., editors. Structural, Syntactic, and Statistical Pattern Recognition: Joint IAPR International Workshop, SSPR & SPR 2008, Orlando, USA, December 4–6, 2008. Proceedings. Berlin, Heidelberg: Springer Berlin Heidelberg; 2008. p. 896–905.View ArticleGoogle Scholar
 Baron M, et al. A singlecell transcriptomic map of the human and mouse pancreas reveals inter and intracell population structure. Cell Syst. 2016;3(4):346–60. e4.View ArticlePubMedPubMed CentralGoogle Scholar
 Dobin A, et al. STAR: ultrafast universal RNAseq aligner. Bioinformatics. 2013;29(1):15–21.View ArticlePubMedGoogle Scholar
 Kim D, et al. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 2013;14(4):R36.View ArticlePubMedPubMed CentralGoogle Scholar
 Barnett DW, et al. BamTools: a C++ API and toolkit for analyzing and managing BAM files. Bioinformatics. 2011;27(12):1691–2.View ArticlePubMedPubMed CentralGoogle Scholar
 Ashburner M, et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000;25(1):25–9.View ArticlePubMedPubMed CentralGoogle Scholar
 Gene Ontology C. Gene Ontology Consortium: going forward. Nucleic Acids Res. 2015;43(Database issue):D1049–56.View ArticleGoogle Scholar
 Frenay B, Verleysen M. Classification in the presence of label noise: a survey. IEEE Trans Neural Networks Learning Systems. 2014;25(5):845–69.View ArticleGoogle Scholar
 Chacon JE, Duong T, Wand MP. Asymptotics for general multivariate kernel density derivative estimators. Stat Sin. 2011;21(2):807–40.View ArticleGoogle Scholar
 Duong, T. Package ‘ks’. 2014. Available from: http://cran.rproject.org/web/packages/ks/ks.pdf. [cited 2017 30 July].Google Scholar
 Croux C, Filzmoser P, Fritz H. Robust sparse principal component analysis. Technometrics. 2013;55(2):202–14.View ArticleGoogle Scholar
 Satija R, et al. Spatial reconstruction of singlecell gene expression data. Nat Biotechnol. 2015;33(5):495–502.View ArticlePubMedPubMed CentralGoogle Scholar
 Zilionis R, et al. Singlecell barcoding and sequencing using droplet microfluidics. Nat Protoc. 2017;12(1):44–73.View ArticlePubMedGoogle Scholar
 Zheng, G.X., et al. Frozen PBMCs (donor A). 2016. Available from: https://support.10xgenomics.com/singlecellgeneexpression/datasets/1.1.0/frozen_pbmc_donor_a. [cited 2018 11 March]Google Scholar
 Zheng GX, et al. AML035 pretransplant BMMCs. 2016. Available from: https://support.10xgenomics.com/singlecellgeneexpression/datasets/1.1.0/aml035_post_transplant. [cited 2018 11 March].Google Scholar
 Zheng GX, et al. Frozen BMMCs (healthy control 1). 2016. Available from: https://support.10xgenomics.com/singlecellgeneexpression/datasets/1.1.0/frozen_bmmc_healthy_donor1. [cited 2018 11 March].Google Scholar
 Zheng, G.X., et al. CD34+. 2016. Available from: https://support.10xgenomics.com/singlecellgeneexpression/datasets/1.1.0/cd34. [cited 2018 11 March].Google Scholar
 Zheng GX, et al. 8k PBMCs from a healthy donor. 2016. Available from: https://support.10xgenomics.com/singlecellgeneexpression/datasets/1.3.0/pbmc8k. [cited 2018 11 March].Google Scholar
 Petukhov V, et al. dropEst: pipeline for accurate estimation of molecular counts in dropletbased singlecell RNAseq experiments. Github. 2018. https://github.com/hmsdbmi/dropEst.Google Scholar
 Petukhov V, et al. dropEst: pipeline for accurate estimation of molecular counts in dropletbased singlecell RNAseq experiments. Github. 2018. https://github.com/VPetukhov/dropEstAnalysis.Google Scholar