Initial whitelisting and barcode correction
After standard quality control procedures, the first step of existing single-cell RNA-seq processing pipelines [1–3] is to extract cell barcode and UMI sequences and to add this information to the header of the sequenced read or save it in temporary files. This approach, while versatile, can create many intermediate files on disk for further processing, which can be time- and space-consuming.
Alevin begins with sample-demultiplexed FASTQ files. It quickly iterates over the file containing the barcode reads and tallies the frequency of all observed barcodes (regardless of putative errors). We denote the collection of all observed barcodes as \(\mathcal {B}\). Whitelisting involves determining which of these barcodes may have derived from a valid cell. When the data has been previously processed by another pipeline, a whitelist may already be available for alevin to use. When a whitelist is not available, alevin uses a two-step procedure for calculating one. An initial draft whitelist is produced using the procedure explained below, to select CBs for initial quantification. This list is refined after per-cell-level quantification estimates are available (see “Final whitelisting (optional)” section) to produce a final whitelist.
To generate a putative whitelist, we follow the approach taken by other dscRNA-seq pipelines by analyzing the cumulative distribution of barcode frequencies and finding the knee in this curve [1, 2]. Those barcodes occurring after the knee constitute the whitelist, denoted \(\mathcal {W}\). We use a Gaussian kernel to estimate the probability density function for the barcode frequency and select the local minimum corresponding to the “knee.” In the case of a user-provided whitelist, the provided \(\mathcal {W}\) is used as the fixed final whitelist.
Next, we consider those barcodes in \(\mathcal {E} = \mathcal {B} \setminus \mathcal {W}\) to determine, for each non-whitelisted barcode, whether (a) its corresponding reads should be assigned to some barcode in \(\mathcal {W}\) or (b) this barcode represents some other type of noise or error (e.g., ambient RNA, lysed cell) and its associated reads should be discarded. The approach of alevin is to determine, for each barcode \(h_{j} \in \mathcal {E}\), the set of whitelisted barcodes with which hj could be associated. We call these the putative labels of hj—denoted as ℓ(hj). Following the criteria used by previous pipelines [1], we consider a whitelisted barcode wi to be a putative label for some erroneous barcode hj if hj can be obtained from wi by a substitution, by a single insertion (and clipping of the terminal base) or by a single deletion (and the addition of a valid nucleotide to the end of hj). Rather than applying traditional algorithms for computing the all-versus-all edit-distances directly, and then filtering for such occurrences, we exploit the fact that barcodes are relatively short. Therefore, we can explicitly iterate over all of the valid \(w_{i} \in \mathcal {W}\) and enumerate all erroneous barcodes for which this might be a putative label. Let Q(wi,H) be the set of barcodes from \(\mathcal {E}\) that adhere to the conditions defined above; then, for each hj∈Q(wi,H), we append wi as putative label for the erroneous barcode hj.
Once all whitelisted barcodes have been processed, each element in \(\mathcal {E}\) will have zero or more putative labels. If an erroneous barcode has more than one putative label, we prioritize substitutions over insertions and deletions. If this does not yield a single label, ties are broken randomly. If no candidate is discovered for an erroneous barcode, then this barcode is considered “noise,” and its associated reads are simply discarded. Note that, although adopted from existing methods, the alevin initial whitelisting process is designed to output a larger number of CBs.
Mapping reads and UMI deduplication
After labeling each barcode, either as noise or as belonging to some whitelisted barcode, alevin maps the sequenced reads to the target transcriptome [10, 11]. Reads mapping to a given transcript (or multimapping to a set of transcripts) are categorized hierarchically, first based on the label of their corresponding cellular barcode, and then based on their unique molecular identifier (UMI). At this point, it is then possible to deduplicate reads based on their mapping and UMI information.
The process of read deduplication involves the identification of duplicate reads based on their UMIs and alignment positions. Most amplification occurs prior to fragmentation in library construction for 10X Chromium protocols [23]. Because of this, the alignment position of a given read is not straightforward to interpret with respect to deduplication, as the same initial unique molecule may yield reads with different alignment coordinatesFootnote 2. UMIs can also contain sequence errors. Thus, achieving the correct deduplication requires proper consideration of the available positional information and possible errors.
Our approach for handling sequencing errors and PCR errors in the UMIs is motivated by “directional” approach introduced in UMI-tools [5]. Let \(\mathcal {U}_{i}\) be the set of UMIs observed for gene i. A specific UMI \(u_{n} \in \mathcal {U}_{i}\), observed cn times in gene i, is considered to have arisen by PCR or sequence error if there exists \(u_{m} \in \mathcal {U}_{i}\) such that d(un,um)=1 and cm>2cn+1, where d(·,·) is the Hamming distance. Using this information, only UMIs that could not have arisen as an error under this model are retained. However, this approach may over-collapse UMIs if there exists evidence that similar UMIs (i.e., UMIs at a Hamming distance of 1 or less) may have arisen from different transcripts and, hence, distinct molecules. Moreover, this approach first discards reads that multimap to more than one read, causing it to lose a substantial amount of information before even beginning the UMI deduplication process.
As previously proposed to address the problem of cell clustering [24], an equivalence class [12, 13, 25–29] encodes some positional information, by means of encoding the set of transcripts to which a fragment is mapped. Specifically, these equivalence classes can encode constraints about which UMIs may have arisen from the same molecule and which UMIs—even if mapping to the same gene—must have derived from distinct pre-PCR molecules. This can be used to avoid over-collapsing UMI tags that are likely to result from different molecules by considering UMIs as distinct for each equivalence class. However, in its simplest form, this deduplication method is prone to reporting a considerably higher number of distinct UMIs than likely exist. This is because reads from different positions along a single transcript, and tagged with the same UMI, can give rise to different equivalence classes, so that membership in a different equivalence class is not, alone, sufficient evidence that a read must have derived from a distinct (pre-PCR) molecule. This deters us from directly using such a UMI-collapsing strategy for deriving gene-level counts, though it may be helpful for other types of analyses.
Given the shortcomings of both approaches to UMI deduplication, we propose, instead, a novel UMI resolution algorithm that takes into account transcript-level evidence when it exists, while simultaneously avoiding the problem of under-collapsing that can occur if equivalence classes are treated independently for the purposes of UMI deduplication.
UMI resolution algorithm
A potential drawback of the gene-level deduplication is that it discards transcript-level evidence. In this case, such evidence is encoded in the equivalence classes. Thus, gene-level deduplication provides a conservative approach and assumes that it is highly unlikely for molecules that are distinct transcripts of the same gene to be tagged with a similar UMI (within an edit distance of 1 from another UMI from the same gene). However, entirely discarding transcript-level information will mask true UMI collisions to some degree, even when there is direct evidence that similar UMIs must have arisen from distinct transcripts. For example, if similar UMIs appear in transcript-disjoint equivalence classes (even if all of the transcripts labeling both classes belong to the same gene), then they cannot have arisen from the same pre-PCR molecule. Accounting for such cases is especially true when using an error-aware deduplication approach and as sequencing depth increases.
To perform UMI deduplication, alevin begins by constructing a parsimonious UMI graph (PUG), G=(V,E), for each cell, where each vi=(u,Ti) is a tuple consisting of UMI sequence u and a set of transcripts \(T_{i} = \{t_{i_{1}}, t_{i_{2}}, \dots, t_{i_{m}}\}\). There is a count associated with each vertex such that c(vi)=ci is the number of times this UMI equivalence class pair is observed. G contains two types of edges: directed and bi-directed. There exists a directed edge between every pair of vertices (vi,vj) for which ci>2cj−1, |Ti∩Tj|>0, and d(umi(vi),umi(vj))=1. For every pair of vertices for which there is no directed edge, there exists a bi-directed edge if d(umi(vk),umi(vℓ))≤1, and |Tk∩Tℓ|>0. Once the edges of this PUG have been formed, we no longer need to consider the counts of the individual UMI equivalence class pairs.
Before proceeding further, we introduce the notion of monochromatic arborescences in terms of this graph G. We can refer to the transcript labels of each node as the potential colors of the node. Since our graph is directed, an arborescence would be a rooted tree in the graph, where each node within the arborescence has exactly one directed path reaching it from a determined root node, using edges in the arborescence. Given these definitions, a monochromatic arborescence is one where the set of colors of the nodes within the arborescence have a non-null intersection and, hence, the arborescence can be labeled using a single color. Then, for a given connected component in the graph, we can find different sets of monochromatic arborescences and, for our graph, each one represents a single pre-PCR molecule.
However, motivated by the principle of parsimony, we wish to explain the observed vertices (i.e., UMI, equivalence class pairs) via the minimum possible number of pre-PCR molecules that are consistent with the observed data. Hence, we pose this problem in the following manner. Given a graph G, we seek a minimum cardinality covering by monochromatic arborescences. In other words, we wish to cover G by a collection of vertex-disjoint arborescences, where each arborescence is labeled consistently by a set of transcripts, which are the pre-PCR molecule types from which its reads and UMIs are posited to have arisen. Further, we wish to cover all vertices in G using the minimum possible number of arborescences. Here, the graph G defines which UMI, read pairs can potentially be explained in terms of others (i.e., which vertices may have arisen from the same molecule by virtue of different fragmentation positions or which vertices may have given rise to other through PCR duplication with error). The decision version of this problem is NP-complete, as shown below and so, alevin employs a greedy algorithm in practice to obtain a valid, though not necessarily minimum, covering of G. We note that while numerous covering and packing problems related to arborescences have appeared in the literature (Bernáth and Pap [30] and references therein), to the best of our knowledge, the following problem formulation is new.
Theorem 1
Minimum cardinality covering by monochromatic arborescences is NP-complete.
Proof
Consider a reduction from dominating set. Let (G,k) be an instance of the dominating set problem where G=(V,E) is an undirected graph. Then, we can construct a new graph G′=(V,E′) such that G′ has a minimum cardinality covering by ≤k monochromatic arborescences if and only if G has a minimum dominating set of size ≤k. The color of an arborescence is chosen from among the intersection of the set of labels for each node it covers and, hence, is non-null. Construct G′ as follows. Convert each edge in G to a bi-directed edge in G′ and label each node with the union of its own label and the labels of all nodes to which it is directly connected in G. In other words, Ti={i}∪{j∣{i,j}∈E}.
→ If G has a minimum dominating set of size k, then G′ has a minimum cardinality covering by k monochromatic arborescences. Every node in the original graph G has to be connected to at least one node in the dominating set. Due to the manner in which node labels are assigned in G′, this means that every node in G′ can be covered by an arborescence starting from a dominating set node; this arborescence is colored by the label assigned to that node. Since there are k nodes in the dominating set, there will be k monochromatic arborescences in G′, and since the k nodes in G dominate V, the arborescences will cover all of V.
← If G′ has a covering of k monochromatic arborescences, then G has a dominating set of size k. An arborescence is assigned a color, let us say ℓi, from the intersection of the labels of the nodes it covers. Hence, the node with label ℓi in G′ has to be one of the nodes covered by this arborescence. That node connects to all the nodes in this arborescence; otherwise, they would not have shared this label. Let these nodes be selected as the dominating set of G. Hence, if there are k arborescences, there are k such nodes that are part of the dominating set, and because the arborescences cover all of G′, the selected nodes, likewise, dominate G. □
The algorithm employed by alevin works as follows. First, we note that weakly connected components of G can be processed independently, and so, we describe here the procedure used to resolve UMIs within a single weakly connected component—this is repeated for all such components. Let C=(VC,EC) denote our current component. We perform a breadth-first search starting from each vertex vi∈VC and considering each transcript \(t_{i_{j}}\) (the jth transcript in the equivalence class labeling vertex vi). We compute the size (cardinality) of the largest arborescence that can be created starting from this node and using this label to cover the visited vertices. Let \(\phantom {\dot {i}\!}v_{i^{\prime }}, t_{i^{\prime }_{j^{\prime }}}\) be the vertex, transcript pair generating the largest arborescence, and let \(\phantom {\dot {i}\!}a\left (v_{i^{\prime }}, t_{i^{\prime }_{j^{\prime }}}\right)\) be the corresponding arborescence. We now remove all of the vertices in \(\phantom {\dot {i}\!}a\left (v_{i^{\prime }}, t_{i^{\prime }_{j^{\prime }}}\right)\), and all of their incident edges, from C, and we repeat the same procedure on the remaining graph. This process is iterated until all vertices of C have been removed. This procedure is guaranteed to select some positive order arborescence (i.e., an arborescence containing at least one node) in each iteration and hence is guaranteed to terminate after at most a linear number of iterations in the order of C.
After computing a covering, each arborescence is labeled with a particular transcript. However, the selected transcript may not be the unique transcript capable of producing this particular arborescence starting from the chosen root node. We can compute, for each arborescence, the set of possible transcript labels that could have colored it (i.e., those in the intersection of the equivalence class labels for all of the vertices in the arborescence). If the cardinality of this set is 1, then only a single transcript is capable of explaining all of the UMIs associated with this arborescence. If the cardinality of this set is > 1, then we need to determine if all transcripts capable of covering this arborescence belong to the same gene, or whether transcripts from multiple genes may, in fact, be capable of explaining the associated UMIs. In the former case, the count of pre-PCR molecules (i.e., distinct, deduplicated UMIs) associated with this uniquely selected gene is incremented by 1. In the latter case, the molecule associated with the arborescence is considered to potentially arise from any of the genes with which it could be labeled. Subsequently, an EM algorithm is used to distribute the counts between the genes. Note that other pipelines simply discard these gene-ambiguous reads and that both manners in which alevin attempts to resolve such reads (i.e., either by being selected via the parsimony condition or probabilistically allocated by the EM algorithm) are novel in the context of scRNA-seq quantification. The EM procedure we adopt to resolve ambiguous arborescences proceeds in the same manner as the EM algorithm used for transcript estimation in bulk RNA-seq data [13], with the exception that we assume the probability of generating a fragment is directly proportional to the estimated abundance, rather than the abundance divided by the effective length (i.e., we assume that, in the tagged-end protocols used, there is no length effect in the fragment generation process).
Tier assignment
The alevin program also outputs a tier matrix, of the same dimensions as the cell gene count matrix. Within a cell, each gene is assigned one of four tiers. The first tier (assigned 0) is the set of genes that have no read evidence in this cell and are, therefore, predicted to be unexpressed (whether truly absent, or the effect of some dropout process). The rest of the tiers (1, 2, and 3) are assigned based on a graph induced by the transcript equivalence classes as follows:
-
1
All equivalence classes of size 1 are filtered out. The genes associated with the transcripts from these classes are assigned to tier 1.
-
2
For the remaining equivalence classes, of size > 1 gene, a graph G is constructed. The nodes in G are transcripts, and two nodes share an edge if their corresponding transcripts belong to a single equivalence class.
-
3
All the connected components in G are listed, and the transcript labels on the nodes mapped to their corresponding genes. If any component contains a node whose gene has previously been assigned to tier 1, that gene and all other genes in this connected component are assigned to tier 2. Hence, tier 2 contains genes whose quantification is impacted by the EM algorithm (after the UMI deduplication).
-
4
Genes associated with the remaining nodes in the graph are assigned to tier 3. These are genes that have no unique evidence and do not share reads (or, in fact, paths in the equivalence class graph) with another gene that has unique evidence. Hence, the EM algorithm will distribute reads between these genes in an essentially uniform manner, and their estimates are uninformative. Their abundance signifies that some genes (at least 1) in this ambiguous family are expressed, but exactly which and their distribution of abundances cannot be determined.
Alevin, optionally (using the —numCellBootstraps flag), also outputs bootstrap variance estimates for genes within each cell. These variance estimates could conceivably be used by downstream tools for dimensionality reduction, differential expression testing, or other tasks.
Final whitelisting (optional)
Many existing tools for whitelisting CBs, such as Cell Ranger [3] and Sircel [7] perform whitelisting only once. As discussed above, both tools rely on the assumption that the number of times a CB is observed is sufficient to identify the correct CBs, i.e., those originating from droplets containing a cell. However, as observed by Petukhov et al. [8], there is considerable variation in sequencing depth per cell, and some droplets may contain damaged or low-quality cells. Thus, true CBs may fall below a simple knee-like threshold. Similarly, erroneous CBs may lie above the threshold. Petukhov et al. [8] proposed that instead of selecting a single threshold, one should treat whitelisting as a classification problem and segregate CBs into three regions: high quality, low quality, and uncertain/ambiguous. Here, high quality refers to the CBs which are deemed to be definitely correct, and low quality are the CBs which are deemed to most likely not arise from valid cells. A classifier can then be trained on the high- and low-quality CBs to classify the barcodes in the ambiguous region as either high or low quality. We adopt this approach in alevin, using our knee method’s cutoff to determine the ambiguous region. Specifically, we divide everything above the knee threshold into two equal regions: high-quality valid barcodes (upper-half), denoted by \(\mathcal {H}\), and ambiguous barcodes (lower-half), denoted by \(\mathcal {L}\). Since the initial whitelisting procedure is very liberal in selecting a threshold, most of the recoverable, low-confidence CBs tend to reside in the ambiguous region, and to learn the low-quality region, we take \(n_{l} = \max (0.2 \cdot \left |\mathcal {H}\right |, 1000)\) barcodes just below the knee threshold.
In the implementation of Petukhov et al.[8], a kernel density estimation classifier was trained using features which described the number of reads per UMI, UMIs per gene, the fraction of intergenic reads, non-aligned reads, the fraction of lowly expressed genes, and the fraction of UMIs on lowly expressed genes. In addition, a maximum allowable mitochondrial read content was set for a CB to be classified as “high-quality.” Whilst these features enabled the authors to build a classifier which efficiently separated “high-quality” cells from “low-quality” cells, we believe it may be possible to improve this set of features. Specifically, most of these features would be expected to correlate with the number of reads or UMIs per CB. Thus, the classifier is biased towards attributes associated with higher read depth, when in fact one wants it to learn the feature attributes associated with high-quality cells. We therefore used a slightly different set of features, listed below, which we believe may better capture the differences between high- and low-quality cells. While these features work in general, they may not be suitable for all analyses and will have to be tweaked accordingly. We chose to use a naïve Bayes classifier to perform classification, since we observed no clear difference between multiple ML methods (not shown), and the naïve Bayes classifier yields classification probabilities which are easy to interpret. Our final set of whitelisted CBs are those classified as high confidence.
-
1
Fraction of reads mapped
-
2
Fraction of mitochondrial reads (optionally activated by —mRNA flag)
-
3
Fraction of rRNA reads (optionally activated by —rRNA flag)
-
4
Duplication rate
-
5
Mean gene counts post deduplication
-
6
The maximum correlation of gene-level quantification estimates with the high-quality CBs (optionally activated by –useCorrelation flag)
Machine configuration and pipeline replicability
10x v2 chemistry benchmarking has been scripted using CGATCore (https://github.com/cgat-developers/cgat-core). The full pipeline and analysis are performed using Stony Brook’s seawulf cluster with 164 Intel Xeon E5–2683v3 CPUs.
For all analyses, the genome and gtf versions used for human datasets were GENCODE release 27, GRCh38.p10, and for mouse datasets were GENCODE release M16, GRCm38.p5. All transcriptome files were generated using these with “rsem-prepare-reference.”
Cell Ranger (v2.2.0): The following additional flags were used, as recommended by the Cell Ranger guidelines: —nosecondary —expect-cells NumCells, where NumCells is 10,000 for PBMC 8k and Neurons 9k, 5,000 for PBMC 4k, and 2,000 for Neurons 2k and Neurons 900.
Alevin (v0.13.0): Run with default parameters with the Chromium protocol and —keepDuplicates flags and the -lISR to specify strandedness. The mRNA and rRNA lists were obtained from the relevant annotation files and passed as input. Experiments on v1 chemistry can be run using the same flags but with the —gemcode protocol flag. Alevin also supports 10x v3 chemistry via the command-line flag —chromiumV3.
STAR (v2.6.0a): The following flag was used, as recommended by the guidelines of UMI-tools: —outFilterMultimapNmax 1
featureCounts (v1.6.3): This was run to obtain an output BAM file and with stranded input (-s 1).
UMI-tools (v0.5.4): The extract command was used to get the CBs/UMIs, when provided with an external CB whitelist and attach it to the corresponding reads. The following flags were used in the count command to obtain the per-cell gene count matrix: —gene-tag=XT —wide-format-cell-counts
DropEst (v0.8.5): This was run with the default parameters on the 10x BAM files, and the predicted cell counts from Cell Ranger were used as input.
Dropseq utils (v2.0.0): All the commands were run as recommended by the authors in the tool’s manual.
The bulk datasets were quantified using Bowtie2 and RSEM, run as follows:
Bowtie2 (v2.3.4.3): The following flags were used, as recommended in the guidelines of RSEM: —sensitive —dpad 0 —gbar 99999999 —mp 1,1 —np 1 —score-min L,0,-0.1 —no-mixed —no-discordant
RSEM (v1.3.1): Run with default parameters.