Skip to main content
Fig. 1 | Genome Biology

Fig. 1

From: Alevin efficiently estimates accurate gene abundances from dscRNA-seq data

Fig. 1

Overview of the alevin pipeline. The input to the pipeline are sample-demultiplexed FASTQ files, and there are several steps, outlined here, that are required to process this data and obtain per-cell gene-level quantification estimates. The first step is cell barcode (CB) whitelisting using their frequencies. Barcodes neighboring whitelisted barcodes are then associated with (collapsed into) their whitelisted counterparts. Reads from whitelisted CBs are mapped to the transcriptome, and the UMI-transcript equivalence classes are generated. Each equivalence class contains a set of transcripts, the UMIs that are associated with the reads that map to each class and the read count for each UMI. This information is used to construct a parsimonious UMI graph (PUG) where each node represents a UMI-transcript equivalence class and nodes are connected based on the associated read counts. The UMI deduplication algorithm then attempts to find a minimal set of transcripts that cover the graph (where each consistently labeled connected component—each monochromatic arborescence—is associated with a distinct pre-PCR molecule). In this way, each node is assigned a transcript label and, in turn, an associated gene label. Reads associated with arborescences that could be consistently labeled by multiple genes are divided amongst these possible loci probabilistically based on an expectation-maximization algorithm. Finally, optionally, and if not provided with high-quality CB whitelist externally, an intelligent whitelisting procedure finalizes a list of high-quality CBs using a naïve Bayes classifier to differentiate between high- and low-quality cells

Back to article page