Alternative splicing is essential for the specialized functions of eukaryotic cells, necessary for development [1], and a greater contributor to genetic disease burden than mutations [2]. Despite the importance of splicing and massive single-cell RNA-seq (scRNA-seq) data generated, the extent to which the diversity of RNA splicing in single cells is regulated and functional versus transcriptional noise remains contentious [3].
Current spliced aligners call many false-positive spliced junctions, partly because they are computational procedures operating on noisy observations of expressed RNA. The factors influencing this noise include sequence properties of the genome: repetitive genomic sequence within and between genes, biochemical noise introduced during library preparation which could cause mismatches, template switches, and technical noise causing base call errors during sequencing [4]. Differences between the reference and sequenced transcriptomes caused by polymorphisms and other genetic variations yield further false positives during the process of alignment [5–8]. False-positive alignments are further exacerbated in scRNA-seq due to higher levels of biochemical noise specific to single-cell preparations (i.e., higher prevalence of low-entropy-reads in scRNA-Seq as discussed below) as well as statistical issues: false-positive splice junction calls are more prevalent in larger datasets. Together, these challenges combine to generate an ongoing debate regarding whether 10x Chromium (10x) can be used for reliable de novo splice junction detection [9, 10], despite the presence of a large number of junctional reads in the datasets generated by this protocol (Fig. 1A).
The resolution and massive number of available scRNA-seq datasets hold great promise for discovering regulatory and functional splicing biology, including (a) identifying novel unannotated regulated alternative splicing in rare cell types present even in well-annotated organisms such as human [13], (b) de novo prediction of dysregulated splicing in single-cell data from diseases such as in cancer or neurodegenerative diseases, and (c) automatic high-quality junction prediction from poorly annotated organisms such as new model organisms.
Typical workflows from commodity platforms such as 10x genomics (e.g., Cell Ranger [14]) have preprocessing steps that remove all unannotated junctions before reporting spliced alignments. Further, the vast majority of publications characterizing splicing in single cells require ad hoc lower bounds on junction overlap or total number of supporting reads [15]. These approaches will miss and highlight the need for a statistically driven method to discover novel regulation of splicing in single cells. There is thus a great biological need for a method that both reduces the false-positive identification of junctions while having power to detect bona fide unannotated splicing (controlling false negatives). Overall, existing approaches to splicing analysis in the scRNA-seq data either lack sufficient sensitivity to identify splice junctions or specificity to identify false positives [8].
In this paper, we introduce SICILIAN (SIngle Cell precIse spLice estImAtioN), a statistical wrapper for precise splice junction quantification in single cells. SICILIAN deconvolves biochemical noise (generated during library preparation) and computational noise (generated by the spliced aligner while mapping reads to the genome), both being highly prevalent in scRNA-seq and can lead to false-positive junctions. To identify spliced alignments that are erroneously reported by the aligner due to this combined noise, SICILIAN employs generalized linear statistical modeling, with predictors being various read mapping features. In this paper, we use STAR [16] as the spliced aligner in SICILIAN, though the general statistical framework can be applied to refine the splice junction calls from any spliced aligner generating a BAM file. SICILIAN can be applied to both single-end (such as 10x) and paired-end data (such as Smart-seq 2). When running on paired-end data, SICILIAN utilizes alignment features for both mate reads (R1 and R2) in its model and extracts spliced junctions from both R1 and R2 alignments.
The SICILIAN algorithm has three main steps: (1) assign a statistical score to each junctional read’s alignment to quantify the likelihood that the read alignment is truly from RNA expression rather than artifacts; (2) aggregate read scores to summarize the likelihood that a given junction is a true positive; and (3) report single-cell resolved junction expression quantification, corrected for multiple hypotheses testing ( “Methods” section, Fig. 1B,C).
The goal of step (1) above is to statistically evaluate the confidence of the alignment for each junctional read. To do this, SICILIAN fits a penalized generalized linear model [17] on the input RNA-seq data, where positive and negative training classes are defined based on whether each junctional read also has a genomic alignment. Our analysis has shown that this definition for training data well approximates the alignment profile of reads aligned to the false-positive and true-positive junctions (Additional file 1: Fig. S1, Fig. S2). Furthermore, training a new model for each input dataset allows SICILIAN to adapt to batch effects. The model uses the following predictors: the number of alignments for the read, the number of bases in the longer and shorter read overhangs on each side of the junction, the alignment score adjusted by the read length, the number of mismatches, the number of soft-clipped bases, and read entropy ( “Methods” section). These predictors have the power to distinguish reads aligned to false-positive and true-positive junctions defined by ground truth from simulated datasets. The positive and negative training reads used for training the model can reliably model the general profile of the true-positive and false-positive reads (Additional file 1: Fig. S1, Fig. S2).
Read entropy, a quantitative measure of how repetitive a sequence is, is not generally appreciated as an important variable in scRNA-seq reads even though it is characteristic of technical artifacts [18], underlining its importance in the SICILIAN model. Read sequence entropy is expected to be a highly informative predictor of false-positive spliced alignments for two reasons: (1) reverse transcriptase or PCR enzymes are known to generate sequences of low entropy (“PCR stutter”) [18], and PCR crossover is common in these regions; (2) these low-entropy sequences typically map to many places in the genome [19]. Also, the entropy could be more informative and variable in scRNA-seq compared to bulk RNA-seq. For example, in the 10x data from a human lung study [11] (HLCA data set), the average read entropy in 20% of cells is < 4 (Additional file 1: Fig. S3A), which is much more than the fraction of low-entropy reads in bulk RNA-seq datasets, where the entropy is < 4 in only 0.09% and 0.4% of reads in one simulated8 and five bulk cell lines, respectively (Additional file 1: Fig. S3B,C).
In step (2), the statistical scores assigned to each junction’s aligned reads are aggregated using a Bayesian hypothesis testing framework to obtain aggregated junction-level scores. SICILIAN subsequently uses the distribution of aggregated scores for likely false-positive junctions to predict an empirical p value for each junction. Finally, in step (3), SICILIAN corrects for multiple hypotheses testing by taking the median of the empirical p-values for each junction across samples and reports it as the final “SICILIAN score” for the junction (Fig. 1C). User-defined thresholding on this score allows for a junction to be either called or thrown out consistently across all samples. In this paper, we used a threshold of 0.15, which was selected to maximize the sum of sensitivity and specificity on the benchmarking datasets with known ground truth (“Methods” section).
We benchmarked SICILIAN using two different types of benchmarking data: matched scRNA-seq and bulk data from five human lung adenocarcinoma cell lines [20] and simulated data with known ground truth [8, 12]. We compared SICILIAN to commonly used filtering criteria in the field: all junction calls based on STAR [16] raw alignments, the junctions supported only by uniquely mapping reads [21], and calling junctions based on read counts [22, 23].
SICILIAN increases the concordance of junction calls on matched single-cell and bulk datasets [20] (Fig. 1D; Additional file 1: Fig. S4A). We define “concordance” to be the fraction of junctions detected in the single cells from a cell line that are also present in the bulk data from that cell line. SICILIAN increases the concordance between the detected junctions from 10x and bulk RNA-seq regardless of the pairs’ cells of origin, which is consistent with SICILIAN identifying and removing scRNA-seq-specific artifacts (i.e., false-positive junctions that are present in only scRNA-Seq data). SICILIAN improves the concordance for all cell lines (Fig. 1D), e.g., for cell line HCC827, the concordance based on raw STAR calls is 0.54, and SICILIAN increases it to 0.75, while calling junctions based on a 10-read filter only increases the concordance to 0.66. Also, considering junctions detected in the single-cell and bulk datasets of the same cell line as true positives, SICILIAN outperforms the read count criterion in terms of the AUC value (Additional file 1: Fig. S4B).
As there is no single-cell dataset with fully-known ground truth and SICILIAN’s modeling is general and can be applied to both bulk and scRNA-Seq, we resorted to bulk-level simulated datasets and ran the identical SICILIAN model on them. SICILIAN increases prediction accuracy on four bulk simulated datasets with known ground truth [8, 12]. For these datasets, SICILIAN uniformly achieves AUCs of ~ 0.94, a significant increase over the AUCs of 0.66–0.89 based on the read count criterion (Fig. 1E).
In addition to benchmarking datasets, we ran SICILIAN on 36,583 10x and 6,565 Smart-seq2 (SS2) cells from two individuals from the human lung cell atlas (HLCA) [11] and 16,755 10x lung cells from two individuals from the mouse lemur cell atlas (MLCA) [24]. Since there is no ground truth for real data, we use annotation status as an approximate surrogate for ground truth. Knowing that many annotated junctions have been manually curated or experimentally validated, we expect that an algorithm with ability to correctly identify false positives should enrich for annotated junctions, particularly for organisms with extensive annotation such as humans [25]. Because transcript annotations are not part of the SICILIAN model, this serves as an orthogonal measure for performance evaluation. SICILIAN modeling increases the proportion of annotated to unannotated junctions in all four human and mouse lemur individuals compared to the original STAR calls and read-count criterion (Fig. 2). In all 10x datasets from four human and mouse individuals, SICILIAN calls a higher proportion of annotated junctions (83.6% of annotated junctions are called on average out of all annotated junctions) than unannotated junctions (29.2% of unannotated junctions are called on average out of all unannotated junctions), and a higher proportion of annotated junctional reads (87.6% of annotated junctional reads are called on average out of all annotated junctional reads) than unannotated junctional reads (23.9% of unannotated junctional reads are called on average out of all unannotated junctional reads), excluding junctions that only appear once in the dataset (Fig. 2A; Additional file 1: Fig. S5). For many genes, including GDI1, GOT2, and CD14, SICILIAN removed all noisy unannotated junctions and kept only annotated junctions although the algorithm was agnostic to the annotation (Fig. 2B; Additional file 1: Fig. S6). Considering annotated and unannotated junctions as surrogates for true-positive and false-positive junctions in human lung data, SICILIAN achieves an AUC of 0.74, while that of the read-count-based approach is 0.5 (Fig. 2C). Note that due to the power of scRNA-seq in capturing cells from rare cell types, if there are unannotated junctions with low read counts in these rare cell types, we would not be able to detect those junctions based on a read-count criterion and need a better detection regime such as SICILIAN for calling them.
SICILIAN also increases the agreement of splicing calls between individuals. At almost all fixed junction expression levels, SICILIAN makes more consistent decisions (either calls or rejects the junction in both individuals) than inconsistent decisions for both human and mouse lemur, emphasizing the robustness of SICILIAN (Fig. 2D,E). It calls 117,684 shared junctions between individuals in HLCA 10x, while a 10-read cutoff calls only 80,292 (Fig. 2D). SICILIAN makes a consistent decision for 83.0% of junctions. Similarly, in MLCA, SICILIAN calls 36,446 shared junctions in both individuals, much higher than 17,798 that were called by a 10-read cutoff criterion, and makes a consistent decision about 69.9% of junctions.
To further identify whether SICILIAN enriches for known junctions, we compared the junctions called by SICILIAN in HLCA 10x with two of the most recent and precise databases of human splice junctions: CHESS [25] and the Genotype-Tissue Expression (GTEx) project [26]. Only 8% and 7% of raw STAR calls are present in CHESS and GTEx, respectively, but SICILIAN increases these percentages to 48.8% and 45.4%, respectively (Fig. 2F). We also looked at the junctions within each lung cell type and found that the fraction of junctions that are not present in CHESS or GTEx varies substantially across different cell types, even those with similar sequencing depth, indicating that the rate of novel splicing may vary between cell types (Additional file 1: Fig. S7).
Finally, mouse lemur calls by SICILIAN are enriched for having annotated orthologous junctions (obtained via the LiftOver tool [27]) in the human transcriptome, which is much more complete than the mouse lemur transcriptome [28] (“Methods” section). Strikingly, more than 48.4% of lemur-unannotated junctions called by SICILIAN are annotated in the human transcriptome, compared to only 11.2% of unfiltered STAR calls, which supports the claim that SICILIAN filtering enriches for true-positive junctions and highlights the power of SICILIAN for improving transcriptome annotation in poorly studied organisms. We also compared the detected junctions in MLCA and HLCA datasets: applying SICILIAN increases the fraction of junctions in mouse lemur that have been also detected in HLCA from 15 to 54% (Fig. 2G).
Taken together, our results demonstrate that the SICILIAN method enables a new level of precision in splice junction detection from single-cell platforms such as 10x and SS2. SICILIAN allows automatic junction discovery even for poorly annotated splicing programs such as rare cell types or in new model organisms. The conceptual models used in SICILIAN are also applicable to other data types such as emerging single-cell sequencing technologies and bulk sequencing, exemplified by the recent application of SICILIAN for detecting SARS-COV-2 subgenomic RNAs in the swab samples taken from patients [19]. Additionally, we anticipate that this framework can be expanded to detect nonlinear RNA variants such as gene fusions and circRNAs using chimeric alignments reported by the aligner. Underlining the importance of unannotated junction discovery, SICILIAN discovered new regulated splicing patterns in primary human samples that were impossible using annotations [29].