Comprehensive characterization of single-cell full-length isoforms in human and mouse with long-read sequencing

A modified Chromium 10x droplet-based protocol that subsamples cells for both short-read and long-read (nanopore) sequencing together with a new computational pipeline (FLAMES) is developed to enable isoform discovery, splicing analysis, and mutation detection in single cells. We identify thousands of unannotated isoforms and find conserved functional modules that are enriched for alternative transcript usage in different cell types and species, including ribosome biogenesis and mRNA splicing. Analysis at the transcript level allows data integration with scATAC-seq on individual promoters, improved correlation with protein expression data, and linked mutations known to confer drug resistance to transcriptome heterogeneity. Supplementary Information The online version contains supplementary material available at 10.1186/s13059-021-02525-6.


Background
Single-cell RNA sequencing (scRNA-seq) is a widely adopted method for profiling transcriptomic heterogeneity in health and disease [1]. However, assessing transcriptlevel changes between cell types using current scRNA-seq protocols is challenging due to their reliance on short-read sequencing. Previous studies using plate-based methods [2,3] have focused on individual alternative splicing events such as exon skipping, due to the fundamental limitation of short-read sequencing in linking distal splicing outcomes belonging to the same transcript. The Smartseq3 protocol [4] can achieve fulllength transcript coverage but is still unable to assemble the complete transcript sequence and is heavily reliant on the reference annotation. Droplet-based methods [5] such as 10x only sequence the 3′ or 5′ end of transcripts which largely precludes PromethION. The remainder of the cDNA from the 10-20% sample together with the GEMs from the 80-90% sample are used for regular 10x library preparation and Illumina sequencing in parallel. In the end, long-read data from the 10-20% subsample of cells and Illumina shortread data for all cells are generated by this protocol.
We demonstrate the use of this protocol by profiling 16,660 cells from diverse biological systems, 2737 of which were sequenced by both long-read and short-read technologies (Fig. 1A, Additional file 1). Firstly, we used our previously published   Fig. 1 Overview of experimental design, modified 10x protocol, FLAMES method, and basic summary statistics. A Summary of the study design, with an overview of the modified 10x protocol and FLAMES data processing pipeline. B UMAP visualization of cells in each sample, cells colored in red are sampled for long-read sequencing. scmixology1 and scmixology2 were integrated and shown together in one plot. All UMAP visualizations are based on short-read data. C The number of nanopore reads generated from each sample, and the percentage of reads that were assigned a cell barcode. D Distribution of UMI counts per cell for Illumina and nanopore data in each sample. E Correlations between gene UMI counts generated from nanopore long-read and Illumina short-read data. F Density scatter plot shows the relationship between transcript-level counts and scATAC-seq read counts around the TSS regions. The horizontal orange line shows the threshold calculated that separates open chromatin from the background. The percentage shows the transcripts that have their TSSs in open chromatin regions scmixology design [13], which involved an equal mixture of cells from five cell lines (H2228, H838, H1975, HCC827, A549). Two biological replicates were profiled (scmix-ology1 and scmixology2) with this protocol together with 10x scATAC-seq for the second replicate. In addition to the cell-line mixtures, we sequenced freshly isolated quiescent and activated muscle stem cells (MuSCs) from mouse. Lastly, we applied this protocol to a cryogenically preserved peripheral blood mononuclear cell (PBMC) sample from a patient (CLL2) whose chronic lymphocytic leukemia (CLL) had progressed on venetoclax treatment after a durable response. The sample was prepared together with the 10x CITE-seq assay with 17 antibody markers. This demonstrated the broad utility of this approach, which is compatible with different 10x transcriptomic assays and can be applied to both fresh and frozen samples. The Uniform Manifold Approximation and Projection (UMAP) visualization presented the cell populations as expected (Additional file 2: Fig. S1) and revealed no obvious bias in the GEM sampling process (Fig. 1B). Collectively, we generated 230 million long reads across all samples, together with scATAC-seq and CITE-seq data for the scmixology2 and CLL2 samples, respectively.

Single-cell isoform detection and quantification with FLAMES
We developed a flexible computational framework called FLAMES (Full-Length Analysis of Mutations and Splicing in long-read RNA-seq data) to detect and quantify isoforms for both single-cell and bulk long-read data (Fig. 1A, Additional file 2: Fig. S2 and Methods). Input to FLAMES are fastq files generated from the long-read platform. Using the cell barcode annotation obtained from short-read data as the reference, it identifies and trims cell barcodes/UMI sequences from the long reads. After barcode assignment, all reads were aligned to the relevant genome to obtain a draft read alignment. The draft alignment is then polished and grouped to generate a consensus transcript assembly. All reads are aligned again using the transcript assembly as the reference and quantified. Benchmarking of the isoform detection component of FLAMES on bulk synthetic spike-in control samples showed good performance in terms of recovery of known transcripts while identifying fewer false-positive transcripts compared to other leading methods FLAIR, TALON, and StringTie2. Next, we benchmarked FLAMES on a bulk SIRV spike-in dataset [14] for which the isoform structure and abundances are known a priori. Our results clearly show that FLAMES outperforms FLAIR, TALON, and StringTie2 [15] both in terms of the isoform detection (Fig. S3A, Additional file 3) and quantification (Additional file 2: Fig. S3B). Additional benchmarking of FLAMES on a bulk sequins spike-in dataset showed similar results [16].
We used FLAMES to preprocess and analyze the four datasets generated. Forty to 60% of the long reads could be assigned to an expected cell barcode and were kept for further analysis (Fig. 1C). The transcript coverage of reads realigned to assembled transcripts showed that the percentages of full-length reads decreased for longer transcripts (Fig. S4), which has also been shown in another study [17]. Reads that are not fulllength and cannot be uniquely assigned to transcripts are discarded during data processing. The average UMI count per cell ranges from 10,000 to 60,000, varied by cell type, and was comparable to the short-read counts from the same cells (Fig. 1D). Gene-level UMI counts between the matched nanopore and Illumina data were also found to be highly correlated (Fig. 1E). The data processed by FLAMES showed that our modified 10x protocol generated high-quality long-read data that was comparable to the short-read Illumina data when analyzed at the gene level.
To validate the transcription start sites (TSSs) from the isoforms generated by FLAMES, we compared them to the FANTOM5 TSS peaks [18] and found that around 75% of the TSSs are within the FANTOM5 annotation (Additional file 2: Fig. S5A). For the scmixology data where scATAC-seq data from the same populations were also available, we aggregated scATAC-seq signals around the TSSs as an indicator of open promoters. The result showed scmixology1 and scmixology2 have similar open promoters and more than 85% of the TSSs are within active promoters, supporting the existence of these transcripts (Fig. 1F). In contrast, when we process the scmixology data using TALON, FLAIR, and StringTie2, the results were less optimal. The majority of transcripts generated by FLAIR and TALON did not match the reference annotations (Additional file 2: Fig. S5B), and FLAIR, TALON, and StringTie2 had fewer TSSs overlapping the FANTOM5 annotations compared to FLAMES (Fig. S5C). We found similar results when comparing the scATAC-seq signals around the TSS regions from transcripts generated by different methods (Fig. S5D), with TALON and FLAIR having around 40% and 50% of their TSSs in open chromatin regions, respectively. In summary, our comparisons yield similar results between a spike-in dataset and the scmixology dataset, with FLAMES outperforming StringTie2, FLAIR, and TALON with the latter two likely generating many false transcripts.
Characterization of isoforms reveals the distinct splicing landscapes of different cell populations We compared the transcripts generated by FLAMES to the reference annotation and classified them using the scheme introduced in SQANTI [19], including transcripts with all splice junctions matching to reference transcripts (full splice match, FSM) or partially matching to consecutive splice junctions for a reference transcript (incomplete splice match, ISM) and transcripts with novel splice junctions with new (Novel, not in catalog, NNC) or known (Novel, in catalog, NIC) donor and acceptor sites ( Fig. 2A). We observed that the number of transcripts detected varies between samples (Fig. 2B) and is correlated with the sequencing throughput as shown in Fig. 1C. While around half of the transcripts detected were novel, the majority of reads aligned to known transcripts with novel transcripts having lower abundance in general. In addition to the comparison with reference annotations, we also compared the transcripts generated from the three human samples to each other (Fig. 2C) and found many transcripts unique to each sample (22%, 27%, and 68% for scmixology1, scmixology2, and CLL2, respectively). A majority of transcripts (~60%) were shared between biological replicates and among these 30% were novel, suggesting that many conserved alternative splicing events in these cell lines have not been annotated. The results from FLAMES were also compared to short-read bulk RNA-seq profiled from matched samples analyzed using STAR (Additional file 2: Fig. S6). The degree of overlap of known junctions between the long-and short-read results was relatively high across all samples (Additional file 2: Fig. S6A). For novel junctions, the overlap was considerably less (Additional file 2: Fig.  S6B), with relatively fewer junctions detected in the short-read data across the different samples compared to the long-read data, which identified between 3-and 14-fold more novel junctions.
Within samples, we found consistent alterations in splicing patterns between cell populations (Fig. 2D). CLL cells had higher proportions of novel transcripts, especially transcripts with novel splice junctions (NNC), compared to non-CLL cells from the same sample, including T cells, NK cells, and monocytes. Similarly, quiescent muscle stem cells also had higher proportions of novel transcripts compared to activated stem cells. Analysis of intronic reads from RNA-seq data [20] has demonstrated that intron retention, which would produce novel transcripts, is increased in quiescent mouse showing the percentage of transcript abundance relative to gene abundance for genes express multiple transcripts. Transcripts are ranked by abundance, shown on the x-axis. G Summary of the type of alternative splicing between the two most abundant transcripts of each gene. The "Complex splicing changes" category represents transcripts with more than one type of constitutive alternatively spliced event muscle stem cells and is essential for these cells to exit the quiescent state. This is consistent with our results and suggests that differences in splicing patterns between different cell populations may act as a regulatory mechanism. After comparing the transcripts identified against reference annotations, we sought to characterize isoforms within the same gene. Around 80% of genes can be expressed as multiple isoforms (Fig. 2E). The average number of isoforms expressed per gene ranges between 3 and 6 and varied between the different samples (Fig. 2E). The distribution of isoform expression is skewed with only a few abundant transcripts dominantly expressed for most genes (Fig. 2F), consistent with previous results [21]. On average, the two most highly expressed isoforms account for 80% of the total counts (median 85%). Next, we categorized the types of alternative splicing between the two most highly expressed isoforms (Fig. 2G). Alternative splicing has mostly been studied based on particular events such as exon skipping or alternative 5′ splice sites using short-read sequencing technology [22]. We found the most common category, comprising around 30% of genes, has more than one type of alternative splicing event between the top two isoforms. This means that the 2 most highly expressed isoforms differ by complex splicing changes involving multiple exons, which may not be accurately characterized by short reads because two isoforms could have a skipped exon near the 5′ end and a different splice site near the 3′ end.

Common classes of genes with differential transcript usage across samples
Following the analysis of isoform abundance, we investigated whether genes with multiple isoforms exhibited differential transcript usage (DTU) between the clusters or cell types shown in Additional file 2: Fig. S1. We focused on changes in the use of internal splice junctions, grouping transcripts with the same intron chain. To mitigate the high dropout rate in single-cell data, we aggregated the transcript counts into pseudo-bulk values per cluster and filtered out transcripts with low abundance. A chi-square test was performed on the pseudo-bulk count matrix for DTU analysis. To assess how sensitive the DTU results are to sequencing level, we down-sampled the number of longreads from the scmixology2 dataset (from 20 to 80% of the total number). We observed that although the number of detected isoforms (Additional file 2: Fig. S7A) and the number of genes with DTU (Additional file 2: Fig. S7B) decrease with decreasing sequencing level per cell, 58% of the DTU changes could still be detected with as little as 20% of the starting reads (Additional file 2: Fig. S7B). We performed DTU analysis on all samples and identified between 500 and 1000 genes with differential transcript usage (gDTU) in each sample (Fig. 3A, Additional file 4). Nearly half of the gDTU (244 out of 573) are shared between scmixology1 and scmixology2. Besides this overlap, gDTU were largely unique for each sample (Additional file 2: Fig. S8A). The functional annotation clustering of these genes revealed shared pathways across different samples, especially pathways related to transcription and translation, such as mRNA splicing and ribosome biogenesis (Fig. 3B, Additional file 5). As an example, different gDTU related to mRNA splicing were found in different samples. PQBP1, which is involved in pre-mRNA splicing, is a gene with DTU in the scmixology samples (Additional file 2: Fig. S8B), while SRSF2 and SRSF3, which belong to a family that acts both as general splicing factors and as regulators of alternative splicing [23], exhibit DTU in the CLL2 sample (Additional file 2: Fig. S8C).
Apart from the gDTU that are unique to each sample, we also found a few genes (18) that appeared in all samples (Additional file 2: Fig. S8A), including a ribosomal protein RPS24 that had the smallest P-value among different samples (Additional file 4). RPS24 is a highly conserved gene between mouse and human, with exons 5 and 6 alternatively spliced to generate different protein coding sequence (Fig. 3C, D). Further analysis showed transcript usage of this gene is altered at multiple levels in different samples. Firstly, the major transcript of RPS24 was distinct between different samples. Transcripts without exons 5 and 6 were most abundant in the cell lines (scmixology1 and scmixology2, Fig. 3C), while the transcript with only exon 5 was highly expressed in the patient-derived PBMC sample (CLL2, Fig. 3C) and MuSC sample (Fig. 3D). In addition to the major transcript, additional transcripts showed differential transcript usage in different cells within the same sample. The transcript with exons 5 and 6 was more frequently expressed in CLL cells while the transcript without exons 5 and 6 was preferentially expressed in non-CLL cells (Fig. 3E). Besides these known transcripts, we also identified a new isoform in mouse (Rps24-001) that was preferentially expressed in quiescent MuSCs (Fig. 3F). Different RPS24 isoforms have tissue-specific expression [24] and some have been linked to tumor progression [25]; however, the functional differences between the encoded proteins remain unclear. In summary, we have highlighted the heterogeneity of expression of the RPS24 transcript that is missed in a typical genelevel analysis (Fig. 3E, F).
Another category of gDTU of interest is cell-surface proteins. Genes encoding cellsurface proteins produce alternative mRNA isoforms, usually by changing the combinations of consecutive exons corresponding to certain functional domains. Some of the isoforms have been characterized, such as CD45, where alternative splicing of exons 4 to 6 are expressed in different lymphocytes [26]. By analyzing the CLL2 data where the surface marker expression is available through CITE-seq, we detected multiple isoforms of CD45 (Fig. 3G). We found similar expression patterns between the transcript and the protein it encodes, where the protein was quantified by counting antibody-derived tags (ADT) from CITE-seq data. The result both validated the isoform quantification from the FLAMES pipeline and showed that transcript-level analysis can provide a better correlation between mRNA and protein expression that cannot be achieved using gene-level quantification. CD45 was the only DTU gene identified among the CITE-seq panel of 17 antibodies. We also observed other genes encoding important cell surface markers with DTU, such as Cd82 (Fig. 3H), CD47 (Additional file 2: Fig. S8D), and CD44 (Additional file 2: Fig. S8E), with different combinations of consecutive exons. Notably, some of the novel transcripts in CD47 can be observed in different samples (Additional file 2: Fig. S8D), suggesting conserved alternative splicing patterns that are missing from the reference annotation. In sum, we observed previously overlooked isoform diversity of cell-surface proteins, which could introduce functional diversity and contribute to development and diseases [27,28].
Alternative promoters have been shown to regulate cancer-specific transcription [29]. Here, we found transcripts with different TSSs expressed in different cancer cell lines. We have shown that open chromatin captured by scATAC-seq is correlated with the promoter region indicated by the TSS (Fig. 1F). As an example, we found multiple isoforms of PRDX1, including a new isoform PRDX1-001 with a different TSS. The new isoform expressed in H838 and A549 contains a novel first exon that is not found in the other cell lines (Fig. 3I). Additional open promoter regions are observed in the scATAC-seq data from H838 and A549 that coincide with the new exon (Fig. 3I). This result suggests that the long-read scRNA-seq data can be integrated with scATAC-seq at the promoter level to further enhance the resolution of integration and reveal promoter heterogeneity that cannot be found via gene-level integration [30].
Long-read scRNA-seq links coding sequence variation to transcriptome heterogeneity Nanopore long-read sequencing allows full transcript coverage compared to 3′ or 5′end short-read sequencing, which provides us with a better chance of identifying coding variations. However, the high error rate in nanopore sequencing presents an obstacle [31]. To overcome this challenge, we first exclude homopolymer regions since they have higher sequence error rates [32,33]. After excluding these regions,~95% of regions have a reference allele frequency >90% (Additional file 2: Fig. S9A). The remaining single-nucleotide variants (SNVs) were filtered again using a statistical test in FLAMES to enrich for true positives and clonal variants, based on the assumption that the occurrence of sequencing errors is independent of cell barcode and such errors will occur randomly in all cells (Fig. 4A and the "Methods" section). We tested our approach on the scmixology samples where the five cell lines carry distinct variants. The principal component analysis (PCA) on the filtered allele count matrix successfully recapitulated the expected population structure (Fig. 4B). Louvain clustering on leading principal components generated similar results to the cell type assignment obtained from running Demuxlet [34] on the short-read data (98.7% and 99.4% concordance for scmixology1 and scmixology2, respectively), which indicated that we can successfully capture variants in these different cell lines using long-read data. We then performed  Candidate variants are filtered based on allele frequency first, then based on per cell allele frequency to remove technical artifacts. The remaining variants are used for differential allele frequency analysis. B PCA on alternative allele matrix using the variants after filtering, colored by unsupervised clustering results using top PCs and annotated with cell lines. C Manhattan plot of P-values from a differential allele frequency analysis with Benjamini-Hochberg adjustment. Genes with significant variants are labeled. D UMAP visualization highlighting two CLL populations that have differential allele frequency for the significant variants. E UMAP visualization of cells colored by Gly101Val mutation status differential allele frequency analysis to find SNVs that were specific to each cell line.
The results showed a high precision (80.1%) with the SNVs called from bulk RNA-seq data from individual cell lines.
After we confirmed the approach using the cell lines, a similar analysis was performed on the CLL2 dataset to examine the relationship between transcriptome heterogeneity and sequence variations. We searched for SNVs that only existed in the cancer cells and had a differential allele frequency across different CLL transcriptional clusters. We found four significant variants (adjusted P-value < 0.05, chi-square test, Benjamini-Hochberg correction) associated with different CLL clusters (Fig. 4C, D). Material from the same sample was further analyzed using bulk RNA-seq and whole exome sequencing. All four variants were detected in the bulk RNA-seq, and 3 out of 4 were present in the bulk whole exome sequencing (the RAD21 variant was outside the capture region). The variant in PHEX is a known polymorphism and the variant allele frequency was consistent with it being a germline SNV, so the difference of its allele frequency could be a result of allele-specific expression. By investigating the allele frequency of these variants across clusters (Additional file 2: Fig. S9B), we identified two subclones, where SNVs in the genes BCL2, RAD21, and PHEX are enriched in subclone1 and the variant located in the gene B3GAT3 is enriched in subclone2 (Fig. 4D). Although the analysis itself does not require genomic sequence data as a reference, it is important to validate the variants detected since the changes in allele frequency may also come from allele-specific expression or RNA editing. The Gly101Val mutation has been confirmed to promote resistance to venetoclax treatment by reducing the affinity of BCL2 for venetoclax [35], and patient CLL2 was known to carry~25% Gly101Val mutations. Our modified protocol shows that the mutation is not just subclonal, but also linked to specific transcriptional clusters. The distribution of the Gly101Val mutation was further confirmed by capture enrichment of the BCL2 gene (Additional file 2: Fig S9C), which greatly increased the coverage of the mutation site (Additional file 2: Fig S9D). Taken together, this approach provides unbiased high-throughput linking of single-cell variants and transcriptomic heterogeneity.

Discussion
Transcriptional activity is typically summarized at the gene level due to the limitations of short-read technology, especially in scRNA-seq studies. The recent development of long-read sequencing technology promises sequencing of the full-length transcript, but its application to single cells has been slowed by a lack of protocols and data analysis pipelines. Using a modified 10x Genomics scRNA-seq protocol that includes subsampling of cells, we demonstrate that a similar per-cell sequencing level can be achieved to short-read technologies at a much lower cost than other methods [6] because only a subset of the cell population undergoes nanopore sequencing. As library preparation artifacts can cause a substantial fraction of reads to lack barcodes (40-60% in this study), enriching for full-length cDNAs that contain barcodes as proposed in Lebrigand et al. [9] has the potential to further enhance the efficiency of sequencing. The hybrid approach taken in our study combines the advantages of throughput and accuracy from short-read sequencing to identify cell barcodes and perform clustering with the strengths of long-read sequencing that enables isoform discovery and quantification, the results of which are combined in an integrated analysis. This modified protocol is compatible with a variety of 10x transcriptomic assays and could be potentially applied to any single-cell transcriptomic protocol with cell barcoding [36]. It could also be combined with other single-cell long-read sequencing approaches such as R2C2 [37] or other sequencing platforms including PacBio Sequel II. As the throughput and accuracy of long-read sequencing platforms increases, it may be possible to skip the sampling step and short-read sequencing altogether and apply long-read sequencing alone rather than the current hybrid approach to achieve similar aims and further reduce sequencing costs.
To analyze noisy long-read data, we developed the FLAMES pipeline that can detect and quantify novel and known isoforms in single-cell and bulk samples [16]. FLAMES implements a strict filter on consensus read clustering and filters out spurious reads that are mostly truncated reads with incomplete 3′ or 5′ ends. It also has two rounds of alignment using the genome and the recovered isoforms as reference to reduce alignment artifacts and distinguish reads aligned to similar isoforms. The recently published SiCeLoRe method can also detect and quantify isoforms, although it requires deep sequencing to perform UMI correction and has not been as extensively validated. FLAMES also identifies cell-type-specific splicing patterns and variants through comparative analysis. Further validation of our approach is available in the recent demonstration that gene fusions can be detected using the scmixology data and the output of FLAMES together with the JAFFAL [38] long-read fusion finding tool. Areas for further improvement of FLAMES include using more sophisticated methods for assessing differential transcript usage and allele frequency between clusters (a simple chi-square test is currently applied), extending the framework to accommodate multi-sample analyses, and applying an optimized nanopore-specific variant caller to improve the mutation analysis. There is also a pressing need for further systematic benchmarking of longread isoform analysis methods, which is currently being pursued by the Long-read RNA-seq Genome Annotation Assessment Project (LR-GASP, https://www. gencodegenes.org/pages/LRGASP/).
With FLAMES, we characterized full-length isoforms at the single-cell level across species and cell types. We detected thousands of novel transcripts expressed at low levels, most of which were unique to each sample. Novel transcripts also showed distinct enrichment patterns at the single-cell level where CLL cells and quiescent muscle stem cells often have more novel transcripts. We frequently found complex splicing changes between the two most abundant isoforms of a gene, suggesting the inadequacy of studying individual splicing junctions via short reads. We observed cell-type-specific isoform usage for genes that are enriched for common functions such as mRNA splicing and ribosome biogenesis. The expression of ribosomal and spliceosome proteins has low correlation with their gene-level expression [39], indicating some protein expression levels might be regulated by alternative splicing. Genes encoding cell-surface proteins such as CD44 and CD47 often have cell-type-specific splice variants, some of which have not been previously annotated, and may result in different functions [40]. The examples we show indicate that identification of cell-type-specific isoforms may have broad applications in many fields such as immunotherapy, where cell surface proteins play an important role.
Compared to short-read scRNA-seq, our approach provides a better linkage between proteome and genome as it can resolve the full-length transcript. Transcript abundance can be aligned directly to single-cell protein measurements, which is particularly helpful for proteins with multiple isoforms such as CD45. We also showed that data integration with scATAC-seq is possible by summarizing open chromatin signals at the exact transcription start site rather than at the gene level. Full transcript coverage unlocks the potential to detect mutations and allelic expression for specific isoforms in an unbiased way compared to methods such as GoT-seq [41]. Through unsupervised analysis, we not only identified coding variants that cannot be easily detected by 3′ or 5′end short-read sequencing, but we could also associate variants to different transcriptional signatures. This can improve our understanding of the correlation between the single-cell transcriptome and genome in cancer. The detection of cluster-specific transcript fusions is also possible using this type of data [38].

Conclusions
Our approach provides new avenues for characterizing single-cell transcriptomic heterogeneity at the transcript level and unveils new questions and challenges. Benchmarking studies that compare the performance of emerging long-read sequencing protocols and computational methods are needed to help our understanding of the strengths and weaknesses of different approaches. Many unanswered questions about transcript expression and alternative splicing in single cells also remain. For example, do the majority of novel transcripts simply reflect stochastic noise in the splicing machinery [42], or are they indicative of a genuine increase in protein diversity? Our methods and analysis provide a starting point for addressing these questions to accelerate isoform-level studies in single cells.

Human cell lines
The cell culture and sample preparation of the scmixology cell lines are as previously described. Briefly, the five cell lines (H2228, H838, H1975, HCC837, A549) were retrieved from the ATCC (https://www.atcc.org/) and cultured in Roswell Park Memorial Institute (RPMI) 1640 medium with 10% fetal calf serum and 1% penicillin-streptomycin. The cells were grown independently at 37°C with 5% carbon dioxide until near 100% confluency. The cells were then counted and mixed in equal numbers. The mix was used for Chromium 10x library preparation. The first batch was derived from the same sample featured in a previous study (scmixology1). The cell lines were cultured again using the same protocol to create a second batch (scmixology2) processed by both Chromium 10x scRNA-seq and scATAC-seq.

Animals
All procedures were approved by the Animal Ethics Committee of The University of Melbourne and conformed to the Australian code of practice for the care and use of animals for scientific purposes as stipulated by the National Health and Medical Research Council of Australia. Mice were housed in the Biological Research Facility at The University of Melbourne under a 12-h light-dark cycle, with drinking water and standard chow provided ad libitum. Pax7 creERT2 R26-eYFP fl/fl mice were generated from Pax7 creERT2 and R26-eYFP fl/fl founder mice strains [43,44] both on a C57BL6 background.

Muscle injury and stem cell isolation
At 3 months of age, Pax7 creERT2 ; R26R YFP mice received daily 100μL tamoxifen (20mg/ml in corn oil) for 5 days to label MuSCs with YFP. Two weeks after the final injection, animals were killed and hindlimb muscles excised and dissociated as described previously [45]. Cells were sorted on a FACS Aria III (BD Biosciences) with gating based on YFP.
To obtain activated MuSCs, mice were anesthetized with isoflurane and muscles were injured with 1.2% barium chloride (Sigma). Lower hindlimbs received 40μL barium chloride via an intramuscular injection. Activated cells were isolated 72 h postinjury while quiescent MuSCs were isolated from uninjured muscles as described previously.

Human CLL sample (CLL2)
After providing written informed consent, the patient sample was collected after progression on venetoclax treatment [46]

scRNA-seq cell isolation and library preparation
Single-cell capture and cDNA amplification for mouse skeletal muscle stem cell and cell lines was performed using 10x Genomics Chromium Single cell 3′ Library and Gel Bead (v2 for MuSC and scmixology1 and v3 for scmixology2) and for the CLL2 sample using Chromium Single Cell 5′ Library & Gel Bead Kit (v1) according to manufacturer instructions with some modifications. Full-length cDNA generation was carried out as described in detail at protocols.io [47]. Briefly, we followed the standard 10x Genomics user guide, with RT time increased to 2 h in the CLL2 sample only to potentially increase the reverse transcription of longer transcripts. After GEM-RT, we transferred 10-20% volumes of GEMs into a new tube and performed subsequent steps in parallel for both 10-20% and 80-90% subsample where each subsample is treated as a separate sample according to the 10x user guide for Illumina library preparation for scmixology and MuSC samples and increasing cDNA PCR extension step to 3 min for the CLL2 sample. Long-read library preparation is detailed at protocols.io [47]. cDNA generation for hybrid capture input was similar to the main protocol except that cDNA from the remaining sample (i.e., 80-90%, excluding the 10-20% subsample) was used for amplification using primers 10x Partial R1: CTACACGACGCTCTTCCGATCT and T5′ PCR Primer IIA: AAGCAGTGGTATCAACGCAGAG in place of FPSfilA and RPSfilBr.
To capture the BCL2 gene, all annotated isoforms from Gencode and Ensembl data bases and isoforms identified in~20% subsamples were integrated and used for IDT xGEN Lockdown capture probe design with one probe for each exon up to 1Kb. To design probes for exons shorter than 120 nucleotides, they were concatenated to flanking exons and two probes were designed to include short exons and part of flanking exons. Hybrid capture was done with 2μg of cDNA input using IDT xGEN Hybridisation and Wash Kit. Captured cDNA were pulled down with streptavidin beads and washed with IDT buffers. The targeted cDNA library was amplified in a similar manner to the main protocol for 12 cycles with Takara LA Taq DNA polymerase in place of PrimeSTAR GXL DNA polymerase.
Standard single-cell Illumina libraries were prepared according to 10x protocol. Illumina HiSeq2500 was used for sequencing scmixology1 (2 × 125 cycles) and MuSC (2 × 150 cycles). Other libraries were sequenced on NextSeq 500 (1 × 28/1 × 91 cycles plus 8 base index cycle) using the v2 150 cycle High Output kit (Illumina) as per the manufacturer's instructions. The base calling and quality scoring were determined using Real-Time Analysis on board software v2.4.6, while the FASTQ file generation and de-multiplexing utilized bcl2fastq conversion software v2.15.0.4. Full-length cDNA libraries from subsamples or capture cDNA were prepared using Oxford Nanopore Technologies SQK-LSK109 Ligation Sequencing Kit with the following modifications: incubation times for endpreparation and A-tailing were lengthened to 15 min, and all washes were performed with 1.8X Ampure beads to conserve smaller fragments. SFB was used for the final wash of the libraries. Fifty femtomole per library was sequenced on PromethION FLO-PRO002 R9.4.1 flow cells according to manufacturer protocols.

Single-cell ATAC-seq (scATAC-seq)
The cells from five cell lines were counted and mixed equally. Cell nuclei were isolated and washed according to the Nuclei Isolation for Single Cell ATAC Sequencing (10x Genomics) protocol, with 1 million cells to start with (0.2 million from each cell line) and 3 min lysis on 100μL buffer. Nuclei were then used to generate scATAC-seq libraries according to the Chromium Single Cell ATAC Reagent Kits User Guide (10x Genomics; CG000168 Rev B). Sequencing libraries were loaded on an Illumina sequencer with 2 × 75 paired-end kits using the following read length: 72 bp read 1, 8 bp i7 index, 16 bp i5 index, and 72 bp read 2. In the sequencing reaction, reads 1 and 2 contain the DNA insert, while the index reads, i5 and i7, capture the cell barcodes and sample indices, respectively. Cells were sequenced on Illumina HiSeq2500 with near around 300 million read pairs in total.
scATAC-seq data analysis scATAC-seq sequencing data was demultiplexed, preprocessed, and aligned with the default settings of the single-cell ATAC Cell Ranger platform (1.0). The reference used for alignment through the Cell Ranger platform was hg38. Next, Picard tools (http:// broadinstitute.github.io/picard/) was used to remove the PCR duplicates. Samtools [48] (1.7) was used to extract read pairs that have mapping quality (MAPQ)>30, were nonmitochondrial, and not chimerically mapped. bedtools [49] (v.2.26.0) was used to identify reads in mate pairs (i.e., fragments) and adjust the start of the paired reads to account for the 9-bp region that the transposase enzyme occupies during transposition (i.e., +4 bp for + strand and −5 bp for − strand). Next, Demuxlet was used to identify the cell lines of each cell barcode using the genotypic information acquired in our previous benchmarking studies [13,50].
The count matrix was generated for each barcode-separated BAM file using the fea-tureCounts function in the Rsubread package [51] (1.32.4) in the R environment (v. 3.5.1). The annotation features were promoter regions (i.e., TSS − 500 bp to TSS + 200 bp) corresponding to isoforms identified by the various long-read analysis methods (i. e., FLAMES, TALON, FLAIR, StringTie2). To determine the background of scATACseq and identify the open promoter, we performed analysis on randomized TSS, where a random position in each gene body was used as the TSS. The 90% percentile of the fragment count around random TSS (7.64) was used as the threshold to determine the open promoters and is annotated in Fig. 1F and Additional file 2: Fig. S5D.

Illumina short-read data analysis
The fastq data were processed by scPipe to generate a gene count matrix for all samples except CLL2, which was processed by Cell Ranger (3.0.0) to generate the antibody and gene count matrix. Each gene count matrix was used as input to the standard Seurat pipeline with normalization performed by SCTransform [54]. Clustering was performed for the MuSC and CLL2 samples with a resolution equal to 0.6. The cell line annotation for scmixology was acquired using Demuxlet with the same parameters as our previous benchmark study. Integration of the scmixology1 and scmixology2 datasets was performed using Seurat. The clustering results and cell line annotation are shown in Additional file 2: Fig. S1.

Nanopore sequencing and data preprocessing
We performed basecalling on the raw fast5 data using Guppy (1.8.1 for MuSC sample and 3.1.5 for scmixology and CLL2) from Oxford Nanopore Technologies. For each read, we locate the barcode sequence by searching for the flanking sequence before the cell barcode. The cell barcodes identified from the short-read data provide a reference to search for and trim in the long reads. An edit distance of up to 2 is allowed during cell barcode matching. Reads that failed to match any cell barcode were discarded. Sequences following the cell barcode were used as UMIs and trimmed. For the 3′ end protocol, the polyA tail after the UMI sequence was trimmed and sequences after the polyA tail were kept. The cell barcode and UMI sequence were integrated into the fastq read header as per scPipe [55]. The processed fastq was used as input for genome alignment and further analysis.

Detection and quantification of isoforms
Reads were aligned to the genome by minimap2 [56] (-ax splice --junc-bonus 1 -k14 --secondary=no --junc-bed) using Gencode reference (human hg38.v33, mouse mm10.vM24). FLAMES summarizes the alignment for each read by grouping reads with similar splice junctions (<5bp by default) to get a raw isoform annotation. The raw isoform annotation is compared against the reference annotation to correct potential splice site and transcript start/end errors. Transcripts that have similar splice junctions (<5bp by default) and transcript start/end (<100bp by default) to the reference transcript were merged with the reference. This process will also collapse isoforms that are likely to be truncated transcripts. This is achieved by modeling the possibility of a read to be truncated as a linear function to the isoform length, given that longer isoforms are more likely to have truncated reads with incomplete 5′/3′ ends (depending on the 10x protocol applied). Next, the sequence of each polished transcript was extracted and used as the updated reference. The reads were realigned to this reference by minimap2. The transcript coverage of individual reads is summarized in Additional file 2: Fig. S3. We noticed that the scmixology2 data contained more reads that were not full-length, which might relate to the difference in sample preparation time or the v2 and v3 10x chemistry. The transcripts with fewer than 5 full-length aligned reads (>95% coverage by default) were discarded. The reads were assigned to transcripts based on both alignment score, fractions of reads aligned, and transcript coverage. Reads that cannot be uniquely assigned to transcripts or had low transcript coverage (< 60%) were discarded. The UMI transcript count matrix was generated by collapsing the reads with the same UMI in a similar way to what is done for short-read scRNA-seq data, but allowing for an edit distance of up to 2. The counts of transcripts from the same gene were aggerated to generate the gene-level UMI count and compared to the gene count generated from the short-read data in Fig. 1E. All default parameters mentioned above are encoded in a configuration file (config_sclr_nanopore_default.json) and can be changed to fit different protocols. For example, the 5-bp distance for splice site correction could be smaller for data generated by PacBio which has a lower error rate. FLAMES is written in python and R and uses other packages, including pysam

Comparison of FLAMES to other tools
Direct RNA sequencing data from SIRV spike-in E2 mix which contains 69 synthetic isoform transcripts (from 7 SIRV genes) was downloaded from NCBI (SRX3204589). The reference annotation for the spike-ins was provided to all methods to ensure a fair comparison. Alignment was performed using minimap2 (2.17), with "-ax splice --splice-flank=no -k14 --secondary=no" and "--junc-bed." For TALON, the mapped reads were processed using TranscriptClean (1.02) to correct for mismatches and microindels. Following correction, long reads were collapsed into a transcript isoform quantification table in TALON (4.1), using the SIRV annotation. We ran the FLAIR pipeline (1.4) using default parameters with all modules including "flair align," "flair correct," "flair collapse," and "flair quantify." The SIRV annotation was supplied in the "flair correct" step to help correct misaligned splicing sites. We ran StringTie2 (2.0. 4) with "-L -G -c 10" and the SIRV annotation. For FLAMES, we used the default parameters except "strand_specific:1" and we filtered transcripts with at least 10 reads. For comparison, transcripts generated by TALON and FLAIR were also filtered to have a read count of at least 10. The scmixology1 data was processed in a similar way to the SIRV data. Gencode human hg38.v33 was used as the gene reference annotation for each method. TALON was run in parallel on each chromosome to reduce compute time and the results were aggregated later. Unfortunately, we were unable to install the SiCeLoRe software on our system to compare with FLAMES.
Isoform classification and splicing analysis SQANTI2 (4.1, https://github.com/Magdoll/SQANTI2) was used to compare the transcripts identified to the reference with parameter "-g --cage_peak --coverage --force_ id_ignore." We used the FANTOM5 cage peak dataset on hg19 and mm9 and lifted these to the hg38 and mm10 reference using UCSC's liftOver tool (https://genome. ucsc.edu/cgi-bin/hgLiftOver). The isoform classification was extracted from the SQAN TI2 result and plotted in Fig. 2B. The gffcompare [58] (0.11.2) program was used with parameter "-T -R -M" to compare isoform annotations generated from different samples (Fig. 2B). It was also used to compare isoform annotations obtained after downsampling, which was achieved by randomly subsampling the long reads in the bam file (20%, 40%, 60%, and 80%) and re-running the pipeline with the same parameters (Additional file 2: Fig. S7A-B). Results from these comparisons were plotted using UpsetR [59] (1.4.0). We ranked transcript abundance for each gene that had multiple isoforms and obtained the alternative splicing events from the most expressed transcript and the second most expressed transcript. We used a common model to classify the splicing events [60], where alternative 5/3′ splice site includes alternative promoter and alternative polyadenylation. Transcripts with more than one splicing event were classified as complex splicing changes (Fig. 2G).

Differential transcript usage analysis
We filtered genes to have at least two isoforms, each with more than 15 UMI counts. For each gene, the per cell transcript counts were merged by group to generate pseudo-bulk samples. For scmixology, the groups are based on cell line identity inferred by known genetic variation, and for the MuSC and CLL2 data, the groups are based on the clusters acquired from Seurat clustering, shown in Additional file 2: Fig. S1. The top 2 highly expressed transcripts for each group were selected and a UMI count matrix where the rows are selected transcripts and columns are groups was used as input to a chi-square test of independence (chisq.test in R). P-values were adjusted by Benjamini-Hochberg correction [61] and results were summarized in Additional file 4. We performed functional clustering with DAVID [62] using genes with significant DTU (adjusted P-value < 0.01) as input (Additional file 5). The transcript structures in all figures were plotted using geom_alignment in ggbio [63] (1.36.0). We performed imputation of transcript counts on cells that were not sampled for long-read sequencing using the shared nearest neighbor network constructed by Seurat [64] (3.1.5). We then scaled the transcript expression matrix for each gene and the results were used in heatmaps and UMAP visualizations (available in Seurat) in Fig. 3 and Additional file 2: Fig.  S8. The imputed results were used for visualization purposes only.

Variant calling and clonal analysis
First, we identified candidate SNVs using FLAMES by excluding homopolymer regions (runs >3 of the same nucleotides), positions with coverage of fewer than 100 reads, and positions with reference allele frequency less than 10% or greater than 90%. For each candidate SNV, we generated an allele count matrix of the reference and alternative allele. Next, we collect the cells that have reads with the reference allele and the alternative allele and performed a binomial test on the allele counts, assuming that under the null hypothesis each cell has the same probability of having the alternative and reference allele as their allele frequency in the whole sample. P-values were adjusted by Benjamini-Hochberg correction and candidate SNVs with an adjusted P-value < 0.05 were kept for further analysis. PCA was applied to the filtered alternative allele count matrix from the scmixology data and first two PCs were plotted in Fig. 4B. The top 5 PCs were selected to build the shared neighbor network (scran::buildSNNGraph [65] k = 20, d = 5) which is used for Louvain clustering (igraph::cluster_louvain [66]) in order to examine whether the allele count matrix captured the variants in the cell lines. Next, we performed differential allele frequency analysis in a similar way to the DTU analysis, but with the allele counts aggregated for each cluster. The candidate SNVs with adjusted P-values < 0.05 were selected and compared to the VCF reference generated from a previous study using bulk RNA-seq. After analyzing the scmixology dataset, we processed the CLL2 data in the same way and conducted differential allele frequency analysis on the CLL2 clusters shown in Additional file 2: Fig. S1. The bar plot of cells with alternative alleles of significant SNVs is shown in Additional file 2: Fig S9B and summarized in Fig. 4D. The capture enrichment data was analyzed in a similar way, with the reads processed by FLAMES and an allele count matrix generated for the Gly101Val mutation. All plots were generated using ggplot2 unless otherwise specified and most of the analysis was performed in R [67] (4.0) unless otherwise specified.

Implementation of the FLAMES software
FLAMES is available as both a Python (https://github.com/LuyiTian/FLAMES) and R/Bioconductor package (https://bioconductor.org/packages/FLAMES) (0.99.31). The R package provides functionality of the original Python package using the basilisk software (1.2.1) (https://bioconductor.org/packages/release/bioc/html/basilisk.html) to create and manage a "frozen" conda environment, allowing all required Python packages to be downloaded from R. This allows for specific package versions to be accessed, without relying on the user to self-manage the required Python modules. Using this basilisk environment, the reticulate package (1.18) (https://cran.r-project.org/web/packages/reticulate/index. html) was used to call the FLAMES Python functions directly from R. The FLAMES package (Additional file 2: Fig S2) thus provides a series of wrapper functions for these reticulate Python calls. All basic data type conversion between R and Python, such as R named list to Python dictionary, is handled by reticulate; however, a few more complex data types required manual conversion.
The Python FLAMES package provided two pipelines, one for single-cell data and one for bulk long reads which have been re-written in R. Each function called from either pipeline is supplied as an R exported function to facilitate user configuration changes if wishing to manually run the pipeline, without the automated pipeline functions. To assist with interfacing with other Bioconductor packages, FLAMES stores the count and annotation data in either a Sin-gleCellExperiment (1.12.0) object (https://bioconductor.org/packages/release/bioc/html/ SingleCellExperiment.html) or SummarizedExperiment (1.20.0) (http://bioconductor.org/ packages/release/bioc/html/SummarizedExperiment.html) for the single-cell and bulk pipeline, respectively, as well as providing a number of additional output files. The package includes vignettes for running the FLAMES pipeline without the read alignment and realignment steps, which are platform dependent.
Bulk RNA-seq and exome data Bulk data from the 5 Human Cell lines used in the scmixology experiments were downloaded from GEO accession number GSE86337.
PBMCs from CLL2 were thawed and stained with CD19-BV510 antibody (BD Biosciences) at 4°for 30 min. PBMCs were washed stained with propidium iodide (PI, Sigma). Viable cells (PI negative) were flow sorted into a CD19+ tumor sample using the FACSAria (BD). RNA was extracted using an RNeasy Mini Kit (Qiagen #74104) according to the manufacturer's instructions. Library preparation and sequencing were performed, using the Truseq Stranded Total RNA library prep kit (Illumina) and 100bp single-end sequencing protocol.
The data was aligned to the human genome (hg38) using STAR (2.7.3) with default parameters. Junctions with more than 5 uniquely mapping reads were retained and classified as known or novel based on the Gencode annotation. These were compared to the unique junctions obtained from FLAMES for isoforms with at least 1 read in the relevant single-cell sample (scmixology1 and CLL2 with non-tumor cells removed) that were classified (as known and novel) and extracted using SQANTI3. DNA was extracted from the CLL2 PBMCs using the QIAamp DNA Mini Kit (Qiagen # 51304) according to the manufacturer's instructions. Library preparation and sequencing were performed at the Australian Genome Research Facility (AGRF), using the TruSeq Nano kit (Illumina), SureSelectXT2 Target Enrichment System, and Human All Exon v6 Capture Library (Agilent Technologies) and 150-bp paired-end sequencing protocol. The bulk sequencing data was aligned with BWA mem [68] (0.7.17) (exome) and STAR [69] (2.7.3) (RNA) and variants were called by superFreq [70] (1.4.3). Select variants detected in the scRNA-seq analysis were visually inspected.