Quantifying the benefit offered by transcript assembly with Scallop-LR on single-molecule long reads

Single-molecule long-read sequencing has been used to improve mRNA isoform identification. However, not all single-molecule long reads represent full transcripts due to incomplete cDNA synthesis and sequencing length limits. This drives a need for long-read transcript assembly. By adding long-read-specific optimizations to Scallop, we developed Scallop-LR, a reference-based long-read transcript assembler. Analyzing 26 PacBio samples, we quantified the benefit of performing transcript assembly on long reads. We demonstrate Scallop-LR identifies more known transcripts and potentially novel isoforms for the human transcriptome than Iso-Seq Analysis and StringTie, indicating that long-read transcript assembly by Scallop-LR can reveal a more complete human transcriptome.


Merging multiple SRA Runs from the same BioSample into one dataset
In most of the PacBio datasets in SRA, one BioSample has multiple SRA Runs. We merge multiple SRA Runs that belong to the same BioSample into one dataset. PacBio sequencing uses the template called SMRTbell that is a closed, single-stranded circular DNA created by ligating adaptors to both ends of a target double-stranded cDNA molecule. The sequencing is based on a SMRT Cell, a chip with consumable substrates comprising arrays of zero-mode waveguide (ZMW) nanostructures, and a SMRTbell diffuses into a sequencing unit ZMW on it. The realtime observation of a SMRT Cell is called a movie, and an SRA Run usually contains one movie and sometimes contains multiple movies. One BioSample has multiple SRA Runs because the experimenters used multiple movies (i.e. multiple SMRT Cells) to increase the coverage so that those low-abundance, long isoforms can be captured in Iso-Seq Analysis, since the "polish" step in Iso-Seq Analysis keeps only the isoforms with at least two full-length reads to support them. In most cases, the experimenters also used a size selection sequencing strategy, that is, isoforms that are in different size ranges are split into separate independent SMRTbell libraries for sequencing, so that larger isoforms are not detrimentally dominated by smaller isoform molecules during the sequencing. Thus, different SRA Runs are designated for different size ranges. Therefore, we use one BioSample instead of one SRA Run to represent one dataset in our analysis, and we merge multiple SRA Runs into that dataset.

Software versions and options used in the analysis workflow
The software versions and options used in the analysis workflow are summarized in the following: Iso

Assessment of predicted transcripts that partially match known transcripts in mouse data
Figures S1, S2, and S3 show box-whisker plots of matched transcripts in matched fraction bins, assembled isoforms in assembled fraction bins, "mean isoform assembly" and "mean fraction of transcript matched" for Scallop-LR, StringTie, and Iso-Seq Analysis on the eight mouse datasets based on rnaQUAST evaluations. Full results are shown in Tables S8.1-S8.8.
In the mouse data, Scallop-LR predicts more transcripts that have a high fraction of their bases matching reference transcripts than Iso-Seq Analysis. From Tables S8.1-S8.8, in the high % bins of the "x-y% matched transcripts" (75-95% and 95-100% matched), Scallop-LR consistently has more x-y% matched transcripts than Iso-Seq Analysis. However, unlike in the human data, Scallop-LR consistently has fewer x-y% matched transcripts than StringTie in the high % bins. These trends are visualized in Figure S1 (75-95% and 95-100% matched bins).
However, on average, Scallop-LR transcripts match reference transcripts better than StringTie transcripts. In Tables S8.1-S8.8, Scallop-LR consistently has much higher values of "Mean fraction of transcript matched" than StringTie. Scallop-LR has slightly lower values than Iso-Seq Analysis though. These trends are visualized in Figure S3 (right: "Mean fraction of transcript matched").
In the mouse data, there are more reference transcripts that have a high fraction of their bases being captured/covered by Scallop-LR transcripts than by Iso-Seq Analysis predicted transcripts. From Tables S8.1-S8.8, in the high % bins of the "x-y% assembled isoforms" (75-95% and 95-100% assembled), Scallop-LR consistently has more x-y% assembled isoforms than Iso-Seq Analysis. However, Scallop-LR consistently has fewer x-y% assembled isoforms than StringTie in the high % bins. These trends are visualized in Figure S2 (75-95% and 95-100% assembled bins).
However, on average, reference transcripts are better captured/covered by Scallop-LR transcripts than by StringTie transcripts and Iso-Seq Analysis transcripts. In Tables S8.1-S8.8, Scallop-LR consistently has higher values of "Mean isoform assembly" than both StringTie and Iso-Seq Analysis. Iso-Seq Analysis consistently has higher values than StringTie. This trend is visualized in Figure S3 (left: "Mean isoform assembly").
The quality of StringTie transcripts in the mouse data is somewhat improved compared to that in the human data. As in the human data, StringTie consistently has significantly more unannotated transcripts than both Scallop-LR and Iso-Seq Analysis (Tables S8.1-S8.8). However, in Figure  S1, unlike Figure 7, in the 0-50% matched bin StringTie no longer has a very high number of transcripts. This indicates that StringTie performs better in the mouse data than in the human data. In Figure S2, though, in the 0-50% assembled bin StringTie still has significantly higher numbers of isoforms than both Scallop-LR and Iso-Seq Analysis, similar to Figure 8.

Evaluation of Scallop-LR and StringTie on simulated human data
We evaluated Scallop-LR and StringTie on a simulated human dataset (Liu et al., 2019). The transcriptome that was used to generate the simulated long reads originated from the Ensembl annotation Homo sapiens GRCh38.94 and was a subset of the transcripts in this Ensembl annotation, by removing unfinished scaffolds, transcripts shorter than 200 bp, annotations with an unknown reference, etc. and randomly selecting alternative-splicing genes, single-splicing genes, and genes with small exons (< 31bp). The PacBio PBSIM tool was used to generate the simulated CCS reads from this transcriptome. The simulation was model-based using the CCS model, and three runs of simulations were performed by using three different sequencing depths 4X, 10X and 30X respectively. We merged the CCS reads generated with the three sequencing depths together to obtain this simulated human dataset. We used the transcripts in the transcriptome sequences that were used to generate the simulated CCS reads to extract the transcripts' records and their corresponding genes' records from the Ensembl annotation Homo sapiens GRCh38.94 to obtain an annotation GTF file. This extracted annotation GTF file serves as the "ground truth" and contains 7810 multi-exon transcripts.
In the Gffcompare evaluation, the extracted annotation GTF file corresponding to the transcriptome that was used to generate the simulated CCS reads serves as the reference annotation. Scallop-LR demonstrates both higher sensitivity and higher precision than StringTie (Table S12), consistent with the trends on the real human datasets. Note that since the simulated CCS reads do not contain the primer information, Scallop-LR's transcript boundary identification algorithm through extracting the boundary information from long reads is not used on the simulated data.
In the rnaQUAST evaluation, the extracted annotation GTF file corresponding to the transcriptome that was used to generate the simulated CCS reads is used to make the gene annotation database. Therefore, the metric "x-y% assembled isoforms" is computed relative to the initial set of expressed isoforms that was used to generate the simulated reads, rather than all known isoforms. StringTie has more "x-y% assembled isoforms" in the 95-100% bin than Scallop-LR (Table S13), consistent with the trend on the majority of the real human datasets. However, Scallop-LR has more "x-y% assembled isoforms" in the 75-95% bin than StringTie, which is different from the majority of the real human datasets. The results of the simulated data confirm that StringTie assembles more "95-100% assembled isoforms" than Scallop-LR, but Scallop-LR assembles more correctly predicted transcripts than StringTie (Table S12). This implies that while Scallop-LR outperforms StringTie in terms of the exact reference-matching transcripts (100% assembled, correct predictions), StringTie transcripts cover more reference transcripts on 95-99% of their bases than Scallop-LR transcripts.
We further performed the rnaQUAST evaluation on the simulated dataset by using the entire Ensembl annotation (Homo sapiens GRCh38.94) as the reference annotation. The resulting "95-100% assembled isoforms" are 3094 and 4613 for Scallop-LR and StringTie respectively. Compared with the corresponding results in Table S13 (which uses the "ground truth" as the reference annotation), StringTie assembles 365 more "95-100% assembled isoforms" when the entire Ensembl annotation is used as the reference. This implies that StringTie assembles 365 transcripts that are not in the ground truth but appear to be misassembled to match the reference transcriptome. For Scallop-LR, the number of "95-100% assembled isoforms" by using the entire Ensembl annotation as the reference is very close to the number of "95-100% assembled isoforms" by using the "ground truth" as the reference (the difference between the numbers of "95-100% assembled isoforms" by using these two references is -9). This result may suggest that, for the real datasets, the "95-100% assembled isoforms" of StringTie could be somewhat inflated (i.e. some assembled transcripts are not in the ground truth but are misassembled to match certain transcripts in the reference), as we do not know the ground truth for the real datasets and use all known isoforms as the reference when evaluating the real data. On the other hand, based on this evidence, it seems that Scallop-LR stays consistent on this measure when the entire Ensembl annotation or the "ground truth" is used as the reference. The above table compares the Gffcompare evaluation results for Scallop-LR, StringTie, and Iso-Seq Analysis on human data. 18 human PacBio datasets were extracted from SRA, each corresponding to one BioSample and named by the BioSample ID (except that the last four datasets are four replicates for one BioSample). Multiple SRA Runs that belong to each BioSample were merged into a large dataset to perform the analyses. The first nine datasets were sequenced using the RS instrument and the last nine datasets were sequenced using the RS II instrument. Sensitivity is the ratio of the number of correctly predicted known transcripts over the total number of known transcripts, and precision is the ratio of the number of correctly predicted known transcripts over the total number of predicted transcripts. PR-AUC was calculated from the precision-recall curves we generated. The values within the parentheses are the adjusted sensitivity and adjusted precision. The adjusted sensitivity for Scallop-LR was calculated by matching the precision of StringTie, and the adjusted precision for Scallop-LR was calculated by matching the sensitivity of StringTie. The adjusted sensitivity and precision were only calculated for the last six datasets, since the last six datasets have opposite trends on sensitivity and precision comparing Scallop-LR and StringTie. The above table compares additional Gffcompare evaluation results for Scallop-LR, StringTie, and Iso-Seq Analysis on human data. The same 18 human PacBio datasets as described in Table S1 were evaluated. A Correctly Predicted Known Transcript is a transcript that has the exact intron-chain matching with a transcript in the reference annotation. A Potential Novel Isoform is a predicted transcript that shares at least one splice junction with a reference transcript. The # Total Multi-Exon Transcripts is the total number of predicted multi-exon transcripts. This table compares the Gffcompare evaluation results for Scallop-LR, StringTie, and Iso-Seq Analysis on mouse data. Eight mouse PacBio datasets were extracted from SRA, each corresponding to one BioSample and named by the BioSample ID. Multiple SRA Runs that belong to each BioSample were merged into a large dataset to perform the analyses. All eight datasets were sequenced using the RS instrument. Sensitivity, precision, and PR-AUC are as described in Table S1. The values within the parentheses are the adjusted sensitivity and adjusted precision. The adjusted sensitivity for StringTie was calculated by matching the precision of Scallop-LR, and the adjusted precision for StringTie was calculated by matching the sensitivity of Scallop-LR. The adjusted sensitivity and precision were calculated for all eight datasets, since all eight datasets have opposite trends on sensitivity and precision comparing Scallop-LR and StringTie.   Table S3 were evaluated via rnaQUAST. Figure axis descriptions are the same as in Figure 7. Figure S2: Mouse data: box-whisker plots of assembled isoforms in four assembled fraction bins for Scallop-LR, StringTie, and Iso-Seq Analysis, based on rnaQUAST evaluations. This figure is to compare numbers of x-y% assembled isoforms. The same eight mouse PacBio datasets as described in Table S3 were evaluated via rnaQUAST. Figure axis descriptions are the same as in Figure 8.  Table S3 were evaluated via rnaQUAST. Figure axis descriptions are the same as in Figure 9.  This table compares the SQANTI evaluation results for Scallop-LR and Iso-Seq Analysis on mouse data. The same eight mouse PacBio datasets as described in Table S3 were evaluated. Metrics descriptions are the same as in Table S5. This table compares the rnaQUAST evaluation results for Scallop-LR, StringTie, and Iso-Seq Analysis on a human dataset. "Isoforms" refer to reference transcripts from the gene annotation database, and "transcripts" refer to predicted transcripts. "# Transcripts" is the total number of predicted transcripts (including single-exon transcripts). "Aligned" is the number of transcripts which have at least one significant alignment to the reference genome. "Uniquely aligned" is the number of transcripts which have a single significant alignment. "Unaligned" is the number of transcripts without any significant alignments. "Misassemblies" are the transcripts that have discordant best-scored alignments (i.e. partial alignments that are mapped to different strands, different chromosomes, in reverse order, or too far away). "Unannotated" is the number of transcripts that do not cover any isoform from the annotation database. "x-y% assembled isoforms" is the number of isoforms from the annotation database that have at least x% and at most y% captured by a single predicted transcript. "x-y% matched transcripts" is the number of transcripts that have at least x% and at most y% matching an isoform from the annotation database. "Mean isoform assembly" is the average value of assembled fractions, where the assembled fraction of an isoform is computed as the largest number of its bases captured by a single predicted transcript divided by its length. "Mean fraction of transcript matched" is the average value of matched fractions, where the matched fraction of a transcript is computed as the number of its bases covering an isoform divided by the transcript length.                     This table compares the Gffcompare evaluation results for Scallop-LR and StringTie on a simulated human dataset (Liu et al., 2019). The transcriptome that was used to generate the simulated long reads originated from the Ensembl annotation Homo sapiens GRCh38.94 and was a subset of the transcripts in this Ensembl annotation, by removing unfinished scaffolds, transcripts shorter than 200 bp, annotations with an unknown reference, etc. and randomly selecting alternative-splicing genes, single-splicing genes, and genes with small exons (< 31bp). The PacBio PBSIM tool was used to generate the simulated CCS reads from this transcriptome. The simulation was model-based using the CCS model, and three runs of simulations were performed by using three different sequencing depths 4X, 10X and 30X respectively. We merged the CCS reads generated with the three sequencing depths together to obtain this simulated human dataset. We used the transcripts in the transcriptome sequences that were used to generate the simulated CCS reads to extract the transcripts' records and their corresponding genes' records from the Ensembl annotation Homo sapiens GRCh38.94 to obtain an annotation GTF file. This extracted annotation GTF file serves as the reference in Gffcompare and the "ground truth", and it contains 7810 multi-exon transcripts.