A multi-split mapping algorithm for circular RNA, splicing, trans-splicing and fusion detection

Numerous high-throughput sequencing studies have focused on detecting conventionally spliced mRNAs in RNA-seq data. However, non-standard RNAs arising through gene fusion, circularization or trans-splicing are often neglected. We introduce a novel, unbiased algorithm to detect splice junctions from single-end cDNA sequences. In contrast to other methods, our approach accommodates multi-junction structures. Our method compares favorably with competing tools for conventionally spliced mRNAs and, with a gain of up to 40% of recall, systematically outperforms them on reads with multiple splits, trans-splicing and circular products. The algorithm is integrated into our mapping tool segemehl (http://www.bioinf.uni-leipzig.de/Software/segemehl/).


Supplementary Benchmarks
In the following we give additional benchmark data on the effect of errors and coverage on our tool ( Supplementary Fig. 1), running times and memory usage on simulated and real data sets (Supplementary Tab.1, 2) as well as the exact parameters used for your evaluation. For each of these simulated splice junctions we simulated reads without errors (no mismatches and indels; left column), 2% error rate (middle column) and 5% error rate (right column). As expected, the recall increases with the coverage. Overall, segemehl shows a high precision. Interestingly, only the recall but not the precision declines when errors are introduced.  For the long RefSeq reads, we used STAR following the supplement of [5].

Spliced leader trans-splicing in C. elegans
The C. elegans transcriptome data (SRX151602) was mapped using an increased accuracy of 95% (-A 95) and increased sensitivity for small splits (-Z 19) to the C. elegans reference genome ce6. Subsequently, we selected all split reads spanning more than 200K or split up between different chromosomes. After masking all splitreads starting or ending in the rRNA cluster on chrI:15060287-15072132 we found that more than 90% of all split-reads start or end in known spliced leader sequences (see Supplementary Tables 3, 4).    Table 5: SL usage. The spliced leader SL1 is the most frequently used spliced leader as determined by our split read alignment.

Validation of novel p53 isoforms
Applying our split read algorithm to the data of [14] (GSE29040) we predicted the following novel isoforms: For the validation of the newly identified canonical splice junctions in p53 we used RNA from venous fibroblasts. The RNA was isolated with TRIzol reagent (Life Technologies) and treated with RNase-free DNase (Qiagen) according to the manufacturers' instructions. The RNA samples were prepared in the context of another study recently published in PLoS Genetics [9]. The protocols for reverse transcription into cDNA are described therein. PCR reactions were prepared in a final volume of 25 µl using AmpliTaq Gold(R) 360 DNA Polymerase (Life Technologies) and primers were selected to span two exons in order to avoid co-amplification of genomic DNA. We used a common forward primer and three isoform-specific reverse primers. (Fig. 2).

Shannon entropy in ALU repeats and circular transcripts
To investigate whether our seed selection procedure and especially the Shannon entropy filter prohibits the detection of circular transcripts in ALU elements, we have calculated the median entropy in 20bp and 40bp windows for each ALU repeat, annotated in hg19 (Fig. 4). Within all ALUs, the smallest median entropy we measured was 1.785. Thus, our results indicate that the required minimum shannon entropy of 1.5 does not impede the split-read mapping procedure of segemehl. To check whether segemehl is indeed able to map successfully onto ALU repeats, we have extracted the sequences of all Human ALU repeats (∼1.1 Million) that were annotated in the RepeatMasker track of hg19. The ALU sequences were artificially circularized and reads of length 100nt were generated. Subsequently, the reads were mapped to the Human genome using segemehl. In total, 99% of the reads were aligned by our tool. Most of the aligned reads (54.5%) were aligned without a split (end-to-end), while only 35.5% of them contained a circular junction. This can easily be explained by the fact that ALU repeats are frequently found in multiple adjacent copies in the genome. By design, segemehl operates conservative and attempts to map any read collinearly if possible. Since 99% of the reads could be mapped, seed and alignment quality filters do not per se impede the mapping of ALU transcripts. However, the above result indicates that a circular transcript could be missed if it was also explainable by a collinear transcript. Obviously, the information of the read is simply not sufficient, e.g. too short, to tell circular from collinear apart and any aligner would naturally fail to reliably predict the circularization. total mapped split-mapped circular 1119183 1108492 (99%) 504118 (45%) 393212 (35.1%)

LnCaP split validation Transcriptome sequencing
Cells from the human prostate carcinoma cell line LNCaP were maintained and total RNA was isolated as described previously [3]. Ribsomal RNAs were removed from total RNA using the Ribo-Zero Kit (Epicentre, Madison, WI) according to the manufacturer's instructions. A strand-specific library for transcriptome sequencing was prepared using the ScripSeq Kit (Epicentre) following the manufacturer's instructions. The library concentration was determined using an Agilent 2100 Bioanalyzer system with a High Sensitivity DNA Kit (Agilent, Sanat Clara, CA) according to the manufacturer's instructions and relying on the concentration of fragments between 150 and 600 nt in size. 12 pmol of library were clustered on one lane of an Illumina paired-end flow cell and 2x100nt were sequenced according to the manufacturer's instructions using the v3 Cluster Generation and SBS Sequencing Kits (Illumina, San Diego, CA) on a HiSeq2000 system.

Split and isoform validation
Reverse transcriptase polymerase chain reaction (RT-PCR) was used to validate the observed split in total RNA of the human carcinoma cell line LNCaP. Complementary DNA was synthesized from 1µg total RNA using AMV reverse transcriptase (Finnzymes) and random hexamer primers (Invitrogen). PCR was performed using FAST SYBR TM green (Applied Biosystems) on a Applied Biosystems 7900 System with the following primer sets: forward primer 29: 5'-GGTCCCTTTTCTTTGACCAG-3', 300nM reverse primer 29: 5'-CTGCAGGATCTATAGGCAGCTT-3', 300nM forward primer 36: 5'-GTCCCTTTTCTTTGACCAGAT-3', 300nM reverse primer 36: 5'-GGGACAGGATCTATAGGCAGCTT-3', 300nM Primers were designed using primer3 [16] and the sequence of the read exhibiting the split as a template, which explains mismatches in the primer sequences compared to the reference genome. Following PCR and agarose gel electrophoresis, amplified bands (around 100bp) were gel excised (Qiagen MinElute Gel extraction kit) and the purified products were ligated into a pCR4-Topo vector and transformed using TOP10 chemical competent Escherichia coli cells (TOPO TA Cloning Kit, Invitrogen). Clones were checked for inserts with M13F and M13R primers and PCR-products were subjected to Sanger sequencing.  Figure 5: A split observed in the prostate carcinoma cell line LNCaP. The split aligns in parts to the forward and reverse strand of chromsome 10, respectively, and is located in the 3' UTR of stearoyl-CoA desaturase (SCD). This enzyme is known to promote prostate cancer cell proliferation and transactivation of the androgen receptor, the key signaling pathway in this disease [12]. LNCaP total RNA was sequenced strand-specific using Illumina paired-end sequencing subsequent to ribosomal RNA removal. For validation, PCR primers were designed using the sequence reads exhibiting the split as a template. Conventional RT-PCR was performed using these primers and the amplicon of the expected size shown in the inset was cloned. Sanger sequencing of this clone confirmed the observed split.

Recovery of published circular RNAs
The existence of circRNAs has been demonstrated in several high-impact publications over the last couple of years, which are cited in the manuscript. We demonstrate, using the RNA seq data for HEK293 cells from the Memczak et al. (Nature 2013), that segemehl is capable of recovering the circRNAs reported and validated in these authors' work. In total, segemehl found 1,712 circular junctions located at canonical splice motifs. From the 239 predicted circular junctions in the Memczak paper (cf. [13]), 191 (80%) were also predicted by our algorithm. More importantly, our mapper predicted all of the 19 circular RNAs that were experimentally validated by Memczak.
In fact, we observed that the majority of the validated RNAs (n=15) had a junction support of two or more reads. Applying this simple filter reduced the number of reported circular junctions to 373.