Global and unbiased detection of splice junctions from RNA-seq data
© Ameur et al.; licensee BioMed Central Ltd. 2010
Received: 23 October 2009
Accepted: 17 March 2010
Published: 17 March 2010
We have developed a new strategy for de novo prediction of splice junctions in short-read RNA-seq data, suitable for detection of novel splicing events and chimeric transcripts. When tested on mouse RNA-seq data, >31,000 splice events were predicted, of which 88% bridged between two regions separated by ≤100 kb, and 74% connected two exons of the same RefSeq gene. Our method also reports genomic rearrangements such as insertions and deletions.
High-throughput sequencing of mRNA opens unprecedented opportunities to identify the spectrum of splice events in a sample on a global scale. The typical approach for detecting splicing in RNA-seq experiments has been to map the reads to a junction library consisting of predefined exon-exon boundaries [1–6]. Although these strategies can successfully recover many splice events, they do not analyze splicing from a truly global and unprejudiced perspective. Only splice junctions present in the library can be identified, and it is simply not feasible to match against all possible combinations of exons. For example, a genome with 100,000 (105) exons, which is a low estimate for mammalian genomes, would yield 1010 combinations. To address this problem, the size of the junction library must be reduced dramatically, and consequently, most methods consider only the candidates involving known exons within the same gene. A severe limitation with this approach is that splicing events involving previously unknown exons cannot be identified. Also, this type of analysis is restricted to the relatively small number of species in which coordinates of genes and exons have been found.
To overcome some of these limitations, the splice-junction library can instead be created directly from the RNA-seq data without relying on any genome annotations. This approach is taken by the two packages G-Mo.R-Se  and TopHat . With these methods, all reads are first mapped to the reference genome, and transcribed fragments are identified through analysis of the coverage profile. The ends of these fragments are then combined into a library of putative exon boundaries to which the previously unmapped reads are aligned. Although this strategy has some advantages over methods that construct the library from known annotations, the problem of analyzing all possible exon combinations remains. G-Mo.R-Se and TopHat solve this problem by considering only putative junctions that span between neighboring (but not necessarily adjacent) transcribed fragments and those that contain a canonic (GT/C-AG) splice site. These restrictions imply that a substantial number of true splice junctions (for example, those with long introns or noncanonic splice sites) are outside of the detection range. A further limitation is that these methods are based on accurate de novo identification of exon boundaries from raw RNA-seq data, which in itself is a computationally challenging task, especially for transcripts expressed at lower levels.
An important application of deep RNA sequencing is the discovery of fusion transcripts in cancer, and two consecutive methods have been proposed by Maher and colleagues . Initially the authors used a combination of long reads (>200 bp) from the Roche 454 sequencer and shorter reads from the Illumina (Solexa) platform, and later they shifted to using paired-end sequencing (2 × 50 bp) . Although these strategies can successfully discover fusion transcripts, they have a number of important drawbacks. First, it is both costly and labor intensive to use two different sequencing platforms, as was done in their primary study. Second, the mate-pair approach complicates the analysis, because the expected insert size must be taken into account when estimating the expected distance between two mates in the sequenced transcript. This will be particularly problematic for mates that span over several splice junctions. Also, preparation of mate-pair libraries require larger amounts of RNA than the fragment libraries used in most RNA-seq experiments. The amount of RNA can be a crucial limitation, especially when studying clinical samples.
Here we present an alternative approach to identify splice junctions. The junctions are predicted de novo without any preassumed set of allowed exon boundaries. This implies that all types of splicing events in the RNA sample can be detected in a completely unbiased way, including previously unknown splice junctions and fusion transcripts. Also, we rely entirely on short reads (~50 bp) from fragment libraries, which is the type of RNA-seq data normally generated by using the Illumina or SOLiD platforms. By applying our method to available RNA-seq data from mouse cells , we showed that splice junctions can be identified at almost nucleotide precision and with a very low false-discovery rate (FDR). Moreover, this strategy also allows unbiased detection of insertions, deletions, and other types of genomic rearrangements within transcribed sequences. Indels and coding repeat expansions are important in a large number of human disorders . The potential for simultaneous detection of expression levels and coding-sequence variation in a single analysis pipeline will be beneficial for patient-sample analysis. We have implemented our method in a software called SplitSeek. The SplitSeek results can be directly uploaded to the UCSC genome browser  and used as input to the BEDTools software suite , which enables the user to visualize and analyze the predicted events in a genomic context.
Number of split read alignments
Anchor length 21
Anchor length 22
Anchor length 23
Anchor length 24
Splice junctions and insertions reported by SplitSeek with anchor length 22
Number processed reads
Predicted splice junctions
Within 1 Mb
Within 100 kb
Match to a RefSeq exon-exon boundarya
Expected false within 1 Mb (FDR)
Expected false within 100 kb (FDR)
Number of predicted small insertions and deletions within RefSeq exons and 3'UTRs
Insertions in RefSeq exons
Insertions in 3' UTR
Deletions in RefSeq exons
Deletions in 3' UTR
Our results demonstrate that SplitSeek has a high specificity, and the number of false positives could be reduced even further by requiring more unique reads to cover each junction. A more difficult task is to increase the sensitivity, but our comparison with the RNA-MATE program  suggests that one possible way is to use SplitSeek in combination with a complementary method that aligns the reads to a library of known exon boundaries. However, this comparison is focused only on splicing between annotated exons, whereas one of the strengths of SplitSeek is that it can perform other types of analysis in which RNA-MATE or other available tools cannot be directly applied. These include identification of splice sites in uncharacterized transcripts, detection of long-range fusion transcripts, and detection of small indels in transcribed sequences.
About 12% of the predicted junctions bridge between regions separated by ≥100 kb (see Table 2). Although a few of them can probably be explained by long introns (for example, Figure 5), this can not account for all detected long-range splicing and especially not the junctions bridging between different chromosomes. Instead, it is likely that many of them are false positives because of alignment issues or properties of the genome sequence. As an example, we may falsely detect splicing between different genes that belong to the same family just because of high sequence similarity in the exons. However, we cannot rule out that a substantial number of these unexpected splicing events are indeed true, and these would be interesting to investigate further. In that case, it might be reasonable to consider only the events bridging between regions identified as significantly transcribed from the RNA-seq data to filter out a large part of the false-positive long-range splicings.
The main limiting factor in the SplitSeek method is that there must be at least one read almost centered over an exon boundary; otherwise, it will not be detectable. When using 50-bp reads and 22-bp anchors as in this study, seven (14%) of 50 of the junction reads have this property. With a length of 75 bp and still splitting into 2 × 22 bp, this proportion would increase to 32 (43%) of 75, and this would likely increase the number of detected splicing events significantly. Another benefit of longer reads is that they could allow longer anchor lengths in the alignment, which might be necessary to discover junctions that are not uniquely mappable with shorter reads. However, it also is possible to increase the throughput by simply performing a deeper sequencing by using more of the 50-bp reads, and it is not obvious which is the optimal approach for this application. Although several benefits exist of using longer reads, some drawbacks might also occur, such as lower-quality base calls at the ends of the reads and difficulties in identifying splicing between very short exons.
Because of the recent improvements in throughput of the next-generation sequencing platforms, we believe that this strategy will make it feasible to investigate the entire spectrum of splicing events or gene fusions in an RNA sample in a completely unbiased way. We also want to emphasize the possibility of finding insertions, deletions, and other types of genetic rearrangements with the SplitSeek approach. This moves beyond the scope of RNA-seq data analysis, because it can equally well be used for DNA samples sequenced with high coverage.
We have developed a strategy for de novo detection of splice junctions in RNA-seq data. The exon-exon boundaries are identified almost at nucleotide resolution and with a low false-positive rate, <1 in 10,000 for junctions within 100 kb. Our method makes it possible to study splice junctions and fusion genes while also quantifying the gene expression, all from the same RNA-seq data. In addition, our method reports insertions and deletions in coding and noncoding parts of transcripts. We expect this to be an important application in a wide range of RNA-seq projects.
Materials and methods
Data acquisition and alignment
The raw RNA-seq data on mouse oocytes were downloaded from Gene Expression Omnibus , with accession number GSE:14605. The reads were aligned and extended by using version 1.0 of the whole transcriptome analysis tool available from Applied Biosystems . This software splits each read into two parts, or "anchors," which are aligned separately and extended as far as possible while still matching the reference sequence. We matched the reads by splitting into two parts of lengths 21 to 24, allowing up to two "color space" mismatches in each alignment. The minimum score required for an alignment to be reported in the final output was set to 20.
The SplitSeek program
Splice junctions were predicted from the alignment output files by using the SplitSeek software, which consists of two programs that are executed sequentially. In the first step, all candidate junction reads are identified and written to an intermediate BEDPE file. BEDPE is a file format that was recently introduced to give a concise description of paired-end sequence alignments . This intermediate file is then used as input to a second script that performs the remaining analysis. The algorithm is split into two parts because the first program is specific to the next-generation sequencing platform, in this case, SOLiD, whereas the second script is more general.
SplitSeek finds exon-exon boundaries that are supported by several split reads. In this case, we required each junction to be covered by at least two reads with unique starting points. Other parameters that may be specified by the user include the total number of reads required to cover a predicted junction, and the maximum allowed distance between two candidate junction reads that belong to the same predicted splice junction. SplitSeek groups candidate junction reads by traversing them in the order of their genomic coordinates and joining those where the two exon boundaries are both within the allowed distance. All groups in which the number of reads is greater than the user-defined threshold are then reported in the SplitSeek output. In some cases, SplitSeek may require an additional "chrmap" input file to ensure that the chromosome names of SplitSeek predictions agree with those in the genome databases. The user is allowed to specify an upper limit on the distance between the junctions (for example, 100 kb), so that longer splicing events are not reported.
The SplitSeek results are presented in two different formats, as a BED file and a BEDPE file. The BED file can be uploaded and viewed in the UCSC genome browser, whereas the BEDPE file can be used as input to BEDTools  or other analysis software for comparing genomic features. SplitSeek is implemented in perl, and the program is available as Additional file 1. The code also can be downloaded from the SOLiD software-development community . The current version is available for data generated by the SOLiD system, but it could be adapted to Illumina or other next-generation sequencing platforms. What then would be required is to perform a split read alignment and to write all candidate junction reads into a BEDPE formatted file to be processed by SplitSeek.
Calculating False Discovery Rate
To make an estimate of the false discovery rate (FDR) in our results, we assume a null hypothesis in which the two parts of a splice event are uniformly distributed over the genome sequence. We then estimated an FDR for all splicing events within 1 Mb by comparing the observed values with the expected. To calculate the number of expected events, we assume that the first anchor has already been randomly mapped to the genome. In that case, the second anchor must be mapped within a ± 1-Mb window surrounding the first anchor for the criteria to be fulfilled. The size of this window is 2 × 106 bases. Because the mouse reference sequence (mm9) used in the alignment consists of about 2.7 × 109 bases, the probability that two randomly placed splicing boundaries are located within 1 Mb is ~2 × 106/2.7 × 109 ≈ 7.4 × 10-4. Under the null hypothesis, the number of expected splicing events within 1 Mb can therefore be estimated by N × 7.4 × 10-4, where N is the total number of predicted junctions. The FDR is then calculated as the ratio between expected/observed events. In the same way, we calculated the FDR for results within 100 kb. The results are presented in Table 2.
Comparing SplitSeek to RNA-MATE
Version 1.01 of the RNA-MATE program was downloaded from the SOLiD software-development web page , along with junction library files constructed from all known genes, gene predictions, mRNA evidence, and EST evidence available at the time of creation (early 2007). The library files contains ~430,000 putative junctions, each of length 60 bp. The RNA-MATE program was then executed on the same set of reads from the oocyte1 dataset, as was used for SplitSeek. Matching in RNA-MATE was done recursively with 50-bp and 45-bp tag lengths using three allowed mismatches and default settings for all other parameters. The RNA-seq data in this experiment is not strand specific, and therefore, all junction reads from both strands were combined in the RNA-MATE output. All RNA-MATE exon boundaries with at least two reads were considered positive. A positive RNA-MATE junction was considered to coincide with a SplitSeek prediction if the difference was at most 5 bp at both ends of the junction.
expressed sequence tag
false discovery rate
high-throughput sequencing of RNA
- 3' UTR:
three prime untranslated region.
We thank Jonathan Mangion, Applied Biosystems UK, for his helpful suggestions regarding the software implementation. This work was supported by the Swedish Natural Sciences Research Council.
- Cloonan N, Forrest AR, Kolle G, Gardiner BB, Faulkner GJ, Brown MK, Taylor DF, Steptoe AL, Wani S, Bethel G, Robertson AJ, Perkins AC, Bruce SJ, Lee CC, Ranade SS, Peckham HE, Manning JM, McKernan KJ, Grimmond SM: Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nat Methods. 2008, 5: 613-619. 10.1038/nmeth.1223.PubMedView ArticleGoogle Scholar
- Pan Q, Shai O, Lee LJ, Frey BJ, Blencowe BJ: Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nat Genet. 2008, 40: 1413-1415. 10.1038/ng.259.PubMedView ArticleGoogle Scholar
- Sultan M, Schulz MH, Richard H, Magen A, Klingenhoff A, Scherf M, Seifert M, Borodina T, Soldatov A, Parkhomchuk D, Schmidt D, O'Keeffe S, Haas S, Vingron M, Lehrach H, Yaspo ML: A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science. 2008, 321: 956-960. 10.1126/science.1160342.PubMedView ArticleGoogle Scholar
- Wang ET, Sandberg R, Luo S, Khrebtukova I, Zhang L, Mayr C, Kingsmore SF, Schroth GP, Burge CB: Alternative isoform regulation in human tissue transcriptomes. Nature. 2008, 456: 470-476. 10.1038/nature07509.PubMedPubMed CentralView ArticleGoogle Scholar
- Cloonan N, Xu Q, Faulkner GJ, Taylor DF, Tang DT, Kolle G, Grimmond SM: RNA-MATE: A recursive mapping strategy for high-throughput RNA-sequencing data. Bioinformatics. 2009, 25: 2615-2616. 10.1093/bioinformatics/btp459.PubMedPubMed CentralView ArticleGoogle Scholar
- Tang F, Barbacioru C, Wang Y, Nordman E, Lee C, Xu N, Wang X, Bodeau J, Tuch BB, Siddiqui A, Lao K, Surani MA: mRNA-Seq whole-transcriptome analysis of a single cell. Nat Methods. 2009, 6: 377-382. 10.1038/nmeth.1315.PubMedView ArticleGoogle Scholar
- Denoeud F, Aury JM, Da Silva C, Noel B, Rogier O, Delledonne M, Morgante M, Valle G, Wincker P, Scarpelli C, Jaillon O, Artiguenave F: Annotating genomes with massive-scale RNA sequencing. Genome Biol. 2008, 9: R175-10.1186/gb-2008-9-12-r175.PubMedPubMed CentralView ArticleGoogle Scholar
- Trapnell C, Pachter L, Salzberg SL: TopHat: discovering splice junctions with RNA-Seq. Bioinformatics. 2009, 25: 1105-1111. 10.1093/bioinformatics/btp120.PubMedPubMed CentralView ArticleGoogle Scholar
- Maher CA, Kumar-Sinha C, Cao X, Kalyana-Sundaram S, Han B, Jing X, Sam L, Barrette T, Palanisamy N, Chinnaiyan AM: Transcriptome sequencing to detect gene fusions in cancer. Nature. 2009, 458: 97-101. 10.1038/nature07638.PubMedPubMed CentralView ArticleGoogle Scholar
- Maher CA, Palanisamy N, Brenner JC, Cao X, Kalyana-Sundaram S, Luo S, Khrebtukova I, Barrette TR, Grasso C, Yu J, Lonigro RJ, Schroth G, Kumar-Sinha C, Chinnaiyan AM: Chimeric transcript discovery by paired-end transcriptome sequencing. Proc Natl Acad Sci USA. 2009, 106: 12353-12358. 10.1073/pnas.0904720106.PubMedPubMed CentralView ArticleGoogle Scholar
- Chuzhanova NA, Anassis EJ, Ball EV, Krawczak M, Cooper DN: Meta-analysis of indels causing human genetic disease: mechanisms of mutagenesis and the role of local DNA sequence complexity. Hum Mutat. 2003, 21: 28-44. 10.1002/humu.10146.PubMedView ArticleGoogle Scholar
- Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D: The human genome browser at UCSC. Genome Res. 2002, 12: 996-1006.PubMedPubMed CentralView ArticleGoogle Scholar
- Quinlan AR, Hall IM: BEDTools: A flexible suite of utilities for comparing genomic features. Bioinformatics. 2010, 26: 841-842. 10.1093/bioinformatics/btq033.PubMedPubMed CentralView ArticleGoogle Scholar
- UCSC Genome Bioinformatics. [http://genome.ucsc.edu]
- Edgar R, Domrachev M, Lash AE: Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 2002, 30: 207-210. 10.1093/nar/30.1.207.PubMedPubMed CentralView ArticleGoogle Scholar
- AB WT Analysis Pipeline. [http://solidsoftwaretools.com/gf/project/transcriptome]
- SplitSeek. [http://solidsoftwaretools.com/gf/project/splitseek]
- RNA-MATE. [http://solidsoftwaretools.com/gf/project/rnamate]
- GNU Operating System Licences. [http://www.gnu.org/licenses]
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.