SpliceGrapher: detecting patterns of alternative splicing from RNA-Seq data in the context of gene models and EST data
© Rogers et al.; licensee BioMed Central Ltd. 2012
Received: 26 September 2011
Accepted: 31 January 2012
Published: 31 January 2012
We propose a method for predicting splice graphs that enhances curated gene models using evidence from RNA-Seq and EST alignments. Results obtained using RNA-Seq experiments in Arabidopsis thaliana show that predictions made by our SpliceGrapher method are more consistent with current gene models than predictions made by TAU and Cufflinks. Furthermore, analysis of plant and human data indicates that the machine learning approach used by SpliceGrapher is useful for discriminating between real and spurious splice sites, and can improve the reliability of detection of alternative splicing. SpliceGrapher is available for download at http://SpliceGrapher.sf.net.
Deep transcriptome sequencing (RNA-Seq) with next-generation sequencing (NGS) technologies is providing unprecedented opportunities for researchers to probe the transcriptomes of many species [1–5]. An important goal in these studies is to assess the extent of alternative splicing (AS), a process that increases transcriptome and proteome diversity, and plays a key role in regulating gene expression and protein function [6, 7]. Although it is inexpensive and easy to obtain whole transcriptome data using RNA-Seq, one limitation has been the lack of versatile methods to analyze these data. Consequently, there is an increasing demand for methods that can use the short reads produced in these studies to predict patterns of AS.
The sequences produced by NGS methods have characteristics that complicate the task of identifying the mRNA transcripts represented in a sample. A sequencing read may consist of fewer than 40 nucleotides, making it difficult to identify a unique origin within a reference sequence. In addition, NGS base-call error rates tend to increase with read length, raising the chance of a mismatch when aligning a read to a reference sequence . These ambiguities are exacerbated by the presence of paralogous genes that can give rise to reads that align well in multiple locations. Much of the work on analyzing NGS reads has focused on aligning reads within exonic regions, and many methods exist for the problem of aligning reads without gaps-for example, MAQ , PASS  and BowTie .
Reads that span splice junctions introduce additional challenges. A splice junction may occur anywhere within a short read, so the read may have just a few bases on one side of a junction. Such a short sequence may align in multiple locations, making it difficult to identify its true origin. One can use heuristics to restrict the number of candidate locations: for example, by establishing limits for permissible intron lengths, or by focusing on locations that are bounded by canonical GT-AG or GC-AG splice-site dimers. Several spliced alignment algorithms exist that use these and other approaches to identify unique alignments for spliced reads [12–17].
The first studies that used RNA-Seq data to predict AS focused on exon-skipping events, the most prevalent form of AS in mammals (see, for example, [1, 18–21]). To identify splice junctions recapitulated in short read data, these studies used exon sequences flanking annotated splice sites to produce a database of splice junction sequences. Using novel combinations of known acceptor and donor sites, researchers can create a database that consists of both known and putative splice junction sequences. RNA-Seq reads that align to these putative sequences then provide evidence for novel splicing events.
In this work we compare our approach with two methods: Cufflinks , and TAU . Both methods were originally designed for de novo splice form prediction, but can make limited use of existing annotations. TAU predicts splice forms by assembling for each gene all feasible combinations of exons that have been identified in the alignment and spliced-alignment step. This approach ensures that no splice form will be overlooked, but it can produce a large number of transcripts that make it difficult to identify the most realistic predictions. Cufflinks uses a more sophisticated approach that seeks to identify the smallest set of mRNA transcripts that explain the observed data . The Scripture method constructs a transcript graph on the basis of regions that exhibit statistically significant read depth compared to genomic background and predicts splice forms by considering all paths through the graph, an approach similar to TAU's . Methods that perform de novo transcriptome assembly without requiring a reference genome-for example, Trinity  and ABySS -are not included in our comparison.
A few tools, such as Sircah, can produce splice graphs from conventional EST or cDNA alignments . In previous work we enhanced Sircah's AS detection rules and extended the package to provide statistics and protein predictions for genome-wide studies of AS based on EST data . With SpliceGrapher we extend this idea by incorporating multiple forms of data, including RNA-Seq reads, into splice graph predictions.
SpliceGrapher was designed from the outset to integrate RNA-Seq data, annotated gene models and EST alignments to produce comprehensive splice graph predictions. It applies inference rules to generate predictions that are compatible with all available evidence. The package includes flexible visualization tools that can depict splice graphs along with the evidence used to predict them. We compare SpliceGrapher's predictions with splice graphs generated from TAU and Cufflinks output based on RNA-Seq data and find that our results are more consistent with existing evidence from curated gene models.
Results and discussion
Splice graph prediction pipeline
Figure 1 provides an example of a splice graph that SpliceGrapher predicted for a gene in Arabidopsis thaliana, along with the graph constructed from the gene model, the splice junctions that were recapitulated in the RNA-Seq data, and the read depth along the genomic region. SpliceGrapher combined these different forms of evidence and predicted only those novel splicing events it could resolve unambiguously.
Alternative splicing events
A. thaliana models
No gene models
With gene models
No gene models
With gene models
V. vinifera models
No gene models
With gene models
No gene models
With gene models
Comparison with TAU and Cufflinks
To see how SpliceGrapher's AS predictions compared with other tools, we ran the HashMatch/Supersplat/TAU and TopHat/Cufflinks pipelines, both with and without gene models, using the same RNA-Seq data as SpliceGrapher, and assembled splice graphs from their transcript predictions. We note that Cufflinks was designed for mammalian genomes: its assembly algorithm includes heuristics that are based on the characteristics of human and mouse transcripts . TAU was designed to analyze the A. thaliana data we consider here .
Recall of TAIR9 and TAIR10 annotations
Splicegrapher + EST
No gene models
With gene models
No gene models
With gene models
None of the packages was able to predict more than 8% of the new annotations that were added in the TAIR10 version of the Arabidopsis genome annotations (see Table 2 for details). SpliceGrapher achieved better recall than Cufflinks-with or without annotations-despite making fewer overall predictions of new exons and introns; this is a strong indication that SpliceGrapher's predictions have fewer false positives. Surprisingly, the recall level for Cufflinks with gene models was lower than without them. SpliceGrapher also achieved better recall than TAU without annotations. With gene models, TAU achieved the highest recall at the exon and intron levels, but we believe it is a result of a severe over-annotation that was discussed earlier. At the transcript level SpliceGrapher performed best. Its splice graphs were consistent with 28 new splice forms, compared with 4 for Cufflinks and 11 for TAU when provided with TAIR9 annotations. Both packages were unable to correctly predict any transcript without using the gene models. In Additional file 1 we compare Cufflinks and SpliceGrapher predictions to results we have recently obtained using a curated set of full-length cDNAs and ESTs for SR genes in Arabidopsis .
Alternative splicing predictions
SpliceGrapher predicts AS events in proportions that more closely match the gene models than TAU and Cufflinks (Table 1). TAU uses known splice sites to predict all possible putative exons for a gene, resulting in a vast number of IR events that represent 85% of its AS predictions (Figure S2 in Additional file 1 demonstrates the issue). There is a noticeable discrepancy in the rate at which Cufflinks detects exon skipping events. Exon skipping accounts for 9% of the AS events in the gene models, whereas 43% of the novel AS events predicted by Cufflinks with gene models are ES. Detection of an exon skipping event requires a novel splice junction, and therefore depends on accurate splice junction detection. Both TAU and Cufflinks ignore the sequence characteristics of putative junctions, a practice that can easily lead to false positives. SpliceGrapher uses splice site classifiers to ensure the accuracy of reads that span splice junctions. We believe that the lower rate of exon skipping detected by SpliceGrapher is a result of better control of the quality of splice junction alignments. Below we discuss this issue in detail.
Splice junction reads
We compared splice-junction read predictions made by the three packages. We restricted this comparison to junctions within known genes that were bounded by canonical GT-AG or GC-AG splice sites. Accurate splice junction alignments are crucial to the quality of AS predictions. This is illustrated in Figures S4 to S8 in Additional file 1. SpliceGrapher uses its splice-site classifiers to predict donor and acceptor sites that are then used to build a database of putative splice junction sequences. Short reads that align to these sequences provide evidence of splicing events recapitulated in the RNA-Seq data. Supersplat performs spliced alignment by attempting to map each end of a read to locations in the genome. It accepts alignments where a read's ends match genomic sequences with 100% identity and the inferred intron length is within specified limits. TopHat maps spliced reads by splitting the reads into segments, aligning the segments to genomic sequences, and accepting spliced alignments when they infer introns that are bounded by canonical GT-AG splice-site dimers and have lengths within specified limits.
Comparison of splice junctions identified by each package
Canonical junctions (GT-AG/GC-AG) within genes
Junctions in common
Novel junctions with a false-positive site
Novel junctions in common
Canonical junctions (GT-AG/GC-AG) within genes
Junctions in common
Novel junctions with a false-positive site
Novel junctions in common
Recall of splice junctions identified from EST data
We next illustrate that spliced alignment filtering is important, even as read length increases. To do so, we generated 41 million 76-nucleotide reads for A. thaliana and created a set of shorter reads by truncating the 76-nucleotide reads to 32 nucleotides; we then ran TopHat on the two sets. We used the same TopHat parameters in both cases except for the segment length, which we set to 20 nucleotides for the 32-nucleotide reads and 26 nucleotides for the 76-nucleotide reads. The 76-nucleotide reads produced more than four times as many spliced alignments as their truncated counterparts; the shorter reads resulted in a slightly larger number of ungapped alignments since their full-length versions might span splice junctions (see Table S2 in Additional file 1 for details). The additional spliced alignments for 76-nucleotide reads increased the number of novel junctions more than six-fold, but the proportion of false-positive junctions also jumped from 24% to 39%. These results demonstrate the value of increased read length for sensitivity in splice junction detection, and yet longer read length alone does not guarantee accurate splice junction predictions.
Inaccurate spliced-alignment can have a strong impact on the accuracy of AS detection. To demonstrate this, we looked at the splice junctions predicted by TAU and Cufflinks and collected statistics for the AS events associated with those junctions that SpliceGrapher's classifiers identified as false positives (Table S3 in Additional file 1). We identified spurious splice junctions as those that contained at least one false-positive splice site. With gene models, 24% of the AS events predicted by Cufflinks could be attributed to spurious splice junctions, compared to 59% without gene models. Note that the percentages were computed with respect to AS events that are not in the TAIR9 annotations, so do not depend on the algorithm's access to the annotations. With TAU we also saw a decrease in the percentage of false positive events when provided with gene models.
Each of the three packages use somewhat different criteria for accepting splice-junction alignments. Both Supersplat and TopHat allow users to control maximum allowed intron length; SpliceGrapher on the other hand only accepts splice junctions that are within a single gene. We have found many examples where TAU and Cufflinks predicted splice junctions that span two genes; with TAU for example, we found 309 cases of splice junctions that span multiple genes in A. thaliana and 341 cases in V. vinifera.
The addition of EST data increased the number of AS events from 7,388 to 9,916 and the number of genes where AS is observed from 4,901 to 6,162. The graphs augmented with EST alignment information contain novel AS events in proportions that were much closer to the proportions in the TAIR9 models than graphs predicted using RNA-Seq alone. Notably, the proportion of IR events is the same as in the gene models, demonstrating the utility of ESTs for detecting IR events. Finally, the level of recall of TAIR10 annotation has resulted in a large increase with the addition of the EST data (Table 2).
Applying SpliceGrapher to mammalian genomes
To demonstrate SpliceGrapher's utility with mammalian genomes, we used a subset of the human RNA-Seq data used in  to compare AS in 60 individuals. We focused on the two individuals for whom a much larger set of reads was available. The data consisted of 74 million paired-end 35-nucleotide reads from two individuals: Caucasian (35.7 million reads) and Yoruban (38.3 million reads). ENSEMBL gene models were used as additional input to SpliceGrapher.
Based on Homo sapiens gene models, SpliceGrapher's classifiers achieved ROC scores of 0.97, 0.95 and 0.91 for GT, AG and GC splice sites, respectively. These classifiers predicted 25.3 million potential splice sites out of 105.2 million dimers that occurred within genes. Because of the longer gene length in human, the approach of creating a splice junction database was not feasible. Instead, we used TopHat to perform spliced alignments and filtered the results using our predicted splice site database.
We ran TopHat with default parameter settings to align both data sets. Our classifiers flagged close to 60% of the splice junctions in both datasets as false-positives (Table S4 in Additional file 1). This is in comparison to 14% in Arabidopsis and 58% in grape, which is further illustration of the potential pitfalls in splice junction alignment of short reads.
Using the filtered TopHat splice junction alignments, we found 1,099 novel AS events in the Caucasian sample and 1,154 novel AS events in the Yoruban sample. The original paper reported statistics on 110 AS events in genes where genetic variation impacted differential alternative expression, while our predictions span the whole genome. The various types of AS are predicted at rates similar to those in the gene models with the notable exception of IR, which was relatively hard to predict from plant RNA-Seq as well. The number of novel AS predictions in the two samples were similar for each AS type. Full details are provided in Table S5 in Additional file 1; Figure S10 in Additional file 1 shows an example where we identified a difference in the observed pattern of AS in the two samples.
We have presented SpliceGrapher, a Python scripting package designed to enhance existing gene annotations by predicting splice graphs from RNA-Seq and EST data. In addition, SpliceGrapher includes modules for identifying AS events and for viewing predicted splice graphs along with the evidence used to generate them.
We compared SpliceGrapher to TAU and Cufflinks; unlike these packages, which make somewhat limited use of gene models, SpliceGrapher exploits gene annotations fully to construct its splice graphs. This allows SpliceGrapher to make meaningful predictions even for genes that have low read coverage, and helps us resolve AS events that are otherwise hard to detect from short-read data. Prediction of AS requires accurate alignment across splice junctions, which is a difficult task when read length is short. Our work demonstrates that using tools that do not use information about splice site characteristics can lead to a large number of possibly erroneous alignments. SpliceGrapher addresses this issue by using a machine learning approach to construct a database of potential splice sites that can be used to filter the output of a spliced alignment method such as TopHat, or to construct a database of potential splice junctions that allow the use of standard mapping tools that do not perform spliced alignment. The latter approach is only feasible for more compact genomes like the plant genomes we analyzed here.
We have demonstrated SpliceGrapher on RNA-Seq data from human and two plants, A. thaliana and V. vinifera. Plant genomes have different architectures from mammalian genomes and are therefore characterized by different types of AS events. In particular, IR is rare in mammalian genomes, but is the dominant form of AS in plants. Most existing work on predicting AS from short-read data has focused on exon skipping in mammals [1, 18, 19, 41–44]. We have shown that SpliceGrapher is able to predict different forms of AS, although the rate at which it detects intron retention is lower than its observed frequency in existing annotations. This is due to the inherent difficulty in predicting intron retention from RNA-Seq data, which requires consistent read coverage across an entire intron.
Materials and methods
In our experiments, we used publicly available short-read sequence data for the plants A. thaliana and V. vinifera. We downloaded FASTQ sequence files containing 284 million A. thaliana reads (accession number [SRA:009031]) and 59 million 36-nucleotide V. vinifera reads (accession number [SRX:012280]) from the NCBI Sequence Read Archive . A. thaliana reads were trimmed to 32 nucleotides as recommended in . As baseline annotation for A. thaliana we used the TAIR9 genome annotations ; the newer TAIR10 annotations were used for evaluation. For V. vinifera we used the Genoscope version 2 annotations and sequences, using only those sequences for the 19 best-characterized chromosomes. Additional RNA-Seq data were generated for A. thaliana as follows. Total RNA from 2-week-old A. thaliana (ecotype Columbia) seedlings grown on MS (Murashige and Skoog medium) plates was isolated using RNeasy Plant Mini Kit from Qiagen (Valencia, CA, USA). To remove any contaminating DNA, RNA was treated with DNAse . Isolation of poly (A) mRNA and preparation of a cDNA library were carried out using the Illumina TrueSeq RNA kit. Sequencing (72 cycle) was done on an Illumina Genome Analyzer II. This dataset, which altogether has 41 million reads, has been deposited in NCBI's Gene Expression Omnibus , and is accessible through GEO series accession number [GSE:32318].
The SpliceGrapher pipeline
SpliceGrapher's pipeline for predicting splice graphs uses several forms of data. Gene models are loaded directly into a splice graph that serves as a baseline for interpreting the RNA-Seq and EST data. Data from RNA-Seq experiments is aligned to the reference genome; ungapped alignment is carried out first, and all unaligned reads undergo spliced alignment, to infer reads that span splice junctions. SpliceGrapher can accept short-read alignments provided by a variety of tools [9–12, 14–17]. EST and cDNA sequences are aligned using conventional sequence alignment tools [39, 49] before SpliceGrapher includes the alignments in a splice graph. As SpliceGrapher loads each form of data, it integrates the new evidence with the gene models to construct a splice graph.
Splice graph construction
Each of the data sources used by SpliceGrapher-genome annotations, short-read data, and EST alignments-requires a distinct interpretation for splice graph construction. For the following discussion of the procedures used to interpret each type of data we borrow terminology from  and refer to exons that have explicit acceptor and donor splice sites as 'bounded' and those with an undefined acceptor or donor splice site as 'unbounded'. When one graph element, such as an exon or an intron, falls completely within the genomic coordinates of another graph element, we say that it is 'contained' within the other element.
SpliceGrapher accepts gene model annotations in GFF3 format  and constructs graphs using the sequence coordinates found in UTR, CDS and exon records. It interprets exon records as bounded exons and incorporates them directly into a splice graph. SpliceGrapher infers an intron whenever the corresponding exons are adjacent in a transcript. When an exon appears in multiple transcripts, a single exon node is created along with edges that link it to the exons that follow or precede it in the corresponding transcripts.
To incorporate RNA-Seq data into a splice graph, SpliceGrapher loads read alignments in SAM format  for all reads that map within the boundaries of some annotated gene. It then constructs exons (hereafter referred to as 'short-read exons') from clusters of contiguous ungapped alignments where the read depth remains above a minimum threshold (the default is 0). A short-read exon is discarded if it is contained within an existing exon. If a short-read exon does not extend exactly to known or predicted acceptor or donor splice sites, SpliceGrapher will extend it to the nearest splice site that is known or has support from spliced reads.
Because they are longer than RNA-Seq reads, ESTs provide more reliable transcriptional evidence. In our framework we run EST alignments through a series of processing steps designed to remove alignment artifacts  and convert the alignments directly to splice-graphs using the same procedures we developed for gene models. Exons at the 3' or 5' end of an EST are considered unbounded. These are merged with other exons in the graph according to the following rules. An exon that is unbounded at its 3' end is merged with another exon if they share the same acceptor site and both exons are unbounded at the 3' end. If one of the exons is bounded, the exons will be merged only if the unbounded exon is contained within the bounded exon. Analogous rules apply to exons unbounded at the 5' end. SpliceGrapher infers an intron between exons whenever they are adjacent in an EST.
Alternative splicing inference from RNA-Seq
Because of their short length, RNA-Seq data cannot be unambiguously interpreted as splice-graphs and AS events (Figure 3). SpliceGrapher's approach is to use as much data as possible to make confident predictions, and to annotate AS events as unresolved if the evidence does not clearly support a specific isoform. SpliceGrapher applies inference rules in the order presented in the following sections.
IR is arguably the most challenging form of AS to infer from RNA-Seq data. SpliceGrapher infers IR events from RNA-Seq evidence in two ways that exploit information from the gene models. When short-read coverage remains above a desired threshold across an intron's full length, it is evidence that the intron was retained in some transcripts (Figure 1). In this case the intron is excised in the constitutive form represented by the gene model. In an alternative scenario shown in Figure S11 in Additional file 1, the intron is retained in the known constitutive form. We detect this form of IR when a known exon has a novel splice junction within it.
When SpliceGrapher infers a novel IR event, it must identify unique exon boundaries for three exons: the longer exon in which the intron is retained, and the two shorter exons that flank the intron when it is excised. Usually the gene model provides good evidence for these boundaries, but in some cases it may not be possible to resolve them unambiguously.
When short-read coverage remains high across an intron's full length, SpliceGrapher will create a short-read exon that spans the intron. Its next task is to determine the exon's correct boundaries. SpliceGrapher first finds all exons in the graph that overlap the short-read exon. If one upstream exon overlaps its 5' end and one downstream exon overlaps its 3' end, SpliceGrapher creates a new exon whose boundaries are the acceptor site from the upstream exon and the donor site from the downstream exon. If more than one exon overlaps either end of the short-read exon, it is still possible to infer the new exon's boundaries provided all overlapping upstream exons share the same acceptor site and overlapping downstream exons share the same donor site (Figure S12 in Additional file 1). If the boundary at either end is ambiguous, SpliceGrapher creates an unresolved IR event. When a junction is used to infer an IR event through the scenario of a splice junction within an exon, SpliceGrapher must identify the boundaries for two new exons, which is performed in a manner analogous to the single exon case.
In some cases read coverage may remain high across two or more introns in succession, making it impossible to determine which of several possible splice forms is correct (Figure 3). In these cases, SpliceGrapher annotates the corresponding short-read exon as unresolved.
Alternative 3' and 5' events
When an intron is excised at more than one splice site, changing the boundaries of one of its flanking exons, we have evidence of an alternative 3' or 5' event. SpliceGrapher uses two forms of evidence to infer alternative 3'/5' events. When a short-read exon overlaps an existing exon but extends beyond its 3' or 5' end, it provides evidence for an alternative donor or acceptor site. In addition, when a novel splice junction appears between two exons it provides evidence for a novel intron (Figure S12 in Additional file 1). SpliceGrapher requires both forms of evidence to infer a novel alternative splice site. Below we describe the procedure for inferring an alternative acceptor (3' site). The procedure for an alternative donor (5' site) is analogous.
When a short-read exon overlaps an existing exon and extends into its upstream intron, it is evidence that the exon boundaries changed in some transcripts. To identify an acceptor site for the new exon, SpliceGrapher looks for junctions that have acceptor sites within the same intron, upstream of the short-read exon (Figure S12 in Additional file 1). SpliceGrapher then uses the acceptor site nearest the short-read exon as its acceptor site. If it finds no acceptor sites within the intron, SpliceGrapher annotates the short-read exon as unresolved.
If SpliceGrapher can resolve a new exon's acceptor site, it must resolve its donor site as well. The procedure is the same as that for identifying retained intron boundaries in the previous section. If one downstream exon overlaps the new exon's 3' end, SpliceGrapher uses the downstream exon's donor site as the new exon's donor site. If more than one downstream exon overlaps the new exon's 3' end, SpliceGrapher can still resolve its donor site provided all overlapping exons share the same donor site (Figure S12 in Additional file 1). If the overlapping exons have different donor sites, SpliceGrapher cannot resolve the new exon's donor, and it annotates the exon as unresolved.
An exon skipping event occurs when an exon is excised from some transcripts but included in others. SpliceGrapher infers skipped exons in two different ways. In the first scenario the exon is included in the known constitutive form represented in the gene model. If a novel splice junction spans the existing exon, it is evidence that the exon was skipped in some transcripts (see top panel of Figure S13 in Additional file 1). If the novel junction's acceptor and donor sites match those of established exons, SpliceGrapher adds the new intron to the graph and annotates the skipped exon.
An alternative scenario is when the exon is skipped in the constitutive form (bottom panel of Figure S13 in Additional file 1). In this case, if a short-read exon falls within an intron and is flanked by two novel junctions, it is evidence for a novel exon that is skipped in some transcripts. These clues may not provide enough evidence to resolve the event, so SpliceGrapher tries to associate the upstream junction's donor site and the downstream junction's acceptor site with exons in the graph. If this first step is successful, SpliceGrapher uses the upstream junction's acceptor site as the new exon's acceptor site and the downstream junction's donor site as the new exon's donor site. If it is unable to resolve the junctions, SpliceGrapher annotates the short-read exon as unresolved.
Splice graph assembly
The above rules are used to infer exons from gene models, ESTs and short-read data. The final step consists of adding edges (introns) that connect consecutive exons. For each newly predicted exon, SpliceGrapher looks for other exons in the graph that use the same acceptor site. If any of these exons has an intron from its acceptor site to a neighboring exon, SpliceGrapher adds an edge from the neighboring exon to the new exon. It repeats the process for all exons that have the same donor site. SpliceGrapher uses an analogous procedure to infer introns for the donor site of a new exon. Graphs are stored using a GFF file format.
Splice graph comparisons
TAU and Cufflinks predict transcripts that we convert into splice graphs associated with annotated genes. As de novo prediction tools, both TAU and Cufflinks make predictions without regard to known gene boundaries. Thus, to perform a meaningful comparison we associate each transcript with one or more genes that it overlaps. Once transcripts are associated with annotated genes, they are converted to splice graphs using the same procedure SpliceGrapher uses to assemble splice graphs from ESTs.
We define R(A|B) = 1 when B is the empty set. To compare two splice graphs we compute recall for exons and introns separately.
Short read alignment
SpliceGrapher includes modules that allow it to incorporate RNA-Seq data into splice graph predictions. It accepts short read alignments in SAM format . For this project we used PASS  to perform short-read alignments as we found it to be fast and accurate in preliminary tests on synthetic data.
To provide high-confidence splice graph predictions, we enforced strict criteria for accepting alignments. We accepted only reads that aligned to the genomic reference with 100% identity at a unique location. To increase confidence in these alignments, we also required that a read align nowhere else in the genome at 90% identity or above (a procedure adapted from ). We used uniquely aligned reads to identify exonic regions, and unaligned reads for splice-junction alignments.
Splice junction reads
SpliceGrapher performs spliced alignment by first constructing a database of splice junction sequences that are formed by concatenating the sequence directly upstream of a donor site with the sequence directly downstream of an acceptor site. We distinguish three kinds of splice junctions: 'known' junctions that are derived from gene model annotations; 'recombined' junctions constructed from novel combinations of known splice sites; and 'predicted' junctions in which one or both splice sites are novel. SpliceGrapher constructs junction sequences by pairing every known and predicted donor site in a gene with all known and predicted acceptor sites downstream of it in the same gene. This procedure has been used in other studies (see, for example, [1, 18, 19]) to construct a database of known and recombined splice junctions from known splice sites. SpliceGrapher extends this procedure to include predicted sites as well.
To avoid spurious alignments, we require that junction-crossing reads align with a minimum 10-nucleotide overlap on either side of a junction. Adopting nomenclature from , we refer to the 10-nucleotide region on either side of a junction as the 'anchor' region. For reads of length n and a required overlap size p, we can enforce this constraint by generating splice junction sequences of length 2(n-p). For example, for 32-nucleotide reads and a required 10-nucleotide minimum overlap, we generated 44-nucleotide sequences (22 nucleotides on either side of a junction). To improve sensitivity, we accepted alignments with 90% identity overall, provided they aligned with 100% identity within the anchor region.
The three pipelines each used slightly different alignment criteria. Our SpliceGrapher pipeline allowed mismatches in exonic alignments so that we could identify and eliminate reads that aligned well to multiple locations. We also accepted mismatches in splice-junction alignments to improve sensitivity, provided there were no mismatches within anchor regions. We used TopHat parameters that allowed us to duplicate our own alignment criteria, though TopHat accepts mismatches within splice-junction anchor regions. Supersplat accepts only reads that align with 100% identity within anchor regions.
Splice site prediction
We classify splice sites using support vector machines, using an approach that has been shown to be highly accurate in splice site prediction [12, 15, 52, 53]. To create splice site classifiers, SpliceGrapher extracts positive and negative example sequences for splice site donor and acceptor dimmers such as GT, GC and AG following the procedure described in . Briefly, we use known splice sites as positive examples for a given dimer and for negative examples we use all other occurrences of the dimer found within genes. Training examples then consist of intronic and exonic sequences taken from either side of a site.
SpliceGrapher's classifiers discriminate sequences using an implementation of the weighted-degree kernel . This kernel represents a sequence in a feature space of k-mers associated with positions in the sequence. Kernel parameters include exon and intron sequence lengths on either side of a splice site, k-mer length, number of mismatches to allow within a k-mer, and whether to allow shifts in k-mer position (for a detailed overview, see ). SpliceGrapher iterates over combinations of these parameters to identify the best-performing combination. It uses the PyML package  to train and test these support vector machines using the training examples described above. SpliceGrapher executes this procedure for all given acceptor and donor dimers to yield a classifier for each one. It then applies each classifier to corresponding dimers in the genomic reference sequences. Locations classified as splice sites contribute a pool of predicted sites that SpliceGrapher combines with known splice sites when it generates splice junction sequences. The Palmapper package  uses a similar set of classifiers, but evaluates acceptor and donor sites during spliced alignment instead of creating a splice junction database.
In predicting splice sites we focused on canonical GT and GC donor sites and AG acceptor sites. For A. thaliana, we used SpliceGrapher with TAIR9 annotations to extract 122,534 annotated GT, GC and AG splice sites. SpliceGrapher created splice site classifiers for GT and GC donor sites and AG acceptor sites. These classifiers achieved ROC scores of 0.97, 0.97 and 0.95 for GT, GC and AG sites, respectively, in five-fold cross-validation. ROC curves are provided in Figure S9 in Additional file 1. These classifiers predicted novel splice sites for 878,994 out of 9,753,440 dimers (9%) found within genes in the TAIR9 reference sequences. Combining these predicted sites with known splice sites, we generated a splice junction database with 8.2 million junctions, out of which 122,534 are known and 586,704 are recombined.
For V. vinifera, SpliceGrapher extracted 125,080 annotated GT, GC and AG splice sites from the gene models. The splice-site classifiers achieved ROC scores of 0.91, 0.81 and 0.88 for GT, GC and AG sites, respectively, in five-fold cross-validation. As V. vinifera is not as well annotated as A. thaliana, we decided to use the EST alignments described earlier to obtain better splice site models. We used SpliceGrapher to extract 84,811 splice sites from these alignments, and used them to improve our splice site models. The resulting classifiers achieved ROC scores of 0.98, 0.92 and 0.96 for GT, GC and AG sites, respectively, considerably higher than those based on the gene models. The ROC curves are shown in Figure S9 in Additional file 1. These classifiers predicted 2.2 million putative splice sites out of 11.2 million dimers (20%) within genes in the reference sequences. The resulting splice junction database contains 91 million splice junction sequences, out of which 125,080 are known, and 616,314 are recombined.
We ran HashMatch and Supersplat  on the two data sets, following the procedures outlined in . We first used HashMatch to perform ungapped read alignment. We then perform spliced alignment with Supersplat using those reads that did not result in a match in ungapped alignment. Supersplat performs alignment without regard for splice site dimers, so for a more realistic comparison we accepted only alignments that spanned canonical dimers. Supersplat accepts minimum and maximum intron lengths and anchor region size. It is prudent to select conservative intron sizes to prevent TAU from generating an inordinate number of transcripts. For A. thaliana we established an intron size range of 40 to 5,000, as 99.9% of known introns fell within this range. For V. vinifera the same criterion yielded an intron size range of 55 to 15,000. To fix a lower bound for overlaps on either side of a junction, we set the minimum anchor size to 10, and accepted only reads that aligned to a unique splice junction. All experiments used Supersplat version 1.0 and TAU version 1.4 (no version information was available for HashMatch).
TopHat and Cufflinks were designed for mammalian genomes and thus rely on some heuristics that are based on mammalian gene expression statistics and gene architecture. For example, TopHat's heuristic filter for spliced alignments is based on the observation that, in humans, minor splice forms usually have expression levels that are at least 15% as high as those of their corresponding major splice forms . Another heuristic embedded into TopHat is to report only alignments across GT-AG introns for reads shorter than 75 nucleotides.
We ran TopHat on the A. thaliana data using parameters that reflected the requirements we set for our own alignments: no multi-hits (-g 1), minimum anchor length 10 (-a 10), and minimum and maximum intron length 40 and 5,000, respectively (55 and 15,000 for V. vinifera). TopHat splits reads into segments for part of its search. To permit reads to align at 90% identity, we set the segment length to 20 (--segment-length = 20) and allowed up to 2 mismatches per segment (--segment-mismatches = 2). To eliminate the heuristic filter associated with mammalian genomes, we set the minimum normalized depth to 0 (-F 0). To make our comparison as fair as possible, we also ran TopHat with --segment-mismatches = 0 to force it to accept only alignments with 100% identity. We then approximated our own alignment criteria by using TopHat's ungapped alignments at 100% identity and its spliced alignments at 90% identity. The remaining difference between TopHat's splice-junction alignments and ours was a subset of TopHat read alignments that contained mismatches in the anchor region. All experiments used Bowtie version 4.1.2, TopHat version 1.3.3 and Cufflinks version 1.1.0.
expressed sequence tag
general feature format
National Center for Biotechnology Information
receiver operating characteristic
The Arabidopsis Information Resource
This project is supported by NSF ABI grant 0743097.
- Mortazavi A, Williams B, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008, 5: 621-628. 10.1038/nmeth.1226.PubMedView ArticleGoogle Scholar
- Wang Z, Gerstein M, Snyder M: RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009, 10: 57-63. 10.1038/nrg2484.PubMedPubMed CentralView ArticleGoogle Scholar
- Filichkin S, Priest H, Givan S, Shen R, Bryant D, Fox S, Wong W, Mockler T: Genome-wide mapping of alternative splicing in Arabidopsis thaliana. Genome Res. 2010, 20: 45-10.1101/gr.093302.109.PubMedPubMed CentralView ArticleGoogle Scholar
- Harr B, Turner L: Genome-wide analysis of alternative splicing evolution among Mus subspecies. Mol Ecol. 2010, 19: 228-239.PubMedView ArticleGoogle Scholar
- Ramani A, Calarco J, Pan Q, Mavandadi S, Wang Y, Nelson A, Lee L, Morris Q, Blencowe B, Zhen M, Fraser A: Genome-wide analysis of alternative splicing in Caenorhabditis elegans. Genome Res. 2011, 21: 342-10.1101/gr.114645.110.PubMedPubMed CentralView ArticleGoogle Scholar
- Stamm S, Ben-Ari S, Rafalska I, Tang Y, Zhang Z, Toiber D, Thanaraj T, Soreq H: Function of alternative splicing. Gene. 2005, 344: 1-20.PubMedView ArticleGoogle Scholar
- Hallegger M, Llorian M, Smith CWJ: Alternative splicing: global insights. FEBS J. 2010, 277: 856-866. 10.1111/j.1742-4658.2009.07521.x.PubMedView ArticleGoogle Scholar
- Shendure J, Ji H: Next-generation DNA sequencing. Nat Biotechnol. 2008, 26: 1135-1145. 10.1038/nbt1486.PubMedView ArticleGoogle Scholar
- Li H, Ruan J, Durbin R: Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 2008, 18: 1851-10.1101/gr.078212.108.PubMedPubMed CentralView ArticleGoogle Scholar
- Campagna D, Albiero A, Bilardi A, Caniato E, Forcato C, Manavski S, Vitulo N, Valle G: PASS: a program to align short sequences. Bioinformatics. 2009, 25: 967-10.1093/bioinformatics/btp087.PubMedView ArticleGoogle Scholar
- Langmead B, Trapnell C, Pop M, Salzberg S: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009, 10: R25-10.1186/gb-2009-10-3-r25.PubMedPubMed CentralView ArticleGoogle Scholar
- De Bona F, Ossowski S, Schneeberger K, Rätsch G: Optimal spliced alignments of short sequence reads. BMC Bioinformatics. 2008, 9: O7-10.1186/1471-2105-9-S10-O7.View ArticleGoogle Scholar
- Yassour M, Kaplan T, Fraser H, Levin J, Pfiffner J, Adiconis X, Schroth G, Luo S, Khrebtukova I, Gnirke A, Nusbaum C, Thompson D, Friedman N, Regev A: Ab initio construction of a eukaryotic transcriptome by massively parallel mRNA sequencing. Proc Natl Acad Sci USA. 2009, 106: 3264-10.1073/pnas.0812841106.PubMedPubMed CentralView ArticleGoogle Scholar
- Trapnell C, Pachter L, Salzberg S: TopHat: discovering splice junctions with RNA-Seq. Bioinformatics. 2009, 25: 1105-1111. 10.1093/bioinformatics/btp120.PubMedPubMed CentralView ArticleGoogle Scholar
- Jean G, Kahles A, Sreedharan V, Bona F, Rätsch G: RNA-Seq Read Alignments with PALMapper. Curr Protocols Bioinformatics. 2010, 32: 11.6.1-11.6.37.Google Scholar
- Wang K, Singh D, Zeng Z, Coleman S, Huang Y, Savich G, He X, Mieczkowski P, Grimm S, Perou C, MacLeod J, Chiang D, Prins J, Liu J: MapSplice: Accurate mapping of RNA-seq reads for splice junction discovery. Nucleic Acids Res. 2010, 38: e178-10.1093/nar/gkq622.PubMedPubMed CentralView ArticleGoogle Scholar
- Bryant D, Shen R, Priest H, Wong W, Mockler T: Supersplat-spliced RNA-seq alignment. Bioinformatics. 2010, 26: 1500-10.1093/bioinformatics/btq206.PubMedPubMed CentralView ArticleGoogle Scholar
- Pan Q, Shai O, Lee L, Frey B, Blencowe B: Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nat Genet. 2008, 40: 1413-1415. 10.1038/ng.259.PubMedView ArticleGoogle Scholar
- Sultan M, Schulz M, Richard H, Magen A, Klingenhoff A, Scherf M, Seifert M, Borodina T, Soldatov A, Parkhomchuk D, Schmidt D, O'Keeffe S, Haas S, Vingron M, Lehrach H, Yaspo M: A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science. 2008, 321: 956-959. 10.1126/science.1160342.PubMedView ArticleGoogle Scholar
- Wang E, Sandberg R, Luo S, Khrebtukova I, Zhang L, Mayr C, Kingsmore S, Schroth G, Burge C: Alternative isoform regulation in human tissue transcriptomes. Nature. 2008, 456: 470-476. 10.1038/nature07509.PubMedPubMed CentralView ArticleGoogle Scholar
- Tang F, Barbacioru C, Wang Y, Nordman E, Lee C, Xu N, Wang X, Bodeau J, Tuch B, Siddiqui A, Lao K, Surani M: mRNA-Seq whole-transcriptome analysis of a single cell. Nat Methods. 2009, 6: 377-382. 10.1038/nmeth.1315.PubMedView ArticleGoogle Scholar
- Trapnell C, Williams B, Pertea G, Mortazavi A, Kwan G, van Baren M, Salzberg S, Wold B, Pachter L: Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 2010, 28: 511-515. 10.1038/nbt.1621.PubMedPubMed CentralView ArticleGoogle Scholar
- Guttman M, Garber M, Levin J, Donaghey J, Robinson J, Adiconis X, Fan L, Koziol M, Gnirke A, Nusbaum C, Rinn J, Lander E, Regev A: Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nat Biotechnol. 2010, 28: 503-510. 10.1038/nbt.1633.PubMedPubMed CentralView ArticleGoogle Scholar
- Grabherr M, Haas B, Yassour M, Levin J, Thompson D, Amit I, Adiconis X, Fan L, Raychowdhury R, Zeng Q, Chen Z, Mauceli E, Hacohen N, Gnirke A, Rhind N, di Palma F, Birren B, Nusbaum C, Lindblad-Toh K, Friedman N, Regev A: Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol. 2011, 29: 644-652. 10.1038/nbt.1883.PubMedPubMed CentralView ArticleGoogle Scholar
- Simpson J, Wong K, Jackman S, Schein J, Jones S, Birol I: ABySS: a parallel assembler for short read sequence data. Genome Res. 2009, 19: 1117-10.1101/gr.089532.108.PubMedPubMed CentralView ArticleGoogle Scholar
- Heber S, Alekseyev M, Sze S, Tang H, Pevzner P: Splicing graphs and EST assembly problem. Bioinformatics. 2002, 18: 181-188.View ArticleGoogle Scholar
- Xing Y, Resch A, Lee C: The multiassembly problem: reconstructing multiple transcript isoforms from EST fragment mixtures. Genome Res. 2004, 14: 426-10.1101/gr.1304504.PubMedPubMed CentralView ArticleGoogle Scholar
- Sammeth M, Valiente G, Guigo R: Bubbles: alternative splicing events of arbitrary dimension in splicing graphs. Lecture Notes Comput Sci. 2008, 4955: 372-10.1007/978-3-540-78839-3_32.View ArticleGoogle Scholar
- Harrington E, Bork P: Sircah: a tool for the detection and visualization of alternative transcripts. Bioinformatics. 2008, 24: 1959-10.1093/bioinformatics/btn361.PubMedView ArticleGoogle Scholar
- Bonizzoni P, Mauri G, Pesole G, Picardi E, Pirola Y, Rizzi R: Detecting alternative gene structures from spliced ESTs: a computational approach. J Comput Biol. 2009, 16: 43-66. 10.1089/cmb.2008.0028.PubMedView ArticleGoogle Scholar
- Labadorf A, Link A, Rogers M, Thomas J, Reddy A, Ben-Hur A: Genome-wide analysis of alternative splicing in Chlamydomonas reinhardtii. BMC Genomics. 2010, 11: 114-10.1186/1471-2164-11-114.PubMedPubMed CentralView ArticleGoogle Scholar
- Richardson D, Rogers M, Labadorf A, Ben-Hur A, Guo H, Paterson A, Reddy A: Comparative analysis of serine/arginine-rich proteins across 27 eukaryotes: insights into subfamily classification and extent of alternative splicing. PLoS ONE. 2011, 6: e24542-10.1371/journal.pone.0024542.PubMedPubMed CentralView ArticleGoogle Scholar
- Zenoni S, Ferrarini A, Giacomelli E, Xumerle L, Fasoli M, Malerba G, Bellin D, Pezzotti M, Delledonne M: Characterization of transcriptional complexity during berry development in Vitis vinifera using RNA-Seq. Plant Physiol. 2010, 152: 1787-10.1104/pp.109.149716.PubMedPubMed CentralView ArticleGoogle Scholar
- Reddy A: Alternative splicing of pre-messenger RNAs in plants in the genomic era. Annu Rev Plant Biol. 2007, 58: 267-294. 10.1146/annurev.arplant.58.032806.103754.PubMedView ArticleGoogle Scholar
- Wang B, Brendel V: Genomewide comparative analysis of alternative splicing in plants. Proc Natl Acad Sci USA. 2006, 103: 7175-10.1073/pnas.0602039103.PubMedPubMed CentralView ArticleGoogle Scholar
- Kim E, Magen A, Ast G: Different levels of alternative splicing among eukaryotes. Nucleic Acids Res. 2007, 35: 125-10.1093/nar/gkm529.PubMedPubMed CentralView ArticleGoogle Scholar
- Boguski M, Lowe T, Tolstoshev C: dbEST-database for "expressed sequence tags". Nat Genet. 1993, 4: 332-333. 10.1038/ng0893-332.PubMedView ArticleGoogle Scholar
- PlantGDB http://plantgdb.org/
- Wu T, Watanabe C: GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics. 2005, 21: 1859-10.1093/bioinformatics/bti310.PubMedView ArticleGoogle Scholar
- Montgomery S, Sammeth M, Gutierrez-Arcelus M, Lach R, Ingle C, Nisbett J, Guigo R, Dermitzakis E: Transcriptome genetics using second generation sequencing in a Caucasian population. Nature. 2010, 464: 773-777. 10.1038/nature08903.PubMedView ArticleGoogle Scholar
- Blencowe B, Ahmad S, Lee L: Current-generation high-throughput sequencing: deepening insights into mammalian transcriptomes. Genes Dev. 2009, 23: 1379-10.1101/gad.1788009.PubMedView ArticleGoogle Scholar
- Huang W, Khatib H: Comparison of transcriptomic landscapes of bovine embryos using RNA-Seq. BMC Genomics. 2010, 11: 711-10.1186/1471-2164-11-711.PubMedPubMed CentralView ArticleGoogle Scholar
- Wang L, Xi Y, Yu J, Dong L, Yen L, Li W: A statistical method for the detection of alternative splicing using RNA-Seq. PLoS ONE. 2010, 5: e8529-10.1371/journal.pone.0008529.PubMedPubMed CentralView ArticleGoogle Scholar
- Richard H, Schulz M, Sultan M, Nürnberger A, Schrinner S, Balzereit D, Dagand E, Rasche A, Lehrach H, Vingron M, Haas S, Yaspo M: Prediction of alternative isoforms from exon expression levels in RNA-Seq experiments. Nucleic Acids Res. 2010, 38: e112-10.1093/nar/gkq041.PubMedPubMed CentralView ArticleGoogle Scholar
- NCBI Sequence Read Archive. [http://www.ncbi.nlm.nih.gov/sra]
- Swarbreck D, Wilks C, Lamesch P, Berardini T, Garcia-Hernandez M, Foerster H, Li D, Meyer T, Muller R, Ploetz L, Radenbaugh A, Singh S, Swing V, Tissier C, Zhang P, Huala E: The Arabidopsis Information Resource (TAIR): gene structure and function annotation. Nucleic Acids Res. 2008, 36: D1009-PubMedPubMed CentralView ArticleGoogle Scholar
- Palusa S, Ali G, Reddy A: Alternative splicing of pre-mRNAs of Arabidopsis serine/arginine-rich proteins: regulation by hormones and stresses. Plant J. 2007, 49: 1091-10.1111/j.1365-313X.2006.03020.x.PubMedView ArticleGoogle Scholar
- Barrett T, Troup DB, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, Marshall KA, Phillippy KH, Sherman PM, Muertter RN, Holko M, Ayanbule O, Yefanov A, Andrey , Soboleva : NCBI GEO: archive for functional genomics data sets-10 years on. Nucleic Acids Res. 2011, 39: D1005-D1010. 10.1093/nar/gkq1184.PubMedPubMed CentralView ArticleGoogle Scholar
- Kent W: BLAT-the BLAST-like alignment tool. Genome Res. 2002, 12: 656-PubMedPubMed CentralView ArticleGoogle Scholar
- Eilbeck K, Mungall C, Lewis S, Ashburner M: The Sequence Ontology Project 2009. [http://www.sequenceontology.org/gff3.shtml]
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R: The sequence alignment/map format and SAMtools. Bioinformatics. 2009, 25: 2078-10.1093/bioinformatics/btp352.PubMedPubMed CentralView ArticleGoogle Scholar
- Rätsch G, Sonnenburg S: Accurate splice site detection for Caenorhabditis elegans. Kernel Methods in Computational Biology. Edited by: Schölkopf B, Tsuda K, Vert JP. 2004, MIT Press, 277-Google Scholar
- Rätsch G, Sonnenburg S, SchÄolkopf B: RASE: recognition of alternatively spliced exons in C. elegans. Bioinformatics. 2005, 21: i369-i377. 10.1093/bioinformatics/bti1053.PubMedView ArticleGoogle Scholar
- Ben-Hur A, Ong C, Sonnenburg S, Schölkopf B, Rätsch G: Support vector machines and kernels for computational biology. PLoS Comput Biol. 2008, 4: e1000173-10.1371/journal.pcbi.1000173.PubMedPubMed CentralView ArticleGoogle Scholar
- PyML-machine learning in Python. [http://pyml.sourceforge.net/]
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.