Cross-kingdom patterns of alternative splicing and splice recognition

A comprehensive survey of alternate splicing across 42 eukaryotes so as to gain insight into how spliceosomal introns are recognized.


Background
Intron splicing occurs in all domains of life, but the splicing methods employed and the frequencies of splicing vary among organisms. Bacteria and archaea lack the spliceosomal pathway and splice infrequently via self-splicing introns. Among unicellular eukaryotes, there is substantial range in splicing frequency [1,2]. Many early-branching eukaryotes, including the protists Giardia, Cryptosporidia, Trypanosoma, Entamoeba, and Trichomonas, have few or no introns. Only 5% of genes are spliced in Saccharomyces cerevisiae [3], a yeast, while the average number of introns per gene among other fungi is generally low (with a few noteworthy exceptions). Their average intron density ranges from just over one in Schizosaccharomyces pombe to approximately five in Cryptococcus neoformans [4]. Protists  The number of introns and recognized splice sites may vary between individual mRNA transcripts of a single gene, giving rise to the phenomena of splice variation and alternative splicing. In this paper we use 'splice variation' to describe any difference in intron processing, reserving the term 'alternative splicing' for splice variation that is regulated and functionally significant. Observed splice variation is a combination of programmed alternative splicing events and splicing errors. Functional alternative splicing may result from various causes, including ontogenic changes and environmental stimuli. As the number of genes in an organism is not well correlated with its complexity, alternative splicing may provide an additional layer of regulation that permits greater complexity in higher organisms [5]. Multicellular organisms may generate different splice forms of the same gene in different tissues, or even within different cells in the same tissue [5,6]. More recently, it has also been demonstrated that alternative splicing can vary between individuals in a heritable manner [7].
The profile of splice variants in a given organism is likely influenced by the mechanisms it uses to identify and process splice sites. In eukaryotes, it has been proposed that the spliceosome recognizes splice sites in pairs, either across the intron (intron definition (ID)) or across the exon (exon definition (ED)) [9]. In ID, splice sites on either side of an intron are recognized as a unit, while in ED, splice sites on either side of an exon are recognized as a unit. Experiments in both yeast and Drosophila have shown that when splice sites are presumably recognized by ID, mutating a single splice site disrupts splicing of the intron adjacent to the mutation. This leads to an RI, but has no effect on the splicing of nearby introns [17,18] (Figure 1). In contrast, when splice sites are presumably recognized by ED, mutating a single splice site affects not only the splicing of the intron adjacent to the mutation, but also the intron on the other side of the exon adjacent to the mutation. This causes cassette exons to be skipped [19,20]. Therefore, it is believed that with ID, splicing errors are more likely to result in RIs, while with ED, splicing errors are more likely to result in CEs. ID and ED are not mutually exclusive; in Drosophila melanogaster, ID and ED have been shown to operate within a single mRNA [21].
The method used to recognize splice sites has been associated with restrictions on exon and intron length. Recognition of splice sites with ED appears to constrain exon length [20,22], while recognition with ID limits intron length [18,23]. Fox-Walsh et al. [24] suggest that splice site recognition across the intron in D. melanogaster ceases at lengths greater than around 200-250 bp. A review of previous studies suggests that phylogenetic trends in exon and intron length may be correlated with the relative occurrence of RIs and CEs and the use of ID or ED for splice junction recognition [16,18,20,23,24]. However, previous results have been limited in their phylogenetic scope.
In this paper, we report a comprehensive survey of splice variants in 42 eukaryotic organisms. Our survey covers a wide phylogenetic range, including 13 multicellular animals, 6 plants, 14 fungi, and 9 protists. We observe variation across major phylogenetic groups in the representation of RIs and CEs among splice variants that is consistent with variation in the mode of splice site recognition (ID or ED) used by these groups. We infer that groups with a high ratio of RIs to CEs (fungi and protists) operate predominantly by ID, while groups with a low ratio of RIs to CEs (multicellular animals) operate predominantly by ED. In organisms with evidence of both RIs and CEs (thus, employing both ID and ED), CEs are shorter than constitutive exons (exons that show no evidence of splice variation), and RIs are shorter than constitutive introns, suggesting that splice mechanisms are closely tied to gene structure.

Results
To assess splice variation in eukaryotes, we selected 42 organisms with genome assemblies and large numbers of publicly available expressed sequence tags (ESTs; Table 1), spanning the plants, fungi, protists and multicellular animals. We aligned ESTs to genome assemblies and constructed transcript fragments, examined all loci where the EST data indicated two or more overlapping non-compatible transcripts, and labeled every instance of splice variation. Table 2 shows the numbers of ESTs for each organism, as well as the numbers of transcripts and loci constructed. Table 3 lists the splice variants we found. A complete list of the locations of predicted sites of splice variation, as well as control introns and exons that show no splice variation despite high EST coverage, is available on the Broad Institute's ftp site [25].

All eukaryotes exhibit splice variation
We found that splice variation is present in all organisms we analyzed. Every eukaryote we studied exhibited RIs, and almost every organism exhibited competing 5' splice sites, competing 3' splice sites, and CEs. Several organisms showed zero or very few CEs or competing splice sites due to having only a small EST library or a small overall number of predicted splice variants (for example, Histoplasma capsulatum, Rhizopus oryzae, Entamoeba histolytica), or a small number of introns (for example, S. cerevisiae). We also found no CEs in Paramecium tetraurelia, despite a large EST library and numerous predicted splice variants. However, P. tetraurelia is unusual in that it has extraordinarily short introns (25 bp on average). As CEs are usually associated with longer introns and shorter exons, it is possible that this organism's gene structure renders CEs impossible. Figure 2 illustrates the relative proportions of the four different kinds of splice variants in each organism, along with previously published data for human for comparison [8]. Our results for Caenorhabditis elegans, D. melanogaster, A. thaliana, and O. sativa confirmed those of previous studies [10,12,13].
The ratio of competing 3' splice sites to competing 5' splice sites was fairly constant, with more competing 3' splice sites than competing 5' splice sites in almost every case (Table 3). This is consistent with results of previous studies of splice variation, including the nine organisms in the Altextron database [10] and several other organisms [8,9,12]. When we combine the data from all the organisms in our analysis, there are 1.7 times more competing 3' splice sites than competing 5' splice sites. Interestingly, Zavolan et al. [26] found that competing 3' splice sites are more likely to preserve the reading frame than competing 5' splice sites.
In contrast to the uniform ratio of competing splice sites, the ratio of CEs to RIs varies widely between organisms. We found the ratio (which we will refer to as the CE fraction) to be a useful metric for summarizing the pattern of these splice variants. The CE fraction is listed in Table 3 and illustrated in Figure 2 for each organism.

CE and RI prevalence vary by kingdom and by intron length
Major eukaryotic groups (animals, plants, fungi, and protists) exhibit very divergent CE fractions ( Figure 2 rerio and B. floridae), and has a correspondingly lower CE fraction (53%) than they do (D. rerio has a CE fraction of 68%; B. floridae, 95%).
In contrast, among the unicellular fungi and protists, we see very few CEs and an overwhelming preference for RIs. Overall, fungi and protists have 37 times more RIs than CEs (an average CE fraction of 3%). This preference for intron   [11][12][13]. The unicellular algae C. reinhardtii has a CE fraction of 22%, which is closer to the values seen in multicellular plants than other unicellular organisms. However, it has a large genome size for a unicellular organism (118 Mb). In contrast, O. lucimarinus is a much simpler unicellular green algae with smaller genome size (13 Mb), minimal cellular organization and no CEs, a genome structure that is more like those of unicellular fungi and protists.
Frequencies of different forms of splice variation, arranged by phylogenetic group Figure 2 Frequencies of different forms of splice variation, arranged by phylogenetic group. The two bar charts show the relative frequencies of each type of splice variation. The ratio of CEs to RIs is shown in the chart on the left, while the one on the right displays competing 5' and competing 3' splice sites. Note that the CE/RI ratio shows wide variation among kingdoms while the ratio of competing 5' to competing 3' splice sites is remarkably consistent. A high-level overview of the phylogenetic tree is shown on the far left, and the organisms' names are colored according to their phylogenetic grouping. To see all four forms of splice variation on a single bar plot, see Additional data file 1. The data for H. sapiens was taken from a previous study [8].  Table S2 in Additional data file 4 for average lengths based on genome annotations. ¥ Average length of the two introns surrounding each predicted CE in our EST alignments. # Number of CEs considered for 'Average intron length next to CE' in the previous column and in Figure 5. This excludes those CEs where the introns next to the CEs do not have identical boundaries between transcripts with and without that CE. **Fraction of constitutive introns longer than 200 bp. † † Calculated from constitutive exons in our EST alignments. ‡ ‡ Calculated from constitutive exons in our EST alignments (excluding exons that cannot be CEs, namely, the first and last exons in a gene, and exons in genes without introns). § § Average length of CEs seen in our EST alignments.
The observed variation in CE fraction closely parallels variation in intron length. Animals and plants have more long introns (introns greater than 200 bp) than do protists and fungi (Table 4). In Figure 3 we plot the fraction of constitutive introns greater than 200 bp versus the CE fraction, and demonstrate a direct correlation between the presence of long introns and high incidence of CEs (y = 0.84x + 0.00; R 2 = 0.73). This correlation also holds within each kingdom (fungi, protists, plants, and animals), providing a phylogenetic control. As discussed below, this correlation is consistent with the hypothesis that splice site recognition differs within these groups.

Variably spliced regions exhibit size constraints
As shown in Table 4, variably spliced introns and exons are usually shorter than those in our data set that display no splice variation, in agreement with previous observations [28]. Moreover, these length differences between constitutive and variably spliced introns and exons appear to be associated with the relative frequencies of splice variation via CEs and RIs. In organisms where CEs are rare, such as fungi, CEs tend to be noticeably shorter than internal constitutive exons. However, in organisms with substantial fractions of CEs (animals and multicellular plants) we observe no significant length difference between CEs and internal constitutive exons ( Figure 4). Intron retention displays the opposite behavior. In organisms where RIs are uncommon (animals and multicellular plants), RIs tend to be shorter than constitutive introns, while organisms with large numbers of RIs (fungi and pro-tists) show no substantial length difference between RIs and constitutive introns ( Figure 5). In general, CEs and RIs both tend to be shorter than their constitutively spliced counterparts, with the length difference most noticeable in organisms in which each splice variant was uncommon.

Most splice variants are not functional
We next sought to determine the degree to which the observed splice variants reflect programmed alternative splicing versus incomplete splicing or splicing errors. To do so, we examined the impact of observed splice variants on the corresponding open reading frame and resultant protein. We also examined conservation within regions containing splice variants to look for signatures of coding selection.
Previous analyses of splice variants in mammals have focused on the more prevalent CEs. One surprising result from these analyses is the high frequency of CEs that alter reading frame or introduce stop codons [29]. Overall, approximately half of human CEs in coding regions result in frameshifts, while an additional 15% of CEs that do not cause frameshifts introduce in-frame stop codons [29]. A more recent analysis of splice variants generated by the ENCODE consortium [30] revealed little evidence that alternative splice variants commonly give rise to functional isoforms. In the case of frameshifts, if both alternative open reading frames lead to functional proteins, one would expect the polymorphism or divergence level in all three codon positions to be the same [30]. Few splice variants in the ENCODE analysis displayed this property [30]. Our data are consistent with previous results. When looking at all 42 organisms in our analysis summed together (for a total of 7,115 CEs), CEs are more likely to have lengths that are a multiple of three (45% in all species examined), but over half of all CEs have lengths that leave remainders of one (28%) or two (27%) when divided by three.
Less has been reported about the functional impact of RIs. In humans, many RIs have been shown to be not merely partially spliced transcripts or splicing errors [31]. They were shown to have evidence of coding potential (having higher GC content than other introns, having codon usage more like exons, and having a lower frequency of stop codons). Many human RIs participate in coding for a protein domain (a smaller fraction than for exons, but a greater fraction than for constitutive introns) [31]. However, not all RIs in higher eukaryotes are necessarily functional. In plants, many RIs were shown to introduce premature termination codons or frameshifts [12].
The prevalence of RIs in all organisms we analyzed provides an opportunity to assess the possibility of a functional role for these observed events. Our analysis reveals that RIs do not display a preference for preserving reading frames: the lengths of all 11,925 observed RIs were roughly equally distributed between intron lengths evenly divisible by three (34%), and intron lengths with remainders of one (34%) and Relationship between long introns and CE fraction  two (33%) when divided by three. Among the ten organisms with greater than 500 RIs, the number evenly divisible by three is 34 ± 2%, with a slightly higher value of 37% for D. rerio.
Though our analysis shows little evidence of frame preservation in RIs, we do see weak selection for coding potential. Between the closely related species of C. neoformans and Coccidioides immitis dN/dS ratios for concatenated RIs showed weak but significant evidence of conservation at the amino acid level (p < 0.001 for C. neoformans and p = 0.05 for C. immitis; Table 5). We also observe significantly fewer in-frame stop codons within RIs than in constitutive introns, with 23% fewer (p < 0.0001) in C. immitis, and 20% fewer (p = 0.02) in C. neoformans (Table 5). We observe no significant functional group over-representation (Table S3 in Additional data file 4).
We thus see some evidence for coding potential in RIs, but taken together with previous observations of CEs, our results suggest that the majority of observed splice variants are unlikely to give rise to functional proteins. It has been proposed that splice variants leading to frameshifts or truncated proteins may be due in part to artifacts associated with EST library construction or sequencing. However, the universality of such disrupting variants across the many independent data sets and kingdoms analyzed here -and the occurrence of such disruptions associated with both RIs and CEs -suggest that these events occur naturally and frequently. In humans, plants, and fungi, transcripts containing premature stop codons are targeted for degradation through the process of nonsense mediated decay [32,33]. The widespread occurrence of premature stop codons in human splice variants has led to the hypothesis that unproductive splicing and translation may be pervasive [34]. Our results are consistent with this hypothesis.

Retained introns are associated with weak splice sites
Studies in mammals have demonstrated that splice sites adjacent to CEs and RIs are associated with weak splice site sig-Average lengths of CEs compared to average lengths of internal constitutive exons Figure 4 Average lengths of CEs compared to average lengths of internal constitutive exons. Species are sorted by the fractional difference between these two lengths. In organisms where CEs are common (animals and plants) CEs are almost identical in length to constitutive exons, while in species where CEs are rare (fungi and protists) CEs tend to be significantly shorter than constitutive exons. In animals and plants, where ED is common, CEs are spliced by the same process as constitutive exons and these two groups are thus subject to the same length constraints. In organisms that splice primarily by ID, including fungi and protists, the lengths of constitutive exons are not constrained by ED. However, CEs in these organisms are still recognized by ED. Thus, in these species, constitutive exons can grow longer than CEs.  nals [28,[35][36][37][38][39][40]. We evaluated RI information content in plants, fungi and protists and report results in agreement with previous studies.

s a v i g n y i S . p u r p u r a t u s B . f l o r i d a e A . m e l l i f e r a T . r u b r i p e s O . s a t i v a D . r e r i o C . i n t e s t i n a l i s C . r e i n h a r d t i i C . e l e g a n s P . p a t e n s P . t r i c h o c a r p a N . v e c t e n s i s A . a e g y p t i S . m a n s o n i A . g a m b i a e P . s o j a e A . t h a l i a n a D . m e l a n o g a s t e r P . i n f e s t a n s A . f l a v u s N . c r a s s a T . t h e r m
We quantified 3' and 5' splice site strength by calculating the information content of the splice sites (see Materials and methods; Additional data file 2; Table S1 in Additional data file 4). We found significantly lower information content on either side of retained introns than constitutive introns for all 40 organisms in our analysis (P values 3.2e-9 and 2.1e-11, respectively, calculated from t-test). Overall, RI 5' splice sites had 1.9 ± 1.4 bits (24%) less information content, while their 3' splice sites had 0.9 ± 0.8 bits (17%) less (see data in Table  S1 in Additional data file 4 and sequence logos [41] in Additional data file 2.) We observed the largest differences between RIs and constitutive introns in animals. The average Average lengths of RIs compared with lengths of constitutive introns and introns adjacent to CEs Figure 5 Average lengths of RIs compared with lengths of constitutive introns and introns adjacent to CEs. RI length, which is constrained by ID, is fairly constant across all organisms. In protists and fungi, average RI length is close to that of constitutive introns, because ID is the primary mode of splice site recognition for both groups. In animals, constitutive intron length differs substantially from RI length because most constitutive introns are recognized by ED and are not subject to the same length constraints as RIs, which are recognized by ID. Plants fall between unicellular organisms and animals. Data are shown only for organisms with at least five RIs. For introns next to CEs, data are shown only for organisms with at least eight CEs with unambiguous adjacent intron lengths on both sides. Introns next to CEs are usually longer than constitutive introns, because these introns are recognized by ED and are free from ID length constraints.  While weak splice sites, such as those we observe in our RIs, have been associated with functional RIs and CEs [28,[35][36][37][38][39][40], they are also expected to lead to greater occurrence of incomplete splicing [18]. Therefore, weaker splice sites are uninformative as to whether these RIs are functional or merely incomplete splicing.

Splice variation reveals mechanisms of splice site recognition
The variation we observe in the CE fraction, the correspondence of this variation with average intron length, and the size constraints observed in both RIs and CEs can be parsimoniously explained by proposing differences in the proportion of splice junctions each organism recognizes via ED and ID. Organisms with a high CE fraction presumably splice predominantly via ED, whereas organisms with a low fraction presumably have a preference for ID-mediated splicing. Our results suggest that the fraction of introns recognized by ID and ED vary extensively across kingdoms, yet both mechanisms play a role in splicing in all phylogenetic groups.
In animals, exons are short relative to introns (Table 4) and CEs are more common than RIs (Table 3). This pattern is consistent with the hypothesis that splice junctions in animals are primarily recognized through ED, as has been previously demonstrated in vertebrates [19]. Moreover, the predominance of ED in these species predicts that CEs should be approximately as long as constitutive exons, because they are spliced the same way and, thus, are subject to the same length constraints. This is precisely the behavior we observe ( Figure  4).
In fungi and protists, conversely, introns are short relative to exons (Table 4) and RIs far outnumber CEs. We propose that these groups primarily recognize splice junctions using ID ( Figure 5). Because the splicing machinery in these organisms recognizes RIs in the same way it recognizes constitutive introns, we expect both types of intron should be subject to the same length constraints. Supporting this hypothesis, Figure 5 shows that constitutive introns in these species are similar in length to RIs. Intron definition has been demonstrated experimentally in the yeasts Saccharomyces and Schizosaccharomyces [17,27], and in plants [42]. Our analysis extends this result to all ascomycetes, as well as basidiomycetes and zygomycetes, and suggests that ID is not simply a characteristic of the derived yeasts. Furthermore, due to their high prevalence of RIs, we predict that ID predominates in the basidiomycete C. neoformans, which possesses 5.4 introns per gene on average, as well as the protozoan Tetrahymena thermophila, which has 3.3 introns per gene. ID is not associated with low intron density per se.
In plants, intron lengths vary widely, and individual species show substantial numbers of introns both greater and less than 200 bp in length (Table 4). Correspondingly, we observe that RIs and CEs both occur in sizeable quantities in this group. We thus propose that ID and ED both play significant roles in splice site recognition in plants.
While ED is most common in animals and ID dominates in fungi and protists, nearly all species analyzed show evidence for using both mechanisms. ID and ED have previously been shown to operate within the same species and indeed within the same gene [21]. We thus propose that the intron and exon length distributions in any organism are each sums of two distributions: one made up of shorter introns recognized by ID and the longer exons that surround them, and one made up of shorter exons recognized by ED and the surrounding longer introns. When we examine the lengths of CEs, we sample from only one of these two distributions: the subset of exons recognized by ED. When we examine the lengths of RIs, we sample from the length distribution of introns recognized by ID. Both these distributions are biased to be short, and this length bias is particularly noticeable in organisms where these splice variants are rare. Finally, when we look at the lengths of introns surrounding CEs, we are primarily sampling from the distribution of introns associated with ED. This last distribution tends to be long (Additional data file 3), as is the length distribution of exons surrounding RIs, which are associated with ID. (See Additional data file 3 for more detail on intron and exon length distributions in CEs and RIs.) Importantly, our model of varying levels of ID or ED splicesite recognition explains the variation we see in CE fraction, whether or not individual variants lead to functional messages. As shown in Figure 1a, ID mis-recognition of a single splice site should lead to intron retention, as has been demonstrated by splice site mutation experiments in Drosophila and Schizosaccharomyces [17,18]. Creating a CE via ID, however, would theoretically require coordinated mis-recognition of two splice sites and pairing of splice sites over a greater distance. If this distance were greater than 200 bp, pairing with ID would be considerably hindered [24]. Similarly, as experimentally demonstrated and shown in Figure 1b, ED mis-recognition of a single splice site leads to a CE [19]. The hypothetical generation of RIs under ED would require multiple mis-recognitions and pairing of splice sites belonging to two possibly distant exons.
Thus, if many non-functional splice variants arise as a consequence of incorrect or incomplete splice site recognition, we would nonetheless expect that splice sites recognized by ID would more commonly give rise to RIs while those recognized by ED would more commonly give rise to CEs. Non-func-tional splice variants, therefore, are as informative as functional ones when considering the question of how splice sites are recognized. This is fortunate, as it is unclear at present what proportion of observed splice variation is indeed functionally significant. Splice variants and their characteristics, then, provide insight into the underlying mechanisms of splice site recognition, irrespective of whether such variants are functional or biological errors.

Evolutionary implications of splice variation
The varying prevalence of CEs and RIs across major eukaryotic groups raises evolutionary questions about modern variation in splice recognition among lineages, as well as the nature of splice site recognition in the last common eukaryotic ancestor. As RIs and CEs are exhibited in almost every organism we studied, it is likely that almost all extant eukaryotes are capable of recognizing splice sites via both ID and ED. The last common eukaryotic ancestor, then, may also have been capable of both types of splice site recognition. Indeed, Collins and Penny [43] report that most of the key components of the spliceosome were present in the last common eukaryotic ancestor. If the minimal mechanistic requirements to support both ID and ED were present in the last common eukaryotic ancestor, which was more prevalent: ID or ED? In our analysis, we note that intron density has less effect on the types of observed splice variation than does intron length. For example, the fungus C. neoformans recognizes splice sites almost exclusively by ID, despite having more introns per gene than many plants where ED is prevalent. Thus, whether the last common eukaryotic ancestor recognized splice sites predominantly via ID or ED is probably not a question of how many introns it had but rather how long its introns were. As very long introns appear to be a derived feature associated with multicellularity, we may speculate that the last eukaryotic common ancestor had introns similar in size to most protists, and, therefore, probably employed ID more than ED.
Evidence pertaining to the evolutionary origin of introns also suggests a prevalence of ID in early eukaryotes. Similarities in the splicing mechanism between self-splicing group II introns in prokaryotes and spliceosomal eukaryotic introns suggest that the former may have begat the latter [44][45][46][47][48]. As group II introns are mobile genetic elements that spread through retrotransposition, the ancestors of spliceosomal introns were probably self-contained in terms of their signals for excision, and would not likely have relied on splicing factors embedded in flanking coding sequence (as may be required for ED [49]). Therefore, the earliest spliceosomal introns may have employed a method of splice site recognition most similar to ID.
Recent molecular evidence suggests that SR (serine-argininerich) proteins may be associated with the ascendance of ED in animals and plants [50]. SR proteins bind to RNA sequences and assist in spliceosome assembly for both constitutively and alternatively spliced introns. They have been shown to enhance the splicing efficiency of introns with suboptimal splice signals in S. pombe [51]. Additionally, Shen and Green [52] have recently shown that SR proteins can rescue splicing of introns with suboptimal splice signals if those proteins are directed to bind to exonic mRNA sequence in S. cerevisiae, a species that has lost all native SR proteins. If the experimental binding of SR proteins to pre-mRNA sequences is indicative of the binding of SR proteins to exonic splicing enhancers in organisms where ED is prevalent, the fundamental splicing machinery of ID and ED may be closely related. Along these lines, Ram and Ast [50] speculate that the primary role of SR proteins has changed over time. In early eukaryotes, SR proteins assist in the recognition of suboptimal ID introns, while in higher eukaryotes, they bind to exonic splicing enhancers. This role change shifts the placement of the basal splicing machinery across exons instead of across introns and enables ED using the same spliceosomal machinery employed for ID [50].
The SR protein family has expanded in multicellular eukaryotic lineages [53], and this proliferation may have facilitated the widespread conversion of introns in those lineages from ID to ED. However, it is unclear what underlying neutral or selective forces could be responsible for this shift. Changes in genome size may have played a role. Multicellular eukaryotes generally exhibit larger genomes than unicellular eukaryotes, and organisms with large genomes tend to have long introns [54,55]. Because ID is only effective for introns less than approximately 250 bp [24], an upward trend in genome size in multicellular lineages (resulting from a reduced deletion rate, increased transposon activity, or both) could have favored ED introns that were spliceable in the face of this mutational pressure.
Regardless of which evolutionary forces are responsible for their ascendance, CEs produced through exon definition in multicellular eukaryotes allow for much greater flexibility and combinatorial complexity of alternatively spliced transcripts than would RIs recognized by ID. For example, the large number of transcripts (>30,000) that can be produced by the Dscam gene in D. melanogaster is facilitated by independent splicing of CEs [56]. The abundance of CEs enabled by ED in the human proteome may help to explain why the human genome contains only about as many genes as that of the worm or fly, despite the (admittedly biased) perception of our own much greater organismal complexity [57].

Conclusion
Using EST and cDNA data from 42 organisms, we find the prevalence of RIs and CEs to vary significantly across major eukaryotic groups, strongly suggesting that the underlying mode of splice site recognition (ID versus ED) also varies in prevalence across eukaryotes. Our results show that RIs, which are present in every organism we analyzed, are more widespread than previously thought. We also find a strong relationship between intron length and the prevalence of splice variation: the fraction of introns greater than 200 bp is correlated with the CE fraction. Shorter introns (<200 bp), such as those found in fungi and protists, are more likely to be recognized by ID. In all 23 fungi and protists that we examined, we observed that RIs are more common than CEs. In contrast, shorter exons surrounded by longer introns (>200-250 bp), such as those found in animals, are more likely to be recognized by ED. In the 13 multicellular animals in our analysis, CEs occur much more frequently, sometimes in greater numbers than RIs. The six plants in our analysis exhibited intermediate intron lengths, having more CEs than fungi and more RIs than animals.
We conclude that ID and ED are likely both present to some degree in all eukaryotes. We conclude that splicing proceeds primarily by ID in fungi and unicellular protists, due to the overwhelming majority of RIs and paucity of CEs observed in these organisms, as well as their short intron lengths and longer exon lengths. In contrast, splice sites in multicellular animals are recognized primarily via ED, due to the larger numbers of CEs observed, as well as these species' longer introns and shorter exons. However, the molecular mechanisms underlying the two different forms of recognition (ED and ID) are still unclear.
These findings help to reveal the complex interplay of selective constraints and mutational pressures underlying eukaryotic genome architecture, and improve our understanding of why eukaryotic genomes exhibit so much variation. Further sequencing of additional organisms, especially those that exhibit both unicellular and multicellular properties, will help to disentangle the effects of multicellularity, genome size, and intron length on the mechanisms of splice site recognition.

EST alignments
All EST and assembly data were publicly available and downloaded from the Broad Institute [58], GenBank, J Craig Venter Institute/The Institute for Genomic Research [59], Joint Genome Institute [60] or VectorBase [61] (Tables 1 and 2). We used BLAT [62] version 33 to align the ESTs to genomic sequence using the following parameters: --minIdentity = 95 --minScore = 50 --queryType = rna. We set the --maxIntron parameter to the longest annotated intron in each species.
We filtered the resulting alignments using the following criteria: each alignment must contain at least one canonical splice site (GT:AG, CG:AG, AT:AC); must have no non-canonical splice sites; must have ≥ 95% nucleotide identity; must not have more than nine consecutive insertions; and must not have more than nine consecutive deletions outside of an intron. We also required three or more exact matches at every intron-exon boundary [12]. If a single EST aligned to more than one location on the genome, we only considered the alignment with the highest score. To guard against redundant input data, if multiple alignments in the same locus had the same sequence after trimming, we disregarded all but one of them.
We discarded all unspliced alignments to prevent labeling pre-spliced transcripts as splice variants. Unspliced ESTs show no evidence of having been processed by the spliceosome, and may represent pre-spliced transcript fragments. It is difficult to distinguish between an unspliced EST that has been processed and an unspliced EST that has not, and thus we cannot report how many processed ESTs we discarded. However, if most unspliced ESTs have already been processed by the spliceosome, we would expect to find more of them in organisms with very few introns and/or very long exons. We found no such correlation in either case (Additional data file 5). We conclude that a substantial fraction of the unspliced ESTs represent pre-spliced ESTs, and, therefore, that unspliced ESTs are not a reliable indicator of splice variation.
The number of RIs changed substantially depending on whether we included or excluded unspliced ESTs, while the numbers of CEs and competing 3' and 5' splice sites changed very little. We believe that previous reports that do not exclude unspliced ESTs overestimate the frequency of intron retention.
After aligning the ESTs and filtering them, we built transcripts and transcript fragments using CallReferenceGenes, an unpublished tool used in the Broad Institute's genome annotation pipeline since 2005. Several previous papers provide an in-depth discussion of the problem [12,63]. We briefly sketch our algorithm here. The source code for CallReference-Genes, as well as the code that labels alternative splice forms, is included in Additional data file 6.
First, we partition the alignments into clusters, so that every alignment has exon-exon overlap with at least one other alignment in the cluster, and no alignment has exon-exon overlap with any alignment outside its cluster. Alignments that overlap no other alignments are ignored.
Next, for each cluster, we compare every alignment to every other alignment that overlaps it. (Here, as opposed to the previous step, we also consider exon-intron overlaps.) Relationships are directional, and are given one of three labels: 'conflicts', 'extended-by', or 'includes'. Two alignments conflict if any base in the region of overlap is exonic in one alignment and intronic in the other. If they do not conflict, one alignment extends another if it ends after the other and does not begin before it. Lastly, one alignment includes another if it does not conflict with it, starts before it, and ends after it.
At this stage the cluster is represented by a graph of nodes (representing alignments) and labeled edges (representing their relationships). To turn this into a more tree-like representation that is easier to traverse, we build an ordered list of alignments. We sort them by 3' coordinate, in ascending order. In the case where we have two alignments with identical 3' coordinates, we sort by 5' coordinate, again in ascending order. In this way, no alignment is extended by any element that sorts before it. An alignment usually, though not always, sorts after the alignments it includes.
To improve performance, we prune redundant relationships from the graph. We apply a series of heuristics to reduce the number of paths from an alignment to its descendants. These 'trimming rules' include: first, if A is extended by B, and C extends both A and B, and there is no element D such that B conflicts with D and C does not, then any path containing A and B will also contain C. The extended-by relationship from A to C is therefore redundant. Second, if A is included by B, and there is no element D such that B conflicts with D and A does not, then any path containing A will also include B. All extended-by relationships terminating in A are redundant. Third, if A is extended by B and both are included by C, and every element D that conflicts with C also conflicts with either A or B, then any path containing both A and B will also contain C. Thus, the extended-by relationship from A to B is redundant.
We traverse the list bottom-up so that every element is pruned before any element it extends. There will be multiple paths from a parent to a given child if a splice variation lies between them. If no splice variation occurs between them, generally, there will be one path, but the above rules are not exhaustive and some duplicate paths may remain after pruning. As we traverse the list and prune it, we track, per alignment, all alignments that can be reached through it (for example, all its extenders and includees, as well as their extenders and includees, and so on). We call these sets of alignments the 'descendants' of each alignment. (Note that pruning never reduces the membership of these sets, only the number of edges between them.) After ordering and pruning, we can traverse the list of alignments left-to-right, following 'extended-by' and 'includes' links to build paths linking splice-compatible alignments. To do this, we do the following. First, find starts -a start is any element that extends no other element and has at least one descendant that cannot be reached through any previously discovered start. Second, walk the tree -treating each start, in turn, as a root of a subtree, traverse extension and inclusion links as edges in a graph; each unique path represents (a fragment of) a transcript. Third, remove sub-paths -if all the alignments in one path are present in another, we discard the one containing fewer alignments. Fourth, overlapping paths with distinct alignments represent splice variants of the same gene.
Given two overlapping transcripts A and B, all bases in the region of overlap will be in one of four states: I, A and B are exonic; II, A is exonic; III, B is exonic; IV, neither is exonic. We group all adjacent states into a column with a single label, then use a sliding-window approach, across three such columns, to compare overlapping transcripts. CEs, RIs, and competing 5' and 3' splice sites all have a different signature appearance. CEs appear as IV:II:IV and IV:III:IV; RIs appear  as I:II:I and I:III:I; competing 5' and 3' splice sites appear as  I:II:IV, I:III:IV, IV:II:I, and IV:III:I. We filter the set of splice variants by requiring that every base in the variant region be exonic in at least one alignment and intronic in at least one alignment. Because EST data are fragmentary we cannot be sure that initial or terminal exons are complete. To ensure accurate labeling as well as reliable length statistics, we require that any exon in the alternatively spliced region not be an initial or terminal exon. Finally, we require that every alignment spanning the region conform to either the major or minor variant. The reported length of each splice variant is simply the width of the center column in the windows described above.
We also generated a list of constitutive introns and exons as a control. These are introns and exons predicted from loci where no ESTs conflict. Furthermore, every base within these constitutive elements must have ten or more ESTs supporting it. Note that there is no way to tell for sure that any exon or intron never exhibits splice variation; this method simply identifies loci where splice variation has never been seen, and if present, is presumably rare.

Testing retained introns for evidence of selection at the codon level
We used a comparative approach to test retained introns for purifying selection at the codon level using two groups of fungi, where genome sequences at suitable evolutionary distances were available. We examined all four sequenced serotypes of C. neoformans (JEC21, H99, R265 and WM276) in one group, while the second group consisted of C. immitis and the C735 strain of Coccidioides posadasii. Orthology of genes was determined using a reciprocal-best-BLAST criterion, while the alignment of orthologs was performed using Clus-talW [64]. Alignments of retained orthologous introns were concatenated to enhance power for detecting selection. Prior to concatenation, splice donor and acceptor sites were removed, and the 5' and 3' ends of each intron alignment were padded with gaps in order to preserve the native reading frames of introns. We used a 100 bp cutoff for retained introns in C. neoformans and 150 bp in C. immitis. Since 95% of the introns we identified in each organism were less than these cutoffs, we used a cutoff to eliminate unusually long introns. We analyzed 477 and 389 orthologous intron alignments in C. neoformans and C. immitis, respectively. The dN/dS ratio of concatenated intron alignments was calculated using the codeml program (model M0) in the PAML 3.15 software package [65]. The probability that the resulting dN/ dS ratios were significantly less than one, indicating purifying selection at the codon level, was calculated using a bootstrapping approach with a set of control introns. We identified 672 and 662 control introns in C. neoformans and C. immitis, respectively. We randomly resampled, with replacement, from the control introns to create 1,000 concatenated control alignments approximately equal in length to the concatenated retained intron alignments in each taxonomic group. Then, we used codeml to calculate the dN/dS ratio exhibited by each resampled control alignment to determine the probability of observing dN/dS ratios as low as or lower than those exhibited by the retained introns in each taxonomic group, under a null hypothesis that they are non-coding.

Authors' contributions
AMM and MDP contributed equally. MDP wrote the EST alignment software and contributed to writing the paper. AMM performed the analysis and drafted and finalized the manuscript. DEN performed the dN/dS and stop codon density analyses, and contributed to writing the paper. JEG initiated and supervised the study, and revised the manuscript.

Additional data files
The following additional data are available with the online version of this paper. Additional data file 1 is a figure showing the frequencies of different forms of splice variation. Additional file 2 is a figure showing sequence logos for splice sites for both RIs and controls. Additional data file 3 shows intron and exon length distributions for six example organisms. Additional data file 4 contains supplementary tables detailing the differences in information content between RIs and controls (Table S1), intron and exon lengths from annotations (Table S2), and details of our analysis of functional group enrichment of RIs (Table S3). Additional data file 5 is a spreadsheet containing a comparison of alternative splicing events for the situations where unspliced ESTs are included as well as discarded. This spreadsheet also includes data for a higher-confidence dataset requiring a greater number of ESTs supporting each predicted alternative splicing event.
Additional data file 6 contains computer source codes for the CallReferenceGenes program, as well as the code that labels alternative splice forms. These codes can be inspected but will not be functional without the rest of the Broad Institute's Calhoun environment.
Additional data file 1 Frequencies of different forms of splice variation The data for H. sapiens were taken from a previous study [8].
Click here for file Additional data file 2 Details of splice site strength analysis (a) Sequence logos [41] for 3' splice sites of RIs and control introns.
(b) Sequence logos [41] for 5' splice sites of RIs and control introns. For 5' splice sites, we show sequence logos for six intronic and three exonic bases. For 3' splice sites, we show sequence logos for three intronic and one exonic bases. Click here for file Additional data file 3 Intron and exon length distributions Six example organisms with large numbers of splice variants were chosen and their normalized intron and exon length distributions are plotted here. (a) CE distributions are shifted towards shorter lengths than the control exon length distribution, because of con-straints on exon length imposed by ED. (b) The peak at short intron lengths is very similar in RIs and constitutive introns, because this short intron peak is primarily made up of introns rec-ognized by ID. In contrast, almost no introns surrounding CEs have lengths close to this 'intron-definition peak' -almost all of them are spread out over a wide range of longer intron lengths, and have low values in the area of the short intron-length ID peak shown here (0-200 bp). Click here for file Additional data file 4 Supplementary Tables S1-S3 Table S1: differences in information content between RIs and con-trols. Table S2: intron and exon lengths from annotations. Table  S3: details of our analysis of functional group enrichment of RIs. Click here for file Additional data file 5 Summary of data including unspliced ESTs A comparison of alternative splicing events for the situations where unspliced ESTs are included as well as discarded. We also include data for a higher-confidence dataset requiring a greater number of ESTs supporting each predicted alternative splicing event.
Click here for file Additional data file 6 Computer source codes We include the source code for the CallReferenceGenes program, as well as the code which labels alternative splice forms. These codes can be inspected but will not be functional without the rest of the Broad Institute's Calhoun environment. These are included as Unix tar files. Click here for file