The fitness cost of mis-splicing is the main determinant of alternative splicing patterns
- Baptiste Saudemont†1, 2,
- Alexandra Popa†3, 4,
- Joanna L. Parmley†3, 5,
- Vincent Rocher3,
- Corinne Blugeon1,
- Anamaria Necsulea3,
- Eric Meyer1 and
- Laurent Duret3Email authorView ORCID ID profile
© The Author(s). 2017
Received: 3 April 2017
Accepted: 9 October 2017
Published: 30 October 2017
Most eukaryotic genes are subject to alternative splicing (AS), which may contribute to the production of protein variants or to the regulation of gene expression via nonsense-mediated messenger RNA (mRNA) decay (NMD). However, a fraction of splice variants might correspond to spurious transcripts and the question of the relative proportion of splicing errors to functional splice variants remains highly debated.
We propose a test to quantify the fraction of AS events corresponding to errors. This test is based on the fact that the fitness cost of splicing errors increases with the number of introns in a gene and with expression level. We analyzed the transcriptome of the intron-rich eukaryote Paramecium tetraurelia. We show that in both normal and in NMD-deficient cells, AS rates strongly decrease with increasing expression level and with increasing number of introns. This relationship is observed for AS events that are detectable by NMD as well as for those that are not, which invalidates the hypothesis of a link with the regulation of gene expression. Our results show that in genes with a median expression level, 92–98% of observed splice variants correspond to errors. We observed the same patterns in human transcriptomes and we further show that AS rates correlate with the fitness cost of splicing errors.
These observations indicate that genes under weaker selective pressure accumulate more maladaptive substitutions and are more prone to splicing errors. Thus, to a large extent, patterns of gene expression variants simply reflect the balance between selection, mutation, and drift.
The maturation of a primary transcript by the spliceosome can lead to the production of diverse transcripts, via the use of different splice sites and/or intron retention (IR). Alternative splicing (AS) is widespread in eukaryotes and it has been postulated that it might considerably expand the functional repertoire of eukaryotic genomes [1–3]. Many case studies have shown that some AS events are functional, i.e. that they play a physiological role, beneficial for the fitness of the organism (for review, see ). However, like any biological machinery, the spliceosome is not 100% accurate and the splicing of primary transcripts occasionally leads to the production of spurious messenger RNAs (mRNAs). These erroneous transcripts represent a waste of resources and may lead to the production of toxic protein variants and hence are expected to be deleterious for the fitness of organisms. Indeed, several quality control mechanisms exist in eukaryotic cells to mitigate the negative impact of erroneous transcripts . In particular, the nonsense-mediated decay (NMD) machinery is able to recognize and degrade cytoplasmic transcripts containing premature termination codons (PTCs) . However, these quality-control processes themselves are not 100% efficient. Hence, any transcriptome necessarily includes a fraction of variants that correspond to splicing errors and their frequency relative to functional AS events remains open for debate.
In a large majority of cases, splice variants contain PTCs (i.e. encode truncated proteins) and only a very small fraction (<0.6%) of annotated AS events lead to the production of a detectable amount of protein . The subset of AS variants that are detected in proteomic studies shows clear signs of protein functionality: 96% of them maintain the reading frame ; they rarely disrupt protein domains [7, 9]; and they are highly conserved, from mammals to bony fish . This contrasts with the bulk of AS events detected within transcriptomes: 58% of them induce frameshifts  and 70% disrupt protein domains . Moreover, comparative transcriptomic analyses revealed that only 1–3% of exon-skipping events detected by RNA-sequencing (RNA-seq) are conserved beyond mammals [11, 12] and alternative splice sites show no sign of selective constraint . The subset of exon-skipping events that are strongly tissue-specific and that preserve the reading frame is generally more conserved, which clearly suggests that this subset includes some functional events [11–14]. However, these cases represent only a small fraction of all AS events [11–14]. These observations indicate that only a small minority of AS events are involved in the production of functional protein variants (for review, see ). This led some authors to conclude that the vast majority of AS events correspond to splicing errors [10, 16–18] (we will hereafter refer to this hypothesis as the “noisy splicing” model).
However, this interpretation is contested by other authors who argue that AS might play another important role, not linked to the production of functional protein variants, but to the regulation of gene expression. Indeed, the maturation of primary transcripts into PTC-containing splice variants, which then get degraded by NMD, can be used as a way to regulate the amount of mRNA available for protein production (this post-transcriptional regulation pathway is termed AS-NMD, for AS coupled with NMD; for review, see [19, 20]). AS-NMD notably plays an important role in the regulation of genes involved in the splicing process itself, presumably to maintain the homeostasis of splicing factors via auto-regulatory loops [21, 22]. Interestingly, although the regulation of splicing factors by AS-NMD is well conserved across animals, the AS events that trigger NMD in these genes often involve different splice sites . The rapid evolution of AS events in mammals is therefore not necessarily in contradiction with the hypothesis that many of them play an important regulatory role. The comparison of transcriptomes in normal vs NMD-deficient cells revealed that a large fraction of genes produce splice variants (in a broad sense, i.e. including cases of IR) that are targeted by NMD [18, 24–27]. This pattern is widespread in eukaryotes and is not restricted to genes encoding splicing factors. Importantly, patterns of AS vary among tissues and during cell differentiation [28–30]. This led several authors to propose that AS-NMD might play a critical role in broadly regulating expression of a large percentage of genes [28–33].
Beyond a few case studies that provided clear evidence of genes regulated by AS-NMD, we still lack a global picture of the relative prevalence of functional AS compared to splicing errors. We propose here a test to quantify the fraction of splice variants corresponding to errors, i.e. having a negative impact on the fitness of organisms. The basis of this test is that the strength of splice signals is expected to reflect a balance between selection (which favors alleles that are optimal for splicing efficiency) and mutation and random genetic drift (which can lead to the fixation of non-optimal alleles) . This selection-mutation-drift equilibrium therefore predicts a higher splicing accuracy at introns where errors are more deleterious for the fitness of organisms. Hence, if AS events predominantly correspond to splicing errors, one should expect a negative correlation between the rate of AS events and their cost in terms of resource allocation (metabolic cost, mobilization of cellular machineries). The noisy splicing model therefore makes several specific predictions regarding the AS rate according to whether splice variants are detectable by NMD and according to the expression level, length, and number of introns of genes.
Results and discussion
Quantification of splicing variants in Paramecium
For a given gene, the abundance of splicing variants depends both on the intrinsic strength of splicing signals and on the relative stability of the different variants. Thus, to study the determinants of alternative splicing in P. tetraurelia, we sequenced the polyadenylated RNA fraction of cells, either in normal state (hereafter denoted wild-type [WT]) or rendered NMD-deficient by knocking down one of the main components of the NMD machinery (Upf1, Upf2, or Upf3). The inactivation of Upf genes leads to stabilization of PTC-containing transcripts that would normally be degraded by the NMD machinery, thus providing a proxy for the intrinsic splicing efficiency of introns.
We generated ten RNA-seq datasets (Additional file 1: Table S1): six distinct NMD knockdown experiments and four replicates of WT cell cultures (see “Methods”). All biological replicates gave similar results (Additional file 1: Figures S1 and S2). We therefore pooled the sequencing datasets, to increase the per gene read depth (50% of genes have a read depth > 41 and > 85 in WT and in NMD-deficient samples, respectively). We detected splicing events by mapping sequence reads to the genome. These splicing events were then compared to gene models of the reference genome annotation, which includes 39,642 protein-coding genes, among which 31,632 contain introns (n = 90,287 introns) .
We detected three types of AS events (Fig. 1c): IR; alternative splice site variants (ASSV); and splicing of cryptic introns (i.e. introns with both splice sites located within an annotated coding exon). It is important to note that the classification of splice variants relies on the definition of a canonical form (Fig. 1c): the distinction between a “cryptic intron” and a “retained intron” depends on which variant is considered as the reference. For the vast majority of introns (97.8%), we observed one single major splice form, at least five times more abundant than other forms (Additional file 1: Figure S3). We therefore decided to define the canonical form as the one that is the most abundant in WT cells (see Additional file 1: Text S1). To be able to identify canonical forms, we restricted all subsequent analyses to genomic segments covered by at least ten RNA-seq reads in WT samples. This subset includes 65,159 annotated introns (which constitute our reference intron dataset).
To compare AS rates between different samples, it is necessary to normalize variant counts by the sequencing depth . For introns, we computed the rates of retention and ASSV, defined as the proportion of variant reads among all reads spanning these reference introns (Fig. 1c). For cryptic introns, we considered all DNA segments potentially subject to cryptic splicing, i.e. segments of length 20–35 nt (matching the size distribution of observed introns and cryptic introns, Fig. 1), entirely located within an exon, starting with GT and ending with AG. These segments will hereafter be referred to as potential cryptic introns (PCIs). The rate of cryptic intron splicing is defined by the proportion of spliced reads among all reads spanning PCIs (Fig. 1c).
Summary of AS rates in paramecia and human
Number of protein-coding genes
Mean (median) number of introns per gene
Average ASSV rate per intron
Average IR rate per intron
Mean (median) number of PCIs per gene
Average splice rate per PCI
Impact of NMD on steady-state levels of splice variants
Lower rate of alternative splicing in long and highly expressed genes
The second point is that, for a given splicing error rate per intron, the rate of production of spurious transcripts increases with the number of introns present in a gene: the greater the number of introns, the greater the risk of having at least one error. The selective pressure on the strength of splice signals of each intron is therefore expected to increase with the number of introns in a gene and hence the AS rate (per intron) should be lower in genes with more introns. To test this prediction, we classified introns into three groups according to the number of introns present in their gene: genes with 1 intron; with 2–3 introns; and with at least 4 introns (mean = 5.2 introns) (the three groups correspond to 27.8%, 43.0%, and 29.2% of intron-containing genes, respectively). We then binned each group according to gene expression level and computed the AS rate per bin. Again, observations perfectly match predictions: for a given expression level, the AS rate per intron is higher in genes with fewer introns, both for IR (Fig. 3d) and for ASSV (Fig. 3e).
The third point is that the risk of cryptic intron splicing increases with the number of PCIs and therefore with the length of coding sequences (CDSs). The selective pressure to limit the strength of cryptic splice signals should therefore increase with CDS length and PCIs in long CDSs should have a lower splicing rate compared to PCIs in short CDSs. To test this prediction, we classified PCIs into three groups according to the length of the CDS in which they are located (each group corresponds to one-third of all genes) and then binned each group by gene expression level and computed the PCI splicing rate per bin. Again, the predictions of the model fit the observations: for a given expression level, the splicing rate per PCI is lower in genes with longer CDSs (Fig. 3f). Thus, all observations fit the three predictions of the “noisy splicing” model.
The genome-wide AS pattern is dominated by splicing errors
where AS f is the rate of functional AS events and AS e is the rate of erroneous splicing.
As a reference for highly constrained genes, we considered genes with a high expression level (top 10%) and with > 3 introns (for the quantification of erroneous ASSV and IR events) or with a CDS > 1400 bp (for the quantification of erroneous cryptic intron splicing). In WT cells, we observed that the ratio of the AS rate in genes with median expression level over the AS rate in highly constrained genes are r i = 12.0, r i = 20.3, and r i = 49.3 for IR, ASSV, and cryptic intron splicing, respectively. According to Eq. 4, this implies that for a median gene, 92–98% of splice variants detected in WT cells result from errors and this proportion might even be higher if the splicing error rate in highly expressed genes is not negligible (Eq. 3).
These estimates are based on the assumption that, on average, the rate of functional AS does not vary with gene expression level (i.e. AS h f = AS l f = AS f in Eq. 3). One may argue, however, that variation in AS rate with expression level might reflect differences in the propensity to use AS-NMD: it is in principle possible that weakly expressed genes are more prone to use AS-NMD to fine-tune their expression level (i.e. AS l f > AS h f ). For instance, one might speculate that highly expressed genes are preferentially regulated at the transcriptional level, to avoid the waste of resources caused by the post-transcriptional AS-NMD pathway. Furthermore, if gene regulation via AS-NMD requires only one AS-prone intron per gene, then this could explain why the average AS rates (measured over all introns) decrease with increasing number of introns per gene (Fig. 3). Thus, although all previous observations are consistent with the predictions of the noisy splicing model, they do not formally invalidate the AS-NMD hypothesis.
A dual strategy to limit the cost of splicing errors
In NMD-deficient cells, the IR rate is much higher for NMD-visible introns than for NMD-invisible introns, which indicates that the former has a lower intrinsic splicing efficiency (Fig. 2a). The difference in intrinsic splicing efficiency results, at least in part, from a difference in the strength of splice signals: on average, 77.4% of NMD-invisible introns match the consensus splicing signals [GTA..TAG], compared to only 69.8% for NMD-visible introns (Chi-squared test = 289.1, p < 10–15). However, in WT cells, the observed AS rate is similar for both categories of introns. This implies that the efficacy of NMD to eliminate transcripts with retained introns is strong enough to compensate the lower intrinsic splicing efficiency of NMD-visible introns.
The same pattern is observed for PCIs: in WT cells, NMD-visible and NMD-invisible PCIs show similar rates of splicing (Fig. 2b, Additional file 1: Figure S6A), despite the fact that the intrinsic rate of splicing of PCIs (observed in NMD-deficient cells) is about five times higher for NMD-invisible compared to NMD-visible PCIs (Fig. 2b, Additional file 1: Figure S6B). Thus, again, the higher intrinsic propensity of NMD-visible PCIs to be spliced out is compensated by the activity of NMD in WT cells.
Patterns of alternative splicing in humans are consistent with the noisy splicing model
As a reference dataset of highly constrained human genes, we considered genes with a high expression level (top 10%) and with > 21 introns (top 33%). The ratio of the AS rate in genes with median expression level over the AS rate in highly constrained genes are r i = 3.1 and r i = 3.6 for IR and ASSV, respectively. According to Eq. 4, this implies that for median genes, at least 68% of IR events and 72% of ASSV events correspond to errors. These estimates are lower than in paramecium (92% for IR, 95% for ASSV), which might reflect a higher proportion of functional AS events in mammals than in ciliates. One noticeable difference between AS patterns in these organisms is that exon-skipping is common in mammals, but absent in paramecium. Interestingly, in mammals, exon-skipping events that preserve the reading frame are more conserved than other AS events, which indicates that this subset includes a higher fraction of functional events . However, it should be noted that this subset represents only ~ 15% of ASSV events in human . In fact, the difference between human and paramecium estimates might simply result from a limitation of our methodology. Indeed, these estimates are based on the assumption that the error rate in the set of highly constrained genes is negligible. In paramecia, AS rates tend to plateau at high expression levels (Fig. 3), which is compatible with the hypothesis that this basal rate might correspond to functional splice variants. However, in human, contrary to paramecia, there is no sign that AS rates reach a basal value at high expression levels, both for IR and ASSV events (Fig. 5). It is therefore likely that the splicing error rate is substantial, even in the reference dataset of highly constrained genes. Hence, the above estimates are certainly an underestimate of the true splicing error rate in humans.
Fitness impact of mis-splicing in humans
To test whether IR rate co-varies with the fitness impact of mis-splicing, we binned introns according to their IR rate and computed πspl/π3 in each bin. We observed a positive correlation between πspl/π3 and the average IR rate per bin (R2 = 0.76, p < 10–6), with a twofold increase between bins of low IR compared to bins of high IR (Fig. 6c). Again, it is important to stress that the frequency of introns matching the GT..AG consensus does not vary with IR rate (Additional file 1: Figure S9). This implies that mis-splicing is deleterious, even in introns with high IR rate. However, in agreement with the noisy splicing model, introns that show a high IR rate correspond to introns where mis-splicing is relatively less deleterious.
The efficiency of excision of introns by the spliceosome is affected by different signals, located within introns and flanking exons (splice sites, branch point, polypyrimidine tract, splicing enhancers, or silencers). Besides the two splice sites that are critical for the splicing reaction (almost always GT for the donor and AG for the acceptor), all other signals tolerate some sequence flexibility. The probability for a mutation affecting a splicing signal to reach fixation depends on its fitness impact (i.e. the selection coefficient, s) and on the power of random genetic drift (i.e. the effective population size, N e ) . There is therefore necessarily a limit to the point up to which selection can optimize the strength of splice signals: if the splicing error rate is already low, any mutation that further improves splicing efficiency will necessarily have a weak fitness impact and hence will be subject to random drift (the so-called drift barrier effect ). This drift barrier therefore determines a basal splicing error rate, which depends on the mutation rate, on N e , and on the fitness cost of splicing errors (s).
For a given error rate, errors are expected to be more costly (in terms of metabolic resources and mobilization of cellular machineries) in highly expressed genes. Hence the fitness cost of mis-splicing is expected to increase with increasing expression level. Indeed, this is precisely what we observed in humans: the strength of selection against deleterious mutations at splice sites is strongly correlated to gene expression level (Fig. 6b). Since the risk of producing erroneous transcripts increases with the number of introns, this implies that all else being equal, there should be a stronger selective pressure against mis-splicing in intron-rich genes. The mutation-selection-drift theory therefore predicts that introns from weakly expressed/intron-poor genes should accumulate more non-optimal substitutions in their splice signals and therefore should show a higher splicing error rate. The relationships that we observe between AS rate, expression level, and intron number are perfectly consistent with these predictions, both in human (Fig. 5) and in paramecia (Fig. 3).
There are two possible ways to limit the deleterious impact of erroneous splicing: (1) improve the strength of splicing signals to increase intrinsic splicing efficiency and avoid the use of cryptic signals (error prevention); or (2) ensure that transcripts are degraded by NMD in case of splicing error (error mitigation). We observed that both strategies are used: there is a deficit of introns and cryptic introns that cannot trigger NMD in case of splicing error; and the rare introns that are not NMD-visible show stronger splicing signals (Additional file 1: Text S3, Additional file 1: Figure S10). The analysis of AS rate in NMD-deficient cells shows that NMD-invisible introns have a much higher intrinsic splicing accuracy than NMD-visible ones. This difference demonstrates that the biophysical limits of splicing accuracy have not been reached and that it would be possible to further improve splicing accuracy of NMD-visible introns by genetic engineering. However, the mutation-selection-drift theory predicts that once the basal splicing error rate has been reached, by error prevention or by error mitigation, then selection cannot further improve splicing efficiency. Thus, this model predicts that the steady state level of erroneous transcripts (after quality control by NMD) should be the same for NMD-visible and NMD-invisible introns. And this is precisely what we observed: in WT cells, NMD-visible and NMD-invisible AS events show similar rates (Fig. 2).
The fitness cost of splicing errors depends on the frequency of transcripts subject to at least one erroneous splicing event. Owing to the short length of RNA-seq sequence reads, it is not possible to directly quantify AS rates per transcript. However, given that AS rates (per intron) are similar in human and in paramecia (Table 1) and that human genes contain on average 3–4 times more introns than paramecia, this implies that the frequency of transcripts subject to at least one erroneous splicing event must be much higher in human than in paramecia. This is consistent with the drift-barrier hypothesis, which predicts that humans should have a higher splicing error rate (per gene), owing to their larger mutational targets (more introns) and to their smaller effective population size [41, 42].
There is clear evidence that some AS events are functional . Notably, we observed that AS-NMD probably plays an important role in the regulation of genes encoding splicing factors in paramecia (Additional file 1: Text S3), as previously shown in other eukaryotes [21, 22]. However, AS-NMD cannot explain the strong relationship between AS rate and expression level that is observed for NMD-invisible splicing variants (Fig. 4, Additional file 1: Figure S7). It has been recently shown that the retention of introns in nuclear transcripts (the so-called “detained” introns) might also contribute to the regulation of gene expression, independently of NMD . If weakly expressed genes were more prone to use this regulatory pathway, this might explain the relationship observed between expression level and IR rate. However, this model does not explain the relationship between IR rate and intron number (Figs. 3d and 5a) and, most importantly, cannot explain the relationship between expression level and other classes of AS events (ASSV or cryptic intron splicing; Figs. 3 and 5). The most parsimonious explanation is that the excess of AS in weakly expressed/intron-poor genes results from the accumulation of maladaptive substitutions, driven by random genetic drift in genes where the selective pressure is weaker. Our observations indicate that for median genes, the vast majority of observed splice variants correspond to errors, in contradiction with the panglossian view of a widespread role of AS-NMD in fine-tuning the expression of genes. Of course, this does not negate the importance of AS-NMD in the regulation of some genes. However, our results highlight the necessity of a careful consideration of non-adaptive hypotheses before concluding about the functionality of AS events.
Paramecium strain, cell culture, and inactivation of NMD
The entirely homozygous strain 51 of P. tetraurelia was grown in a wheatgrass powder infusion medium bacterized with Klebsiella pneumoniae the day before use and supplemented with 0.8 mg.L-1 ß-sitosterol. NMD was inactivated either by RNAi-mediated silencing of UPF genes during vegetative growth of WT cells or by generating somatic knockouts, i.e. clones in which these genes are deleted from the macronucleus. RNAi treatment was based on the double-stranded RNA feeding technique : briefly, cells were fed for seven days with E. coli (HT115) producing double-stranded RNA homologous to the target gene. Sequences used for silencing of UPF1A, UPF1B, UPF2, UPF3, and ICL7a (which encodes a cytoskeletal protein), were segments 1885–2289, 1887–2285, 1143–1546, 18–422, and 1–580 of the genes (from the ATG), respectively. These genes can be accessed with ParameciumDB (http://paramecium.cgm.cnrs-gif.fr/) under accession numbers GSPATG00034062001, GSPATG00037251001, GSPATG00017015001, GSPATG00001393001, and GSPATG00021610001, respectively. Somatic knockouts were generated by applying RNAi treatment during the development of a new somatic macronucleus, which results in the deletion of the targeted genes [45, 46]: WT conjugating pairs were transferred to “UPF” RNAi medium and, following their separation, individual exconjugants were isolated in the same medium. After 24 h of growth, cells were transferred to standard growth medium. Among the viable exconjugants obtained, somatic UPF deletions were screened for based on the slow growth phenotype and the inability to undergo autogamy, and later confirmed by Southern blots and PCR (Additional file 1: Figure S11).
Total RNA was extracted from cells grown on K. pneumoniae or the relevant feeding E. coli strains with the TRIzol (Invitrogen) procedure, modified by the addition of glass beads. All RNA samples were treated with DNase prior to library construction to minimize DNA contamination. For the first four RNA-seq datasets in Additional file 1: Table S1, poly(A) RNAs were purified from 100 μg of total RNA with the MicroPoly(A)purist kit (Ambion). Of the output, 25% was used for mRNA reverse transcription, using the SuperScript III kit (Invitrogen) and the anchor-oligo(dT) primer 5′-GCCCACCAGAGCCGGCGGATTTTTTTTTTTTTTTTT-3′. After alkaline lysis of RNA and removal of the oligo(dT) primer with G-50 columns (GE Healthcare), a poly(G) tail was added to single-stranded complementary DNAs (cDNAs) with terminal transferase (NEB) following the producer’s instructions. After phenol purification and ethanol precipitation, cDNAs were made double-stranded using the Phusion PCR enzyme (Finnzymes) and the anchor-oligo(dC) primer 5′-GCCCACCAGAGCCGGCGGACCCCCCCCCCCCCCCCC-3′. Double-stranded DNA was then purified using the Qiagen PCR purification kit and cDNA libraries were amplified by 15 cycles of PCR with the anchor primer. cDNA libraries were digested by EciI restriction enzyme (NEB) and purified (Qiagen) before addition of Illumina adaptors. For the last six RNA-seq datasets, library preparation and Illumina sequencing were performed at the ENS Genomic Platform (Paris, France). Poly(A) RNAs were purified from 1 μg of total RNA using oligo(dT). Libraries were prepared using the strand non-specific RNA-seq library preparation TruSeq RNA Sample Prep kit (Illumina) and multiplexed by 3 on 2 flowcell lanes. 101-bp paired-end read sequencing was performed on a HiSeq 1500 device (Illumina).
The sequencing of these ten samples yielded a total of 40.8 Gb (from 247,653,027 fragments), 25.1 Gb from NMD-deficient cells, and 15.7 Gb from control cells (Additional file 1: Table S1). Reads were mapped against the P. tetraurelia reference genome assembly (accession number: CAAL01000000) , using TopHat (version 1.4.1) . The minimal and maximal intron lengths were set to 10 nt and 500,000 nt, respectively. Reads that mapped at multiple positions on the genome were excluded from further analyses. Read coverage along transcription unit was obtained using annotated gene models from the reference genome . The expression level of genes was measured in reads per kilobase per million mapped reads (RPKM).
Detection of splicing events
For each annotated intron, we counted the number of mapped reads spanning both extremities (Fig. 1c). Reads aligning to the genome sequence without any gap were counted as IR. Reads showing a deletion corresponding exactly to the annotated intron were counted as splice events. Reads with a deletion that does not match the annotated intron (at one or both extremities) were counted as ASSV. Reads showing a deletion entirely located within an annotated coding exon were counted as cryptic intron splicing events (Fig. 1c).
The cell cultures that we analyzed are totally homozygous. However, it is important to note that in paramecia, the macronuclear genome is highly polyploid and that the different copies of a same gene may differ due to heterogeneity in the process of excision of internal eliminated sequences (IESs) . Thus, a fraction of the diversity detected in the transcriptome may in fact result from this macronuclear genomic heterogeneity. Among all alignment gaps detected by TopHat, > 97% match the consensus intron boundaries (GT/AG), which indicates that most of them correspond to bona fide splice events. To avoid any confusion between splice variants and IES excision variants, we counted as splice variants only those matching the GT/AG consensus.
The classification of splice variants (IR, ASSV, or cryptic intron splicing) was based on the comparison with the canonical form, defined as the major form observed in WT cells. Among the 90,287 annotated introns, we selected those that are spanned by at least 10 reads in WT samples (n = 70,242). Among those ones, 4045 were never observed as spliced and 1038 correspond to minor splice forms. Thus, our reference dataset includes 65,159 introns (72% of the initial dataset).
Quantification of AS rate
One important goal of this study was to analyze the relationship between AS rate and gene expression level. The AS rate at a given intron is defined by the proportion of splice variant reads among all reads spanning that intron (Fig. 1c). One difficulty is that the precision of this metric is strongly dependent on the sequencing read depth and, hence, the measure of the AS rate is much less accurate in weakly than in highly expressed genes. To circumvent this problem, we binned introns (or PCIs) by expression level and then measured the global AS rates in each bin (defined by the proportion of splice variant reads among all reads in that bin).
The measure of IR rate might potentially be biased by the presence of contaminant genomic DNA in the RNA-seq library. We checked that our results are robust to this possible artefact (see Additional file 1: Text S5).
Analysis of intron retention in humans
Braunschweig et al.  analyzed 52 RNA-seq samples from different tissues and cell types to quantify intron retention in human genes. For each gene, they selected one representative transcript, based on Ensembl annotations. Their initial dataset includes 202,973 introns from 20,959 protein-coding genes (Additional file 1: Tables S6 and S8 from Braunschweig et al. ). We computed the average gene expression level of each gene over the 52 samples, using data provided by the authors. We excluded data from genes that are not mapped on chromosomes of the reference genome assembly (n = 18,546 introns from 2185 genes annotated on unmapped contigs or additional haplotypes) or for which expression data were not available (n = 4844 introns from 871 genes).
To analyze the AS rate according to NMD visibility, we also excluded from their dataset all introns located within UTRs or within truncated CDS (i.e. CDS lacking start or stop codon or containing an internal stop codon): n = 10,780 introns from 912 genes. The final dataset includes 170,015 introns from 16,991 genes.
For each intron, Braunschweig et al.  quantified retention rates in all samples where it showed sufficient read depth (>10 reads spanning each flanking exon boundary). Among the 170,015 introns, we excluded those corresponding to minor splice forms (i.e. with an IR rate ≥ 50%, n = 580 introns), and selected all those for which the retention rate had been quantified in at least ten samples. For each of the selected introns (n = 118,703), we computed the average retention rate over all available samples (median = 38 samples).
Analysis of ASSV in humans
We estimated ASSV frequencies in 25 human tissues and cell lines, using 110 publicly available RNA-seq samples (Additional file 1: Table S3), corresponding to a representative subset of the samples analyzed by Braunschweig et al. . To increase comparability among samples, for paired-end data we analyzed only the first read of the pair and stranded samples were treated as unstranded. We aligned the RNA-seq data on the human genome (hg38 assembly, downloaded from Ensembl release 84) using TopHat 2.0.4 with the following options: minimum intron size for junction discovery = 40 nucleotides (nt), maximum intron size = 1 million nt, maximum one mismatch per read segment, anchor size 8 nt, no mismatches allowed in the anchor region, no coverage search. To aid the spliced read mapping process, we provided as an input for TopHat the set of introns annotated in Ensembl release 84, with the –j option. We re-estimated the splice junction frequencies using uniquely mapping reads, annotated with the NH:i:1 tag in the original TopHat alignments. For each tissue/cell line, we combined read counts from all available samples.
For a given intron, this parameter was computed only in tissues with sufficient read depth ((nE1E2 i + nEaE2 i + nE1Ea i ) > 10 reads). We excluded 3075 introns corresponding to minor splice forms (i.e. mean ASSV rate ≥ 50%) and selected all introns for which the ASSV rate had been quantified in at least ten tissues. For each of the selected introns (n = 102,697), we computed the average ASSV rate over all available samples (median = 22 samples).
Note that this definition of ASSV includes any splicing event that connects a donor (or acceptor) of the annotated intron, to an alternate acceptor (or respectively donor) in the same gene. This definition encompasses many different types of AS events: not only alternative 3′ or 5′ splice site usage (as shown in Fig. 1c for paramecia), but also exon skipping, alternative initial/terminal exons or mutually exclusive exons  (Additional file 1: Figure S12).
Definition of NMD-invisible alternative splicing events in humans
In mammals, NMD is able to recognize and degrade PTC-containing transcripts only if the PTC occurs more than 50 nucleotides upstream of the last exon-exon junction [6, 49]. Hence, alternative splicing events (IR or ASSV) affecting last introns were classified as NMD-invisible, whereas the other were classified as potentially NMD-visible.
Analysis of polymorphism at splice sites of human introns
For each of the 170,015 introns located within coding regions, we analyzed patterns of polymorphism in the vicinity of its donor splice site (last 20 bp from the upstream exon and first 30 bp of the intron) and of its acceptor splice site (last 30 bp of the intron and first 20 bp from the downstream exon), using polymorphism data from the 1000 Genomes Project (phase 3; ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/) . In total, our dataset includes 447,659 SNPs (0.026 SNP per bp), among which 437,080 (97.6%) with DAF information.
We thank Linda Sperling for helpful comments and for sharing her analyses of the distribution of splice signals within paramecium CDSs. We thank Olivier Arnaiz for his precious help in analyzing RNA-seq data. We thank Ulrich Braunschweig for kindly providing data on intron retention rates and expression levels of human genes. This work was performed using the computing facilities of the CC LBBE/PRABI.
This work was supported by the Agence Nationale de la Recherche (ANR-12-BSV6-0017-04 INFERNO), and by the France Génomique national infrastructure, funded as part of the “Investissements d’Avenir” program managed by the ANR (ANR-10-INBS-09). It received support under the program “Investissements d’Avenir” launched by the French government and implemented by the ANR with the references ANR-10-LABX-54 MEMOLIFE and ANR-11-IDEX-0001-02 PSL Research University.
Availability of data and material
Illumina read sequences generated in this study have been submitted to the European Nucleotide Archive (ENA) (https://www.ebi.ac.uk/ena) under accession number PRJEB15532 . All datasets (human, paramecia) are available at http://doi.org/10.5281/zenodo.321639 .
EM and LD designed the study. EM and BS designed the experiments (paramecia). BS performed the experiments. BS and CB prepared the sequencing libraries. AP, JLP, and LD performed the bioinformatics analyses of paramecia transcriptomes. AN and LD performed the bioinformatics analyses of human transcriptomes. VR and LD analyzed human polymorphism data. LD wrote the manuscript, with the help of EM, BS, AN, and AP. All authors read and approved the final manuscript.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Graveley BR. Alternative splicing: Increasing diversity in the proteomic world. Trends Genet. 2001;17:100–7.View ArticlePubMedGoogle Scholar
- Nilsen TW, Graveley BR. Expansion of the eukaryotic proteome by alternative splicing. Nature. 2010;463:457–63.View ArticlePubMedPubMed CentralGoogle Scholar
- Blencowe BJ. Alternative splicing: new insights from global analyses. Cell. 2006;126:37–47.View ArticlePubMedGoogle Scholar
- Kelemen O, Convertini P, Zhang Z, Wen Y, Shen M, Falaleeva M, et al. Function of alternative splicing. Gene. 2013;514:1–30.View ArticlePubMedGoogle Scholar
- Graille M, Séraphin B. Surveillance pathways rescuing eukaryotic ribosomes lost in translation. Nat Rev Mol Cell Biol. 2012;13:727–35.View ArticlePubMedGoogle Scholar
- Popp MW-L, Maquat LE. Organizing principles of mammalian nonsense-mediated mRNA decay. Annu Rev Genet. 2013;47:139–65.View ArticlePubMedPubMed CentralGoogle Scholar
- Abascal F, Ezkurdia I, Rodriguez-Rivas J, Rodriguez JM, del Pozo A, Vázquez J, et al. Alternatively spliced homologous exons have ancient origins and are highly expressed at the protein level. PLoS Comput Biol. 2015;11:e1004325.View ArticlePubMedPubMed CentralGoogle Scholar
- Tress ML, Abascal F, Valencia A. Most alternative isoforms are not functionally important. Trends Biochem Sci. 2017;42:408–10.View ArticlePubMedGoogle Scholar
- Ezkurdia I, Del Pozo A, Frankish A, Rodriguez JM, Harrow J, Ashman K, et al. Comparative proteomics reveals a significant bias toward alternative protein isoforms with conserved structure and function. Mol Biol Evol. 2012;29:2265–83.View ArticlePubMedPubMed CentralGoogle Scholar
- Pickrell JK, Pai AA, Gilad Y, Pritchard JK. Noisy splicing drives mRNA isoform diversity in human cells. PLoS Genet. 2010;6:e1001236.View ArticlePubMedPubMed CentralGoogle Scholar
- Barbosa-Morais NL, Irimia M, Pan Q, Xiong HY, Gueroussov S, Lee LJ, et al. The evolutionary landscape of alternative splicing in vertebrate species. Science. 2012;338:1587–93.View ArticlePubMedGoogle Scholar
- Merkin J, Russell C, Chen P, Burge CB. Evolutionary dynamics of gene and isoform regulation in Mammalian tissues. Science. 2012;338:1593–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Reyes A, Anders S, Weatheritt RJ, Gibson TJ, Steinmetz LM, Huber W. Drift and conservation of differential exon usage across tissues in primate species. Proc Natl Acad Sci U S A. 2013;110:15377–82.View ArticlePubMedPubMed CentralGoogle Scholar
- Mudge JM, Frankish A, Fernandez-Banet J, Alioto T, Derrien T, Howald C, et al. The origins, evolution and functional potential of alternative splicing in vertebrates. Mol Biol Evol. 2011;44:1–36.Google Scholar
- Tress ML, Abascal F, Valencia A. Alternative splicing may not be the key to proteome complexity. Trends Biochem Sci. 2017;42:98–110.View ArticlePubMedGoogle Scholar
- Melamud E, Moult J. Stochastic noise in splicing machinery. Nucleic Acids Res. 2009;37:4873–86.View ArticlePubMedPubMed CentralGoogle Scholar
- Wang M, Zhang P, Shu Y, Yuan F, Zhang Y, Zhou Y, et al. Alternative splicing at GYNNGY 5′ splice sites: more noise, less regulation. Nucleic Acids Res. 2014;42:13969–80.View ArticlePubMedPubMed CentralGoogle Scholar
- Stepankiw N, Raghavan M, Fogarty EA, Grimson A, Pleiss JA. Widespread alternative and aberrant splicing revealed by lariat sequencing. Nucleic Acids Res. 2015;43:8488–501.View ArticlePubMedPubMed CentralGoogle Scholar
- McGlincy NJ, Smith CWJ. Alternative splicing resulting in nonsense-mediated mRNA decay: what is the meaning of nonsense? Trends Biochem Sci. 2008;33:385–93.View ArticlePubMedGoogle Scholar
- Hamid FM, Makeyev EV. Emerging functions of alternative splicing coupled with nonsense-mediated decay. Biochem Soc Trans. 2014;42:1168–73.View ArticlePubMedGoogle Scholar
- Lareau LF, Inada M, Green RE, Wengrod JC, Brenner SE. Unproductive splicing of SR genes associated with highly conserved and ultraconserved DNA elements. Nature. 2007;446:926–9.View ArticlePubMedGoogle Scholar
- Ni JZ, Grate L, Donohue JP, Preston C, Nobida N, O’Brien G, et al. Ultraconserved elements are associated with homeostatic control of splicing regulators by alternative splicing and nonsense-mediated decay. Genes Dev. 2007;21:708–18.View ArticlePubMedPubMed CentralGoogle Scholar
- Lareau LF, Brenner SE. Regulation of splicing factors by alternative splicing and NMD is conserved between kingdoms yet evolutionarily flexible. Mol Biol Evol. 2015;32:1072–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Ramani AK, Nelson AC, Kapranov P, Bell I, Gingeras TR, Fraser AG. High resolution transcriptome maps for wild-type and nonsense-mediated decay-defective Caenorhabditis elegans. Genome Biol. 2009;10:R101.View ArticlePubMedPubMed CentralGoogle Scholar
- Weischenfeldt J, Waage J, Tian G, Zhao J, Damgaard I, Jakobsen JS, et al. Mammalian tissues defective in nonsense-mediated mRNA decay display highly aberrant splicing patterns. Genome Biol. 2012;13:R35.View ArticlePubMedPubMed CentralGoogle Scholar
- Kalyna M, Simpson CG, Syed NH, Lewandowska D, Marquez Y, Kusenda B, et al. Alternative splicing and nonsense-mediated decay modulate expression of important regulatory genes in Arabidopsis. Nucleic Acids Res. 2012;40:2454–69.View ArticlePubMedGoogle Scholar
- Drechsel G, Kahles A, Kesarwani AK, Stauffer E, Behr J, Drewe P, et al. Nonsense-mediated decay of alternative precursor mRNA splicing variants is a major determinant of the Arabidopsis steady state transcriptome. Plant Cell. 2013;25:3726–42.View ArticlePubMedPubMed CentralGoogle Scholar
- Wong JJ-L, Ritchie W, Ebner OA, Selbach M, Wong JWH, Huang Y, et al. Orchestrated intron retention regulates normal granulocyte differentiation. Cell. 2013;154:583–95.View ArticlePubMedGoogle Scholar
- Braunschweig U, Barbosa-Morais NL, Pan Q, Nachman EN, Alipanahi B, Gonatopoulos-Pournatzis T, et al. Widespread intron retention in mammals functionally tunes transcriptomes. Genome Res. 2014;24:1774–86.View ArticlePubMedPubMed CentralGoogle Scholar
- Edwards CR, Ritchie W, Wong JJ-L, Schmitz U, Middleton R, An X, et al. A dynamic intron retention program in the mammalian megakaryocyte and erythrocyte lineages. Blood. 2016;127:24–35.View ArticleGoogle Scholar
- Ge Y, Porse BT. The functional consequences of intron retention: Alternative splicing coupled to NMD as a regulator of gene expression. Bioessays. 2014;36:236–43.View ArticlePubMedGoogle Scholar
- Smith JE, Baker KE. Nonsense-mediated RNA decay - a switch and dial for regulating gene expression. Bioessays. 2015;37:612–23.View ArticlePubMedPubMed CentralGoogle Scholar
- Wong JJL, Au AYM, Ritchie W, Rasko JEJ. Intron retention in mRNA: No longer nonsense: Known and putative roles of intron retention in normal and disease biology. Bioessays. 2016;38:41–9.View ArticlePubMedGoogle Scholar
- Bulmer M. The selection-mutation-drift theory of synonymous codon usage. Genetics. 1991;129:897–907.PubMedPubMed CentralGoogle Scholar
- Irimia M, Roy SW. Origin of spliceosomal introns and alternative splicing. Cold Spring Harb Perspect Biol. 2014;6:a016071.View ArticlePubMedPubMed CentralGoogle Scholar
- Jaillon O, Bouhouche K, Gout JF, Aury JM, Noel B, Saudemont B, et al. Translational control of intron splicing in eukaryotes. Nature. 2008;451:359–62.View ArticlePubMedGoogle Scholar
- Aury J-M, Jaillon O, Duret L, Noel B, Jubin C, Porcel BM, et al. Global trends of whole-genome duplications revealed by the ciliate Paramecium tetraurelia. Nature. 2006;444:171–8.View ArticlePubMedGoogle Scholar
- Kim E, Magen A, Ast G. Different levels of alternative splicing among eukaryotes. Nucleic Acids Res. 2007;35:125–31.View ArticlePubMedGoogle Scholar
- Marquez Y, Höpfler M, Ayatollahi Z, Barta A, Kalyna M. Unmasking alternative splicing inside protein-coding exons defines exitrons and their role in proteome plasticity. Genome Res. 2015;25:995–1007.View ArticlePubMedPubMed CentralGoogle Scholar
- Wang Z, Burge C. Splicing regulation: from a parts list of regulatory elements to an integrated splicing code. RNA. 2008;14:802–13.View ArticlePubMedPubMed CentralGoogle Scholar
- Sung W, Ackerman MS, Miller SF, Doak TG, Lynch M. Drift-barrier hypothesis and mutation-rate evolution. Proc Natl Acad Sci. 2012;109:18488–92.View ArticlePubMedPubMed CentralGoogle Scholar
- Sung W, Tucker AE, Doak TG, Choi E, Thomas WK, Lynch M. Extraordinary genome stability in the ciliate Paramecium tetraurelia. Proc Natl Acad Sci U S A. 2012;109:19339–44.View ArticlePubMedPubMed CentralGoogle Scholar
- Boutz PL, Bhutkar A, Sharp PA. Detained introns are a novel, widespread class of post-transcriptionally spliced introns. Genes Dev. 2014;29:63–80.View ArticleGoogle Scholar
- Beisson J, Bétermier M, Bré MH, Cohen J, Duharcourt S, Duret L, et al. Silencing specific paramecium tetraurelia genes by feeding double-stranded RNA. Cold Spring Harb Protoc. 2010;5:1–6.Google Scholar
- Garnier O, Serrano V, Duharcourt S, Meyer E. RNA-Mediated programming of developmental genome rearrangements in Paramecium tetraurelia. Mol Cell Biol. 2004;24:7370–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Dubois E, Mathy N, Régnier V, Bischerour J, Baudry C, Trouslard R, et al. Multimerization properties of PiggyMac, a domesticated piggyBac transposase involved in programmed genome rearrangements. Nucleic Acids Res. 2017;3:gkw1359.View ArticleGoogle Scholar
- Trapnell C, Pachter L, Salzberg SL. TopHat: Discovering splice junctions with RNA-Seq. Bioinformatics. 2009;25:1105–11.View ArticlePubMedPubMed CentralGoogle Scholar
- Duret L, Cohen J, Jubin C, Dessen P, Goût J-F, Mousset S, et al. Analysis of sequence variability in the macronuclear DNA of Paramecium tetraurelia: a somatic view of the germline. Genome Res. 2008;18:585–96.View ArticlePubMedPubMed CentralGoogle Scholar
- Lindeboom RGH, Supek F, Lehner B. The rules and impact of nonsense-mediated mRNA decay in human cancers. Nat Genet. 2016;48:1–9.View ArticleGoogle Scholar
- The 1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature. 2012;491:56–65.View ArticlePubMed CentralGoogle Scholar
- Saudemont B, Popa A, Parmley JL, Rocher V, Blugeon C, Necsulea A, et al. Analysis of Paramecium tetraurelia transcriptome in normal or NMD-defficient cells. European Nucleotide Archive (ENA). 2017. https://www.ebi.ac.uk/ena/data/view/PRJEB15532. Accessed 26 Oct 2017.
- Saudemont B, Popa A, Parmley JL, Rocher V, Blugeon C, Necsulea A, et al. Quantification of alternative splicing in paramecium and in human. Zenodo. 2017. http://doi.org/10.5281/zenodo.321639. Accessed 26 Oct 2017.