The fitness cost of mis-splicing is the main determinant of alternative splicing patterns

Background Most eukaryotic genes are subject to alternative splicing (AS), which may contribute to the production of protein variants or to the regulation of gene expression via nonsense-mediated messenger RNA (mRNA) decay (NMD). However, a fraction of splice variants might correspond to spurious transcripts and the question of the relative proportion of splicing errors to functional splice variants remains highly debated. Results We propose a test to quantify the fraction of AS events corresponding to errors. This test is based on the fact that the fitness cost of splicing errors increases with the number of introns in a gene and with expression level. We analyzed the transcriptome of the intron-rich eukaryote Paramecium tetraurelia. We show that in both normal and in NMD-deficient cells, AS rates strongly decrease with increasing expression level and with increasing number of introns. This relationship is observed for AS events that are detectable by NMD as well as for those that are not, which invalidates the hypothesis of a link with the regulation of gene expression. Our results show that in genes with a median expression level, 92–98% of observed splice variants correspond to errors. We observed the same patterns in human transcriptomes and we further show that AS rates correlate with the fitness cost of splicing errors. Conclusions These observations indicate that genes under weaker selective pressure accumulate more maladaptive substitutions and are more prone to splicing errors. Thus, to a large extent, patterns of gene expression variants simply reflect the balance between selection, mutation, and drift. Electronic supplementary material The online version of this article (doi:10.1186/s13059-017-1344-6) contains supplementary material, which is available to authorized users.


Supplemental Text S1: Definition of canonical splice forms
The classification of splice variants relies on the definition of a canonical form (Fig1C): the distinction between a "cryptic intron" and a "retained intron" depends on which variant is considered as the reference. Here we decided to define the canonical form as the one that is the most abundant in WT cells. The underlying assumption is that for most genes, there exists a dominant transcript, and that the other variants (functional or not) are quantitatively minor. To test whether this assumption is correct, we measured the rate of splice variation of individual introns in WT cells. For a given intron, the rate of splice variation is defined as the proportion of reads that differ from the annotated spliced form (either IR or ASSV), among all reads spanning both flanking exons (Fig1C). To avoid artefacts owing to gene annotation errors, we restricted this analysis to introns for which the spliced form is observed in at least one read of the WT samples. And to limit the variance in the measurement, we analyzed introns overlapped by at least 50 sequence reads. This subset represents 44% of all annotated introns, and hence is expected to be representative of the entire genome. In 99.4% of cases, the annotated intron corresponds to the major splice form (Supplemental Figure S3). On average, the rate of splicing variations is 3% (median 1%), and for 97.8% of introns, minor variants correspond to less than 20% of all spanning reads. Thus, in the huge majority of cases, there exists one major splice form that strongly predominates over other variants. Hence, we considered the major splice form (in WT samples) to be the canonical one.

Supplemental Text S2: Regulation of splicing factors by AS-NMD in paramecia
Our results indicate that the pattern of alternative splicing in paramecium is dominated by splicing errors. But of course, this does not exclude that a small fraction of genes might be subject to functional alternative splicing events. In particular, there is evidence from different model organisms that AS-NMD can play an important role in the regulation of genes encoding splicing factors (Lareau et al., 2007;Ni et al., 2007). To test whether this is also the case in paramecia, we searched for homologs of arginine-serine rich splicing factor 5 (SRSF5), which has been shown to be regulated by AS-NMD both in fungi and animals (Lareau and Brenner, 2015). We identified 11 SRSF5-homologs, which can be classified into 4 distinct clades (which we refer to as A, B, C and D). As in other eukaryotes, these genes are expressed at high levels (Supplemental Figure S4). They contain two to four introns (the intron/exon structure is conserved within clades, but differ between clades). All these genes include at least one intron showing extremely weak intrinsic splicing efficiency, with rates of IR or ASSV ranging from 23% to 76% (mean = 48%) in NMD-deficient cells. In most cases, there is only one such intron (the only exception is gene C3, which has two), and it is conserved among paralogs from a same clade (Supplemental Figure S4). The abundance of these splice variants is strongly sensitive to NMD (Supplemental Figure S4). These observations indicate that in paramecia, as in many other eukaryotes, genes encoding splicing factors contain introns with very weak intrinsic splicing efficiency, which can trigger the regulation of gene expression by AS-NMD. It should be stressed that the levels splicing variation observed in the 12 weak introns from SRSF5-homologs is extremely high: compared to introns from genes with similar expression levels, 11 of these introns are in the top 1% with highest IR or ASSV rate and the last one is in the top 5% (Supplemental Figure S4). This suggests that a very high IR or ASSV rate is probably a good signature to predict genes subject to functional alternative splicing.

Identification of SRSF-like genes
Arginine-serine rich splicing factor 5 (SRSF5) and its two closely related paralogs (SRSF4 and SRSF6) have been shown to be regulated by AS-NMD both in fungi and animals (Lareau and Brenner, 2015). We compared human SRSF4, SRSF5 and SRSF6 proteins (Uniprot accession numbers Q08170, Q13243 and Q13247) against all paramecium proteins using BLASTP, with an E-value threshold of 10 -3 (Altschul et al., 1997). We identified 11 hits common to the three paralogs (Gene accession numbers: GSPATG00039656001, GSPATG00037425001, GSPATG00002543001, GSPATG00001514001, GSPATG00019872001, GSPATG00009396001, GSPATG00025558001, GSPATG00030796001, GSPATG00003006001, GSPATG00001197001, GSPATG00005005001). We manually corrected several errors in the original gene annotations, using RNAseq data from WT samples to identify the correct exon junctions. Note that these annotation errors did not affect the 12 introns that we identified as being subject to AS-NMD (Supplemental Figure S4), except intron 1 from GSPATG00030796001. The phylogenetic tree was inferred from the protein multiple with PHYML (Guindon and Gascuel, 2003), using the SEAVIEW software (Gouy et al., 2010).

Supplemental Text S3: Signatures of selective pressure against splicing errors
We detected cryptic intron splicing in more than 30% of paramecium genes. Their size distribution is very similar to that of introns, with 97.1% of them between 20 and 35 bp (mean= 26.0 bp) (Fig1B). We have previously shown that in paramecium, as in many other eukaryotes, introns are under selective pressure to ensure that NMD can detect and degrade transcripts in case of intron retention (Jaillon et al. 2008). This constraint leads to a deficit in introns of length multiple of three (3n introns), lacking in-frame stop codons (Jaillon et al. 2008). By definition cryptic introns are located within coding regions, and hence do not contain any in-frame stop codon. Interestingly, we observe a very strong deficit of 3n cryptic introns (Fig1B; overall, only 18% of cryptic introns are of length multiple of 3), which suggests that cryptic introns that preserve the reading frame are counter-selected. This is consistent with the hypothesis that cryptic introns generally result from errors of the splicing machinery, and that such errors are more costly in the case of 3n cryptic introns because they are not detectable by NMD. In paramecia, splice signals are characterized by a strong conservation of the first and last 3 bp of introns (71% of introns match the consensus sequence: [GTA(N) n TAG]). The analysis of the distribution of distances between GTA and TAG triplets within coding exons shows a deficit of TAG downstream of GTA, specifically in a window 20 to 35 bp, which corresponds to the length of bona fide introns (Supplemental Figure S10). This indicates that strong cryptic splice sites are counterselected within exons, in agreement with the hypothesis that they are generally deleterious.

Supplemental Text S4: Quantification of the proportion of splicing errors: extended model
Our estimates of the proportion of AS events corresponding to splicing errors are based on the assumption that the rate of functional AS events (AS f ) does not vary with gene expression (see Equation (3) in the main text for more details). To test whether this assumption could bias our results, we explored a more complex model, where we considered that AS f might vary with expression level. Let us note k, the ratio of the functional AS rate in a given bin of expression (i) over the AS rate in highly constrained genes: Thus, the proportion of splicing errors in expression bin (i) given by Equation (4) becomes: Equation (5) shows that if the rate of functional AS was negatively correlated with the gene expression level (i.e. if k>1), then Equation (4) would lead to overestimate the proportion of splicing errors (and conversely if k<1).
Estimates of the proportion of variants resulting from splicing errors (for a gene with median expression level) are given below for different values of k: The main conclusion reported in the main text (based on the assumption that k=1) is that in a median gene, the vast majority of AS events correspond to errors. Thus, this conclusion would remain valid up to k=4 for IR and k=8 for ASSV or PCI. We will discuss below whether such values of k are plausible or not.
First, let us precise one point of terminology. Two types of functional AS events can be distinguished: -AS events that lead to the production of functional protein variants.
-AS events that do not produce functional proteins, but that contribute to the regulation of gene expression level via AS-NMD.
We will hereafter refer to the first type as "AS-FPV" (functional protein variants) and the second one as AS-NMD (regulatory function). The rate of functional AS can therefore be decomposed as: Thus, if k>1, this implies that gene expression level is negatively correlated either with AS FPV or with AS NMD . As mentioned in the main text, our analyses rule out the latter hypothesis. Indeed, we observed a strong negative relationship between AS rate and gene expression level, both for NMD-visible and NMD-invisible splicing variants (which, by definition, cannot contribute to AS-NMD). This pattern is observed both in paramecium ( Fig. 4) and in human (Suppl. Fig. S7). Hence, if k > 1, this must be due to AS-FPV and not to AS-NMD.
It is in principle possible that AS FPV vary with expression level. However, AS FPV would have to be quite high to affect significantly the estimates of splicing error rates reported in the main text. For instance, as shown in the above table, a value k=4 implies that in a median gene, 20% of ASSV variants are functional (compared to 5% if k=1). Thus, if k=4, this would imply that at least 15% (and possibly up to 20%) of ASSV variants correspond to AS-FPV events. For IR variants, this proportion would have to be even higher (25% to 33%).
This is in contradiction with numerous lines of evidence indicating that AS-FPV represents only a tiny fraction of all AS events (reviewed in Tress et al. 2017a,b). Proteomic studies (covering > 100 distinct tissues and cell lines) showed that at the protein level, 98% of genes produce one single dominant isoform: only 0.6% of all annotated AS events lead to the production of a detectable amount of protein (Abascal et al. 2015). Of course, this does not exclude the possibility that many genes could produce functional protein variants at low levels, below the limit of detection of proteomic analyses. However, if such minor isoforms were functional, one would expect them to display the same signatures of protein functionality as the protein variants that have been detected in proteomic studies. Yet, the bulk of AS variants identified in transcriptomic studies show clearly distinct features compared to the ones that have been validated at the protein level: -70% of them disrupt the domain structure of proteins (compared to 15% for validated variants) (Abascal et al. 2015) -58% of them shift the reading frame (compared to 66% expected by chance) (Pickrell et al. 2010) whereas this is the case for only 4% of validated variants (Tress et al. 2017b) -comparative transcriptomic analyses revealed that only 1%-3% of exon-skipping events detected with RNA-seq are conserved beyond mammals (Merkin et al., 2012;Barbosa-Morais et al., 2012), whereas the analysis of 60 exon-skipping events validated by proteomics data revealed that 100% of them are conserved from mammals to bony fish (Abascal et al. 2015).
All these observations indicate that AS-FPV represents at most a few percent of all AS events, which implies that k must be lower than 4.
Our observations provide additional evidence indicating that AS-FPV cannot account for the relationship between AS rate and gene expression level. Indeed, the fact that 96% of the variants validated by proteomics data preserve the reading frame (Tress et al. 2017b) demonstrates that AS-FPV is extremely rare among AS events that induce PTCs. Thus, if AS-FPV contributed substantially to the increase in AS rate in weakly expressed genes (as expected if k=4), then one would expect this increase to be stronger among frame-preserving AS events than among PTC-inducing events. As shown in Figure 4, this is clearly not the case: the increase in AS rate with decreasing expression level is in fact stronger for PTCinducing events than for the frame-preserving ones (NMD-invisible).
Thus, our main conclusion remains robust for plausible values of k.
Supplemental Text S5: Estimates of IR rate are robust to possible contamination by genomic DNA It is in principle possible that a small fraction of RNAseq sequence reads correspond to contaminant genomic DNA fragments. Contaminant genomic reads spanning an intron are counted as unspliced reads, and therefore lead to bias the measure of IR rate. This artefact can potentially contribute to the observed relationship between IR rate and expression level: in highly expressed genes, the fraction of reads corresponding to contaminant DNA is expected to be negligible. But in weakly expressed genes, contaminant DNA reads might significantly inflate the observed IR rate.
To quantify the potential amount of DNA contamination in our RNAseq libraries, we analyzed read depth in intergenic regions. Given that UTRs and non-coding RNA genes are poorly annotated in the genome of paramecium, we defined here intergenic regions as the interval between coding regions of consecutive protein-coding genes. We measured the read depth at the center of each of these intervals. The genome of Paramecium is very compact (70% coding), with very short intergenic regions (mean = 337.1 bp, median = 156 bp). Thus, UTRs may represent a substantial fraction of the length of intergenic regions. Moreover, transcriptional read-through may lead to the production of bona fide RNA reads beyond the canonical polyadenylation site. Hence, it is expected that the RNAseq read depth in intergenic regions should depend on the expression level of their flanking genes. And indeed, we observed that the read depth in intergenic regions is strongly correlated to the level of expression of flanking genes (Supplemental Figure S13).
In the subset of intergenic regions flanked by genes with very low expression, the average read depth at their center is 0.58 in NMD-deficient samples (Supplemental Figure S13) and 0.59 in WT samples (not shown). The level of DNA contamination is not expected to depend on gene expression level. Hence, the read depth that is potentially due to DNA contamination is at most 0.59. It should be noted that this corresponds to an upper estimate, because some intergenic regions may contain unannotated genes (protein-coding genes or non-coding RNA genes).
We re-estimated IR rates in NMD-visible introns for each expression bin, after subtracting the number of unspliced reads potentially corresponding to DNA contaminants (0.59 reads per intron). This lower boundary of the true IR rate is indicated by dashed lines in Supplemental Figure S5A. This figure shows that even after controlling for potential DNA contaminants, there remains a strong negative relationship between IR rates and expression level. Hence, DNA contamination cannot be the cause of the correlation between IR rates and expression level.
It should also be noted that there is a strong negative relationship between AS rate and expression level, not only for IR, but also for ASSV and cryptic intron splicing (Figure 3). Obviously, these two latter categories cannot be explained by DNA contamination since, by definition, they correspond to sequence reads with evidence of splicing. Hence, our conclusion that AS rates correlate negatively with expression level is robust to possible issues of DNA contamination.

Supplemental Figures
Supplemental Figure S1: Impact of NMD on observed IR rates: comparison of biological replicates.   Supplemental Figure S3: Distribution of AS rate in WT cells.

N=39,461 annotated introns, covered by at least 50 sequence reads in WT cells, and for which at least one read corresponds to the annotated splicing form. The AS rate at a given intron is defined as the ratio (n 2 +n 3 )/(n 1 +n 2 +n 3 ), where n 1 is the number of reads matching with the annotated splicing event, n 2 is the number of reads showing splicing with alternative splice sites, and n 3 the number of reads showing intron retention, and (n 1 +n 2 +n 3 ) is the total number of reads spanning both flanking exons (see Fig1C).
Proportion of variant splice forms  Figure S4: NMD-sensitive introns in P. tetraurelia SRSF-like genes.  (N= 52,in red,and N=12,