High-resolution modeling of the selection on local mRNA folding strength in coding sequences across the tree of life

Background mRNA can form local secondary structure within the protein-coding sequence, and the strength of this structure is thought to influence gene expression regulation. Previous studies suggest that secondary structure strength may be maintained under selection, but the details of this phenomenon are not well understood. Results We perform a comprehensive study of the selection on local mRNA folding strengths considering variation between species across the tree of life. We show for the first time that local folding strength selection tends to follow a conserved characteristic profile in most phyla, with selection for weak folding at the two ends of the coding region and for strong folding elsewhere in the coding sequence, with an additional peak of selection for strong folding located downstream of the start codon. The strength of this pattern varies between species and organism groups, and we highlight contradicting cases. To better understand the underlying evolutionary process, we show that selection strengths in the different regions are strongly correlated, and report four factors which have a clear predictive effect on local mRNA folding selection within the coding sequence in different species. Conclusions The correlations observed between selection for local secondary structure strength in the different regions and with the four genomic and environmental factors suggest that they are shaped by the same evolutionary process throughout the coding sequence, and might be maintained under direct selection related to optimization of gene expression and specifically translation regulation.


List of supplementary figures
. List of species

Figure S1. Correlations of traits with ΔLFE are not present in its individual components
Coefficient of determination ( ) for GLS regression of the specified trait with ΔLFE and its components (ΔLFE -red; native LFE -green; randomized LFE -blue), at different positions relative to CDS start. Negative values indicate negative regression slope. The observed correlation between each trait and ΔLFE is not observed with the individial components (native or randomized LFE).

Figure S2. The ΔLFE profile is more conserved than other genomic traits
Correlation (expressed using Moran's I coefficient) between the values of different traits, for pairs of species of different phylogenetic distances. Genomic-GC% is positively correlated at short distances. ΔLFE values (at different positions relative to CDS start) are more strongly correlated than genomic-GC% at most phylogenetic distances, but less correlated than genome sizes. Confidence intervals represent 95% confidence calculated using 500 bootstrap samples. The 'Random' trait is a normallydistributed uncorrelated variable.

Figure S3. Local CUB vs. Local ΔLFE
Spearman correlations between the ΔLFE profile (i.e., mean value for a given species at each position relative to CDS start) and the corresponding CUB profiles (i.e., CUB for all CDSs for a given species at this position relative to CDS start) show no direct correspondence, indicating the ΔLFE profiles are not simply a side-effect of direct selection operating on CUB in different CDS regions. CUB measures were calculated for the sequences contained in the same 40nt windows, starting at positions 0-300nt relative to CDS start, with all the sequences for each species concatenated, for a random sample of N=256 species.

ΔLFE (kcal/mol/window)
Position-specific randomization (maintaining the encoded AA sequences as well as the codon frequency in each position (across all CDSs belonging to the same species) yields qualitatively similary results to the CDS-wide randomization used throughout the rest of this paper. This supports the conclusion that the observed ΔLFE profiles are not merely a result of position-dependent biases in codon composition. A Correlation between ΔLFE calculated using "CDS-wide" and "position-specific" randomizations (see methods), at each position relative to CDS start. Correlations were calculated for a random sample (N=23) of species. B Comparison of individual mean ΔLFE profiles calculated using "CDS-wide" (LFE-0) and "position-specific" (LFE-1) randomizations.

Figure S5. ∆LFE is stronger in highly expressed genes and genes encoding for highly abundant proteins
The observed average ∆LFE features are generally more prominent in highly expressed genes and in genes encoding for highly abundant proteins. This figure shows results for 32 species, plotted according to their position on a taxonomic tree (Left). Results are summarized for highly expressed genes based on transcriptomic RNA-sequencing for 29 species (green region) and for experimentally measured proteinabundance (PA) for 12 species (blue region). Also shown are results for purely computational translation elongation optimization scores, I_TE(34) (cyan region). For each evidence type, results are shown for regions [A]-[C] (as defined in Fig. 1A).

Note that the increase in regions [B]-[C] in highly expressed genes in Listeria monocytogenes and
Bacillus subtilis was previously reported in (35 Figs. 4,5). They report similar results in E. coli region [B] (which is marginally significant in our data).
For each region, the following symbols identify the relation between the "high" and "low" groups:

+
The trend observed in this region (i.e., increased or decreased folding strength) is more extreme in highly expressed or highly abundant genes.

−
The trend observed in this region (i.e., increased or decreased folding strength) is less extreme in highly expressed or highly abundant genes (or the opposite trend is observed).

(no symbol)
There is no consistent and statistically significant difference between the groups (or there is no ∆LFE trend in this region).

+/−
Inconsistent or contradictory results in different positions.

NA
Data was not available for this species.
Determination of each symbol (+/−) was based on results of a Mann-Whitney U test between the two groups of genes across the appropriate region, once for each direction (with the null hypothesis being that a value sampled from one group is not likely to be greater than an item from the other group). Fraction of positive species and total number of species are shown below for each evidence type.
On the right side, the table shows a summary of relevant characteristics for each species. From right to left -the average ∆LFE "heat-map" for this species, for the 300nt region at the beginning (left) and end (right) of the CDS, the average GC% for the genome, and the average ENc' (CUB) for the genome.
RNA sequencing data was obtained through ENA from the experiments detailed in the table below. Species were chosen based on availability of data using for the same strain or a closely related strain and using short-read sequencing technology compatible with the pipeline described here. Experiments are transcriptomic in their design and the control sample from each experiment was used (from the logarithmic growth phase if possible).
Normalized read counts were calculated as follows. Trimmomatic(36) version 0.38, using the single-end or paired-end mode and the Illumina adapters, sliding window with window size 4nt and quality threshold 15, leading and trailing below 3 and minimum length of 36nt. Reads were mapped to reference genomes obtained from Ensembl genomes (37), except for E. coli that was obtained from NCBI (38). Reads were mapped to genomic positions with Bowtie2(39) version 2.3.4.3 using local alignment with the default settings. Read were then assigned to coding sequences using htseq-count(40) version 0.11.2 in union mode with non-unique matches included and ignoring expected strand. Normlalized counts for each CDS were finally obtained by dividing by the CDS length. Genes were divided to the "low" and "high" groups based on the median normalized read count for each species, with genes having no reads counted as 0.
PA results were obtained from PaxDB(41) using the "Integrated" dataset. Genes were divided to the "low" and "high" groups based on the median count for each species, with genes having no reads counted as 0.
I_TE (34), a CUB measure designed to measure codon optimization for translation elongation, was computed using DAMBE7(42) based on the included codon frequency tables for each species.  To verify this analysis is robust, bootstrapping using 1000 repeats was used to measure the following values: RSD1 -Relative standard-deviation (SD/mean) for the angle between the loading vectors shown (i.e., those for ΔLFE profile positions 0nt and 250nt). Distribution of angles shown in C. RSD2 -Relative standarddeviation (SD/mean) for the explained variance of PC1. B PCA plot for ΔLFE profiles at positions 0-300nt relative to CDS end (created using the same method as A). C Distribution of angles between shown loading vectors (i.e., those for ΔLFE profile positions 0nt and 250nt) using 1000 bootstrap samples. The distribution mean is 2.08 radians ( ) and the relative standard deviation (also shown as RSD1 on A) is 1.4%.

Sources for RNA-seq data
This procedure was repeated for all species and for each domain individially (see also Fig. 4D). In each case, the first two PCs explain >80% of the variation. The loading vectors for positions 0nt and 250nt are not parralel nor orthogonal (and this is robust to sampling and persists in smaller groups, see Fig. 4D), indicating some level of dependence between the two positions (also indicated in Fig. 3E, Fig. S6).

Figure S7. ΔLFE profiles for all species
ΔLFE profiles calculated using the CDS-wide randomization for individual species arranged by NCBI taxonomy. The ΔLFE profiles shown are for positions 0-300nt relative to CDS start (left) and CDS end (right). The numbers of species included in each group is shown to the left of the group name.

Figure S8. Comparison between ΔLFE profiles in different domains
Distribution of ΔLFE profiles relative to CDS start (left) and end (right), for species belonging to each domain. In bacteria and archaea, only one species has positive ΔLFE in the mid-CDS region, despite this being common in eukaryotes.

Figure S11. Correlation of ∆LFE with different genomic measures of CUB is consistent
Using different measures of CUB generally leads to the same conclusion about the interaction between CUB and ∆LFE. Note that for CAI and DCBS, increasing values indicate stronger bias, whereas for ENc', decreasing values indicate stronger bias.
The following measures were used to estimate genomic CUB. CAI (66) was computed using codonw(67) version 1.4.4, using the entire genome as the reference set. ENc' (68) was calculated using ENCprime (github user jnovembre, commit 0ead568, Oct. 2016). DCBS(69) was calculated as described in the paper. All CUB measures were averaged for each genome and the resulting values were used in GLS regression against the ∆LFE at each position.

A B C D
To test if correlation between genomic-ENc′ and ∆LFE is related to the general magnitude of ∆LFE or to position-specific aspects of the ∆LFE profile, we performed the following test: we decomposed the values by normalizing each genomic profile by its standard-deviation (as a measure of its scale), thus getting profiles of equal scale. We then checked for correlation between the normalized ∆LFE profiles with genomic-ENc′. There was no correlation after this normalization (Fig. S9), but the correlation between genomic-ENc′ and the scaling factor was strong. This suggests that the correlation of ENc′ (in contrast to GC-content) is indeed caused by the magnitude of ∆LFE.
The observed correlation of ΔLFE with Genomic-ENc' (Fig. 6) is due to correlation with the magnitude of the ΔLFE profile. When all profiles are normalized to have the same scale (by dividing the values of each profile by their standard deviation so the resulting profiles all have standard deviation 1), most of the correlation is removed (A,B). For comparison, the same procedure is followed for genomic-GC (C,D). Values represent coefficient of determination ( ) for GLS regression of each trait (genomic-ENc' or genomic-GC%) vs. the normalized ΔLFE profile at different position relative to CDS edges, with the sign representing the regression coefficient. Regressions for different taxons are shown using different line colors and widths (black is for all species), and white dots show areas in which the regression is significant (p-value<0.01). The dashed red line represents for regression against the standard deviation for each ΔLFE profile (i.e., the scaling factor). A Genomic-ENc' vs. ΔLFE, CDS start. B Genomic-ENc' vs. ΔLFE, CDS end. C Genomic-GC vs. ΔLFE, CDS start. D Genomic-GC vs. ΔLFE, CDS end.

Figure S14. Endosymbionts have weaker ΔLFE
Numeric regression results for GLS multiple regression using genomic-GC, genomic-ENc' and intracellular classification in 4 regions of the CDS, for several taxonomic groups (which contain a sufficient number of intracellular species). p-values shown for GLS are for the categorical Is-intracellular classification factor (determined using t-test), indicating this factor improves upon the predictions made using the two numerical factors in some cases (even after controlling for evolutionary relatedness using GLS), but not in others. R 2 values are shown for the regression without and with intracellular classification.
Explanation of table columns: CDS Reference -point in CDS (start/end) for defining relative positions within all CDSs. Positions: range of positions within CDS (relative to the reference) for which ΔLFE values are averaged. OLS p-value: p-value (using t-test) for Is-intracellular factor, in single regression using OLS (uncorrected for phylogenetic distances). This regression includes all available species (including those which are not contained in the phylogenetic tree so are not used in GLS regression). GLS p-value: p-value (using t-test) for Is-intracellular factor, in multiple regression (including factors GenmoicGC, GenomicENc') using GLS. R 2 without Is-intracellular: coefficient of determination (R 2 ) for regression using the factors GenmoicGC+GenomicENc', as baseline for comparing improvement from the additional factor Is-intracellular. R 2 with Is-intracellular: coefficient of determination (R 2 ) for regression using the factors GenmoicGC+GenomicENc'+Is-intracellular. Slope: direction of slope for factor Is-intracellular (positive or negative). This indicates intracellular species have weaker ΔLFE in  Genes having weak SD sequence may require stronger contribution of other initiation-promoting mechanisms to ensure efficient translation initiation, and therefore might have stronger ∆LFE at the CDS start (feature [A]). This effect, previously reported in the 5'UTRs of S. sp. PCC6803 in (70), is also observed here.

Isintracellular GLS GenmoicGC + GenomicENc' + Is-intracellular
CDS that overlap with a previous CDS may have biased ∆LFE results close to the overlapping region (see e.g. (35) for an explanation of this phenomemon in E. coli). As a simple control for this, we show the difference between genes with 5' intergenic distances shorter than 50nt (including overlapping genes) and other genes. Results show significant but small differences near the CDS start in some but not all species (see e.g. S. sp. And E. coli, panels B,C). Additional differences observed at other points in the CDS may be related to operonic structure. In E. coli, for example, a large decrease in mean ∆LFE is observed in genes with long intergenic distances, but the distributions of the two groups remain similar (inset on the right shows the distributions at the position 40nt from CDS start, where the effect is strongest).
For both controls, determination of each symbol (+/−) was based on results of a Mann-Whitney U test between the two groups of genes across the appropriate region, once for each direction (with the null hypothesis being that a value sampled from one group is not likely to be greater than an item from the other group). Fraction of positive species and total number of species are shown below for each evidence type.
SD strength was calculated according to (71), using the minimum anti-SD hybridization energy in the 20nt upstream of the start codon. The "weak SD" group includes genes with minimum energy greater than -1 kcal/mol.

ΔLFE (kcal/mol/window)
A Correlation between ΔLFE calculated using standard temperature ( ) and native temperature (see methods), at each position relative to CDS start, for species grouped by native temperature range. Correlations were calculated for a random sample (N=71) of species (bacteria and archaea) for which native temperature data is available. B Comparison of individual mean ΔLFE profiles using calculated using standard temperature ( ) and native temperature.