Quantitative principles of cis-translational control by general mRNA sequence features in eukaryotes

Background General translational cis-elements are present in the mRNAs of all genes and affect the recruitment, assembly, and progress of preinitiation complexes and the ribosome under many physiological states. These elements include mRNA folding, upstream open reading frames, specific nucleotides flanking the initiating AUG codon, protein coding sequence length, and codon usage. The quantitative contributions of these sequence features and how and why they coordinate to control translation rates are not well understood. Results Here, we show that these sequence features specify 42–81% of the variance in translation rates in Saccharomyces cerevisiae, Schizosaccharomyces pombe, Arabidopsis thaliana, Mus musculus, and Homo sapiens. We establish that control by RNA secondary structure is chiefly mediated by highly folded 25–60 nucleotide segments within mRNA 5′ regions, that changes in tri-nucleotide frequencies between highly and poorly translated 5′ regions are correlated between all species, and that control by distinct biochemical processes is extensively correlated as is regulation by a single process acting in different parts of the same mRNA. Conclusions Our work shows that general features control a much larger fraction of the variance in translation rates than previously realized. We provide a more detailed and accurate understanding of the aspects of RNA structure that directs translation in diverse eukaryotes. In addition, we note that the strongly correlated regulation between and within cis-control features will cause more even densities of translational complexes along each mRNA and therefore more efficient use of the translation machinery by the cell. Electronic supplementary material The online version of this article (10.1186/s13059-019-1761-9) contains supplementary material, which is available to authorized users.


Distributions of Translation rates
Translation rate (log 10 )  Figure S4. The number of nucleotide pairs controlling translation are similar across diverse eukaryotes. Metrics for "min" windows in the 10% most highly translated mRNAs (high TR) were subtracted from metrics for "min" windows in the 10% most poorly translated mRNAs (low TR), y-axis. The metrics are the means of minimum free energy (MFE); log10 translation rate (TR); number of unpaired nucleotides linking the longest stem (loop); number of nucleotide pairs in the part of the longest stem that contains no mismatches or single nucleotide bulges (contig. stem); number of all nucleotide pairs in the longest stem (max stem); the total number of nucleotide pairs in the "min" window (all). See Additional file 6 for the primary data. The distributions of the total number of "min"-window nucleotide pairs in high and low TR cohorts are shown in Additional file 1: Figure S5. The result shows while the low TR cohort contains on average more paired nucleotides than the high TR cohort, there is a considerable overlap in structures between the two cohorts.  Figure S6. Fine mapping the 5' and 3' boundaries of APEs in M. musculus and H. sapiens. The R 2 coefficients of determination between log10 translation rates (TR) and PWM scores. The figure is as described in Fig. 4 except that PWMs were chosen to more precisely map APE boundaries. The results show the APEs extend from -6 to +13 in M. musculus and -6 to +7 in H. sapiens.

Figure S7. AUG proximal elements (APEs) comprise sequences upstream and downstream of the iAUG and are best described by Position Weight Matrices (PWMs) and di-and tri-nucleotide frequencies.
The R 2 coefficients of determination between log10 translation rate (TR) and feature(s) describing the iAUG upstream portion of the APE (uAPE); the iAUG downstream portion of the APE (dAPE); and the complete APE. Results are shown for a model employing only the PWM score and for a multivariate model combining the PWM score with a BIC selected subset of di and tri-nucleotide frequencies. The models are described in Additional file 5.  Figure S8. A 5' cap element. (a) PWMs for the 10% of mRNAs with the highest translation rate (high TR cohort) and the 10% with the lowest rate (low TR cohort). Sequence logos show the frequency of each nucleotide at each position relative to the first nucleotide of the transcript (i.e. the 5' cap). Only 5'UTR sequences that lie 5' of the APE were included in the analysis. (b) The R 2 coefficients of determination between log10 TR and PWM scores. PWMs of varying lengths were built from the sequences of the high TR cohort. Log odds scores were then calculated for all mRNAs that completely contained a given PWM. PWMs extending 3' from the 5' cap in 1 nucleotide or 5 nucleotide increments were tested (x-axis, right to left). A local maxima in R 2 values is seen between 3 to 15 nucleotides from the 5' cap, depending on the species. Because this 5'cap element only controls less than 1.2% of the variance in TR and to simplify our models, the PWM score of the first 5 nucleotides was used in a model for the 5'ofAPE region for all species, together with a BIC selected subset of di and tri-nucleotides from the entire 5'ofAPE region. The nucleotide frequencies shown in the sequence logos are given in Additional file 7.

Figure S9. AUG proximal elements (APEs) and the sequences 5' of these elements (5'ofAPEs) are differently important in mammals and non-mammalian eukaryotes.
The R 2 coefficients of determination between log10 translation rate (TR) and models describing the APE; 5'ofAPE; and the combination of these two models (5'motifs). The models are described in Additional file 5.  Figure S10. Tri-nucleotides regulating translation. The heat maps show the frequency of each trinucleotide in the most highly translated 10% of genes divided by its frequency in the most poorly translated 10% of genes ((TR high / TR low) ratios). Results are presented for four mRNA regions of each species. The (TR high / TR low) ratios are provided in Additional file 8.

Figure S11. Correlations between translational cis-regulatory elements within a species.
The Pearson correlation coefficients between the (TR high / TR low) tri-nucleotide ratios for different portions of mRNAs are shown. The correlations were calculated from pairwise comparisons such as those shown in Fig. 5 Figure S12. Multivariate models for the five general features. Scatter plots show the relationship between measured translation rates (y-axis) and translation rates explained by multivariate models for the five general features (x-axis). The R2 coefficients of determination are given. The measured and explained log10 translation rates plotted are provided in Additional file 2.  Figure S14. Arabidopsis genes that express only a single isoform mRNA behave similarly to the collection of all genes. The subset of genes for which only a single mRNA isoform has been detected in whole seedlings or plants were identified (s.i.) and their behavior compared to that of the complete set of genes (all) for each tissue (Materials and Methods). The numbers of genes in each set is given to the right. (a) For each feature separately, its R 2 coefficient of determination vs log10 translation rate is given as a percent the sum of the R 2 coefficients for linear models for each of the five features. (b) The R 2 coefficients between log10 TR and a linear multivariate model that combines the five features. The results show that the s.i. genes are broadly representative of all genes, thus any differences in translation rates between mRNA isoforms do not substantially alter our conclusions.  Figure S16. Control by amino acid frequencies and synonymous codon preferences correlates with tRNA abundances. The relationship between tRNA abundances and control by amino acid content and synonymous codon preferences was tested in S. cerevisiae because estimates of effective tRNA abundances are particularly well established for this species. The frequencies of amino acids (AA), or codons (codon), or the preferences for synonymous codon (syn.codon) were determined separately for the most highly translated 10% of genes (high TR) and for the most poorly translated 10% of genes (low TR).
Genes not encoding all 20 amino acids were excluded from the analysis. (a) The coefficient of determination (R 2 ) for the high TR cohort or low TR cohort AA, syn.codon or codon frequencies vs their cognate tRNA abundances. For AA, the frequencies of all cognate tRNAs for each amino acid were summed to give a combined tRNA abundance. p-values testing if the correlation of tRNA abundance with the high TR cohort is greater than that with the low TR cohort are given, with significant p-values shown in red. The High TR mean frequencies correlate more strongly with tRNA abundances than do the low TR frequencies, indicating that translation of high TR mRNAs uses the cellular population of amino acylated tRNAs more efficiently than translation of low TR mRNAs. (b) The ratios between AA, syn.codon or codon frequencies in the high TR cohort divided by those in the low TR cohort were determined. Ratios > 1 thus indicate a larger frequency in high TR cohorts than in low TR cohorts. Scatter plots are shown between these (high TR /low TR) ratios and tRNA abundance along with the Pearson correlation coefficients (r) and p-values testing if the correlations are significant (significant p-values in red). Dashed vertical lines indicate a ratio of 1. The Pearson correlation coefficients range from +0.43 -+0.70, establishing that codons for high abundance tRNAs are more prevalent in highly translated mRNAs, whereas codons for low abundance tRNAs are more prevalent in poorly translated mRNAs.