Generation of thousands of dATG yeast variants
Previous systems studies have observed reduced protein abundance upon the addition of an out-of-frame uATG [17,18,19], indicating that PICs scan continuously along the mRNA in the 5′–3′ direction [4, 5, 15, 16]. By the same logic, a reduction in GFP intensity following the addition of an out-of-frame dATG will indicate that ribosomes can also scan in the 3′–5′ direction (Fig. 1A). In this study, we investigated the occurrence and prevalence of 3′–5′ PIC movement by inserting ATGs downstream of the aATG of a GFP reporter and then detecting the impacts on GFP synthesis (i.e., through differences in GFP intensity). To avoid reaching conclusions that are caused by some specific flanking sequences (i.e., confounding factors), as in viruses that may use specific sequences to regulate PIC scanning for overlapping ORFs in their bicistronic mRNAs [29, 30], we generated a large number of sequence variants, each with a dATG inserted in various sequence contexts (Fig. 1B).
Specifically, we introduced dATGs by chemically synthesizing a 39-nt DNA oligo, with six upstream and thirty downstream doped (i.e., random) nucleotides (N = 25% A + 25% T + 25% G + 25% C) around a fixed ATG triplet (designated as the aATG, Fig. 1B, Additional file 1: Fig. S1A). ATG triplets (either in-frame or out-of-frame) could then form randomly within the 30-nt downstream region of each individual construct, ultimately resulting in a randomly sampled variant library (from a huge number of all possible variants) containing dATGs at each successive position downstream of the aATG in various sequence contexts. To increase the fraction of dATG-containing variants, we further synthesized 28 additional DNA oligos, each with a dATG fixed at one of the 28 possible downstream triplet positions (Fig. 1B, Additional file 1: Fig. S1A). We fused these DNA oligos with the full-length GFP sequence (with its initiation codon omitted, Additional file 1: Fig. S1A), and integrated the fusion constructs individually into the same locus in Chromosome II of the yeast genome. We also inserted dTomato, encoding a red fluorescent protein, into a nearby genomic region to normalize GFP intensity (Fig. 1B, Additional file 1: Fig. S1A).
We measured the GFP intensities en masse through fluorescence-activated cell sorting (FACS)-seq for individual variants, as described in a previous study [31]. Briefly, we sorted yeast cells into eight bins according to GFP intensity (here and elsewhere in this study, normalized by dTomato intensity). Based on the variant frequencies in high-throughput sequencing reads of the eight bins, and the median GFP intensity and the number of cells belonging to each bin, we calculated the GFP intensity for each variant as the weighted average GFP intensity across the eight bins (Additional file 1: Fig. S1A).
To verify the accuracy of GFP intensities measured en masse, we randomly isolated 20 clones from the yeast library, and individually measured their GFP intensities by flow cytometry. There was good consistency between the GFP intensities measured en masse and individually (Pearson’s correlation coefficient r = 0.99, P = 1 × 10−19, Additional file 1: Fig. S1B). We measured GFP intensity of the yeast variants in two biological replicates, and the values were highly correlated for 18,950 variants shared between both experiments (Pearson’s correlation coefficient r = 0.99, P < 2.2 × 10−16, Additional file 1: Fig. S1C). Consequently, we pooled dATG variants from both replicates in subsequent data analyses (Additional file 1: Fig. S1D, the average GFP intensity was used for variants shared by the two replicates), if not otherwise specified.
We performed two positive control analyses to examine the data quality. First, the GFP intensity of variants with in-frame stop codons formed in the 30-nt region downstream of aATG was lower than that of variants without in-frame stop codons (P < 2.2 × 10−16, Mann-Whitney U test; Additional file 1: Fig. S1E). Second, the variants containing in-frame uATGs showed elevated GFP intensity compared to variants without uATGs (P < 2.2 × 10−16, Mann-Whitney U test; Additional file 1: Fig. S1F), most likely because the second in-frame AUGs could function as an auxiliary initiation site for GFP translation [32]. In contrast, the variants containing out-of-frame uATGs showed reduced GFP intensity (P < 2.2 × 10−16, Mann-Whitney U test; Additional file 1: Fig. S1F), likely because they can prevent translation in the reading frame of GFP. These observations bolstered our confidence to compare GFP intensities among the dATG variants in our study. Note that we excluded the variants containing in-frame stop codons or uATGs from the subsequent analyses, to avoid their potential impacts on GFP intensity (remaining variants n = 21,598, Additional file 1: Fig. S1D).
Both seminal studies analyzing the consensus sequence across genes [15, 33,34,35] and the recent structural analysis of the late-stage 48S initiation complexes [36] led to the hypothesis that some flanking sequences could facilitate translation initiation (known as the Kozak sequence). To determine if the sequences flanking the aAUGs exerted any detectable influence on the GFP intensities measured in our yeast library, we grouped the 1805 variants that had only one ATG (i.e., the designed aATG) in the 39-nt region, according to the nucleotide type at each position and estimated the average GFP intensity for each of the four variant groups at each position (Additional file 1: Fig. S2A). Briefly, placing different nucleotides at the −3 position (relative to the A[+1] in the aATG codon) led to the highest variation in GFP intensity compared to variation related to different nucleotides at other positions (from −6 to +15, Additional file 1: Fig. S2A). At the −3 position, “A” conferred the highest GFP intensity, followed by G, C, and finally T. This observation is qualitatively consistent with the prevalence of A at the −3 position among 96 yeast genes investigated in a previous study [33] or among the 500 genes with the highest protein synthesis rate in the yeast genome (Additional file 1: Fig. S2B). For simplicity, we hereafter refer to the ATG context using the nucleotide at the −3 position; in the order from “strong” to “weak” are the A, G, C, and T contexts. The observed differences in the strength of the sequence context are likely related to the frequency of leaky scanning, according to previous studies [15].
Frame- and distance-dependent translational inhibition by dAUGs
Prior to measuring the effects of dAUGs on GFP intensity, we considered the variation in the number, position, and context of ATGs among the variants in the yeast library, to establish a standardized and clear nomenclature for these variants. Some variants had only one ATG in the 39-nt region (i.e., the designed aATG) and were therefore denoted as “Solo” variants. Some variants had one additional ATG in the 30-nt downstream region (i.e., the dATG) and were thus designated as “Duo” variants. In addition, the names of variants include the position and context of the aATG and dATG (if present). For example, Duo(1N, 4A) represents variants with two ATGs: the aATG having any one of the four nucleotides (N) at the −3 position and a dATG at the +4 position with an A in its −3 position (Fig. 1C). We subsequently focused on the analysis of 1805 Solo and 13,437 Duo variants.
In our design, dATGs were introduced at a total of 28 positions, among which ten were in-frame and 18 out-of-frame, relative to the GFP reading frame (Fig. 1C, Additional file 1: Fig. S1A). To investigate whether out-of-frame dAUGs can inhibit translation initiation from the aAUG, we grouped the Duo variants according to the reading frames of their dATGs. The results showed that the Duo variants containing in-frame dATGs showed elevated GFP intensity compared to Solo variants (P < 2.2 × 10−16, Mann-Whitney U test; Fig. 1D), as variants containing in-frame uATGs (Additional file 1: Fig. S1F). In sharp contrast, Duo variants harboring an out-of-frame dATG showed reduced GFP intensity compared to Solo variants (P = 3.9 × 10−5, Mann-Whitney U test, Fig. 1D), strongly suggesting that out-of-frame dAUGs can inhibit translation initiation at the aAUG in a frame-dependent manner. The fraction of reduction in GFP intensity for Duo variants relative to Solo variants is termed as the “inhibitory effect” subsequently.
To test if these inhibitory effects of out-of-frame dAUGs were dependent on the distance between aATG and dATG, we grouped the Duo variants according to the position of their dATG and then estimated the average GFP intensity for each group. The inhibitory effect gradually declined with increasing aATG-dATG distance (Fig. 1E, Additional file 1: Fig. S2C), and no inhibitory effects were evident at aATG-dATG distances of ~17 nt or greater (Fig. 1E, Additional file 1: Fig. S2C). These observations indicated that translation initiation decisions involving two proximal, potential AUGs were not strictly sequential, but competitive. Note that the placement of dATGs at various positions did not significantly alter the synonymous codon usage or the formation of mRNA secondary structure in the 30-nt variable sequence downstream of the aAUG (Additional file 1: Fig. S3), two factors known to affect translation initiation or elongation, and therefore, protein synthesis [37, 38].
We then performed an additional, small-scale experiment that strictly controlled the flanking sequence to further characterize the distance-dependent inhibitory effect of dAUGs. Specifically, we introduced an out-of-frame dATG at +8, +14, +20, or +26 positions downstream of the aATG (Fig. 1F). To exclude any potential impacts of the peptide sequence on GFP intensity, we used only synonymous mutations to introduce these out-of-frame dATGs. The results showed that proximal out-of-frame dATGs indeed reduced GFP intensity, while increases in distance between the two ATGs resulted in a gradual increase in GFP intensity. Beyond 20 nt, the negative impacts on translation initiation were no longer detectable (Fig. 1F). Collectively, these results established that out-of-frame dATGs could inhibit GFP synthesis and that these inhibitory effects decreased with increasing distance from the aATG.
Context-dependent translational inhibition by dAUGs
The frame- and distance-dependent inhibitory effects of dAUG suggested that ribosomes could sometimes scan in the 3′–5′ direction, which was compatible with the Brownian ratchet scanning process wherein PICs oscillate in both 5′–3′ and 3′–5′ directions, scanning each successive triplet multiple times. An aAUG that is not recognized by the PIC in the first scan may be recognized in a subsequent scan. When a dAUG is inserted near the aAUG, a PIC that misses the aAUG may be instead retained by that nearby dAUG if it is recognized, thereby reducing the likelihood that a PIC will oscillate 3′–5′ and recognize the aAUG. As the aAUG-dAUG distance increases, there is an increased probability that a given PIC will turn to the 3′–5′ direction before encountering a dAUG, explaining why the inhibitory effect of out-of-frame dAUGs diminishes as the dAUG becomes farther.
The Brownian ratchet scanning model further predicted that the aAUG-dAUG competition depended on the leaky scanning at the aAUG. To test if the observed inhibitory effect of proximal out-of-frame dAUGs is indeed related to the leaky scanning at the aAUG, we divided the Duo variants into four groups based on their aATG −3 context. We found that the inhibitory effect of dATGs was greater when the aATG was in a weaker context (i.e., higher leakage rate, Fig. 2A, Additional file 1: Fig. S2D), which indicated that leaky scanning at aAUGs contributed to dAUG inhibition of translation initiation. To then determine whether these inhibitory effects were due to translation initiation at the dAUG, we also divided the Duo variants into four groups according to their dATG −3 context. We found that the inhibitory effect was greater when the dATG was in a stronger context, indicating the competition of translation initiation between the two AUGs (Fig. 2B, Additional file 1: Fig. S2D).
To confirm this apparent competition between aAUGs and dAUGs for translation initiation, we performed an experiment using a reporter construct carrying two fluorescent proteins, GFP and dTomato, encoded in different reading frames (hereafter referred to as a dual-frame reporter). In this reporter, GFP was translated from an aAUG in a weak context, and dTomato was translated from a proximal out-of-frame dAUG (+8 position, Fig. 2C). Furthermore, six “frame +1” stop codons were removed from the GFP coding sequence (mainly via synonymous mutations, see “Methods”) to avoid premature termination during dTomato translation. Placing the dATG in two different contexts, we measured both green and red fluorescence intensities with flow cytometry. We observed that dTomato intensity increased with increasing strength of dATG context (i.e., lower leakage rate) while GFP intensity was substantially reduced (Fig. 2C). Meanwhile, the mRNA levels did not significantly vary (Additional file 1: Fig. S4). These results confirmed that translation initiation decisions between two closely spaced AUGs were determined in a competitive manner.
Proximal out-of-frame dAUGs lead to reduced mRNA levels via nonsense-mediated mRNA decay (NMD)
Our findings above thus suggested that proximal out-of-frame dAUGs could compete with aAUG for translation initiation. Since out-of-frame termination codons are abundant in the GFP coding sequence (see “Methods”), we predicted that if translation indeed initiated at a proximal out-of-frame dAUG, a long distance should remain between its (also out-of-frame) termination codon and the poly(A) tail, a signal for mRNA degradation by the NMD pathway [39,40,41]. To test if the insertion of proximal out-of-frame dAUGs can result in lower GFP mRNA stability, we measured the mRNA levels en masse for each variant in the library, as described in previous work [31]. Briefly, we used Illumina sequencing to determine the mRNA levels of each variant, which was normalized by the number of cells for each variant (as reflected by its fraction of sequencing reads in the DNA-seq, Fig. 3A). Since the mRNA levels of dATG variants were highly correlated between two biological replicates (Pearson’s correlation coefficient r = 0.86, P < 2.2 × 10−16, Additional file 1: Fig. S5A), we pooled dATG variants from both replicates in subsequent data analyses. We grouped the Duo variants according to the position of their dATGs, as well as by the aATG and dATG contexts. The results showed that mRNA levels were lower in the Duo variants when the out-of-frame dATG was closer to the aATG, particularly when the aATG resided in a weaker context (Additional file 1: Fig. S5B) and dATG resided in a stronger context (Fig. 3B), suggesting competition for translation initiation between closely spaced AUGs.
To then determine whether the reduction in the mRNA level was caused by NMD activity, we knocked out UPF1, the gene encoding an RNA helicase required for initiating NMD in eukaryotes [42, 43], and created a new yeast library containing a total of 15,256 variants in the background of upf1Δ (Fig. 3A). Note that in an effort to control for the potential cellular effects of the selective marker used for knocking-out UPF1, a yeast strain with a pseudogene (HO) deleted using the same selective marker was used as the wild type for yeast library construction throughout this study. We measured the mRNA levels of these variants and found that the reduction in mRNA levels we previously observed in Duo variants with proximal out-of-frame dAUGs was nearly abolished in the absence of UPF1 (Fig. 3C, Additional file 1: Fig. S5C). These observations are consistent with the idea that the NMD pathway activated by translation initiation at out-of-frame dAUGs could reinforce the inhibitory effect of proximal dAUGs at the translational level.
To exclude the possibility that the distance-dependent inhibitory effect of out-of-frame dATGs is associated with variation in the activation efficiency for NMD, which has been reported depending on the position of the premature stop codon [41], we further computationally excluded variants that contained out-of-frame stop codons in the variable region in the same reading frame of the corresponding dATGs. After that, all Duo variants containing frame +1 (or +2) dAUG would terminate translation at the same location in the coding sequence of GFP (+56 or +60, see “Methods”). The NMD activity induced by proximal out-of-frame dATGs remained observed (Additional file 1: Fig. S5D), excluding the variation in NMD efficiency among dATG variants as a confounding factor.
To examine if the inhibitory effects of proximal out-of-frame dAUGs can be detected without the impact of NMD-related variation in mRNA stability, we used FACS-seq to measure GFP intensities in the genetic background of upf1Δ (Additional file 1: Fig. S6A). Despite full rescue at the mRNA level (Fig. 3C), NMD inactivation via UPF1 deletion did not result in a full restoration of GFP intensity in these out-of-frame dATG variants (Additional file 1: Fig. S6B, C). These findings were further confirmed in small-scale experiments using the same dATG constructs as those shown in Fig. 1F (Additional file 1: Fig. S6D). Taken together, the inhibitory effects of proximal out-of-frame dAUGs persisted even controlling for the impact of NMD-related variation in mRNA stability, indicating direct competition between an aAUG and its proximal dAUG for translation initiation on the transcripts that have escaped NMD.
We surprisingly noticed that dATGs in frames +1 and +2 exhibited slightly different inhibitory effects, in both hoΔ and upf1Δ backgrounds (Additional file 1: Fig. S2D and Fig. S6C). This difference was not observed at the mRNA level (Fig. 3B, C, Additional file 1: Fig. S5B, C), suggesting that it was unlikely caused by the difference in translation initiation between dAUGs in these two frames. Instead, we hypothesized that this phenomenon was related to specific amino acids encoded in frame 0, provided that dATGs at frames +1 and +2 will lead to the overrepresentation of different amino acids in the N-terminus of the GFP reporter. To reduce the possible effects of sequence variation in the N-terminus peptide on GFP folding and fluorescence, we inserted a DNA sequence encoding a 2A self-cleaving peptide [44] upstream of the GFP coding sequence (Additional file 1: Fig. S7A). We controlled the sequence context of both aATG and dATG, generated 3402 Solo variants and 32,140 Duo variants, and performed the FACS-seq and the en masse RNA-seq experiments on this 2A-inserted dATG library (Additional file 1: Fig. S7B, C). The difference in GFP intensity between frame +1 and frame +2 dATGs was no longer detectable, and GFP intensity remained increasing with the aAUG-dAUG distance (Additional file 1: Fig. S7B). These observations further confirmed the inhibitory effect of proximal out-of-frame dAUGs.
Distance-dependent translational inhibition by uAUGs
In general, proximity to the 5′-cap grants an AUG triplet some advantages in competition to initiate translation since they are scanned first [6, 15]. Consistent with this hypothesis, it has been widely reported that out-of-frame uAUGs can inhibit translation at the aAUG because the uAUG can retain a proportion of PICs that would otherwise initiate translation at the aAUG [15, 19, 20]. Given our results showing competition for initiation between a closely spaced aAUG-dAUG pair, we further predicted that a closely spaced uAUG-aAUG pair would also compete for translation initiation. That is, when a uAUG is near the aAUG, a PIC that misses the uAUG (due to leaky scanning) may be retained by the nearby aAUG, thereby reducing the likelihood that the PIC will oscillate 3′–5′ and recognize the uAUG. Therefore, the Brownian ratchet scanning model further predicted that the inhibitory effect by an out-of-frame uAUG should diminish with decreasing uAUG-aAUG distance (Fig. 4A).
To test if the inhibitory effects of an out-of-frame uAUG indeed depend on its distance to the aAUG, we synthesized a uATG variant library (Fig. 4B) similar to the dATG variant library. Specifically, we introduced uATGs by chemically synthesizing a 30-nt DNA oligo with doped nucleotides (N) upstream of a fixed aATG triplet. To increase the proportion of variants carrying a uATG, we synthesized 28 additional DNA oligos, each with a uATG fixed at one of the 28 possible upstream triplet positions. We fused these DNA oligos with the full-length GFP sequence and integrated the fusion constructs individually into the yeast genome. GFP intensity and mRNA level of individual variants were then measured by FACS-seq and en masse RNA sequencing of the variable region, respectively, following the same protocol as used for the dATG library. GFP intensity and mRNA level were quantified in two biological replicates, and since the values were highly correlated between replicates (Additional file 1: Fig. S8), the data from both replicates were pooled in subsequent analyses.
The 3112 variants containing stop codons in the frame of uATGs and at a position upstream of the aATG were excluded from the subsequent analyses to avoid their potential impacts of translation reinitiation (i.e., the ability of some short upstream open reading frames to retain the 40S subunit on mRNA post-termination, then reinitiate translation at a downstream AUG). We confirmed that the 6553 Duo variants containing in-frame uATGs indeed showed higher GFP intensities than the 2872 Solo variants and the 9033 Duo variants containing out-of-frame uATGs indeed had lower GFP intensities (Fig. 4C, Additional file 1: Fig. S9). These results led us to further examine the impacts of uATG position relative to aATG, as well as uATG sequence context, on GFP intensities among the Duo variants.
To this end, Duo variants were grouped according to the position of their inserted uATGs, in a manner similar to that used for grouping dATGs in Fig. 1E. The results showed that GFP intensities increased with decreasing uATG-aATG distance, a trend which was especially apparent when the distance between the two ATGs was relatively small (Fig. 4C). We also observed that the inhibitory effect of a proximal, out-of-frame uATG was reduced in the variants harboring the aATG in a strong context (Additional file 1: Fig. S9A) or with a uATG in a weak context (Additional file 1: Fig. S9B). We then performed an additional, small-scale experiment in which the flanking sequence was strictly controlled in order to further characterize the distance-dependent inhibitory effects of uAUGs. Specifically, we introduced an out-of-frame uATG in a weak context (with a T in the −3 position) at positions −25, −19, −13, or −7 upstream of the aATG in a strong context (with an A in the −3 position, Fig. 4D). We observed that decreasing distance between the two ATGs indeed resulted in a gradual increase in GFP intensity (Fig. 4D). Taken together, these results showing distance- and context-dependent inhibitory effects by out-of-frame uATGs suggested that aAUGs compete with proximal uAUGs to initiate translation.
Translation initiation at out-of-frame uAUGs would result in the activation of the NMD pathway. Therefore, if the reduced inhibitory effect of proximal out-of-frame uAUGs did result from competition for translation initiation between the aAUG and a proximal uAUG, we predicted that GFP mRNA level should increase with decreased uAUG-aAUG distance. En masse quantification of mRNA levels for the out-of-frame uATG variants in the hoΔ background revealed that GFP mRNA level was higher in the variants with smaller uATG-dATG distance, weaker uATG context, and/or stronger aATG context (Additional file 1: Fig. S9C, D), and upon UPF1 deletion, the GFP transcripts of Duo variants carrying an out-of-frame uATG were restored to levels comparable with that of Solo variants (Additional file 1: Fig. S10A, B), regardless of the uATG-aATG distance and the sequence context. Similar to the observation of the dATG variants, NMD inactivation also did not result in a full restoration of GFP intensity in the out-of-frame uATG variants of the uATG library (Additional file 1: Fig. S10C, D) or of the small-scale experiment (Additional file 1: Fig. S10E). These results thus indicated that the distance- and context-dependent inhibitory effect of out-of-frame uAUGs was indeed a consequence of competition for translation initiation between a pair of uAUG and aAUG.
Computational modeling reveals that each successive triplet is on average scanned by the PIC approximately ten times
The competition for translation initiation we observed between closely spaced AUGs (either between an aAUG-dAUG pair or between a uAUG-aAUG pair) is qualitatively consistent with a scanning process in which the PIC is tethered to mRNA and progresses toward 3′-end under a Brownian ratchet mechanism and is inconsistent with a strictly unidirectional scanning process. It is worth noting that this observation would also be qualitatively compatible with other scanning models as long as PIC movement in both 5′–3′ and 3′–5′ directions is invoked. For example, some researchers have proposed that the PIC can move to the initiation codon via ATP-independent PIC “diffusion” along the mRNA [10, 45, 46]. Notably, the quantification of GFP intensity we conducted for thousands of variants in this study provided us with an opportunity to estimate the parameters of PIC scanning, such as the number of scans for each triplet, the frequency that a pawl (i.e., the 5′-block) is placed along the mRNA, and the efficiency of AUG recognition by the PIC. If the frequency of pawl placement is estimated to be zero, the diffusion model will be supported. On the contrary, the Brownian ratchet model will be supported if this frequency is greater than zero.
To this end, we simulated the scanning process using a modified random walk model, as the PIC movement consists of a succession of random steps on the discrete positions along the “one-dimensional space” of a linear mRNA (Fig. 5A). We specified the following parameters in our random walk model. During scanning, the 13th–15th position of a PIC-binding mRNA fragment is the P-site [47], where the inspection for complementarity to the Met-tRNAi anticodon occurs. The PIC started out at the 5′-cap and took 1 nt per step in either the 5′–3′ or 3′–5′ direction, with equal probability (i.e., 50% each). However, the PIC could not move further upstream if its 5′-trailing side hits a pawl or the complex of 5′-cap and its binding protein eIF4E. The pawl was stochastically placed along the mRNA at the 5′-trailing side of a PIC (depending on the PIC location at the time) with the probability p.Pawl (Fig. 5A). When an AUG enters the P-site of the PIC, in a probability of p.Leakage the AUG might not be recognized by the PIC, and in a probability of (1 − p.Leakage) the AUG was recognized by the PIC and initiated translation. Note that in our model AUG triplets could be recognized in either 5′–3′ or 3′–5′ PIC movement. Considering that the NMD pathway can reduce mRNA levels and consequently amplify the impact of translation initiation at out-of-frame dAUGs on protein abundance, we used the parameter p.NMD to determine the probability of activating NMD when an out-of-frame dAUG is recognized by the PIC (Fig. 5A).
We employed a Markov Chain Monte Carlo (MCMC) algorithm [48, 49] to calculate numerical approximations for the probability parameters in the Brownian ratchet scanning model. To compare with experimental measurements of GFP intensity (Fig. 1E), we simulated the Brownian ratchet scanning process for 25 Duo variants with N-context dAUGs (representing an “average” dAUG) positioned between +7 and +31. To explore the relevant parameter space by the MCMC sampler, we first tested 1000 parameter sets for p.Pawl, p.Leakage, and p.NMD (each with ten values ranging from 0.001 to 0.8, Additional file 1: Fig. S11A). For each parameter set, we generated 100 simulations of the PIC scanning process for each Duo variant (two examples are shown in Fig. 5B) and estimated GFP intensity based on the number of simulations in which translation was initiated at the aAUG. We identified the ten parameter sets that showed the smallest residual sum of squares (RSS) for the 25 Duo variants and initiated the MCMC simulation using the median value for each parameter among the ten parameter sets: p.Pawl = 0.001, p.Leakage = 0.75, and p.NMD = 0.55 (Additional file 1: Fig. S11B).
We then ran the MCMC algorithm for 300 iterations, in which we sequentially replaced each of the three probability parameters with a random number generated from a uniform distribution (see “Methods”). For each iteration, we calculated the RSS from the simulated and observed GFP intensities in our yeast library, and used the RSS as a proxy to optimize the parameters. If the RSS decreased, the previous set of parameters was replaced by the new parameters, whereas if the RSS increased, the previous set of parameters remained unchanged (Fig. 5C).
The parameters reached the stationary distribution at the end of 300 iterations (Fig. 5C), and the GFP levels observed in the experiments were largely recapitulated by our simulated ratchet-and-pawl mechanism of PIC scanning (Fig. 5D). To obtain a reliable estimation of the parameters, we repeated 30 times of MCMC and found that the estimated parameters were robust after 300 iterations in all MCMCs (Fig. 5E, Additional file 1: Fig. S11C, Fig. S12). The average parameter values that resulted in the smallest RSS were as follows: the probability of adding a pawl to the mRNA was ~ 1 out of 1000 PIC steps; the average leakage rate for every single scan was 77%; and the NMD rate was 62% (Fig. 5F). Based on these values, we estimated that, on average, each triplet was scanned approximately ten times by the PIC (with 95% confidence intervals ranging from 6–14), resulting in a net leakage rate of 8% for a single AUG triplet (i.e., on average 8% of PICs eventually miss an AUG after multiple scans).
So far, we simulated using an N-context dAUG and neglected any difference in the leakage rate among sequence contexts of ATG triplets. To individually estimate p.Leakage for ATGs in the A, G, C, or T contexts, we fixed the values of p.Pawl (= 0.001) and p.NMD (= 0.62) and optimized context-specific p.Leakage by running the MCMC algorithm for another 100 iterations, based on the RSS estimated from the GFP intensities of variants Solo(1A), Solo(1G), Solo(1C), and Solo(1T) observed in our experiments (Fig. 2A). The average p.Leakage values among 30 times of MCMCs were 0.49, 0.68, 0.85, and 0.92 (Fig. 5G, Additional file 1: Fig. S11D), corresponding to the net leakage rates of 0.02, 0.08, 0.12, and 0.21, for ATGs in the A, G, C, or T context, respectively (Fig. 5G).
The Brownian ratchet scanning model and the ATP-independent diffusion model can be distinguished by determining if the probability of adding a pawl (p.Pawl) is equal to zero. p.Pawl was estimated to be significantly greater than zero (0.10% with the standard error equal to 0.02%, Fig. 5F) in our MCMC analyses, supporting a Brownian ratchet model for PIC scanning rather than a diffusion model. Note that a linear relationship between the length of the 5′-untranslated region (5′-UTR) and the time required for the first round of translation products was reported in previous studies [26, 50]; this relationship is also consistent with the Brownian ratchet scanning model instead of a diffusion model, which predicts a square relationship.
Depletion of proximal out-of-frame dATGs in yeast and human genomes
Given the reduced efficiency for translation of canonical ORFs and the possibility of enhanced synthesis of potentially cytotoxic peptides, we predicted that proximal out-of-frame dATGs would be generally deleterious. Therefore, we sought to test if proximal out-of-frame dATGs have been purged from the yeast genome by purifying selection. To this end, we counted the number of genes with ATGs at various positions downstream of the aATG across the yeast genome (Fig. 6A). The results showed that the number of out-of-frame dATGs increased gradually with distance from the aATG. The trend was statistically more significant in frame +1, probably because ~80% of out-of-frame dATGs are located in frame +1 due to the preferred usage of some amino acids or codons in frame 0. Moreover, the paucity of frame +1 proximal dATGs was particularly apparent for dATGs in the stronger context, suggesting that this paucity is related to translation at dAUGs (Fig. 6B; for results of frames 0 and +2, see Additional file 1: Fig. S13).
The detrimental impacts of proximal out-of-frame dATGs (e.g., synthesis of toxic peptides) should scale with the gene expression level. Therefore, the mutations that generate proximal out-of-frame dATG should be subject to stronger purifying selection in more highly expressed genes, leading to less proximal out-of-frame dATGs in these genes. Our sequence analyses showed that the 2000 genes with highest expression levels (or transcription rates) in the yeast genome indeed harbored less proximal out-of-frame dATGs than the 2000 genes with lowest expression levels (or transcription rates, Additional file 1: Fig. S14A, B). Collectively, the paucity of proximal out-of-frame dATGs, especially those in the stronger contexts and in more highly expressed genes, suggested that the purifying selection against proximal out-of-frame dATGs can exert a role in yeast genome evolution.
To test if proximal out-of-frame dATGs were also purged from other eukaryotic genomes, we similarly counted the number of frame +1 dATGs at various positions in the human genome. Similar to our analysis of the yeast genome, we found that the number of frame +1 dATGs increased with distance from the aATG, consistent with the observation in a previous study [51]. And the trend was more obvious for dATGs in a stronger context (Fig. 6C) and in more broadly expressed genes (Additional file 1: Fig. S14C). As a negative control, prokaryotes, which do not use the scanning mechanism to search the initiation codon [52, 53], did not show the depletion of proximal out-of-frame dATGs (Fig. 6D, E). These observations implied that the Brownian ratchet scanning process probably generally drove the evolution of eukaryotic genomes.
Although the translation machinery of yeast and humans is largely identical, some differences have been reported in the components of eIFs [6]. To test whether proximal out-of-frame dAUGs can indeed inhibit translation initiation at the aAUG in humans, we constructed a firefly and Renilla dual-luciferase reporter system and designed three additional variants, each with a frame +1 ATG introduced at a different location (+8, +14, and +20) downstream of the firefly luciferase aATG, using synonymous mutations (Fig. 6F). We transfected the reporters individually into HeLa cells and measured the Renilla-normalized firefly luciferase activity and mRNA level. In agreement with our findings in yeast, the results showed that proximal out-of-frame dATGs reduced firefly luciferase activity, partly contributed by the reduced mRNA level (which was likely caused by the NMD pathway activated by translation initiation at the out-of-frame dATG). Moreover, increases in the distance between the two ATGs resulted in a gradual increase in firefly luciferase activity, which was no longer distinguishable from the wild type in ATG variants located 20-nt downstream of the aATG (Fig. 6F). These results confirmed evolutionary conservation of the inhibitory effects on protein synthesis by proximal out-of-frame dATGs between humans and yeast.