RNA G-quadruplexes at upstream open reading frames cause DHX36- and DHX9-dependent translation of human mRNAs

Background RNA secondary structures in the 5′-untranslated regions (5′-UTR) of mRNAs are key to the post-transcriptional regulation of gene expression. While it is evident that non-canonical Hoogsteen-paired G-quadruplex (rG4) structures somehow contribute to the regulation of translation initiation, the nature and extent of human mRNAs that are regulated by rG4s is not known. Here, we provide new insights into a mechanism by which rG4 formation modulates translation. Results Using transcriptome-wide ribosome profiling, we identify rG4-driven mRNAs in HeLa cells and reveal that rG4s in the 5′-UTRs of inefficiently translated mRNAs associate with high ribosome density and the translation of repressive upstream open reading frames (uORF). We demonstrate that depletion of the rG4-unwinding helicases DHX36 and DHX9 promotes translation of rG4-associated uORFs while reducing the translation of coding regions for transcripts that comprise proto-oncogenes, transcription factors and epigenetic regulators. Transcriptome-wide identification of DHX9 binding sites shows that reduced translation is mediated through direct physical interaction between the helicase and its rG4 substrate. Conclusion This study identifies human mRNAs whose translation efficiency is modulated by the DHX36- and DHX9-dependent folding/unfolding of rG4s within their 5′-UTRs. We reveal a previously unknown mechanism for translation regulation in which unresolved rG4s within 5′-UTRs promote 80S ribosome formation on upstream start codons, causing inhibition of translation of the downstream main open reading frames. Our findings suggest that the interaction of helicases with rG4s could be targeted for future therapeutic intervention. Electronic supplementary material The online version of this article (10.1186/s13059-018-1602-2) contains supplementary material, which is available to authorized users.


Ribosome profiling quality control
Analysis of ribosome-protected mRNA fragments (RPFs) yields a quantitative and detailed map of ribosome occupancy that reveals translation with singlenucleotide resolution. 1 Most ribosome footprints fall within known coding sequences where they showed three-nucleotide periodicity reflecting the triplet nature of the genetic code. The periodicity plots reported in Fig. S1 show that all the libraries prepared in this study display a robust triplet periodicity suitable for analysis of translation.

Characterisation of ribosome protected fragments
In this work we focus on ribosome protected fragments that were mapped within the 5'-UTRs of translated mRNAs. In order to assess whether reads aligning to 5'-UTRs are true ribosome footprints, rather than nonribosomal contaminants such as RNA regions that are protected by protein complexes or stable RNA secondary structure, we applied the fragment length organization similarity score (FLOSS) pipeline developed by Ingolia et al. 2 The FLOSS measures the magnitude of disagreement between the distribution of fragments lengths mapping a region of interest to a reference distribution from all annotated nuclear protein-coding transcript. FLOSS was computed from a histogram of read lengths for footprints on an annotated CDS or within a given 5'-UTR. A reference histogram was produced using raw counts on all annotated nuclear protein-coding transcript. The FLOSS was calculated as  (red) display reads that are true ribosome footprints rather than nonribosomal contamination. c) The FLOSS pipeline was found reproducible as shown by the correlations between values obtained from duplicates.

Principal component analysis (PCA) and Statistical Modelling
Features collection. 5'-UTR sequences using annotation from the version 26 of the human transcriptome from Gencode were recovered. These sequences were used to calculate the quantitative parameters used to describe the different mRNA features discussed in this manuscript. A complete and comprehensive list of these features is reported in the Table   S1.  mRNA subsets. We used the PCA to select the subset of transcripts that is characterised by discrete rG4 predicted structures marking their 5'-UTR. This set of transcripts, referred to as 'rG4-containing transcripts', were defined by Dim.1 ≥ 0 and Dim.2 ≤ 0 and consisted of 1,841 transcripts. We also considered all the transcripts expressed in HeLa cells, referred to as 'all transcripts', with fully annotated 5'-UTR (with a length ≥ 10 nt) and 3'-UTR. This sets represents 8,024 transcripts.
Features selection. To identify which features explain the highest amount of variation in ribosome distribution (RPFdist), we used a 10-fold cross validation (CV) scheme to select a subset of features with good predictive power. To this end, each feature (see Table S1) value was centred and scaled, i.e. calculated as z-score. To penalize for model complexity, predictor selection was performed using the LASSO (least absolute shrinkage and selection operator) procedure ('glmnet' method from the R 'caret' package) optimising penalty parameters over the internal cross validation steps. The procedure (using final alpha and lambda parameters of 0.05 and 1 respectively) selected 32 predictors with good predictive power. It is noteworthy that PRTE and TOP-like elements were discarded at this stage. We then assessed the correlations (using a threshold of |correlation| ≤ 0.85) and linear dependencies (using QR decomposition) in between the selected predictors, and found that all selected predictors were independent.
Model selection. We then used the selected list of predictors to build regression models predicting ribosome distribution on both transcript data sets ('all transcripts' and 'rG4containing transcripts'). To this end, each feature was centred and scaled, both data sets were randomly portioned into 4 sets: 70% of the sets were used for training and the remaining 30% were equally portioned providing 3 testing sets. The training sets were used to select models using a gradient boosting approach ('gbm' model from the caret package). Models were optimised by tuning the gradient boosting parameters over a 10fold cross validation scheme. The optimised parameters were the number of iterations, the complexity of the tree, the learning rate and the minimum number of training set samples in a node to commence splitting. To assess the overall performance of the models, we then challenged them against the three test sets (see Additional File 1: Fig. S5).
Models explaining the highest amount of variation on both the training and test sets were selected. This procedure was used to select models predicting the ribosome distribution (RPFdist) over the two sets of transcripts while considering all predictors or only a subset of predictors according to their category. Seven categories of predictors were studied independently and each predictor was assigned to one of the category (see Supplementary The performance of the best models selected for each set of transcripts and each subset of predictors in reported in the Table S1. Model comparison. To characterise the differences between models (generated using different categories of predictors) and quantify the contribution of each predictor category, we compared their resampling distributions. We generate resampling distributions (10 CV repeated 10 times) for each model, using the 'resamples' function of the 'caret' package.
Since models were fit on the same versions of the training data, the differences between the resampling distributions reflect the differences between model performances rather that the correlations that may exist within-resamples. Differences between model performances were assessed using a non-parametric Kolmogorov-Smirnov test and are reported Fig. 2f.

5'-UTR relative length
Ratio between the length of 5'-UTR sequences over the length of the corresponding CDS.

Base frequency
The frequency of A, C, G and U nucleotides in 5'-UTR sequences. The frequency of nucleotide N was defined as N/(A+C+G+U) Length-normalised ΔG 0 rG4 rG4 5'-UTR MFE normalised to 5'-UTR length.
All transcripts: 0.14 ± 0.03 rG4-containing transcripts: 0.32 ± 0.07 rG4 cap structures rG4 MFE of sequences comprising the first 30 nucleotides of 5'-UTR. Gscore Quantitative estimation of G-richness and G-skewness of 5'-UTR sequences. Gscore calculation is based on the G4Hunter algorithm developed by Bedrat et al. 2 Briefly, each position in a sequence is given a score between -4 and 4. To account for Grichness, a single G is given a score of 1, in a GG sequence each G is given a score of 2; in a GGG sequence each G is given a score of 3; and in a sequence of 4 or more Gs each G is given a score of 4. To account for G-skewness, Cs are scored similarly but values are negative. The Gscore is the maximum value obtained while scanning 5'-UTR sequences using a 35 nt window and averaging the score of each nucleotides over the considered window.