Learning the language of post-transcriptional gene regulation

A large-scale RNA in vitro selection study systematically identified RNA recognition elements for 205 RNA-binding proteins belonging to families conserved in most eukaryotes.

Messenger RNAs (mRNAs) are regulated at every stage of their life cycle. All cellular RNA, including mRNA, is packaged into distinct ribonucleoprotein (RNP) complexes to orchestrate RNA maturation and turnover processes summarized as post-transcriptional gene regulation. Th e most relevant processes involving mRNAs include pre-mRNA splicing, 5' and 3' end modifi cation, editing, transport, translation and degradation. Among the challenges for decoding post-transcriptional gene regulation is the elucidation of the mRNP composition, which changes as mRNAs mature or are translated. Th is is a prerequisite for understanding the consequences of dysregulation and/or mutation of RNA-binding proteins (RBPs) and/or their target RNA-binding sites in disease.
Th e human genome encodes 1,500 RBPs, and 600 microRNAs targeting mRNAs [1]. Most RBPs are composed of at least one, but frequently also combi nations of multiple distinct RNA-binding domains (RBDs). At least 800 distinct RBDs are known [2]; among the most frequent in humans are the single-stranded-RNAbinding RRM, KH, zf-CCCH and zf-CCHC domains, and the double-stranded-RNA-binding DSRM domain. Recent proteomic analysis consolidated the number of mRBPs to 700 proteins and revealed at least 20 previously unknown RBDs [1,3].
Following or coinciding with the determination of the composition of mRNPs is the identifi cation of the precise binding site(s) located within the mRNA targets of RBPs and the derivation of the underlying RNA recognition element(s) (RRE(s)). Th is task is non-trivial considering that RBDs generally recognize short and degenerate sequences of three to eight nucleotides, sometimes involving additional RNA secondary structure. In addition, in vivo binding is modulated by competition with other RBPs for the same or adjacent sites [1]. Since the implementation of high-throughput methods in RNA biology, various protocols for the experimental identifi cation of RBP binding sites have been developed. A recent study by Ray et al. [4] used a single-cycle RNA in vitro selection approach to characterize the binding specificities for 205 recombinant RBPs and, in doing so, has brought us an important step closer to solving the posttranscriptional RBP regulatory code.

Experimental methods for determining RREs
RREs are traditionally determined by sequence comparison and/or conservational analysis from known RNA targets, and validated by biochemical interaction analysis (such as electrophoretic mobility shift assays, fi lter binding or surface plasmon resonance). For RBPs with unknown RNA-binding sites, in vitro evolutionary methods (primarily SELEX) that identify high-affi nity RNA ligands within pools of randomized sequences have been employed with some success [5]. Th e RRE is then derived by comparing multiple independently sequenced RNA ligands. Alternatively, various crosslinking and immuno precipitation (CLIP) methods have been introduced that rely on covalent crosslinking of an RBP to its RNA targets in live cells, followed by the isolation of crosslinked RBP-RNA segments ( Figure 1) [1,6]. Coupled with deep sequencing of the crosslinked RNAs, CLIP methods allow for the comprehensive determi nation of in vivo RNA target sites and their underlying RRE. Until recently, knowledge of RREs was rather scant and experimental binding data from SELEX, CLIP and other methods were available for less than 10% of the known RBPs in humans [1,6,7].
To increase throughput and identify the highest-affi nity RREs, Ray et al. [4,8] introduced a SELEX method termed RNAcompete ( Figure 1). In contrast to random sequence pools used in SELEX, which contain up to 10 14 diff erent molecules of 20-to 80-nucleotide random sequence fl anked by constant primer binding sites, RNAcompete pools were designed to contain only 240,000 diff erent sequences of 30 to 40 nucleotides in length. Th ese RNA sequences were predicted to be only weakly structured, Figure 1. Overview of in vitro and in vivo methods for RRE determination. (a) SELEX and RNAcompete start with the preparation of a diverse DNA sequence pool, which is in vitro transcribed into RNA. The protein of interest is incubated with the random sequence RNA pool, followed by RBP pulldown and recovery of the bound RNA. In SELEX, high-affi nity ligands are enriched by several rounds of reverse transcription, (mutagenic) PCR and selection, before sequencing of the RNA ligands. In RNAcompete, the recovered RNA is directly quantifi ed on a microarray, rather than sequenced, and enrichment for each individual sequence over the initial pool is calculated. Enrichment scores, which directly correlate with the binding affi nity of the RNA sequence, are used to derive the RRE, which serve as input for computational prediction of in vivo RNA targets. (b) CLIP-based methods use in vivo crosslinking to covalently link RBPs to their RNA targets by UV light. After cell lysis, limited RNase treatment and immunoprecipitation of the RBP, the crosslinked RNA segments are recovered, converted into cDNA libraries and deep sequenced. CLIP methods directly identify in vivo RNA targets and binding sites, and motif fi nding algorithms are used to deduce the RRE from the crosslinked RNA sequences. dsDNA, double-stranded DNA; HITS-CLIP, high-throughput sequencing of RNA isolated by CLIP; iCLIP, individual-nucleotide resolution CLIP; PAR-CLIP, photoactivatable-ribonucleoside-enhanced CLIP; ssDNA, single-stranded DNA; XL-RBP, crosslinked RBP. with each possible 9-mer represented at least 16 times in the RNAcompete sequence pool. To prepare this RNA sequence pool, oligodeoxynucleotides printed on a microarray were amplified, transcribed into RNA, and subsequently incubated with a recombinantly expressed, affinity-tagged RBP of interest. The RNA pool was then incubated with 75-fold molar excess over protein to ensure efficient competition between the various sequences during binding, so that at equilibrium the proportion of each sequence bound to the RBP reflected its affinity. The incubated protein was recovered and the enrichment of bound RNAs over the initial pool RNA was quantified on microarrays. In contrast to SELEX, the bound RNA was directly analyzed after the first competitive binding reaction without further cycles of amplification and mutagenesis. The RRE for the protein was inferred by combining the calculated Z-and E-values for each possible 7-mer.

Evolutionary insights and global patterns in protein-RNA sequence recognition
In their recent study, Ray et al. [4] applied RNAcompete to determine RREs for a collection of 205 different RBPs distributed across 24 species and representing approximately 60 conserved families of RBPs. The parallel processing of samples using a single method facilitated comparison of the RREs and specificities of various RBPs. Most RBPs were expressed in truncated forms comprising all constituent RBD(s) with 30 to 50 flanking amino acid residues to enhance solubility. The selected RBPs contained at least one of nine well-characterized RBDs (RRM, KH, S1, YTH, Pumilio repeats (PUF), zf-CCCH, zf-CCHC, zf-RanBP and SAM), whereby the majority of RBPs contained multiple RBDs. Approximately 90% of the RBPs tested recognized five to seven nucleotide-long sequence motifs and did not require structured RNA for binding, which is expected based on the inclusion of predominantly single-strand-specific RBDs in this study. For 52 proteins, RNAcompete RREs were compared with RREs previously determined by CLIP or other methods. Of these, 35 were highly similar, 6 matched partially and 11 were dissimilar to RNAcompete RRE. For example, for PUM1/2 or ELAVL1/HuR the RREs agreed perfectly, while for proteins such as FMR1 only one of two established RREs were identified. The discrepancies may mirror technical differences between the methods or differences between in vivo and in vitro specificities of RBPs. Enrichment of an RRE by RNAcompete is dependent on affinity, and for multi-RBD proteins affinities of individual RBDs for RNA can vary by orders of magnitude, and contributions of weaker binding RBDs, which can be detected in in vivo data, may be potentially overlooked. In addition, in vivo, the highest affinity sites may not always be accessible due to competition with other RBPs, the cell-type-and subcellular-compartmentdependent concentration of RBP and RNA targets, modulation of RNA affinities by protein cofactors, and the secondary structure of RNA.
Of importance was the validation of the intuitive notion that RBPs with high sequence identity bind to similar RREs. It was found that RBPs with 70% sequence identity have close to identical RREs, and RBPs with 50% identity share related binding specificities. Based on this notion, the authors predicted RREs for a total of 8,056 RBPs in humans and other metazoans, as well as in plants and protists. Specifically, this number amounted to 159 RBPs in human belonging to 62 protein families, of which approximately 90% were putative or experimentally validated mRNA-binding proteins (mRBPs). Estimating that 700 of the 1,500 RBPs are mRNA-binding, this study elucidated RRE motifs for 20% of all human mRBPs, and 53% of proteins containing canonical single-strand RBDs ( Figure 2). The results are available as a public database and represent a valuable resource for researchers interested in prediction of RBP binding sites.

Conservation of motifs and functional implications
RNAcompete-derived RREs demonstrated predictive power for anticipating regulatory functions of RBPs [4]. Evolutionary conservation analysis showed that sequence elements containing these RREs were frequently under positive selection pressure in 5' UTRs, coding regions, 3' UTRs and intronic regions flanking alternative exons. The location of conserved RREs correlated well with previously elucidated RBP binding patterns, with a few surprising twists; for example, conserved RREs for several splicing factors were unexpectedly frequent in the 3' UTR of mRNAs. RNA sequencing experiments from diverse cell lines and tissues with different RBP expression levels allowed correlation of RBP levels with predicted target RNA levels or splicing patterns. This analysis confirmed known RBP functions in some cases (ELAVL1/HuR, RBM4), but also hinted at unanticipated roles for others (PUM1/2, RBFOX1). A study of RNA knockdown data confirmed that RBFOX1, a splicing regulator, also had a positive effect on RNA stability of putative targets with predicted RBFOX1 sites in the 3' UTR, confirming previous reports that some RBPs may have multiple functions in post-transcriptional gene regulation.
Some of the regulatory effects predicted by the evolutionary conservation analysis of RNAcompete RREs, however, are difficult to reconcile with other available data, such as the implied negative effect of the FMR1 protein on target mRNA levels. An effect of FMR1 on RNA abundance was explicitly ruled out in two recent studies, although FMR1 was shown at the same time to negatively regulate protein abundance of targeted mRNAs [9,10]. As discussed above, these discrepancies may reflect differences between in vivo and in vitro preferences of multi-RBD proteins, including FMR1. Analysis of CLIP-derived motifs showed that the FMR1 RG-rich region bound WGGA with higher affinity than its KH domains bound ACUK [9]. The RNAcompete motifs GACAAG and ANGGAC more likely reflected contributions of the RG-rich region to binding. The implicit assumption that the highest-affinity RRE also reflects the optimal in vivo RRE may prove inaccurate in some cases, because of varying accessibility of a motif.

From RRE identification to elucidation of posttranscriptional gene regulatory network
The systematic analysis and identification of RREs, together with in vivo RNA targets of regulatory proteins, will remain one of the main focuses in post-transcriptional gene regulation research. Ray et al. have compiled the largest catalog of experimentally derived RREs at present and this resource may be used to understand evolutionary relationships between RBPs. It also allows researchers to find putative binding sites for RBPs of interest and gives computational biologists the opportunity to integrate RREs as predictors into statistical learning methods to model, in concert with microRNA binding sites, transcription factor recognition elements and epigenetic marks, the transcriptional and post-transcriptional control of gene expression.
To capture the physiological role of RBPs, we still need to dissect the target gene network for each RBP individually in various cellular contexts, and then integrate the knowledge into computational approaches that are able to recapitulate quantitatively the regulatory effects of RBPs. This includes understanding protein and target RNA levels in different cell types and tissues, insights into RRE occupancy, competition among RBPs and accounting for redundancies in protein families or regulatory pathways. Efforts such as SELEX-or CLIPbased methods increase the growing compendium of RREs and contribute to this goal to characterize posttranscriptional regulation in a comprehensive manner.   [1]), 20% were studied by RNAcompete (red, n = 146). Another 18% of mRBPs contain at least one or more of the selected RBDs but remain to be studied (dark gray, n = 127). Sixty-two percent of mRBPs contained RBDs (for example, DSRM, La, PWI, and so on) not studied by RNAcompete (light gray, n = 444