Principles of RNA processing from analysis of enhanced CLIP maps for 150 RNA binding proteins

Background A critical step in uncovering rules of RNA processing is to study the in vivo regulatory networks of RNA binding proteins (RBPs). Crosslinking and immunoprecipitation (CLIP) methods enable mapping RBP targets transcriptome-wide, but methodological differences present challenges to large-scale analysis across datasets. The development of enhanced CLIP (eCLIP) enabled the mapping of targets for 150 RBPs in K562 and HepG2, creating a unique resource of RBP interactomes profiled with a standardized methodology in the same cell types. Results Our analysis of 223 eCLIP datasets reveals a range of binding modalities, including highly resolved positioning around splicing signals and mRNA untranslated regions that associate with distinct RBP functions. Quantification of enrichment for repetitive and abundant multicopy elements reveals 70% of RBPs have enrichment for non-mRNA element classes, enables identification of novel ribosomal RNA processing factors and sites, and suggests that association with retrotransposable elements reflects multiple RBP mechanisms of action. Analysis of spliceosomal RBPs indicates that eCLIP resolves AQR association after intronic lariat formation, enabling identification of branch points with single-nucleotide resolution, and provides genome-wide validation for a branch point-based scanning model for 3′ splice site recognition. Finally, we show that eCLIP peak co-occurrences across RBPs enable the discovery of novel co-interacting RBPs. Conclusions This work reveals novel insights into RNA biology by integrated analysis of eCLIP profiling of 150 RBPs with distinct functions. Further, our quantification of both mRNA and other element association will enable further research to identify novel roles of RBPs in regulating RNA processing.


Background
RNA can act as a carrier of information from the nucleus to the cytoplasm in the processing of proteincoding genes, as a regulatory molecule that can control gene expression, and even as an extracellular signal to coordinate trans-generational inheritance [1][2][3]. RNA binding proteins (RBPs) interact with RNA through a wide variety of primary sequence motifs and RNA structural elements to control all processing steps [3]. Furthermore, with the increase in the number of RBPs that are becoming associated with human diseases, identifying their RNA targets and how they are regulated has become an unmet, urgent need.
To identify direct RNA targets of RBPs, RNA immunoprecipitation (RIP) and crosslinking and immunoprecipitation (CLIP) methods are frequently used. CLIPbased methods utilize UV crosslinking to covalently link an RBP with its bound RNA in live cells, enabling both stringent immunoprecipitation washes and denaturing SDS-PAGE protein gel electrophoresis and nitrocellulose membrane transfer which serves to remove background unbound RNA [4]. Analyses of single RBP binding profiles by CLIP have provided unique insights into basic mechanisms of RNA processing, as well as identified downstream effectors that drive human diseases [5][6][7]. Further efforts to profile multiple human RBPs in the same family or regulatory function by CLIP illustrated coordinated and complex auto-and cross-regulatory interactions among RBPs and their targets [8][9][10]. Rising interest in organizing public deeply sequenced CLIP datasets to enable the community to extract novel RNA biology is apparent from newly available computational databases and integrative methods [11,12]. However, methodological differences between CLIP approaches, combined with simple experimental variability between labs and variation in acceptable quality control metrics, add significant challenges to interpretation of differences observed.
The field of transcription regulation observed similar challenges and opportunities in integrating transcription factor target profiles [13]. To address this challenge, the ENCODE consortium piloted large-scale profiling of transcription factor targets using a single standardized chromatin immunoprecipitation (ChIP-seq) protocol [14]. The initial effort to profile 119 factors generated a unified dataset for creating and assaying robust quality assessment standards [15], and led to insights into modeling transcription factor complexes, binding modalities, and regulatory networks [16]. More critically, however, this has served as an invaluable resource for researchers to annotate potential functional variants [17] and generate hypotheses across a variety of fields of interest. This success suggested that a similar effort to profile RBP targets using a standardized methodology could similarly drive significant insights in RNA biology.
To this end, we introduced the enhanced CLIP (eCLIP) methodology featuring a size-matched input control [18] and characterized hundreds of immunoprecipitation-grade antibodies with a standardized workflow [19] to generate 223 eCLIP datasets profiling targets for 150 RBPs in K562 and HepG2 cell lines [20]. Along with orthogonal data types, this study provided insights into localized RNA processing, studied the interplay between in vitro binding motifs and RBP association (and factor-responsive targets) in live cells, and identified novel effectors of RNA stability and alternative splicing [20].
In this companion work, we provide further insight into how integrative analysis of RBP target profiles by eCLIP can reveal both general principles of RNA processing as well as specific mechanistic insights for individual RBPs. Although most CLIP analysis typically focuses on binding to mRNAs (both intronic and exonic), we find that for 70% of RBPs, the dominant enrichment signature is instead a variety of multicopy and non-coding elements (including structural RNAs such as ribosomal RNAs and spliceosomal snRNAs, retrotransposable and other repeat elements, and mitochondrial RNAs). These analyses can be then used to generate hypotheses about RBP function, as enrichment for the ribosomal RNA precursor corresponds with RBPs regulating ribosomal RNA maturation whereas enrichment for retrotransposable elements corresponds to both regulation of retrotransposition itself as well as suppression of improper RNA processing due to cryptic elements contained within these elements. Binding maps across meta-profiles of mRNAs and exon-intron junctions similarly show that RBP binding patterns correlate with RBP functional roles, and analysis of spliceosomal components indicates that eCLIP can be used to identify branch points and provides evidence for a 3′ splice site scanning model. In summary, these results provide further validation of the power of integrated analyses of RBP target maps generated by eCLIP in identifying novel principles of RNA biology, as well as generating RBPspecific hypothesis for further functional validation.

Results
Large-scale profiling of RNA binding protein binding sites with eCLIP The eCLIP methodology enabled highly efficient identification of RBP binding sites [18], leading to the generation of the first large-scale database of RNA binding protein targets profiled in the same cell types using a standardized workflow [20]. This dataset contains 223 eCLIP profiles of RNA binding sites for 150 RNA binding proteins (120 in K562 and 103 in HepG2 cells), covering a wide range of RBP functions, subcellular localizations, and predicted RNA binding domains ( Fig. 1a; Additional files 1 and 2) [20]. Each experiment contains biological duplicate immunoprecipitation libraries along with a paired size-matched input from one of the two experimental biosamples (Fig. 1b). For each experiment, raw sequencing data, processed data (including read mapping and identified binding sites), and experimental meta-data (including antibody and immunoprecipitation validation documentation, biosample information, and additional related ENCODE datasets) were deposited at the ENCODE data coordination center (https://www. encodeproject.org) [20].
Many CLIP methods included radioactive labeling of the 5′ end of RNA fragments with 32 P to visualize protein-RNA complexes after SDS-PAGE electrophoresis and membrane transfer in order to query whether RNA bound to co-purified RBPs of different size is present [4]. However, the eCLIP protocol we utilized above did not include this direct visualization of proteinassociated RNA due to the complexity of incorporating radioactive labeling at this scale, preferring validation of eCLIP signal with orthogonal approaches (such as comparison with in vitro-derived motifs or overlap with knockdown/RNA-seq changes). To address this question for future large-scale eCLIP profiling, we pursued alternative labeling approaches. We found that ligation of biotinylated cytidine (instead of the normal RNA adapter) enabled visualization similar to that observed with 32 P while using commercially available chemiluminescent detection reagents for biotin-labeled nucleic acids (Additional file 3: Fig. S1a-c) [21]. We note that unlike 32 P labeling (which is done as a 5′ phosphorylation reaction with T4 Polynucleotide Kinase), this labeling uses the standard eCLIP RNA adapter ligation reaction and thus may more accurately reflect true protein-coupled RNA positioning.
Surprisingly, when expanding this approach across RBPs, we observed detectable transfer of RNA from non-crosslinked cells to nitrocellulose membranes in a supplier-dependent manner (Additional file 3: Fig. S1df). We had previously noted that certain sourced nitrocellulose membranes contained greater amounts of RNA, which would then be recovered during library preparation (particularly in input libraries, which lack adapter addition prior to membrane transfer) [22]. However, we now observed that the recommended (lower contaminant, membrane I) membrane from that effort showed increased transfer of RNA than our previous supplier (membrane G) (Additional file 3: Fig. S1d-f). Although the signal observed in crosslinked samples was typically significantly higher (median 12.5-fold across 17 RBPs tested), with 88% (15 out of 17) RBPs greater than 5-fold (Additional file 3: Fig. S1d), for 2 out of 17, we observed within 5-fold RNA transfer in non-crosslinked samples (Additional file 3: Fig. S1d,f).
To directly query whether this led to artifactual eCLIP peak identification, we chose seven eCLIP experiments performed with membrane I and performed replicate experiments with membrane G. Using MATR3 as an example, we observed that peak fold-enrichment compared Fig. 1 Two hundred twenty-three eCLIP datasets profile targets for 150 RNA binding proteins. a Colors indicate RBPs profiled by eCLIP, with manually annotated RBP functions, subcellular localization patterns from immunofluorescence imaging, and predicted RNA binding domains indicated (Additional file 1). b Schematic overview of eCLIP as performed in the datasets described here. Two biological replicates (defined as biosamples from separate cell thaws and crosslinked more than a week apart) were performed for each RBP, along with one size-matched input taken from one of the two biosamples prior to immunoprecipitation across membranes was similar to that observed for within-membrane replicates (Additional file 3: Fig. S1g). Extending this to all seven RBPs, only one (FXR2) out of seven showed notably lower replication of peak significance using membrane G (Additional file 3: Fig. S1h), and even in that case, we observed high overall correlation in peak fold-enrichment (Additional file 3: Fig. S1i). Conservation of signal was not limited to peak calls, as we observed similar enrichments for retrotransposable and other RNA elements as well (Additional file 3: Fig. S1j). Thus, although our data indicates that whether RNA that is not crosslinked to protein will transfer to nitrocellulose membranes is supplier-and product-dependent, but that it does not generally appear to add significant background to the eCLIP profiles studied here.
Recovering RNA binding protein association to retrotransposons and other multicopy RNAs Standard peak analysis revealed a wide variety of binding modes to mRNAs, with RBPs enriched for coding sequences, 3′ and 5′ untranslated regions, proximal and distal intronic regions, and non-coding RNAs (Additional file 3: Fig. S2a) [20]. Notably, we observed that RNA binding protein mRNAs were 1.4-fold enriched (p = 2.1 × 10 −22 by one-sample t test) among all peakcontaining genes (median 13.5% per dataset, relative to 9.4% of all genes with at least one peak). In particular, well-studied splicing regulators (e.g., SRSF7 and TRA2A) were more than 3-fold enriched for binding to RBPs (Additional file 3: Fig. S2b-c). In contrast, transcription factors were unchanged (1.0-fold depleted), suggesting that RNA processing regulators are particularly likely to themselves be the target of RNA processing regulation. In total, RBPs profiled in this study bound a median of 107 RBPs and 34 transcription factors, confirming the presence of a highly complex regulatory network of RNA and DNA processing (Additional file 3: Fig. S2c).
In addition to single-copy RNA transcripts, the human genome contains many high-copy regions that are expressed as functional RNAs but present a substantial challenge to standard short read mapping strategies. These include RNAs such as the large and small ribosomal RNA (rRNA), 7SK snRNA, and others that have one or few expressed primary transcripts but dozens to hundreds of pseudogenes throughout the genome, as well as retrotransposable elements including LINE and Alu elements with thousands of moderately divergent sense and antisense copies throughout transcribed genes [23]. We found that simply including non-uniquely mapped reads in standard analysis created thousands of peaks in introns, in intergenic regions, and at pseudogenes that typically lacked standard peak shapes (likely reflecting sequencing errors relative to the main expressed transcript), indicating the need for improved methods to properly quantify RBP binding to such loci.
In order to include these RNA types in eCLIP analysis, we developed a "family-aware mapping" approach in which adapter-trimmed reads are first mapped against a database of sequences for primary transcripts and pseudogenes for 82 families (Fig. 2a) (Additional file 4). Reads mapping to reference transcripts contained within a family (e.g., LINE, YRNA, or 18S rRNA) are used for quantitation, but reads that map to multiple families are masked (discarding an average of 1.1% of reads). These results are then integrated with standard unique genomic mapping in order to incorporate reads that uniquely map to regions annotated as repetitive elements by RepeatMasker [24] into the final family quantitation (Fig. 2a). Confirming the success of this approach, we observed that in eCLIP replicates of YRNAassociating factor TROVE2/RO60 in K562, only 3.7 and 6.8% (replicate 1 and 2, respectively) of usable reads uniquely mapped to YRNA transcripts with standard processing (2.9 and 5.1% to RNY1/2/4/5, with another 0.7% and 1.8% to YRNA pseudogenes) (Fig. 2b). In contrast, for these same datasets, 14.2% and 21.7% of reads mapped uniquely to the YRNA family using the familyaware mapping approach, making use of hundreds of thousands of additional reads that did not uniquely map to individual transcripts (Fig. 2b). Performing this analysis for all RBPs, we observed a wide range of read recovery and enrichment for particular elements (Fig. 2c, Additional file 5). For some RBPs such as RPS11 (K562), an average of 95.2% of reads were only recovered using family mapping (68.1% mapping to RNA18S with an additional 24.1% to RNA28S). In contrast, only 10.4% of reads in KHSRP (K562) eCLIP mapped to multicopy family elements, with 58.9% uniquely mapping to the genome (including 41.1% uniquely mapping to introns outside of RepeatMasker elements) (Fig. 2c).
At the element level, our family-aware mapping strategy recovers many known processing or interacting factors, including RBPs enriched for the mature 18S (RPS3, RPS11) and 28S rRNA (DDX21, NOL12) as well as the 45S rRNA precursor (UTP18, WDR43), tRNAs (NSUN2), RN7SK (LARP7), YRNA (TROVE2), and others (Fig. 2d). To validate this approach, we considered 17 RNA elements with well-studied direct links to either RBP function (such as snoRNA binding with rRNA processing and snRNA binding with snRNA processing and the spliceosome) or specific RBP regulators (e.g., snRNA RN7SK with LARP7 [25] and YRNAs with TROVE2/Ro60 [26]) (Additional file 3: Fig. S2d). We observed that 140 eCLIP datasets had one of these 17 elements as the most highly enriched (by relative information, which we observed to better enable comparison across elements versus fold-enrichment), and in Fig. 2 Quantification of repetitive elements and other non-uniquely mapped reads. a Graphical representation of repetitive element mapping. Reads are mapped to human genome (requiring unique mapping) and a database of repetitive element families. Reads are then associated with RNA element families based on mismatch score, with (red) reads discarded if mapping equally well to more than one family. b Stacked bars indicate the number of reads from TROVE2 eCLIP in K562 that map either uniquely to one of four primary Y RNA transcripts, map uniquely to Y RNA pseudogenes (identified by RepeatMasker), or (for family-aware mapping) map to multiple Y RNA transcripts but not uniquely to the genome or to other repetitive element families. c Stacked bars indicate the fraction of reads (averaged between replicates) of all 223 eCLIP experiments, separated by whether they map (red) uniquely to the genome, (purple) uniquely to the genome but within a repetitive element identified by RepeatMasker, or (gray) to repetitive element families. Datasets are sorted by the fraction of unique genomic reads. d Heatmap indicates the relative information for 26 elements and 168 eCLIP datasets, requiring elements and datasets to have at least one entry meeting a 0.2 relative information cutoff (based on Additional file 3: Fig. S2d). See Table 1 for RBP:element enrichments meeting this criteria and Additional file 5 for all enrichments 84 (60%) of these cases, the RBP was previously characterized as having the element-paired RBP function, indicating that this approach is highly successful at recovering targets that reflect annotated functions of profiled RBPs. To set a cutoff for analysis, we found that an information cutoff of 0.2 maximized predictive accuracy, at which 70% (74 out of 105 RBPs with the most enriched RNA element meeting this cutoff) had annotated functions matching the known role for this element (Additional file 3: Fig. S2e). Using this cutoff, 235 RBP-element pairings were identified with large numbers of RBPs associated with mRNA regions (42 with CDS, 24 with 3′UTR, 40 with distal intronic, and 23 with proximal intronic regions) and rRNA (24 with RNA28S and 15 with RNA18s, as well as 12 with precursor 45S rRNA), and smaller numbers associated with other specific RNA classes (Fig. 2d, Table 1).

Characterization of ribosomal RNA interactors and processing factors
Ribosomal RNA (rRNA) is the most abundant RNA found in eukaryotic cells and plays essential roles in defining the structure and activity of the ribosome. In humans, the 5S rRNA is separately transcribed, whereas the 18S, 28S, and 5.8S rRNAs are transcribed as one 45S precursor transcript that then undergoes a complex series of cleavage and RNA modification steps to process the mature rRNAs, which then form complex structures that scaffold the assembly of~80 proteins to create the functional ribosome [27]. Unbiased approaches have characterized over 250 additional factors as playing critical roles in processing pre-rRNA, indicating that rRNA processing and function represent a major function of RBPs in humans [28].
Considering the 150 RBPs profiled, we observed that different subsets of RBPs showed enrichment to specific rRNAs ( Fig. 3a), suggesting that the incorporation of normalization against paired input was successful in removing general background at abundant transcripts. Although we are unable to distinguish between mapping to mature 18S, 28S, and 5.8S transcripts versus those regions in the precursor, the~10-fold lower read density we observe for 45S (median 281 reads per million (RPM)) versus 18S (2715 RPM) or 28S (1983 RPM) in eCLIP input samples (Additional file 3: Fig. S3a-c) suggests that the majority of 18S and 28S reads reflect mature rRNA transcripts. Considering 30 RBPs previously shown to effect pre-rRNA processing [28], we found that 16 had enrichment for one of the three (18S, 28S, or 45S) rRNAs (42.1% of RBPs meeting a 0.101 positionwise information cutoff) relative to 12.5% of others (3.4fold enriched, p = 0.00025 by Fisher's exact test) (Additional file 3: Fig. S3d). Despite high and relatively even read density overall on the abundant rRNA transcripts (Additional file 3: Fig. S3a-c), we observed that these rRNA-enriched RBPs showed a number of specific enrichment patterns: two on the 45S precursor (one situated around the 01 and A0 early processing sites, and a second located~2000 nt further downstream that is discussed below), a cluster at position~4200 of the 28S, and a cluster at~1150 of the 18S, along with other profiles unique to individual RBPs (Fig. 3a). Distinct ribosomal components RPS3 and RPS11 had different positional enrichments, as expected given their different positioning within the 18S ribosome (Additional file 3: Fig. S3e).
Our data on rRNA precursor position-specific enrichment confirms and provides further resolution to proteins previously characterized to play roles in ribosomal RNA processing. Some factors had specific positioning, including DDX51 which had specific enrichment at the 3′ end of 28S as well as the 3′-ETS precursor region, consistent with previous characterization of the role of DDX51 in 3′ end maturation of 28S [29], and UTP18 which had specific enrichment at the 5′ end, matching its roles in early cleavages at the 01, A0, and 1 sites suggested from large-scale screening data [28] (Fig. 3b, c, Additional file 3: Fig. S3f-g). Others, such as WDR3, had broader enrichment patterns that suggest participation in multiple maturation steps (Fig. 3d, Additional file 3: Fig. S3h).
Surprisingly, we observe a cluster of RBP association in the 45S precursor around position 2100, a region located between the A0 and 1 processing sites which lacks a well-defined processing role (Fig. 3a) [27]. Two of these factors have previous links to nucleolar activity, as ILF3 (also known as NF90) was previously shown to associate with pre-60S ribosomal particles in the nucleolus and knockdown of ILF3 gives defects in rRNA biogenesis [28,30], and LIN28B has been shown to repress let-7 processing by sequestering pri-let-7 in the nucleolus [31]. In this region, multiple sites of ILF3 and SSB enrichment flank a more specific region enriched in LIN28B eCLIP (Fig. 3e, Additional file 3: Fig. S3i) which has previously been described to contain a potential rRNA-encoded microRNA, rmiR-663a [32]. As rmiR-663a shares similar sequence to genomic-encoded miR-663a on chromosome 20 (and would have the same mature miRNA sequence), it has been challenging to isolate expression of the ribosomal-encoded transcript in isolation [33], and indeed, the majority of LIN28B eCLIP reads mapping to pri-miRNA map equally to both variants (Sup Fig. 3j). However, when we used sequence variants in the pri-miR sequence as well as the more variable flanking sequence to estimate their separate expression (Fig. 3f), we observed that reads unique to the rmiR outnumbered those unique to genomic homologs by more than 400-fold ( Fig. 3g and Additional file 3: Fig.  S3j-k), indicating that the observed signal is likely derived from 45S rather than other genomic homologs.
Finally, we considered binding to snoRNAs, a class of highly structured small RNAs that play essential roles in guiding modification of ribosomal RNAs. We found that enrichment for C/D-box snoRNAs, which canonically guide methylation of RNA, was highly correlated to enrichment for the 45S precursor (R 2 = 0.67, p = 1.6 × 10 −54 ) (Fig. 3h), providing further confirmation that these 45S-enriched RBPs are likely playing key roles in rRNA processing. Surprisingly, however, we observed that enrichment for H/ACA-box snoRNAs showed far lower correlation with enrichment for either C/D-box snoRNAs (R 2 = 0.42) or the 45S precursor (R 2 = 0.17) Uniquely mapped to genome (intronic) Spliceosomal small nuclear RNAs Ribosomal RNA function and processing Other unique regulatory RNA classes  Fig. S3l). Thus, this data confirms the ability of eCLIP with input normalization to specifically isolate enrichment between abundant snoRNA classes, and suggests that (at least for the RBPs profiled to date here) we see stronger overlap between rRNA precursor and C/D-box versus H/ACA-box snoRNAs.
Repetitive elements define a significant fraction of the RBP target landscape Repetitive elements constitute a large fraction of the non-coding genome [34], and elements annotated by RepBase constitute an average of 12.2% of reads observed in eCLIP input experiments (Additional file 3: Fig. S4a). In particular, as retrotransposable L1/LINE and Alu elements constitute 10.8% and 0.4% of intronic sequences, respectively (Additional file 3: Fig. S4b), they represent a significant fraction of the pool of nuclear transcribed pre-mRNAs available for RBP interactions. Although some RBPs have been shown to play roles in regulation of active retrotransposition [35], the majority of intronic elements have accumulated mutations or deletions and are no longer capable of active retrotransposition, leaving the question of their function relatively poorly understood. However, recent analyses of RBP targets identified by CLIP (including early releases of the eCLIP data considered here) have shown that both antisense Alu and antisense LINE elements contain cryptic splice sites that can lead to improper splicing and polyadenylation, suggesting that a major yet unappreciated role for many RBPs may be to suppress the emergence of inappropriate cryptic RNA processing sites introduced upon retrotransposition [36,37]. Querying for RBPs with enriched eCLIP signal at retrotransposable and other repetitive elements, we surprisingly observed that only a small subset of elements (notably including L1 and Alu elements both in sense and antisense orientation) showed high RBP specificity, whereas most elements showed extremely highly correlated enrichments across RBPs (Fig. 4a, Additional file 3: Fig. S4c). This group of elements showed enrichment in a small subset of eCLIP experiments, notably including multiple members of the highly abundant HNRNP family (HNRNPA1, HNRNPU, HNRNPC, and HNNRPL), indicating that they may be coordinately regulated to prevent inappropriate RNA processing.
Analysis of Alu elements recapitulated a previously described interaction of HNRNPC with antisense Alu elements [36], but additionally revealed two RBPs with more than 5-fold enrichment: ILF3 (enriched for both sense and antisense Alu elements) and RNA Polymerase II component POLR2G (antisense) (Fig. 4b, Additional file 3: Fig. S4d). Both of these factors have previous links to RNA processing through Alu elements, as ILF3 association was suggested to repress RNA editing in Alu elements [39] and Alu elements have been shown to effect RNA Polymerase II elongation rates [40]. In total, 19 datasets showed more than 2-fold enrichment for either Alu or antisense Alu elements (Fig. 4b).
Considering L1/LINE elements, we observed enrichment with far more RBPs, with 26 datasets showing 5fold enrichment (Fig. 4c). Interestingly, we observed generally distinct sets for sense versus antisense L1 enrichment, with only HNRNPC (in K562, but not HepG2) and ZC3H8 showing enrichment for both (Fig. 4c, Additional file 3: Fig. S4e). The RBPs identified here align well with those identified in an independent analysis of L1-associated RBPs which used a subset of these datasets along with independent iCLIP and other datasets, confirming robustness of this analysis across different approaches to quantify enrichment to L1 elements [37]. To query the role of L1 association, we first considered whether binding could specifically act to repress L1 retrotransposition itself. Of the 15 RBPs with more than 5fold enrichment at sense L1 elements, SAFB (p = 0.002), PPIL4 (0.06), and TRA2A (p = 0.05) were all identified as candidate suppressors of L1 retrotransposition in a recent genome-wide CRISPR screening assay [38], suggesting that this eCLIP enrichment approach identifies functional regulators of retrotransposition (Fig. 4d).
However, we observed that while enriched signal was centered at L1 sense and antisense elements, the signal often extended for multiple kilobases on either side (Additional file 3: Fig. S4f), indicating that despite the overlap with functional regulators of active lines, the majority of eCLIP signal is likely coming from inactive L1 elements contained within pre-mRNAs rather than independently transcribed active L1 elements in the cell lines studied here. Thus, we next assayed whether these RBPs showed evidence for silencing cryptic RNA processing sites created upon retrotransposition, as previously described [36,37]. To do this, we hypothesized that knockdown of such RBPs would lead to inclusion of premature stop codons that signal nonsense-mediated decay, ultimately decreasing abundance of target mRNA transcripts. For MATR3, we indeed observed that genes containing one or more antisense L1 elements overlapped by peaks showed significantly decreased expression upon RBP knockdown (Fig. 4e), consistent with recent findings that MATR3 binding blocks both cryptic poly(A)-sites and splice sites within LINEs [37]. Interestingly, we observed a similar pattern for 3 other RBPs with antisense L1 enrichment, HNRNPM (which has been identified in complexes with MATR3 [41]), SUGP2, and EXOSC5 (Fig. 4e). These four RBPs also showed particular enrichment for reference L1 sequences as opposed to unique genomic mapping to more degenerate elements, suggesting that this specifically segregates

Meta-gene binding profiles reveal RBP functions
Next, we turned to the question of whether eCLIP peak distributions could reveal RBP roles in mRNA processing. To better separate RBP association patterns, we considered the distribution peaks across a meta-gene generated by size-normalizing binding across all proteincoding transcripts relative to transcription start and stop sites and start and stop codons, and then averaging across all expressed genes (Fig. 5a). Considering binding relative to the coding region (CDS) and 5′ and 3′ untranslated regions of spliced mRNA, we observed an overall average of approximately one peak per gene across the entire mRNA (Additional file 3: Fig. S5a), with a variety of patterns of individual RBP association ( Fig. 5b). At a global level, the most striking observation was clear delineation points at the start and stop codon positions (Fig. 5b, c), likely reflecting the fact that translation initiation is unique to the 5′UTR whereas the 3′UTR is the only region where bound RBPs will not be removed by translating ribosomes. However, more subtle clustering revealed distinct subgroups within the broader 5′ UTR-, CDS-, and 3′UTR-enriched classes (Fig. 5b, d). For example, we observed two distinct classes of 5′UTR binding that appear to correlate with distinct RBP functions. The first (5UTR.TSS) showed greater enrichment closer to the transcription start site and included nuclear 5′ end processing factors such as cap-binding protein NCBP2 (Fig. 5b, d). In addition to 5′ end enrichment, this class also contained RBPs with substantial 3′UTR signal, such as 3′ end processing factor CSTF2T (which also showed significant signal extending past annotated transcription termination sites (Additional file 3: Fig.  S5b), consistent with previous CLIP studies [42]). A second set (5UTR.SC) showed biased peak presence closer to the start codon and included both canonical translational initiation factors (such as EIF3G, EIF3D, and EIF3H) as well as RBPs previously shown to play translational regulatory roles (including DDX3X, SRSF1, and FMR1) (Fig. 5b).
Similarly, we also observed distinctions within CDS binding, with either uniform (CDS.UN) density or biased towards the 5′ (CDS.5P) or 3′ (CDS.3P) end. We observed that 13 out of 15 spliceosomal RBPs showed CDS enrichment (10 of which fell into the CDS.UN category), likely reflecting the general lack of introns in 5′UTRs (due to their small size) and 3′UTRs (as they would create targets for nonsense-mediated decay) (Fig. 5b, d).
Finally, we observed multiple modalities of 3′UTR peak distribution. The 3UTR.Un class showed relatively uniform density and contained many well-characterized 3′UTR binding proteins, including NMD factor UPF1 and stress granule factor TIA1. In contrast, RBPs in the 3UTR.5P class had peak density enriched closer to (and continuing 5′ of) the stop codon, including the wellstudied IGF2BP family of RBPs (Additional file 3: Fig.  S5c). Finally, we observed a number of RBPs with increased enrichment towards the transcription termination site (3UTR.TTS).
Next, we considered whether these patterns corresponded to different RNA processing functions. Although the number of RBPs is limited for some functions, we observed that many clusters had significant overlaps with distinct RBP functional annotations (Fig. 5e, Additional file 3: Fig. S5d). In particular, RBPs associated with nuclear RNA processing steps showed little change (median 1.2-fold decrease in peak density around the stop codon), whereas RBPs with cytoplasmic roles showed a significant 1.6-fold increase (Additional file 3: Fig. S5e), consistent with a stronger role for the stop codon as a delineation point for cytoplasmic RBP association. In all, our results suggest that the pattern of relative enrichment in different gene regions is predictive of the regulatory role that the RBPs play. Shown are all RBPs with average enrichment of at least 2 (for Alu elements) or 5 (for L1 elements). d Bars indicate L1 retrotransposition casTLE effect score (positive score indicates increased retrotransposition upon RBP knockout), with error bars indicating 95% minimum and maximum credible interval estimates (data from Liu et al. [38]). e (left) Each point indicates significance (from two-sided Kolmogorov-Smirnov test) between fold changes observed in RNA-seq of RBP knockdown for the set of genes with one or more RBP-bound L1 (or antisense L1) elements versus the set of genes containing one or more L1 (or antisense L1) elements but lacking RBP binding (defined as overlap with an IDR peak). RBPs were separated based on requiring 5-fold enrichment for L1 elements as in c. (right) Cumulative distribution plots for (top) MATR3 in HepG2 and (bottom) SUGP2 in HepG2. Significance shown is versus the set of genes containing one or more L1 (or antisense L1) elements but lacking RBP binding (red line). f Points indicate the fraction of antisense L1-assigned reads that map to canonical (RepBase) elements for six expression-altering antisense L1-enriched eCLIP datasets (from e), five other antisense-L1 enriched eCLIP datasets, and 11 paired input samples. Significance is from the two-sided non-parametric Kolmogorov-Smirnov test. See Additional file 3: Fig. S4g for the full distribution of read assignments

Splicing regulatory roles revealed by intronic meta-gene profiles
Next, we performed regional analysis to query binding to exons (specifically 50 nt bordering the splice sites) and 500 nt of proximal introns flanking both the 3′ and 5′ splice sites. As an example, we observed that out of 89,265 introns present in highly expressed transcripts (TPM > 1), 2699 had a significant IDR peak from eCLIP of U2AF2 in K562 cells (Additional file 3: Fig. S6a). These peaks had a stereotypical positioning at the 3′ splice site (extending into the downstream exon due to the use of full reads rather than just read 5′ ends for analysis), matching the well-characterized role of U2AF2 in 3′ splice site recognition (Fig. 6a). These matrices were then summed across all introns to calculate a meta-intron plot representing the average peak coverage at each position, with confidence intervals estimated by bootstrapping (Fig. 6b).
Performing this analysis for 130 RBPs with sufficient peaks (see the "Methods" section), we observed that the profiles recapitulated many known binding patterns, including U2AF1 and U2AF2 at the 3′ splice site, SF3B4 and SF3A3 at the branch point, PRPF8 at the 5′ splice site, and RBFOX2 and PTBP1 at proximal introns (Fig. 6c). Clustering analysis indicated a number of distinct RBP association patterns. In addition to a large group of exclusively exonic datasets, we observed clusters for the canonical splicing features (5′ splice site, 3′ splice site, and branch point), and two additional clusters: one where RBPs showed enrichment for peaks at proximal introns flanking both the 5′ and 3′ splice sites, and one with dominant enrichment in the 5′ splice site proximal intron only (Fig. 6c, right). We also observed a wide range of peak frequency; canonical splicing machinery components such as U2AF2, SF3B4, and PRPF8 had significantly enriched peaks at many introns (with a position maximum of 3.6%, 7.8%, and 5.3% of queried abundant introns respectively in K562), whereas factors such as PTBP1 and RBFOX2 were less commonly enriched at specific positions (0.1% and 0.5%, respectively) (Fig. 6c).

Insights into spliceosomal association and core splicing regulation
The breadth of RBPs profiled provided a unique opportunity to explore their interactions with the spliceosome and their impacts on splicing regulation. In addition to contacting the intron, many spliceosomal and splicing regulatory proteins also interact with the spliceosomal small nuclear RNAs (snRNAs). The overall snRNA family includes five specific RNA families (U1, U2, U4, U5, and U6, which also have variant isoforms that differ slightly in sequence) that play essential roles in canonical GT-AG RNA splicing, as well as four (U11, U12, U4atac, U5atac) specific to the minor AT-AC spliceosome, each of which plays specific mechanistic roles during splicing [43]. Thus, RBP association with a particular snRNA can help to map its function to a particular step in splicing. Quantitating snRNA enrichment using the family-aware mapping described above, we recapitulated many known associations between RBPs and the spliceosome, including interactions of SF3B4 with U2 snRNA (47-and 32fold enriched in HepG2 and K562, respectively) [44] and GEMIN5 with U1 (11.2-fold enriched in K562) [45] (Fig. 7a). In some cases, these dominated overall RNA recovery; for example, an average of 41% of reads from SF3A3 eCLIP and 17% and 20% of SF3B4 eCLIP reads in HepG2 and K562 respectively mapped to the U2 snRNA, whereas U2 reads averaged only 0.7% in input samples.
Interestingly, while many factors showed similar association between analogous snRNAs in the major and minor spliceosomes (such as PRPF8 and SMNDC1 with U6 and U6atac, and SF3B1 and SF3B4 with U2 and U12), some RBPs were specifically associated with either the major (SF3A3, which was 29.5-fold enriched for U2 but 1.2-fold depleted for U12 in HepG2, and QKI, 118.6-fold enriched for U6 but 2.4-fold depleted for U6ATAC) or minor spliceosome (HNRNPM, which was 8.1-fold enriched in K562 and 7.6-fold in HepG2 for U11 but 5.3-and 4.2-fold depleted for U1) (Fig. 7a, Supplemental Fig. 7a-d). Although preliminary analysis did not show altered splicing upon HNRNPM knockdown specifically at U11/U12 introns, previous studies have suggested that HNRNPM may contribute to minor intron splicing through interactions with FUS [46].
In the first catalytic step of intron splicing, a transesterification step joins the 5′ splice site with the branch point to create an intron lariat structure (Additional file 3: Fig. S7e). This is an essential step in splicing and helps to define 3′ splice site choice, but identification of branch points has remained challenging due to variable positioning (ranging from 20 to 40 nucleotides upstream of the 3′ splice site) and a degenerate sequence motif [47]. Recent efforts to use either specialized library preparation protocols or focused analysis of deep sequencing to identify branch points via lariat junction-spanning reads have enabled the identification of tens of thousands of branch points, but the regulation of branch point recognition and its role in splicing regulation remains poorly understood. Considering the RBPs profiled here, we observe multiple RBPs showing specific enrichment at branch points, including both known regulators (such as SF3 complex components SF3B4 and SF3A3), as well as novel factors (including RBM5). Indeed, analysis of these datasets coupled with focused iCLIP profiling of purified spliceosomes recently indicated distinct patterns of RBP association at branch points and 5′ and 3′ splice sites, which yielded unique insights into how branch point strength defines RBP association and splicesomal assembly dynamics [48].
However, we were particularly intrigued by the observation of a striking pattern of both 5′ splice site and branch point enrichment for the RBP AQR (Fig. 7b). Knockdown of AQR yielded over 30,000 altered alternative splicing events, by far the most of any knockdown performed by the ENCODE consortium to date (including canonical splicing components including U2AF1/2 and SF3B4) [20], consistent with previous studies that indicate a role for AQR in pre-mRNA splicing [49]. However, closer inspection revealed that unlike the canonical peak shape in the branch point region observed for SF3B4 and SF3A3, the 5′ end of AQR eCLIP reads often piled up at specific positions (Fig. 7b). Using simple criteria to identify candidate branch points as positions with more than 50% of read 5′ ends within the overall − 15 to − 50 region, out of 2475 introns with at least 20 reads mapping to the entire branch point region, we identified 1018 candidate branch points in K562 (Fig. 7c). Motif analysis of these positions yielded the canonical branch point motif signal (with 92% containing an A at the base prior to read starts) (Fig. 7c). Thus, these results suggest that AQR eCLIP signal is derived from introns after lariat formation, where reverse transcription is incapable of reading through the branch point adenosine (Additional file 3: Fig. S7e), and that deeper sequencing of AQR eCLIP (potentially with improved methodology to enrich reads at the 3′ rather than 5′ splice site) will provide direct identification of branch points in human. Next, we considered eCLIP signal at alternatively spliced cassette exons. Considering "native" cassette exons in wild-type K562 and HepG2 cells, we observed that branch point factors SF3B4 and SF3A3 showed decreased signal at alternative exons relative to constitutive exons, consistent with U2AF2 and other spliceosomal components and potentially reflecting overall lower spliceosomal occupancy (Additional file 3: Fig. S7f). However, at alternative 3′ splice sites with the proximal site increased upon knockdown of branch point components SF3B4 and SF3A3, we observed that average eCLIP enrichment for SF3B4 and SF3A3 was decreased at the typical branch point location but increased towards the 3′ splice site (compared to eCLIP signal at native A3SS events which utilize both distal (upstream) and proximal 3′ splice sites in control shRNA datasets) (Fig. 7d, Additional file 3: Fig. S7g). Consistent with previous minigene studies showing that 3′ splice site scanning and recognition originates from the branch point and can be blocked if the branch point is moved too close to the 3′ splice site AG [50], these results provide further evidence that use of branch point complex association to restrict recognition by the 3′ splice site machinery may be a common regulatory mechanism [51] (Additional file 3: Fig. S7h).

Clustering of RBP binding identifies known and novel coassociating factors
Large-scale RBP target profiling using a consistent methodology enables cross-comparison between datasets. Considering simple overlap between peak sets for all profiled RBPs, we observed significant overlap for many pairs of RBPs, which often formed co-associating groups (Fig. 8a, left). These groups of RBPs with highly overlapping peaks generally segregated into four major categories. First, we observe high similarity between the same RBP profiled in HepG2 and K562 (including QKI, PTBP1, and LIN28B) (Fig. 8a, green). Indeed, we observe an average peak overlap of 30.0% between the same RBP in K562 and HepG2 versus 4.9% for random RBP pairings (6.1-fold increased), confirming the broad reproducibility of binding across cell types (Fig. 8b). Second, we observe many cases of high overlap between eCLIP for homologous RBPs within the same family, including TIA1 and TIAL1, IGF2BP1/2/3, and fragile X-related FMRP, FXR1, and FXR2 (Fig. 8a, yellow). Third, we observe clusters containing known co-regulating RBPs, including recognition and processing machinery for the 3′ splice site (U2AF1 and U2AF2), branch point (SF3B4 and SF3A3), and 5′ splice site (EFTUD2, RBM22, PRPF8, and others), as well as a group of RBPs that play general roles in binding the 5′UTR of nearly all genes to regulate translation (DDX3X, EIF3G, and NCBP2) (Fig. 8a, red).
Interestingly, we observe unexpected clusters that suggested potential novel complexes or co-interacting partners (Fig. 8a, blue). Some clusters likely reflect overlapping targeting to specific types of RNAs: for example, one cluster contains three RBPs we described above to show specific enrichment at antisense L1/LINE elements (HNRNPM, BCCIP, and EXOSC5). The patterns of other clusters are often less clear, with some containing both well-studied RBPs as well as those with no known RNA processing roles (for example, high overlap between HNRNPL and AGGF1 across both cell types). To consider whether these likely reflected true instances of RBP co-interaction, we asked whether RBPs that had higher peak overlap were more likely to have interactions from large-scale IP-mass spectrometry experiments. Using the BioPlex 2.0 database of~56,000 interactions [52], we observed that RBPs with IP-MS interactions showed an average 2.3-fold increase in eCLIP peak overlap (11.4% versus 4.9% for RBPs without interactions), suggesting that there is a general correlation between peak overlap and RBP interactions (Fig. 8c).
Finally, we performed co-immunoprecipitation (co-IP) studies focusing on one predicted novel interaction group involving HNRNPL and AGGF1. We observed that AGGF1 co-immunoprecipitated HNRNPL, unlike unrelated factors RBFOX2 or FMR1 (Additional file 3: Fig. S8a). We note that this co-IP was observed using less stringent co-IP wash buffers, but was not observed using the high-salt wash buffers present in eCLIP (Additional file 3: Fig. S8b), indicating that the overlap in eCLIP binding likely reflects independent crosslinking events to the distinct RBPs. Thus, these results indicate (See figure on previous page.) Fig. 7 Insights from eCLIP of spliceosome-associated RBPs. a Heatmap indicates fold-enrichment for individual snRNAs within eCLIP datasets. Shown are all RBPs with greater than 5-fold enrichment for at least one snRNA. b Browser shows read density for eCLIP of AQR (K562), SF3B4 (K562), and SF3A3 (HepG2) for the NARF exon 11 3′ splice site region. Dotted line indicates position of enriched reverse transcription termination at crosslink sites. c (left) Pie chart shows all (n = 2475) introns with > 20 reads in the − 50 to − 15 (branch point) region in AQR K562 eCLIP. Blue indicates putative branch points (the subset with more than 50% of read 5′ ends at one position). (right) Motif information content for 11-mers centered on the putative branch points. Image generated with seqLogo package in R. d Lines indicate mean normalized eCLIP enrichment in IP versus input for SF3B4 and SF3A3 at (red/purple/green) alternative 3′ splice site extensions in RBP knockdown or (black) alternative 3′ splice site events in control HepG2 or K562 cells. The region shown extends 50 nt into exons and 300 nt into introns that the eCLIP data resource reveals many novel RBP interactions that are likely to reflect previously unidentified regulatory complexes.

Discussion
The ENCODE RNA binding protein resource contains 1223 replicated datasets for 356 RBPs, including in vivo targets by eCLIP, in vitro binding motifs by RNA Bind-N-Seq, subcellular localization by immunofluorescence, factor-responsive expression and splicing changes by knockdown/RNA-seq, and DNA associations by ChIPseq [20]. This unique resource has already proven useful in characterizing allele-specific RBP interactions [53,54], identifying candidate regulators of miRNA processing [55], predicting whether RNAs are protein-coding or non-coding [56], and identifying novel factors which act to suppress improper RNA processing caused by retrotransposable elements [37], and will continue to enable researchers to ask broad questions about basic RNA processing mechanisms, deeply consider the functional roles of an individual RBP, or even query an RNA of interest in order to gain insight into potential regulators. Here, we describe examples how integrated analyses of binding profiles obtained from eCLIP can yield novel insights into both processing of standard mRNAs as well as other RNA families, including identifying new characteristics of ribosomal RNA processing and the role of RBP interactions with retrotransposable elements.

Inference of RBP function based on eCLIP enrichment patterns
Deep profiling of RBPs associated with a specific RNA processing pathway can yield unique insights into the specialization of RBPs. For example, profiling of 30 RBPs associated with RNA degradation gave insights into specific RNP complex variants with roles targeting specific subtypes of RNAs, providing a comprehensive view of how the wide array of RNAs in the cell are turned over [8]. In contrast, the relatively unbiased selection of 150 RBPs profiled here enabled us to query across a wide variety of RBP functions and binding modalities and, at a broad level, address the basic question of whether RNA targets identified by CLIP can generally predict the likely function of the RBP of interest. This analysis confirmed overlap at both the RNA transcript class level, where eCLIP enrichment for ribosomal RNA or retrotransposable elements correlated with specific RBP functions focused around these element types, and the regulatory region level, where enrichment at 5′UTRs or branch point regions corresponded to specific RBP functional roles. Although these overall patterns match well with our existing understanding of RBP functions, the validation of distinctive profiles for different functions enables deeper interpretation of RBPs based solely on eCLIP. For example, we observed specific enrichment for GEMIN5 beginning in the 5′UTR and peaking at the start codon, providing further genome-wide validation for the role of GEMIN5 in translation regulation [57] (Fig. 2b). Similarly, the association of ZC3H11A at the 3′ end is consistent with iCLIP signal observed for TREX complex component ALYREF [58] and provides further transcriptome-wide evidence to support the observation that ZC3H11A plays an essential role in export of polyadenylated mRNA through interaction with the TREX complex [59]. As we continue to profile additional RBPs, these results suggest it should become possible to predict RBP function with increasing resolution based on association patterns.
Considering meta-exon plots focused on exon/intron boundaries, we observe expected patterns of eCLIP enrichment at canonical splicing elements (5′ and 3′ splice sites and branch points). We also observe classes of RBPs with broader patterns of enrichment, with a particularly interesting group showing a stereotypical pattern of high enrichment at the 5′ end of introns (extending hundreds of bases into the intron). Notably, this cluster contains multiple factors with links to cotranscriptional RNA processing, including CSTF2T [60], XRN2 [61], and Nono [62], suggesting that this group may reflect interactions that mark the time period between 5′ splice site transcription and splicing. Interestingly, this cluster also contains FET family proteins FUS and EWSR1, consistent with previous CLIP-seq studies which identified a similar "sawtooth" pattern for FUS [63] and suggesting that co-transcriptional deposition may be a general regulatory principle for this family of neurodegenerative disease-associated RBPs.

Enrichment patterns reveal insights into ribosomal RNA processing
The enrichment for previously identified rRNA processing factors suggests that many additional factors here may represent unexplored regulators. Indeed, building upon rRNA enrichments observed from the analyses described here, further research has led to validation of NOL12 [64] and AATF [65] as novel regulators of ribosomal RNA processing, indicating that there remain more RBPs with unexplored roles in ribosomal RNA processing.
Another benefit of the unbiased approach presented here is that it enables identification of novel potential sites of regulatory activity, as our analysis of the 45S ribosomal RNA precursor indicates a surprising cluster of substantial RBP eCLIP enrichment at an uncharacterized region located between the A0 and 1 processing sites. This region (particularly the sharp peak observed in LIN28B eCLIP) is centered on a putative ribosomal-encoded microRNA (rmiR-663) [32], and our analysis indicates that the reads do appear to be derived from ribosomal RNA rather than paralogous genomic-encoded microRNAs. However, we do not observe enrichment in DROSHA or DGCR8 eCLIP in this region (Fig. 3e), suggesting that rmiR-663 does not progress through the normal miRNA maturation pathway. Thus, it remains unclear whether this represents a bona fide microRNA, or more complex regulation of either ribosomal RNA processing or maturation of other microRNAs. Indeed, LIN28B has previously been shown to inhibit let-7 biogenesis by sequestering primary let-7 transcripts in the nucleolus away from DROSHA processing [31]. Although one model could be that LIN28B association to this region simply is an artifact of nucleolar localization, the high abundance of 45S rRNA overall (and nearly 500-fold enrichment for LIN28B at this site) suggests that the rmiR-663 region might instead act to sequester LIN28B, thereby coupling LIN28B inhibition of let-7 microRNA biogenesis to ribosomal RNA transcription and abundance. Similarly, although SSB has previously been associated with microRNA processing through interactions with pre-and pri-miRNAs [66], SSB traditionally interacts with RNA Polymerase III transcripts [67], potentially suggesting distinct Polymerase III transcription of this region in addition to Polymerase I transcription of the entire 45S transcript. Further work will be required to fully confirm whether rmiR-663 is actually processed from the 45S to maturity as a functional miRNA incorporated into the RISC complex for mRNA targeting, or whether these other potential regulatory modalities act to control other aspects of rRNA or microRNA processing.

Retrotransposable element suppression: a major function for many RBPs
Analysis of Alu elements identified 3 RBPs with at least 4-fold enrichment, each of which appears to reflect a different underlying mechanism. The most enriched RBP, HNRNPC, has previously been shown to suppress cryptic 3′ splice site signals in antisense Alu elements [36]. In contrast, ILF3 (enriched for both sense and antisense Alu elements) has previously been shown to interact with RNA editing mediator ADAR1 [68], and the majority of ADAR1 targets and edited sites throughout the genome occur at Alu elements [69]. Further research has now revealed that ILF3 knockdown induces RNA editing, and suggested that ILF3 binding to Alu elements generally acts to repress RNA editing at these sites [39]. The third RBP, RNA Polymerase II subunit POLR2G, may reflect previous observations of antisense L1 and (particularly inverted tandem) Alu elements repressing PolII progression [40,70]. Indeed, we observe that POLR2G eCLIP shows enrichment for sense Alu (2.3-fold), sense L1 (1.8-fold), and antisense L1 (4.0-fold) elements as well as antisense Alu (5.0-fold), providing further evidence that the high propensity for such regions to form structural elements may generally inhibit polymerase progression through these regions, leading to increased dwell time for POLR2G.
Similarly, analysis of L1 element enrichment revealed multiple modalities of regulatory activity. One function of RBP association to L1 is to suppress retrotransposition activity, and indeed, we observed that three RBPs (PPIL4, SAFB, and TRA2A) showed both eCLIP enrichment for sense L1 elements and act to suppress L1 retrotransposition activity in genome-wide screening data. For RBPs enriched for antisense L1 elements, we instead see signatures of RBPs acting to increase RNA expression, extending a similar analysis recently published (that included an earlier release of the ENCODE eCLIP resource along with other iCLIP datasets) that revealed widespread association with L1 elements by RBPs [37]. From these and other works, it is now becoming clear that suppression of aberrant RNA processing due to retrotransposable elements is a major responsibility of many RNA binding proteins, suggesting that the genome has evolved to devote substantial resources to this effort.

Large-scale RBP target maps provide unique opportunities for further specialized insights
It is notable that the above enriched RNA element classes often reflected a substantial fraction of eCLIP reads, suggesting that they may represent dominant functions of the RBP. For example, antisense L1 elements constituted 19-27% of eCLIP reads for HNRNPM and MATR3 and antisense Alu elements were 13-18% of reads in HNRNPC eCLIP. Similarly, 42-56% of UTP18, 27-31% of WDR43, and 16% of HepG2 LIN28B eCLIP reads mapped to the 45S ribosomal RNA precursor. Thus, these results strongly argue that analysis of CLIP data should include proper quantitative analysis of reads mapping to non-mRNA regions, as they can in some cases represent the dominant binding modality of the RBP and should be considered in interpreting potential functional roles of the RBP in regulating RNA processing.
Intriguingly, we even observed significant differences even between RBP components of the same RNP complex. For example, 41.0% of SF3A3 HepG2 eCLIP reads mapped to RNU2 snRNA versus only 8.5% mapping to proximal intronic regions; in contrast, SF3B4 was far more even (23.1% proximal intronic in HepG2 and 17.8% in K562, versus 17.0% and 19.7% RNU2 in HepG2 and K562, respectively). Although we cannot rule out that this difference in crosslinking to snRNA versus intron reflects underlying amino acid biases in UV crosslinking efficiency, it does confirm that CLIP profiling of multiple RBP members of an RNP complex can yield distinct insights into interaction patterns and regulatory roles of the complex, suggesting that it is critical to assay multiple independent proteins to gain a full understanding of the target repertoire of an RNP complex.
In addition to specific insights into the RBPs themselves, we anticipate that the broad diversity of RBPs profiled and RNA elements and features bound will spur further development of methods targeted towards specific RNA processing steps. For example, the peak distribution pattern of the CDS.5P class (and RPS3 in particular) resembles the average profile observed using ribosome profiling [71], suggesting that RPS3 eCLIP may capture ribosome association on translating mRNAs and could be used as a general approach to assay translation. Similarly, our metaexon analysis of AQR (followed by further analysis of crosslink-induced termination sites) showed that AQR eCLIP could identify branch points for a set of highly abundant introns, suggesting that further development of profiling of AQR binding targeted to 3′ splice site regions could yield a highly specific approach to identification of branch points transcriptome-wide. Recent work using iCLIP to specifically purify spliceosomeassociated RNAs further showed that other eCLIP datasets analyzed here also showed highly stereotypical crosslinking patterns around branch points, which could also broadly map branch point locations and reveal unique insights into the combinatorial effect of branch point and splice site strength on spliceosomal assembly and dynamics [48].
The diversity of distinct RBP association patterns can also be flipped to predict features of a queried RNA. For example, recent work used the ENCODE eCLIP resource to identify UPF1 as one of many RBPs with specific enrichment at 3′UTRs [56]. This finding enabled improved prediction of whether a queried transcript was a protein coding versus long non-coding RNA by incorporating presence (or absence) of UPF1 eCLIP signal as a biomarker for translation [56]. Similarly, our unbiased analysis of foci of enrichment on the 45S rRNA precursor suggested two regions as notably highly enriched across multiple RBPs, one of which matches a well-characterized region (between the canonical 01 and A0 processing sites) with another suggesting interesting regulatory mechanisms linking ribosomal RNA and microRNA processing. Similar analysis identifying eCLIP datasets with enrichment on regulatory non-coding RNAs Xist and Malat1 also suggested that the patterns of RBP enrichment often correlate with specific structural and functional domains on these non-coding RNAs [18]. With the continuing release of profiles for additional RBPs, we expect that identification of these distinct RBP "states" may serve as a useful method for independent prediction of key regulatory domains within these non-coding RNAs.

Conclusions
The maturation of methods to profile the in vivo targets of regulatory proteins at both the DNA and RNA level has enabled unparalleled large-scale efforts to map the human gene expression regulatory network [16,20]. In this work, we describe how integrated analysis of targets for 150 RBPs identified by eCLIP, coupled with analysis tools to quantify enrichment to multicopy and other RNA elements beyond standard pre-mRNAs, provides a unique perspective into RNA processing regulation. Through analysis of rRNA processing, linkages between RBP target modalities and mechanistic functions, and RBP complexes, we show that analysis of such largescale, unbiased views of the RNA processing landscape can yield unique insights into RNA regulation, suggesting that there remains much to learn about how RBPs control gene expression in humans.

eCLIP datasets used
Enhanced CLIP (eCLIP) datasets used were obtained from the ENCODE data coordination center (https://www. encodeproject.org) with accession identifiers listed in Additional file 1. Unless otherwise indicated, standard peak analysis used the set of peaks identified as irreproducible discovery rate (IDR) reproducible and meeting foldenrichment (≥ 8-fold) and significance (p value ≤10 −3 ) in immunoprecipitation versus paired size-matched input. RNA binding protein function annotations and localizations were obtained from [20] (Additional file 2). The list of RNA binding proteins was obtained from [3]. The list of transcription factors was obtained from [72], using the "a," "b," and "other" classes.

Biotin-based visualization of RBP-coupled RNA
A step-by-step version of the biotin-based labeling protocol is available at https://www.protocols.io/view/biotin-labelling-of-immunoprecipitated-na-v1pre-7z4hp8w. In brief, for visualization experiments, HepG2 or K562 cells were prepared identically to eCLIP experiments up until the first RNA adapter ligation: 20 million cells were lysed in 1 mL 4°C eCLIP lysis buffer, fragmented for 5 min at 37°C with 40 U RNase I (Ambion), centrifuged at 15k RPM for 3 min at 4°C (with supernatant kept) to clear lysate, and incubated with rotation overnight with antibody coupled to species-specific secondary beads (10 μg primary antibody as indicated coupled to 125 μL of Sheep anti-Rabbit or anti-Mouse Dynabeads; ThermoFisher). After incubation, samples were washed once with eCLIP wash buffer, washed twice with high-salt wash buffer, and washed three times with wash buffer. FastAP and T4PNK reactions were performed on-bead as previously described for eCLIP, followed by one wash with high-salt wash buffer and 3 washes with wash buffer. At this point, a modified RNA linker ligation was performed with standard eCLIP ligation conditions (buffer and High ?A3B2 show $132#?>Concentration T4 RNA Ligase) but with 500 pmol pCp-Biotin (Jena Bioscience) in place of the RNA adapter, and samples were incubated at 16°C. For some experiments, immunoprecipitations were performed on 4 million cells; for these experiments, half reactions were used for the pCp-biotin ligation step. After ligation, samples were washed once with high-salt wash buffer and three times with wash buffer, followed by standard SDS-PAGE electrophoresis and transfer to nitrocellulose membranes. Visualization was performed using the Chemiluminescent Nucleic Acid Detection Module Kit (ThermoFisher), following the manufacturer's instructions for blocking, washes, and labeling. Imaging was performed on the Azure C600 platform. For 32 P experiments, radiolabeling was performed as previously described [73].

Family-aware mapping to multicopy elements
The software pipeline used to quantify enrichment for retrotransposable and other multicopy elements is available at https://github.com/YeoLab/repetitive-elementmapping, and was initially described in [20] but is described in more complete detail below. This release includes scripts, detailed documentation, and database files necessary to perform the described analyses.
A database of multicopy elements was generated based on 5606 transcripts obtained from GENCODE v19 covering 34 families of abundant non-coding, multicopy, and other types of RNA refractory to standard peak analysis, including families within the broader rRNA (RNA18S, RNA28S, RNA5S, RNA5-8S), snoRNA (SNORD, SNORA, RNU105, RNU3, RNU7, snoU13, snoU109, U8), snRNA (RNU1, RNU2, RNU4, RNU4ATAC, RNU5A, RNU5B, RNU5D, RNU5E, RNU5F, RNU6, RNU6ATAC, RNU11, RNU12), vault RNA (VTRNA1, VTRNA2, VTRNA3), non-coding RNA (H1RNA, RN7SK, RN7SL, MRP, YRNA), and small Cajal body-specific RNA (SCARNA) broader classes (Additional file 4). Each family contained GENCODE v19 annotated transcripts as well as their pseudogenes. To this set were added a family for tRNAs (606 tRNA transcripts were obtained from GtRNAdb [74], and each tRNA was included in two versions: one variant including 50 nt of genome flanking sequences, and one mature variant that included the canonical CCA tail), mitochondrial transcripts (which were initially added as one class of 37 annotated genes, but ultimately counted as two families based on H-or L-strand position that included not only gene-mapping reads, but also intergenic reads mapping uniquely to the mitochondrial genome), the rRNA RNA45S precursor transcript (NR_046235.1, obtained from GenBank), a "simple repeat" class containing 501 60-mer sequences containing simple repeats of all 1-to 6-nt k-mers, and 49 families comprising 705 total human repetitive elements obtained from the RepBase database (v. 18.05) [75]. Within each family, transcripts were given a priority value, with primary transcripts prioritized over pseudogenes. Mapping to the reverse strand of a transcript was counted separately from forward strand mapping, creating a second "antisense" family for each RNA family above (which utilized the same element priority order), with the exception of simple repeats (which were all combined into one family).
To quantify eCLIP signal, paired-end sequencing reads were first adapter trimmed as previously described [18]. Next, reads were mapped against the repetitive element database using bowtie2 (v. 2.2.6) with options "-q --sensitive -a -p 3 --no-mixed -reorder" to output all mappings. Read mappings were then processed as follows. length) were kept. Next, the mapping to the transcript with the highest priority within a RNA family (as listed above) was identified as the "primary" match mapping. At this stage, read pairs which had equal best alignments to multiple repeat families were discarded, with only reads mapping to a single repeat family considered for further quantification.
Next, these RNA family mappings were integrated with unique genomic mapping from the standard eCLIP processing pipeline (using read mapping prior to PCR duplicate removal). For read pairs that mapped both to an RNA family above as well as uniquely to the genome, the mapping scores (as defined above) were compared. If the unique genome mapping was more than 2 mismatches per read (24 alignment score for the read pair) better than to the repeat element, the unique genomic mapping was used; otherwise, it was discarded and only the repeat mapping was kept. Next, PCR duplicates were removed by comparing all read pairs based on their mapping start and stop position (either within the genome or within the mapped primary repeat) and unique molecular identifier sequence, and all but one read pair for read pairs sharing these three values were defined as PCR duplicates and removed. At this stage, RepeatMasker-predicted repetitive elements in the hg19 genome were additionally obtained from the UCSC Genome Browser [24]. Element counts for RepBase elements were therefore determined as the sum of repeat family-mapped read pairs (described above) plus the number of reads that mapped uniquely to the genome at positions which overlapped (by at least one base) RepeatMasked RepBase elements. Reads uniquely mapping to non-RepBase genomic regions were then annotated into one of 11 additional classes in the following priority order (based on GENCODE v19 annotations): CDS, 5′UTR and 3′UTR, 3′UTR, 5′UTR, proximal intronic (within 500 nt of splice sites), distal intronic (remaining intronic regions), non-coding exonic, noncoding proximal intronic, non-coding distal intronic, antisense to GENCODE transcripts, and intergenic.
Finally, the number of post-PCR duplicate removal read pairs mapping to each class was counted in both IP and paired input sample and normalized for sequencing depth (using the total number of post-PCR duplicate read pairs from both unique genomic mapping as well as repeat mapping as the denominator to calculate fraction of reads). Significance was determined by Fisher's exact test or Pearson's chi-square test if all expected and observed values were five or more. Relative information content of each element in each replicate was calculated as p i Â log 2 ð p i q i Þ , where p i and q i are the fraction of total reads in IP and input respectively that map to element i. To combine two biological replicates, the average reads per million (RPM) was calculated across two IP samples and compared against the paired input experiment to calculate one overall fold-enrichment and relative information value per dataset.

Validation of RNA element links with RBP functional annotations
To quantify whether RNA element enrichment matched with RBP functions, a set of positive control pairings were generated between RNA elements with known links to either RBP function or known RBPs contained within a well-characterized ribonucleoprotein complex (Additional file 3: Fig. S2a). One hundred forty datasets for which the RBP had at least one of these annotated functions were selected, and datasets were sorted by relative information of the most-enriched class. Accuracy (defined as (TP + TN)/(TP + TN + FP + FN)) was then calculated, where true positives (TP) were RBPs for which the most-enriched RNA element was greater than the cutoff value and the RBP has published evidence for the function associated with the most-enriched RNA element, false positives (FP) were RBPs that had an RNA element meeting the relative information cutoff but the RBP lacked publication evidence for the linked function, false negatives (FN) were RBPs lacking an RNA element meeting the relative information cutoff but the RBP had published evidence for functions associated with at least one RNA element class, and true negatives (TN) were RBPs lacking annotated functions or RNA elements meeting the relative information cutoff. Accuracy was calculated for each possible relative information cutoff, and the maximum point (0.2) was chosen.

Ribosomal RNA analysis
RBPs with roles in ribosomal RNA processing were obtained from [28]. Position-wise relative information was calculated as above, using the number of reads overlapping the position in IP versus input for each dataset (using paired-end read 2 only, as was done for genomic mapping). To obtain a cutoff for further analysis, RBPs were sorted by the maximum position-wise relative information on the 45S rRNA precursor, and at each value, the F1 score was calculated (defined as (2 × TP)/ (2 × TP + FP + FN)) using the definitions described above. The maximum point at 0.101 was used for further analysis.
To quantify enrichment at the rmiR-663 ribosomal versus genomic paralog loci, sequences of rmiR-663 and four genomic-encoded paralogs (miR-663a, miR-663b, AC010970.1, and AC136932.1) were obtained from the UCSC Genome Browser, along with 100 nt of flanking sequence. Only reads that perfectly aligned (with zero mismatches or gaps) to these sequences were counted for further analysis.

Retrotransposable element analysis
L1 retrotransposition genome-wide CRISPR screening data was obtained from Liu et al. [38], using Combo cas-TLE Effect scores from K562 cells. Bonferroni correction was performed on uncorrected casTLE p values using n = 15 (the number of L1 (sense)-enriched RBPs queried).
To calculate change in expression of L1-containing bound genes, DESeq-calculated gene expression fold changes for RBP knockdown/RNA-seq data were obtained from the ENCODE DCC (http://www.encodeproject.org) for all RBPs with both eCLIP and RNA-seq performed in the same cell type. L1 sense and anti-sense elements were taken from RepeatMasker-predicted repetitive elements in the hg19 genome obtained from the UCSC Genome Browser [24]. For each gene in GENCODE v19, the transcript with the highest abundance in rRNA-depleted total RNA-seq in HepG2 (EN-CODE accession ENCFF533XPJ, ENCFF321JIT) and K562 (ENCFF286GLL, ENCFF986DBN) was chosen as the representative transcript, and the set of expressed genes (10,247 in HepG2 and 9162 in K562 with TPM ≥ 1) were considered. Next, genes were separated into three classes: "≥ 1 bound L1(as)" genes with at least one antisense L1 element that overlapped a significant peak identified in eCLIP, "bgd with ≥ 1 L1(as)" genes with at least 1 antisense L1 element but did not have an element that overlapped with an eCLIP peak, or "Bgd" which contained all expressed genes. Significance was determined by the Kolmogorov-Smirnov test with no multiple hypothesis testing correction.
To compare reference versus divergent L1 elements, we defined "canonical" reads as those which mapped best (and were assigned) to sequences present in RepBase, whereas "divergent" reads mapped better to unique genomic loci than to the reference sequence.
Calculation of overall element coverage (Additional file 3: Fig. S4b) was based on the above set of 9162 reference transcripts in K562 expressed with TPM ≥ 1.

Meta-gene and meta-exon peak density maps
To generate meta-gene and meta-exon maps, for each gene in GENCODE v19, the transcript with the highest abundance in rRNA-depleted total RNA-seq in HepG2 (ENCODE accession ENCFF533XPJ, ENCFF321JIT) and K562 (ENCFF286GLL, ENCFF986DBN) was chosen as the representative transcript, and the set of expressed genes (10,247 in HepG2 and 9162 in K562 with TPM ≥ 1) were considered. Datasets with fewer than 100 mRNA-overlapping peaks were discarded, leaving 205 datasets. Next, each gene was split into 162 bins (13 for 5′UTR, 100 for CDS, 49 for 3′UTR), based on the median 5′UTR, CDS, and 3′UTR lengths of highly expressed (TPM ≥ 10) GENCODE v19 transcripts in K562 cells. For each eCLIP dataset, the average peak coverage for each bin was calculated for each gene and then averaged over all genes to generate final meta-gene plot. To generate confidence intervals, bootstrapping was performed by randomly selecting (with replacement) the same number of transcripts and calculating the average position-level peak coverage as above, with the 5th and 95th percentiles (out of 100 permutations) shown. For further visualization and analysis, only 104 RBPs where the 5th percentile was at least 0.002 peaks per gene (~20 peaks in at least one bin) were considered. Normalized coverage was then calculated by setting the maximum position to one and minimum position to zero for each eCLIP dataset. Cross-position correlations were calculated using normalized coverage for across all 104 RBPs at each position. Odds ratios and significance (determined by Fisher's exact test or Yates' chisquare test if observed and expected values were greater than five) utilized RBP annotations (Additional file 3) from [20].
To generate meta-exon plots for each eCLIP dataset, for all internal exons (excluding the first and last exons), the region from 500 nt upstream to 500 nt downstream (for introns less than 1000 nt, the region was split with half assigned to the upstream exon and half to the downstream exon) was queried for the presence of significant (IDR) peaks. Finally, the number of peaks at each position was averaged over all events to obtain the final meta-exon value. To generate confidence intervals, bootstrapping was performed by randomly selecting (with replacement) the same number of transcripts and calculating the average position-level peak coverage as above, with the 5th and 95th percentiles (out of 100 permutations) shown. For further analysis, only datasets with at least 100 IDR peaks were considered. Next, after calculating meta-exon profiles and confidence intervals as above, datasets that did not have at least one position with the 5th percentile bootstrap value above a minimal cutoff of 0.0005 (~5 peaks observed at that position) were discarded to leave 133 datasets for further consideration. Finally, for visualization of comparison across RBPs (Fig. 6), an additional normalization was performed by dividing each position by the maximum meta-exon value for that dataset, in order to scale the meta-exon profiles between 0 and 1.

Analysis of AQR enrichment at branch points
To identify points of enriched read termination in AQR eCLIP, regions from − 50 nt to − 15 nt from annotated 3′ splice sites were obtained from GENCODE v19, and the subset of regions with at least 20 overlapping reads in AQR eCLIP in K562 cells were taken for further analysis. Points of enrichment were identified as those where more than half of reads overlapping the overall region terminated at the same position. Motif analysis was performed by counting the frequency of 11-mers centered on the read start position with 5 nt flanking on either side. Motif logos were generated with seqLogo (R).

Enrichment of branch point factors at alternative 3′ splice site events
Splicing maps profiling normalized enrichment for SF3B4 and SF3A3 at RBP knockdown-responsive alternative 3′ splice site events were generated as previously described [20,76]. In brief, the set of differential 3′ splice site events for RBP-knockdown/RNA-seq was identified from rMATS analysis between RBP knockdown and paired non-target control. Normalized read density in eCLIP was then calculated for each differential event by subtracting input read density from IP read density (each normalized per million mapped reads). To weigh each event equally, position-wise subtracted read density was then normalized to sum to one across the entire event region (composed of 50 nt of exonic and 300 nt of flanking intron), including a pseudocount of one read (normalized by total mapped read density) at each position. The highest 2.5% and lowest 2.5% values at each position across all events were then removed, and the mean was then calculated across all other events to define the final splicing map. As a control, a set of "native" alternative 3′ splice site events was defined as those which showed alternative usage (0.05 < inclusion < 0.95) in control K562 or HepG2 cells, respectively. Confidence intervals were generated by randomly sampling the number of events in the RBP-responsive class from the native alternative 3′ splice site set 1000 times, processing this sampled set as described above, and plotting the 0.5th to 99.5th percentiles.

Co-occurrence of RBP eCLIP peaks and validation of subcomplexes of RBPs
Overlap between eCLIP datasets A and B was determined by calculating the fraction of significant and reproducible peaks in dataset A that overlapped (by at least one base) a peak in dataset B, and vice versa the fraction of peaks in B that overlapped a peak in A, and taking the maximum of those fractions as the overall pairwise fraction overlap. Only datasets with at least 100 reproducible and significant peaks were used for this analysis. Gene Set Enrichment Analysis was performed using the GSEA software package [77]. RBP interaction data was obtained from the BioPlex 2.0 dataset [52].