Positionally enriched k-mer analysis (PEKA) overview
PEKA includes several features that are designed to examine and minimize the impact of technical biases of CLIP, and thereby to obtain enriched motifs that mediate RBP binding specificity. PEKA can perform motif discovery either across the full transcriptome or within defined transcriptomic regions, with the provided options including introns, 3′-UTR, remaining exonic regions of protein-coding genes (coding sequence (CDS) combined with 5′-UTR), non-coding RNAs (ncRNAs), and the rest of non-annotated intergenic regions (Fig. 1a). PEKA also provides an option to include or exclude repetitive regions in the analysis.
Importantly, PEKA implements an approach of background normalization that aims to minimize the technical biases at crosslink sites. Crosslink sites are determined by the first nucleotide of aligned sequencing reads, which can include nucleotide preferences of UV crosslinking or sequence biases of cDNA ligation [6, 14]. To normalize for these biases, PEKA extracts the background sequences that are centered on low-scoring crosslink sites (out-of-peak crosslinks, oXn), which are located outside the peaks (i.e., areas with high crosslink density). The peak file is provided by the user, and peaks can be identified by any peak calling tool. For the current study, peaks have been identified with our newly developed peak caller Clippy [19] (“Methods”) due to its short runtime and the ability to call peaks in an annotation-aware manner, to correct for biases of highly expressed genes and exons. Conversely, the foreground sequences are centered on high-scoring crosslink sites located inside the crosslinking peaks (thresholded crosslinks, tXn). The first step of PEKA is thus to split crosslink sites into thresholded (tXn) and out-of-peak (oXn) sites based on whether the sites overlap with a peak and whether the cDNA count at the site meets a minimum threshold that is defined in a region-specific manner (Fig. 1b, “Methods”). Notably, tXn and oXn sites are expected to be affected by the same technical biases since they are part of the same cDNA library.
PEKA derives background and foreground sequences from user-defined genomic windows centered on tXn and oXn, respectively (Fig. 1c). In the current study, a distance of 20 nt was used to define proximal windows used to collect the sequences (41 nt long, i.e., ±20 nt distance from Xn). The script allows users to adjust the size of the proximal window if they wish to narrow down the search for motifs closer or expand it further away from the crosslink sites. By scanning for the presence of k-mers across sequences, PEKA obtains k-mer counts at each sequence position and then transforms them into k-mer occurrences by normalizing with the number of evaluated sequences (Fig. 1d). Next, the “relevant positions” around crosslink sites are defined at which the enrichment is calculated for each k-mer (Fig. 1e). For this purpose, “relative k-mer occurrence” is calculated by normalizing occurrence at each proximal position with the mean occurrence in a distal window (defined as −150…−100 and 100…150 nt around tXn). Relevant positions are the ones where “relative k-mer occurrence” of each specific k-mer around tXn is higher than a threshold value that is determined based on relative occurrences of all k-mers (see “Methods” for more detail, Fig. 1d). The relevant positions are then used to calculate a PEKA score that represents motif enrichment. Alternatively, PEKA also offers users an option to calculate enrichment by using all positions within a proximal window without the process of selecting the “relevant positions.”
PEKA score conveys the extent of k-mer enrichment at relevant positions around thresholded crosslink sites (tXn) relative to the same relevant positions around out-of-peak crosslink sites (oXn), which represent the intrinsic background of the studied dataset (Fig. 1e,f). PEKA score is a derivative of standard score and measures the number of standard deviations separating the approximated motif occurrence around tXn (ARtXn) from the mean approximated occurrence around oXn (μ(ARoXn)) (Fig. 1f, “Methods”). For further analyses, the motifs are ranked based on PEKA score in a descending order and the top n k-mers are selected to visualize their positioning around tXn with occurrence profiles. By default, PEKA separates the top 20 k-mers into up to 5 clusters based on their similarity in sequence and occurrence profiles, and then visualizes the profiles of k-mers within each cluster on the same graph (Additional file 1: Fig. S1a). For heatmap visualizations, the top 40 k-mers are clustered based on their sequence, and their relative occurrence profiles are shown (Fig. 1g). The use of relative occurrence, which normalizes the raw k-mer occurrences with distal window (Fig. 1d), improves the capacity to compare the enriched positions of different k-mers, as otherwise the regional genomic differences of k-mers would decrease the visibility of the less-abundant k-mers, as shown by comparing the two approaches for LIN28B (Additional file 1: Fig. S2a,d).
PEKA detects multiple binding modes of RBPs
To get preliminary insights into the distribution and sequence characteristics of the enriched k-mers around crosslink sites, we first analyzed the occurrence profiles of 5-mers for the well-studied protein TIA1 in eCLIP (Fig. 1g). The 5-mer profiles for TIA1 eCLIP in the HepG2 cell line show the most enriched motifs to be U-rich (Fig. 1g), in agreement with the known sequence specificity of TIA proteins [20]. In particular, the top 18 k-mers are all U-rich, followed by “GGGGG” and several other G-rich k-mers. Interestingly, U-rich motifs are enriched directly at crosslink sites, whereas G-rich motifs are depleted at crosslink sites and enriched primarily downstream of the crosslink sites. These non-overlapping positional patterns of the two groups of motifs indicate either that they represent different modes of binding by TIA1, or that the G-rich motifs are derived from other co-purified RBPs that bind to the G-rich motifs. To examine whether G-rich k-mers indeed represent a novel binding mode of TIA1, we examined whether the enrichment of these k-mers could be reproduced by other CLIP methods. For this, we compared occurrence profiles of enriched k-mers for TIA1 iCLIP and TIA1 PAR-CLIP and found that enrichment of G-rich k-mers was specific to eCLIP (Additional file 1: Fig. S1b,c). This case study of TIA1 data from three different CLIP methods demonstrates that k-mers with similar sequences tend to have similar positional profiles around crosslink sites, the applicability of PEKA to various CLIP datasets, and the ability of PEKA to report multiple motif types with distinct profiles, which can yield insights into data specificity or multiple binding modes of RBPs.
To corroborate the ability of PEKA to detect multiple binding modes of RBPs, we plotted positional profiles of enriched motifs for two RBPs that are known to bind several types of motifs (Additional file 1: Fig. S2a-c). Indeed, the top 20 k-mers for TARDBP eCLIP form three groups (UG-repeat, GUAU-motifs, and UGAA-motifs) which are consistent with those previously found to have distinct binding preferences to mutant variants of TARDBP in iCLIP experiments [21]. Moreover, LIN28 is known to bind (U)GAU k-mer, as well as GGAG-like sequences [22], which are also both recovered among the top 20 enriched k-mers in eCLIP data. Finally, PEKA is able to detect and visualize complex binding patterns, such as bipartite motifs bound by QKI [23, 24] which are not apparent from the PWM-based visualizations provided by other methods for motif discovery. Given the useful insights derived from this visualization, we provide heatmaps of the top 40 k-mers determined by PEKA for all 223 ENCODE eCLIP datasets on the web interface we developed to share our results with the community [18].
Benchmarking of PEKA against mCross and in vitro data
A major challenge in CLIP data analysis is the identification of motifs that correspond to the RNA-binding specificity of the isolated RBP, as opposed to motifs enriched for other confounding reasons, such as the technical biases of CLIP, associated features of genomic regions and repetitive elements bound by the protein, or motifs bound by other RBPs that may be co-purified if IP stringency was insufficient [6]. To evaluate the specificity of enriched motifs, these can be cross-compared with the motifs obtained by in vitro methods, in particular the high-throughput methods RNA-Bind-n-Seq (RBNS) [25] and RNAcompete (RNAC) [26]. In both RNAC and RBNS, a specific RBP or an assembly of its RNA-binding domains is mixed with a random pool of RNAs, followed by the isolation and sequencing of bound RNA molecules. These in vitro methods are expected to have different biases than CLIP data, and their limitations arise mainly from the use of short RNA fragments and the absence of full-length proteins and their cofactors. Thus, data produced with these methods is well suited to examine the biological specificity of motifs derived from CLIP data.
To examine the capacity of PEKA to recognize the top-ranking motifs obtained from in vitro data, we compared the top 20 k-mers recovered from in vitro RBNS/RNAC datasets with the motifs identified by PEKA in the corresponding eCLIP data (Additional file 1: Fig. S3a). We also compared the performance of PEKA with that of a previously published mCross method on the same data [11]. In total, 41 eCLIP datasets (Additional file 2: Table S1) were compared for 28 distinct proteins as they contain both mCross and in vitro data (Additional file 1: Fig. S3a). In addition, we analyzed the recall achieved by PEKA in eCLIP datasets which have orthogonal in vitro data available but did not yield significant motif enrichment in mCross (Additional file 1: Fig. S3b). As RBNS provides 5-mer scores, PEKA was also run for 5-mers, and to ensure a valid comparison, the mCross and RNAC z-scores were converted from 7-mer to 5-mer scores (“Methods”) prior to ranking the motifs. Ranking k-mers by normalized mCross 5-mer z-scores resulted in k-mer logos (Additional file 1: Fig. S3c) that were very similar to the mCross sequence logos downloaded from the mCrossBase [11, 27] (Additional file 1: Fig. S3d), confirming that the relevant k-mers were retained during this conversion. Importantly, PEKA overall showed a similar performance to mCross in recovering the top 20 k-mers from in vitro data (Additional file 1: Fig. S3a, Additional file 8: Table S7) in addition to being able to achieve high recall for several datasets that did not yield significant enrichment in mCross (Additional file 1: Fig. S3b, Additional file 8: Table S7).
To compare PEKA and mCross in more detail, we returned to the G-rich motifs observed for TIA1 eCLIP in HepG2 cell line (Fig. 1g). Because these motifs are unexpected, and have not previously been reported to be bound by TIA1, they could be considered a potential artifact of PEKA. Therefore, we performed a direct comparison of the top 20 5-mers obtained for TIA1 eCLIP in HepG2 cell line, using either PEKA score, raw mCross z-scores, or normalized mCross z-scores, respectively. Based on our observation of the two k-mer groups in the data (U-rich and G-rich, Fig. 1g), the top 5-mers were divided into two clusters based on their sequence (“Methods”) and visualized as k-mer logos (Additional file 1: Fig. S3c). Reassuringly, G-rich k-mers were also detected by the raw mCross z-scores to a similar extent as by the PEKA scores. However, they were largely removed in the normalized mCross z-scores, which were obtained by relativizing the z-score of each k-mer in a given dataset with a distribution of z-scores for the same k-mer across all experiments, thereby penalizing the k-mers with high z-scores across many experiments [11]. PEKA does not perform any such inter-RBP normalization, and thus detection of G-rich k-mers in TIA1 eCLIP by PEKA is in line with similar enrichment seen in raw mCross z-scores.
Insights from variable motif enrichments across CLIP variants
To learn more about the motif preferences of different CLIP protocols, we performed a more detailed comparison of PEKA with the results obtained by in vitro method RBNS [3], which are available along with eCLIP data for 21 RBPs, iCLIP data for 4 of these RBPs, and PAR-CLIP data for 9 RBPs (Additional files 2 and 3). PEKA analysis was performed in the “protein-coding gene” region, which combines intron, CDS, and the untranslated regions (UTRs). Repeat sequences were filtered out in motif detection. Visualization of k-mer ranking across groups, generated by sequence-based clustering, forms a unique binding signature for each dataset (Fig. 2). We observed informative variations in the ranking of top k-mers between data produced by these three CLIP methods.
In the case of TARDBP, FUBP3, KHSRP, TIA1, HNRNPC, HNRNPL, PCBP1, and PCBP2, eCLIP as well as most of the available iCLIP and PAR-CLIP experiments show high agreement with RBNS (i.e., recall, Fig. 2). The G-rich motifs, which were previously observed to be enriched for TIA1 eCLIP, but not iCLIP or PAR-CLIP, are also enriched in RBFOX2 and IGF2BP1/2 eCLIP, but not in RBNS, or IGF2BP1/2 PAR-CLIP. In the case of IGF2BP1/2, additional divergence can also be attributed to the enrichment of C-rich motifs in eCLIP, which RBNS and PAR-CLIP do not exhibit. In addition to IGF2BP1/2, eCLIP experiments for PUM1, SFPQ, RBM22, and EIF4G2 also showed poor agreement with RBNS data. In the case of PUM1 and IGF2BP1/2, the data from PAR-CLIP are in much better agreement with RBNS than eCLIP. Instead of the expected motifs, G-rich motifs were enriched in the PUM1 eCLIP, suggesting that these motifs tend to be enriched when an eCLIP experiment fails to identify the expected signal (Fig. 2). Considering the known similarity of motif specificity of PUM1 and PUM2 [28], we compared the PUM2 eCLIP experiment with the in vitro data of PUM1, which showed high agreement, as reported previously [3]. This analysis demonstrates a substantial variance in the reliability of eCLIP datasets, and the value of CLIP meta-analyses to identify the datasets that are likely to be the most reliable for further studies.
We found that when CLIP datasets differ in enriched motifs, they tend to also differ in the regional distribution of crosslink sites (Fig. 2). Specifically, compared to eCLIP and iCLIP, PAR-CLIP has a generally higher proportion of tXn in the 3′-UTR relative to introns, and a lower coverage of repetitive elements. tXn distribution also varied between eCLIPs of homologous proteins PUM1 and PUM2, where we found PUM1 to have a lower proportion of tXn in the 3′-UTR and a higher proportion of tXn in CDS and 5′-UTR. Thus, a combined analysis of several CLIP features, such as motif enrichment and regional binding, may be particularly valuable for data quality assessment, and for understanding the potential generic biases of each CLIP variant.
Peak filtering by external background yields limited benefit
PEKA was developed for general use on any type of nucleotide-resolution genomic data, and therefore, it models motif enrichment relative to the intrinsic background of a single dataset, without the need for additional data to model extrinsic background. Nevertheless, we wished to understand if the performance of PEKA would improve if the input peaks were called taking into account the extrinsic background signal. The ENCODE consortium provides narrowPeaks, which are generated by retaining only those peaks with a significant enrichment of reads over the size-matched input (SMInput) control, which is expected to decrease the extent of extrinsic background in the data [3, 6]. Surprisingly, when we ran PEKA using only the crosslink sites located within the narrowPeaks as the foreground, we observed no overall improvement in motif specificity as compared to our Clippy peak calling approach that does not include SMInput analysis (Fig. 3a). A major increase in agreement with in vitro data when using narrowPeaks was found only for a couple of RBPs, the most prominent being IGF2BP1, but decreased agreement was seen for other RBPs, with the most prominent effect observed for hnRNPC.
We find that the approach used to derive narrowPeaks generally results in an increased adenosine content of enriched motifs (Fig. 3b), which can in turn lead to decreased content of other nucleotides, such as uridine and cytidine. As a result, the use of narrowPeaks tends to improve motif specificity for RBPs that show binding to A-rich motifs by in vitro data, but decrease specificity for proteins that bind to motifs that do not contain adenosine (Fig. 3a). To better understand this phenomenon, we visualized motif coverage of the top 20 in vitro k-mers for hnRNPC (Additional file 1: Fig. S4a) in hnRNPC eCLIP and SMInput in both cell lines. We found ~25% of cDNA start sites in SMInput have these k-mers enriched around them, in comparison to ~50% (HepG2 cells) or ~35% (K562 cells) in eCLIP (Additional file 1: Fig. S4b), with a very similar profile around crosslink sites in both cases, indicating that hnRNPC likely contributes to a major part of the foreground signal in its SMInput control (Additional file 1: Fig. S4b). Despite eCLIP having a higher proportion of these motifs enriched at cDNA start sites, this indicates that using SMInput as a control for hnRNPC eCLIP likely results in the foreground signal becoming erroneously assigned to the background, precluding the identification of relevant binding sites, an issue that likely affects many other datasets. This could explain the lack of general improvement in motif specificity when using narrowPeaks.
To evaluate whether the varying effect of SMInput on recall could be caused by the background model used in PEKA, we selected ten eCLIP datasets with the highest difference in recall observed between the two sets of peaks and analyzed them with STREME [29], which uses shuffled peak sequences as background, thus providing an independent approach to investigate the effects of using SMInput for analysis of eCLIP data. We provided STREME either with sequences from Clippy peaks or narrowPeaks, and then compared the ranking of significantly enriched k-mers retrieved by PEKA or STREME (motifs with p < 0.05) in the corresponding in vitro data. In the cases of RBPs where PEKA analysis of narrowPeaks gave better ranking than Clippy peaks, STREME analysis also achieved better ranking with narrowPeaks for all datasets (Additional file 1: Fig. S4c). Furthermore, in the cases of RBPs where PEKA analysis of Clippy peaks gave better ranking than narrowPeaks, STREME analysis also recovered motifs with either higher or similar ranking for Clippy peaks compared to narrowPeaks (Additional file 1: Fig. S4d). This indicates that the lack of general motif improvement from narrowPeaks as compared to Clippy peaks in PEKA is reliable, since the alternative motif analysis approach follows the same trends. We also found that the median number of tXn that overlap with narrowPeaks is around 10-fold lower compared to the Clippy peaks used with PEKA (Fig. 3c). Thus, in spite of using SMInput filtering, we do not find the use of narrowPeaks to improve the general specificity of identified motifs; however, it greatly reduces the sensitivity of the analysis.
Taken together, this analysis demonstrates that the value of SMInput-based correction depends on the specificity of the studied RBP, for example, it can be beneficial for proteins binding to A-rich motifs. We speculate that this might be because RBPs binding to A-rich motifs are minor contributors to the SMInput data, which is rather dominated by proteins binding U-rich motifs. As a result, for proteins such as hnRNPC SMInput likely represents a mixture of foreground and background signal, leading to depletion of the foreground signal from narrowPeaks. Taken together, this analysis demonstrates that it is appropriate for PEKA to perform motif analysis by relying on the intrinsic background modelled from each individual CLIP dataset, without the need to use SMInput filtering.
Insights into the technical biases of CLIP
To illustrate how PEKA controls for technical biases, we compared the motif ranking between PEKA, mCross, and a simple “local” approach that only examines the local relative motif occurrence in a narrow (−3...3nt) window around tXn sites, normalized by the average occurrence within distal windows (−150…−100 and 100…150nt around tXn), rather than the intrinsic background (see “Methods”). We first analyzed recall across 57 eCLIP datasets for PEKA and the “local” approach (Fig. 4a) to find that PEKA significantly improved recall as compared to the “local” approach. While both PEKA and the “local” approach decrease the biases of regional genomic sequences, the normalization with the intrinsic background used by PEKA can also correct for technical biases at crosslink sites such as those resulting from crosslink and ligation biases, which likely explains its higher agreement with the in vitro data.
To gain insights into the reasons why certain motifs are more or less recoverable from eCLIP than in vitro data, we next compared motif enrichments obtained by the “local” approach, PEKA, mCross or in vitro across a subset of 41 eCLIP datasets (representing 28 distinct RBPs, Additional file 2: Table S1) that had both in vitro and mCross data available. We identified groups of k-mers that were differentially ranked in each method relative to in vitro data (confidence interval > 95% and a fold-change greater than 1.5 or less than 0.66) (Fig. 4a, “Methods”). Interestingly, the “local” approach produced most motifs that were differentially enriched in eCLIP vs in vitro (~27%), and PEKA produced the least (~10%) (Fig. 4b, Additional file 1: Fig. S5a,b).
To investigate the general sequence characteristics of differential k-mers, we visualized the mean rank of each k-mer across datasets for each approach and colored the k-mers according to their most common dinucleotide (Fig. 4c). As expected, the “local” approach was strongly enriched for k-mers in which the predominant dinucleotide is UU. The ranking of such motifs was strongly decreased by PEKA, while they still tend to be among highest ranking motifs, whereas such motifs tend to be the lowest ranking in mCross. Conversely, the “local” approach was strongly depleted for CU, CG, and CC dinucleotides, all of which are efficiently corrected by PEKA and mCross where they rank at levels comparable to in vitro. Interestingly, AA-rich motifs are depleted in all CLIP analysis approaches as compared to the in vitro data, with the strongest depletion seen in mCross. These k-mer sequence imbalances in eCLIP data demonstrate the importance of assessing global trends of enriched motifs across large numbers of RBPs to understand both the biases of experimental and computational approaches, as well as the potentially true differences in the binding preferences of RBPs between in vitro and in vivo conditions.
To investigate if the same biases are observed also in other CLIP variants, we expanded our analysis to 19 PAR-CLIP datasets for which in vitro data was available. In PAR-CLIP, crosslinking is performed at a higher wavelength and with the addition of a photoactivatable nucleotide 4-SU, which, upon sequencing, allows for the precise locations of crosslink sites to be identified, based on observing T-to-C transitions. We first performed a recall analysis and observed slightly improved recall in PEKA compared to the “local” approach; however, the change was not significant. Furthermore, the recall of PEKA motifs obtained from PAR-CLIP data was generally lower compared to the RBP-matched eCLIP datasets (Additional file 1: Fig. S5d). We then analyzed the proportion of k-mers that exhibit significant differences in ranking between CLIP and in vitro data to again find that PEKA greatly reduces the number of differential k-mers as compared to the local approach (Additional file 1: Fig. S5e). When we analyzed k-mer ranking with respect to their sequence composition, we observed relatively similar trends as in eCLIP, with U-rich k-mers being highly enriched and C-rich k-mers depleted in the local approach, while the effect is diminished in PEKA k-mers. Interestingly, the bias against AA k-mers that was observed in eCLIP data is not apparent in PAR-CLIP, and instead some bias against GG-containing k-mers is seen in PAR-CLIP (Additional file 1: Fig. S5f).
19 k-mers were differentially enriched in eCLIP by all three analytic approaches (the UUCG group, Fig. 4d) and 24 were differentially depleted (CAUA group, Fig. 4e, Additional file 1: Fig. S5b). Interestingly, these motifs are evenly enriched up to 150nt around tXn sites as compared to oXn sites (Fig. 4d). One of the proteins with the highest ranking of these k-mers (in eCLIP) is the DEAD-box helicase DDX3X, which is a major regulator of cellular RNA condensates and itself contains intrinsically disordered domains (IDRs) with strong condensation propensity [30]. A study reported that DDX3X binds in the vicinity of a motif composed in large part of CG and CGU subsequences, similar to the k-mers found in the UCG-group [31]. It could be speculated that such patterns are better detected by CLIP than in vitro binding data due to their need to assemble “binding-region condensates” on long RNA regions as was observed for TDP-43 [21].
Both PEKA and mCross use a strategy to control for the technical biases of CLIP, and it is thus reassuring that their differential motifs are more similar to each other than to the “local” approach (Fig. 4d,e, Additional file 1: Fig. S5a). To further understand the unique features of each approach, we examined the motifs that were enriched or depleted uniquely in each approach (Fig. 4d,e), or in two approaches (Additional file 1: Fig. S5a,b). This showed that differential motifs identified either by mCross and PEKA tend to be broadly enriched or depleted at over 100-nt region around crosslink sites, whereas as expected, those identified in the “local approach” are enriched or depleted directly at crosslink sites (Fig. 4d,e). We also visualized the extent of the differences in ranks for each motif group, which shows that the greatest drop is seen for the CAUA group for all CLIP analysis approaches as compared to in vitro data, and for the CCCC group for the “local approach” to CLIP analysis (Fig. 4f, d).
Importantly, for all motif groups, we show that PEKA is able to identify such motifs as top-ranking for the RBPs that show strong enrichment (Fig. 4d, e). For example, even for the CAUA group motifs that are found depleted in eCLIP by all approaches as compared to in vitro data, KHDRBS1 is correctly identified as a strong binder of the motif, with a broad enrichment pattern seen around crosslink sites (Fig. 4e). Moreover, PEKA can identify enriched motifs for hnRNPL, TRA2A, and PABPN1 even though they are depleted directly at crosslink sites (Fig. 4e). Thus, in spite of generic imbalances observed in the motifs enriched across all CLIP datasets and a great variety of their positional enrichment patterns, PEKA is able to identify all types of motifs when they mediate high-affinity RBP binding.
Many eCLIP datasets share similar enriched motifs
To understand how specific the motifs enriched in each eCLIP dataset are, we performed a systematic cross-motif and cross-RBP comparison of the whole set of eCLIP datasets available from ENCODE [4, 5]. We clustered the data based on the regional crosslinking profiles within protein-coding genes (CDS/UTR/intron) and by the ranks of 5-mers (see “Methods”), which identified groups of eCLIPs with similar motif enrichment signatures (Fig. 5). The first visually apparent feature of this analysis is that regional crosslinking preferences are accompanied by trends towards certain motif preferences. For instance, if crosslinks are primarily in 5′-UTR and CDS, the largest cluster of data shows enrichment in GC-rich motifs. In case of datasets with primary crosslinking in introns, motif enrichment is dominated by three clusters, two large clusters of datasets dominated by G-rich motifs, and a smaller cluster dominated by U-rich motifs. Datasets that crosslink primarily to 3′-UTRs also show enrichment of U-rich or UA-rich motifs (see the blue and yellow cluster, respectively). These clusters likely include binders of the AU-rich elements (AREs) that are common regulators of RNA stability in 3′-UTRs [32]. Moreover, both CDS and intronic datasets are often enriched in C/G-rich motifs. This analysis demonstrates that the common motif preferences in eCLIP data are closely linked to the regional crosslinking profiles within protein-coding genes.
As expected, eCLIP datasets within the largest clusters share highly similar motif preferences and regional profiles. Interestingly, we noticed that RBPs falling within these large clusters generally lack orthogonal in vitro binding data (indicated by the absence of recall) and are often poorly detected in the mRNA interactome proteomics (enhanced RNA interactome capture, i.e., eRIC) [33], and their top 50 ranked k-mers are often G-, U- or GC-rich. It has been reported previously that such k-mers tend to be overrepresented in eCLIP compared to RBNS [3], but the scale of their presence across eCLIP data was not yet examined. Strikingly, several G-rich k-mers were enriched among the top 50 k-mers in more than 50% of all eCLIP datasets.
Specificity of eCLIP datasets relates to RBP domain types and compositional biases
To further understand how data similarity relates to the various features of RBPs, we divided the eCLIP datasets according to whether or not they had available in vitro data, and then clustered each group based either on a combination of inter-data similarity (similarity score, see “Methods”) and recall, or just on inter-data similarity where no in vitro data was available, obtaining 7 groups of data (Fig. 6a, “Methods”). Most apparent differences are seen between the group of proteins for which in vitro data are available (groups 1–4, i.e., “in vitro set”) and the group for which no data are available (groups 5–7, i.e., “eCLIP-only set”). The in vitro set tends to have high eRIC values, high number of thresholded crosslinks (n tXn), and high mean PEKA scores across the top 50 ranked k-mers, whereas the eCLIP-only set tends to have low eRIC values, lower number of tXn, and low mean PEKA scores, indicating that the proteins in the eCLIP-only set do not crosslink well and have low motif enrichment, respectively. The great majority of proteins in the in vitro set contain a K-homology (KH) domain or RNA recognition motif (RRM) and a low-complexity sequence (Fig. 6b), which are the common characteristics of RBPs [34]. Conversely, proteins in the eCLIP-only set rarely contain KH or RRM domain. Higher prevalence of canonical RNA-binding domains in the in vitro set is not surprising, as the great majority of RBNS and RNAC data are in proteins which contain such domains [12, 35]. Interestingly, the in vitro set contains only a small group 4 with a high similarity score, while the eCLIP-only set contains a large group 7 with a high similarity score, accounting for a third (33%) of all eCLIP experiments. Taken together, it is clear that proteins lacking orthogonal in vitro data generally have different features from the rest, and their eCLIP data tends to have lower inter-data specificity (high similarity index) and motif enrichment (low mean PEKA score, eCLIP-only set). This indicates that cross-validation of eCLIP with in vitro data cannot be extrapolated to warrant the specificity of eCLIP data without available in vitro data, which must be taken into account when performing meta-analyses on the whole set of eCLIP data.
The most reliable eCLIP experiments are expected to be in group 2, which includes ~5% of datasets with unique k-mer signatures and high agreement with corresponding RBNS or RNAC data, as indicated by their low similarity index and high recall, respectively. This group of RBPs generally ranked highest in the eRIC experiments, indicating that they crosslink efficiently with RNA, which is consistent with the high number of thresholded crosslinks identified by PEKA. These RBPs contain a median of 3 RRM or KH domains (Fig. 6b). Thus, the canonical RBPs that crosslink well and contain many RNA-binding domains tend to yield specific and reliable eCLIP datasets. In addition to the high cross-validation, RBPs in group 2 have the highest mean PEKA scores across the top 50 ranked k-mers, implying that the coverage of top k-mers around tXn is much higher than around oXn. In other words, binding affinity of these RBPs is strongly sequence-dependent, requiring the presence of one or more high-affinity binding motifs.
The least reliable eCLIP experiments are expected within group 7, containing ~33% of eCLIP datasets with high inter-data similarity, which lack orthogonal in vitro data (Fig. 6a). ~39% of datasets in group 7 are undetected in eRIC experiments, which indicates that they crosslink poorly or do not crosslink at all to RNA. These proteins lack annotated features of RBPs, such as KH or RRM domains (Fig. 6a,b). This increases the likelihood of the signal being dominated by the most common contaminants of eCLIP experiments, which are likely the abundant and well-crosslinking RBPs. Low-specificity datasets in group 7 predominantly enrich for G-rich motifs, with most crosslinking sites originating from introns (Fig. 6a). We note that the same features are also prevalent in group 4, which contains eCLIP datasets that have reasonable agreement with in vitro data, indicating that many of these RBPs directly interact with the G-rich motifs. Nevertheless, these experiments have high inter-data similarity due to the fact that very similar motifs are enriched in groups 4 and 7. However, RBPs in group 4 generally have much higher mean PEKA scores than those in group 7, and thus even though both show enrichment of similar motifs, the extent of enrichment is stronger in group 4 (Fig. 6b).
It is interesting to find many groups with strong regional binding preferences, even though regional preferences had no direct role in clustering the groups (Fig. 6a). For example, groups 1, 2, 4, and 7 all contain predominantly intronic binding. However, G-rich motifs dominate groups 4 and 7, whereas groups 1 and 2 mainly show enrichment of A-rich, C-rich, or U-rich motifs, likely due to its higher data specificity. It is notable that proteins in group 2 contain a median of 3 KH or RRM domains, whereas those in group 4 contain only a median of 1 domain, and do not ever contain a KH domain. Moreover, group 5 commonly shows predominant binding in CDS and 5′-UTR, which tends to be associated with a higher proportion of A-rich motifs. Since A-rich motifs are otherwise rare in eCLIP experiments, their enrichment contributes to the low similarity index of datasets in group 5. We propose two possible explanations why datasets with similar specificity tend to have similar regional binding. First, the signature might reflect similar contaminants; for example, RBPs that bind G-rich motifs in introns might be the common contaminants of datasets from group 7. Second, the link between RBP specificity and regional binding could reflect regional sequence biases.
In addition to the aforementioned differences in the content of RRM/KH domains, we found that our clusters of proteins also differed in other domains. The proteins in the in vitro set very rarely contain domains other than RRM/KH, whereas the proteins in the eCLIP-only set frequently contain helicase domains, TNase-like domain, Tudor domain, and dsRNA-binding domains (DRBM) (Fig. 6c). These domains are less sequence specific, which likely contributes to the generally low eRIC scores of these proteins, and the high similarity scores and low PEKA scores of their eCLIP data.
We also observed differences in the number and types of compositional biases between groups (Fig. 6d). In the in vitro set, group 4 stands out as its proteins generally have higher numbers of intrinsically disordered domains (IDRs) compared to other groups. Interestingly, group 4 generally binds G-rich motifs and has decent recall and high similarity score. Thus, these proteins generally produce specific data, but their IDRs might be prone to formation of condensates or aggregates that can co-purify in immunoprecipitations of other proteins, which could explain high similarity of their data with many other proteins. Finally, we observed differences in the median PEKA score of each group, which is highest in groups 2 and 4, which have the highest recall. Thus, the extent of motif enrichment (as quantified by PEKA score) is related to the extent of cross-validation with in vitro data (Fig. 6a,b).
To understand the generality of insights obtained from eCLIP meta-analysis, we performed the same systematic analysis for the large multi-protein PAR-CLIP resource (Additional file 1: Fig. S6). Unlike the eCLIP analyses, we observed no correlation between unique motif enrichments (low similarity index) and high numbers of RRM/KH domains, and instead we observed a slightly reverse trend in PAR-CLIP, with the most unique data obtained for RBPs that lack or have few KH/RRM domains. Since the purification procedure in PAR-CLIP employs stringent Flag-tag immunoprecipitation and quality control on SDS-PAGE, it is expected that the extent of contaminating signal is lower in PAR-CLIP, and thereby indicates that the high similarity (and common dominance of G-rich motifs) in eCLIP data for RBPs that lack KH/RRM domains is unlikely just a signature of their lower binding specificities. Thus, further studies will be needed to distinguish datasets where high similarity of enriched motifs reflects biological relationships between studied RBPs from those where it reflects common contaminating RBPs or other technical biases in the data.