GraphProt: modeling binding preferences of RNA-binding proteins
© Maticzka et al.; licensee BioMed Central Ltd. 2014
Received: 7 August 2013
Accepted: 22 January 2014
Published: 22 January 2014
We present GraphProt, a computational framework for learning sequence- and structure-binding preferences of RNA-binding proteins (RBPs) from high-throughput experimental data. We benchmark GraphProt, demonstrating that the modeled binding preferences conform to the literature, and showcase the biological relevance and two applications of GraphProt models. First, estimated binding affinities correlate with experimental measurements. Second, predicted Ago2 targets display higher levels of expression upon Ago2 knockdown, whereas control targets do not. Computational binding models, such as those provided by GraphProt, are essential for predicting RBP binding sites and affinities in all tissues. GraphProt is freely available at http://www.bioinf.uni-freiburg.de/Software/GraphProt.
Recent studies have revealed that hundreds of RNA-binding proteins (RBPs) regulate a plethora of post-transcriptional processes in human cells [1–3]. The gold standard for identifying RBP targets are experimental cross-linking immunoprecipitation-high-throughput sequencing (CLIP-seq) protocols [4–6]. Despite the great success of these methods, there are still some problems to overcome: (1) the data may contain many false positives due to inherent noise [7, 8]; (2) a large number of binding sites remain unidentified (a high false-negative rate), because CLIP-seq is sensitive to expression levels and is both time and tissue dependent  and (3) limited mappability  and mapping difficulties at splice sites lead to further false negatives, even on highly expressed mRNAs. To analyze the interaction network of the RBPome and thus to find all binding sites of a specific RBP, a CLIP-seq experiment is only the initial step. The resulting data requires non-trivial peak detection to control for false positives [7, 8]. Peak detection leads to high-fidelity binding sites; however, it again increases the number of false negatives. Therefore, to complete the RBP interactome, computational discovery of missing binding sites is essential. The following describes a typical biological application of computational target detection. A published CLIP-seq experiment for a protein of interest is available for kidney cells, but the targets of that protein are required for liver cells. The original CLIP-seq targets may have missed many correct targets due to differential expression in the two tissues and the costs for a second CLIP-seq experiment in liver cells may not be within the budget or the experiment is otherwise not possible. We provide a solution that uses an accurate protein-binding model from the kidney CLIP-seq data, which can be used to identify potential targets in the entire transcriptome. Transcripts targeted in liver cells can be identified with improved specificity when target prediction is combined with tissue-specific transcript expression data. Generating expression data is likely cheaper than a full CLIP-seq experiment.
Computational target detection requires large numbers of highly reliable binding sites for training a binding model. Modern experimental methods such as RNAcompete[3, 11] and CLIP-seq[4–6] give a better characterization of RBP-binding specificities due to two important aspects: (1) the number of binding sites available for model training is increased from tens to thousands of sequences and (2) detection of exact binding locations is more precise, ranging from about 30 nucleotides for RNAcompete and high-throughput sequencing of RNA isolated by CLIP (HITS-CLIP)  to measurements at the nucleotide level for individual-nucleotide resolution CLIP (iCLIP)  and photoactivatable-ribonucleosideenhanced CLIP (PAR-CLIP) . A major qualitative difference between CLIP-seq and RNAcompete data is that the latter determines relative binding affinities in vitro, whereas CLIP-seq detects binding events in vivo.
There is a clear deficit of computational tools suited to detecting RBP binding sites to date; however, a multitude of sequence-motif discovery tools have been developed to detect DNA-binding motifs of transcription factors . Popular examples are MEME, MatrixREDUCE and DRIMust. In the past, some of these methods have also been applied to the analysis of RBP-bound RNAs [16–18].
It has been established that not only sequence, but also structure, is imperative for detecting RBP binding [17, 19]. The first tools to introduce structural features into target recognition were BioBayesNet for transcription factor binding sites and MEMERIS for the recognition of RBP targets. MEMERIS is an extension of MEME using RNA accessibility information to guide the search towards single-stranded regions. A recent approach and the current state of the art for learning models of RBP binding preferences is RNAcontext[17, 22]. RNAcontext extends accessibility information to include the type of unpaired regions (external regions, bulges, multiloops, hairpins and internal loops). RNAcontext was shown to outperform MEMERIS and a sequence-based approach, MatrixREDUCE, on an RNAcompete set of nine RBPs .
Available approaches that introduce a secondary structure into motif detection have two weaknesses. First, a single-nucleotide-based structure profile is used, that is, a nucleotide is considered paired or unpaired (or part of a specific loop). Second, the main assumption behind these models is that nucleotide positions are scored independently. While this assumption seems to work well for RBP motifs located within single-stranded regions, positional dependencies arise when structured regions (that is base-pairing stems) are involved in binding recognition: binding to double-stranded regions involves dependencies between base pairs, which lead to distant stretches of nucleotides in the sequence that can affect the binding affinity [23–27].
The general requirements for accurate binding models are thus manifold. First, training data nowadays comprise several thousands of RBP-bound sequences, therefore, identification of sequence and structure similarities must be computationally efficient. This excludes the use of conventional alignment-based methods (such as LocaRNA [28, 29] and RNAalifold ). Second, both sequence and structure interdependencies should be modeled, which cannot be achieved by structure-profile-based approaches [17, 21, 31]. Third, models should be robust with respect to noisy data and be able to take quantitative binding affinities into account.
Results and discussion
We present GraphProt, a flexible machine-learning framework for learning models of RBP binding preferences from different types of high-throughput experimental data such as CLIP-seq and RNAcompete. Trained GraphProt models are used to predict RBP binding sites and affinities for the entire (human) transcriptome, regardless of tissue-specific expression profiles. We start with a schematic overview of the GraphProt framework and highlight the advantages of this approach. For the first time, in spite of the huge amount of data, we make use of the full secondary structure information by relying on an efficient graph-kernel approach.
We establish that GraphProt has robust and improved performance in comparison to the state of the art by evaluating prediction performance for 24 sets of CLIP-seq and nine sets of RNAcompete data. Prediction performance was clearly improved in comparison to RNAcontext[17, 22] and even more clearly in comparison to a sequence-only-based approach, MatrixREDUCE, which was added to accentuate the importance of considering secondary structure. To gain further insight into the binding preferences learned by GraphProt models, we devised a procedure to extract simplified sequence and structure binding motifs that could be visualized as well-known sequence logos. We compared our motifs with current data on binding specificities and found substantial agreement.
Finally, we showcase two possible applications that consolidate the biological relevance of GraphProt models. First, we estimated affinities for PTB binding sites when training on CLIP-seq data without access to affinity measurements. As a control, we compared these estimated affinities with additional experimental measurements and observed a significant correlation. Thus, our binding models can learn from simple binding and non-binding information to differentiate between strong and weak binding sites. Second, using a GraphProt model trained on a set of Ago2 HITS-CLIP sites, we verified that predicted Ago2 targets are in agreement with changes in transcript expression levels upon Ago2 knockdown. The same trend was not observed for the original HITS-CLIP-detected sites, clearly indicating that GraphProt identifies binding sites missed by the high-throughput experiment.
The flexible GraphProt framework
In the following, we highlight special features of GraphProt that are not found in RBP-binding prediction tools in the literature.
A natural encoding for RNA-binding protein binding sites
Conventional feature encoding in RNA-binding models uses aggregate probabilities per nucleotide to characterize RNA structure, that is, models integrate a structure profile of the bound sequence [17, 31, 32]. The most common measurement is accessibility, which is the probability that a nucleotide is unpaired [33, 34]. Accessibility is used by MEMERIS. In addition, RNAcontext extends accessibility as the probability that an unpaired nucleotide is located within a specific type of loop (for example, a hairpin, bulge or multiloop). These single-nucleotide structure profiles allow encoding of the RBP target sites in sequential data structures, which guarantees higher computational efficiency. The downside of structure profiles is that the original structure information of the RNA molecule is severely compressed: instead of storing exact base-pairing information, only the marginal binding propensity of one nucleotide towards all other nucleotides is considered.
We propose a representation that is more natural and fully preserves base-pairing information (Figure 2). The key idea is to use a small set of stable structures to represent probable folding configurations on the mRNA in the surrounding context of RBP binding sites. These structures are then encoded as graphs with additional annotations for the type of substructure, that is, multiloops, hairpins, bulges, internal loops, external regions and stems (see Figure 2B).
Advantages of graph-kernel features
To efficiently process RNA structures encoded as graphs, we propose a method based on graph kernels. The main idea is to extend the k-mer similarity for strings (which counts the fraction of common small substrings) to graphs and finally to fit a predictive model using algorithms from the Support Vector Machine (SVM) family  for classification problems and Support Vector Regression (SVR)  when affinity information is available.
Using a graph-kernel approach, we extract a very large number of features (that is small disjoint subgraphs, see Figure 2C and Materials and methods for details) in a combinatorial manner and assess their importance in discriminating between bound and unbound regions on an mRNA. The use of disjoint subgraphs gives a binding motif that is more expressive than the one offered by traditional position specific scoring matrices  because it takes the simultaneous interdependencies between sequence and structure information at different locations into account. Feature importance information can be used, not only to build accurate predictors, but can be subsequently processed to identify sequence and structure binding preferences.
GraphProt learns binding preferences from CLIP-seqdata to predict new target sites
Computational approaches for predicting RBP binding sites require large amounts of training data. The current increase in the number of available CLIP-seq data sets make these a valuable data source of target sites bound by specific RBPs. To benchmark the ability of GraphProt to detect binding preferences of RBPs from human CLIP-seq data, we used 24 sets of HITS-CLIP-, PAR-CLIP- and iCLIP-derived binding sites: 23 were curated by doRiNA and an additional set of PTB HITS-CLIP binding sites was taken from  (Additional file 1). The Ago1-4 and IGF2BP1-3 sets contain combined binding sites of several proteins; four of the sets consist of ELAVL1 binding sites derived by both HITS-CLIP and PAR-CLIP. Other proteins included are ALKBH5, C17ORF85, C22ORF28, CAPRIN1, EWSR1, FUS, HNRNPC, MOV10, PTB, PUM2, QKI, SFRS1, TAF15, TDP-43, TIA1, TIAL1 and ZC3H7B.
The ability of a computational method to detect RBP target sites is assessed using the well-known tenfold cross-validation technique. The data is subdivided into ten segments. A model of binding preferences is trained on nine segments and target sites are predicted using the remaining segment (see Additional file 2 for details). Results are averaged over ten different train-and-test experiments. This technique assesses the ability of a method to predict RBP target sites that were not seen during training (this is analogous to the prediction of novel sites). The performance is measured as the area under the receiver operating characteristic curve (AUROC).
We compared the performance of GraphProt to RNAcontext and MatrixREDUCE. MatrixREDUCE was added to the benchmark comparison because it is a sequence-based method that previously displayed promising results in a comparison with RNAcontext (the current state of the art). GraphProt uses an extended sequence context for structure prediction, but centers on the CLIP-seq sites using the viewpoint technique (Figure 2A). For a fair comparison, the same context sequences (for structure prediction) and viewpoint information (for target sites) were used by RNAcontext and MatrixREDUCE (see Materials and methods).
GraphProt learns binding preferences from RNAcompete data
The affinity of an RBP to its target site is important for the effectiveness of the subsequent regulation. This implies that a classification into bound and unbound sequences is only a coarse approximation. Instead, a regression approach that can distinguish target sites according to their binding strength is more suitable. To model this binding strength, we require a training set with the affinities for different sequences instead of just a list of bound regions. Such measurements are provided by RNAcompete, an in vitro assay used to analyze recognition specificities of RBPs . To measure affinities, a pool of short RNAs, designed to include a wide range of k-mers in both structured and unstructured contexts, is exposed to a tagged RBP. The resulting RNA-protein complexes are pulled down and the abundance of bound RNA is measured. Relative binding affinity is then defined as the log ratio between the amount of pull-down RNA and the amount of RNA in the starting pool. Although a modified version of the RNAcompete protocol was published recently , the data were not suitable for evaluating GraphProt as the experiment was designed in such a way that it uses only unstructured sequences.
We evaluated the ability of GraphProt to predict binding affinities accurately in a regression setting using the RNAcompete sets for nine RBPs from the initial RNAcompete assay: Vts1p, SLM2, YB1, RBM4, SFRS1, FUSIP1, ELAVL1, U1A and PTB . All sets included both structured and unstructured sequences. The performance of affinity predictions was measured using the mean average precision (APR).
GraphProt models capture known binding preferences
Kernel-based methods allow the use of more complex features and thus an improved prediction performance. On the downside, kernel approaches usually do not provide an insight into what the model has learned. Since this insight is useful for assessing the biological relevance of the CLIP-seq models, we devised a novel post-processing step to identify the sequence and structure preferences learned by the models (see Materials and methods). Note that these logos are a mere visualization aid and do not represent the full extent of the information captured by GraphProt models.
RNA structure improves prediction of RNA-binding protein binding
Previous benchmarking analyses (Figures 3 and 4) established that the full GraphProt models (with secondary structure information) are superior to those gained by state-of-the-art methods. Now we assess the importance of secondary structure in RBP binding models. The encoding of RBP target sites is flexible, such that it is easy to remove all structural detail to leave only sequence information. This enables a direct comparison of the full structure to sequence-only models in a controlled setting (that is, the only difference in the comparison is the encoding of the target site). Thus, the added value of structure information for RBP target site prediction can be determined.
RNAcompete data are optimal for comparing models, since the initial sequences in the library were designed to be either unstructured or to form a stem-loop structure consisting of a single hairpin; therefore, a clear distinction of structure contribution is possible. The results are plotted in Figure 6A. Three of the four proteins from the RNAcompete set showing significant improvements over the sequence models (PTB, RBM4 and U1A) are known to recognize stem-loop structures [51–53]. For PTB, it was determined by ITC, gel shift assays and NMR studies that the two RRM domains bind a stem-loop structure of U1 snRNA . For RBM4, information about possible targets is scarce; however, in one case it was reported that the target of RBM4 is a cis-regulatory element that was predicted to be a stem-loop structure . This finding was supported by several mutations that were predicted to disrupt the RNA structure that led to a decreased interaction with RBM4. U1A is also known to bind to a stem-loop structure .
The large-scale analysis of double-stranded RNA-binding proteins (dsRBPs) is slightly lagging behind that of single-stranded RNA-binding proteins (ssRBPs). To the extent of the authors’ knowledge, the first and only genome-wide studies of dsRBPs were performed for MLE, MSL2 (two members of the Male-Specific Lethal complex)  and Staufen . The data from these studies, however, is not suitable for training GraphProt models. MLE and MSL2 bind very specifically to only a few sites in the roX1 and roX2 RNAs  and for Staufen, only target mRNA was available instead of exact target sites . Therefore, we could not evaluate the performance of GraphProt for dsRBPs binding predominantly to stems; however, the previously mentioned improved performance when studying RBPs binding to mixed structured and accessible regions indicate that GraphProt is well equipped for, and should perform well when, learning binding preferences of dsRBPs.
In summary, for ssRBPs binding to accessible regions, GraphProt sequence models may provide results comparable to the full structure models at increased processing speed. In contrast, the study of proteins binding to structured regions, benefits strongly from the full structure models provided by GraphProt, with larger than average increases in performance over structure-profile-based models. Since full structure models never performed significantly worse than sequence-only models, they should be used as the default.
Showcase 1: GraphProt learns binding affinities without affinity data
Biologically, it is more important to predict the binding affinity of an interaction than to categorize a potential target site as binding or non-binding. The bottleneck of this computational task is the availability of large data sets of quantitative, experimental measurements of affinities. Although CLIP-seq experiments are becoming increasingly popular, the data from them does not inherently provide a quantification of the binding affinity. In principle, the number of reads mapping to a binding site could be used as a proxy for its affinity, provided there is suitable expression data to normalize read counts. Even if these data exist, which is often not the case, normalization is non-trivial. We therefore ask whether binding affinities can be predicted while learning from only bound vs unbound information, as can be derived from CLIP-seq data.
These results clearly show that, in addition to predicting binding affinities in a regression setting, GraphProt can also be applied to the prediction of binding affinities when only sets of bound sites for a binary classification task are available, as is the case when analyzing CLIP-seq data. This allows the evaluation of putative binding sites with a meaningful score that reflects the biological functionality.
Showcase 2: Differential expression upon Ago2 knockdown is explained by GraphProt predictions but not by published CLIP-seqbinding sites
In detail, two independent factors influence the efficiency of downregulating a target mRNA. First, the binding affinity of an RBP to its target site regulates the binding frequency and strength. Second, the number of proteins bound to the same target can increase the signal for subsequent steps in the regulation process . The effect of cooperative regulation when the same element binds multiple times has been especially well studied for Ago2–microRNA interactions [58–61]. Here, Ago2 generally associates with a microRNA and other proteins (together a miRNA-induced silencing complex (miRISC)) to target mRNAs for degradation and/or translational inhibition. A common observation is that several miRISC complexes bind to the same mRNA and the cooperative effect is that the downregulation is stronger [59, 61].
In previous work, Schmitter and colleagues established that the mean number of microRNA seed sites per 3′ UTR increased significantly between unchanged and weakly upregulated as well as strongly upregulated mRNAs in human HEK293 cells upon Ago2 knockdown . Using their expression data and the same fold-change categories, we investigated the influence of both affinity and cooperative effects based on GraphProt predictions of Ago2 binding sites in comparison to the available CLIP-seq data. The GraphProt sequence-only model was trained on the Ago2-HITS-CLIP set (the use of structure did not improve prediction results for Ago2) and was applied to 3′ UTRs with measured fold changes to predict high-scoring target sites.
In showcase 1 (Figure 8), we established that GraphProt prediction margins correlate with measured affinities. Therefore, we estimated high-affinity Ago2 binding sites by only considering the highest-scoring predictions. We compared these predictions to reliable binding sites derived by peak calling on the Ago2-HITS-CLIP read profiles. The overall regulatory effect was investigated by comparing the fraction of 3′ UTRs that contain binding sites between the fold-change categories (Figure 9A). An interaction with higher affinity should cause a greater upregulation upon Ago2 knockdown. In a second analysis, cooperative effects were estimated by counting the number of Ago2 binding sites per 3′ UTR (Figure 9B) in each fold-change category. For binding sites predicted by GraphProt, both the fraction of 3′ UTRs with at least one GraphProt hit (Figure 9A) and the number of GraphProt hits per 3′ UTR (Figure 9B) showed a significant increase between unchanged and weakly upregulated transcripts. While there was no major difference in the fraction of UTRs containing UTRs with at least one hit, we saw a clear enrichment for the number of hits in UTRs that are highly regulated, indicating the cooperative effect of multiple miRISC target sites (Figure 9B). In contrast, no correlation was observed for binding sites taken from the Ago2-HITS-CLIP set in both cases (Figure 9).
Since microRNAs guide Ago2 binding, we also looked at computational approaches for detecting microRNA binding sites. To this end, we repeated the analysis from  using the same microRNA seeds found to be over-represented in upregulated transcripts and extracted PicTar 2.0 microRNA target predictions from doRiNA to compare against GraphProt (Additional file 4). Both microRNA detection approaches showed some agreement within the differential expression upon Ago2 knockdown; however, the differences between fold-change categories are not as significant in comparison to GraphProt. These results prove the necessity of computational target prediction in addition to performing CLIP-seq experiments. We proved the capacity of GraphProt to predict RBP target sites reliably and even to detect sites missed by experimental high-throughput methods.
GraphProt is an accurate method for elucidating binding preferences of RBPs and it is highly flexible in its range of application. We used a novel and intuitive representation of RBP binding sites that, in combination with an efficient graph kernel, is able to capture binding preferences of a wide range of RBPs. Depending on the input data, GraphProt models can solve either a regression or a classification task and are thus suitable for learning binding preferences from the two current major sources of experimental data: RNAcompete and CLIP-seq. Trained models are used to predict functional RBP target sites on any transcript from the same organism.
GraphProt had a robust and much improved performance in comparison to the existing state of the art. The full RNA structure representations used by GraphProt were shown to be especially suitable for modeling preferences for binding sites within base-pairing regions. For RBPs known not to be influenced by RNA structure, GraphProt provides very fast sequence-only models that perform as well as the full structure models. RBP sequence and structure preferences learned by GraphProt can be visualized using well-known sequence logos. Beyond the mere elucidation of binding preferences, GraphProt models have been successfully used for diverse tasks such as predicting RBP affinities and scanning for RBP target sites. GraphProt is applicable on a genome-wide scale and can thus overcome the limitations of CLIP-seq experiments, which are time and tissue dependent. We showed that when GraphProt is applied to all transcripts, missing targets are identified in a setting different to the one where the original CLIP-seq experiment was performed.
Materials and methods
Graph encoding of RNA sequence and structure
We have proposed an easy-to-adapt method to encode information about RNA sequence and structure in a natural way. The key idea is to use a generic hypergraph formalism to annotate different types of relations: (1) relations between nucleotides, such as sequence backbone or structure base pairs and (2) relations between abstract structure annotations, such as loops or stems, and the corresponding subsequences.
In this paper, we started from the representation used in GraphClust, and provide several useful extensions. In GraphClust, an RNA sequence is encoded, together with its folding structure, as a graph, where vertices are nucleotides and edges represent either a sequence backbone connection or a bond between base pairs. We do not require a single best-folding structure (for example, the one achieving minimum free energy) because this is known to be error prone. Instead, we sample the population of all possible structures and retain highly probable, representative candidates. The sampling strategy was implemented via the shape abstraction technique introduced by RNAshapes. RNAshapes categorizes all secondary structures according to a simplified representation, called the shape, which abstracts certain structural details. Different abstraction levels, which ignore various structure details, are possible, for example, ignoring all bulges, or all bulges and all internal loops. Stem lengths are always ignored. Out of all possible structures that have identical shapes, RNAshapes considers the one with minimum free energy as representative and calls it the shrep. We calculated shreps using shifting windows of 150 nucleotides with a step size of 37 nucleotides and predicted up to three shreps that are required to be within 10% of the minimum free energy of the sequence for each window.
In this work, we extended the representation used in GraphClust  in three ways: (1) we added a layer of abstract structure information to the secondary structure representation (see Figure 2B); (2) we considered an oriented version of the graphs and (3) we imposed a restriction on the graph, termed the viewpoint, so that features are only extracted from the informative part, that is, the part where RBP binding is hypothesized to occur (see Figure 2A).
Encoding abstract structure information
It is possible to use GraphProt in pure sequence mode, which ignores the RNA secondary structure by discarding base-pairing edges and abstract RNA structures. In this case, GraphProt behaves like an efficient, string kernel machine with gaps in the spirit of .
The graph kernel used by GraphProt is the Neighborhood Subgraph Pairwise Distance kernel (NSPD Kernel) . In this approach a graph is decomposed into a set of small overlapping subgraphs (see Figure 2C). Every subgraph is then assigned a numerical identifier using an efficient hash-based technique. The identifier is used to solve the isomorphism detection problem in an approximate but extremely fast way and it is used to build the final explicit feature encoding. In this way we build representations that can effectively use millions of features. The type of subgraph chosen in NSPD Kernel is the conjunction of two neighborhood subgraphs at a small distance from each other. Two parameters determine the characteristics of these subgraphs (and are thus related to the complexity and size of the entire feature set): (1) the maximum size of the neighborhood, called the radius R, and (2) the maximum distance between any two root nodes, called the distance D. Features are extracted for all combinations of values r≤R and d≤D.
In this work, the NSPD Kernel was extended in the following way: (1) we upgraded the encoding from graphs to hypergraphs to annotate the RNA abstract structure elements, (2) we considered directed graphs rather than undirected graphs and (3) we introduced a way to select subsets of features using the viewpoint.
A kernel for hypergraphs
In the NSPD Kernel of , shortest paths can access all vertices and edges in the graph. When the graph contains vertices with a large degree (that is, it is not sparse), however, the shortest path distance becomes degenerate and many vertices are immediate neighbors of each other. Under these conditions, the NSPD Kernel would generate uninformative features corresponding to extremely large subgraphs that are unlikely to occur in more than one instance. Thus, effective learning or generalization would be impossible. This situation would occur if we used the incident graph representation for hypergraphs as shown in Figure 10A (left). Hyperedges (that is, relations) would yield vertices with a large degree. For example, a hairpin loop relation would produce a vertex connected to all nucleotides belonging to the respective hairpin loop. This would effectively remove the nucleotide order of the RNA sequence, since there would exist a shortest path of length two between any two nucleotides in the original hairpin sequence. To deal with this issue, we extended the NSPD Kernel to work on the incident graph as visualized in Figure 10 by (1) considering the relation vertices as non-traversable by paths and (2) creating additional features (that is, pairs of subgraph decompositions), where the root vertices of the two paired neighborhoods are on the two end points of the hyperedge relation (Figure 10B). In intuitive terms, this yields features that are aware of the nucleotide composition of a substructure and, at the same time, of the position of that substructure in the global abstract structure annotation. Consider Figure 10B. Without the abstract structure annotation, the two occurrences of the subsequence UUC would be indistinguishable. With the abstract annotation, we generate two independent features, one that is aware that UUC is located in an internal loop (the vertex labeled I surrounded by two stems), and another feature that is aware that UUC is located in a hairpin loop (the vertex labeled H, preceded by a stem).
By making the relation vertex non-traversable, we have separated the basic from the abstract part of the graph. The NSPD Kernel features in this case can be divided into three separate sets: one set for the basic part, which corresponds to the features used in GraphClust, a set of novel features for the abstract part and finally a hybrid set of features that relate the nucleotide composition to the abstract part. Note that the features for the abstract part are independent of the exact nucleotide composition of the underlying substructures and therefore allow a better generalization for distantly related RNA sequences.
Using undirected graphs for RNA sequences (as in GraphClust) means that the order imposed by the 5′→3′ asymmetry is lost. Hence, a sequence and its reversed counterpart (not the complement) would yield the same feature representation. To overcome this limitation, we extended the NSPD Kernel to use directed graphs. For this, we required an unambiguous definition of edge direction: (1) the sequence backbone edges reflect the natural 5′→3′ direction, (2) the base-pair edges are directed away from the nucleotide closer to the 5′ end and towards the nucleotide closer to the 3′ end and (3) edges in the abstract part are directed by starting at the sequence ends and traveling from the inner annotations towards the outer limbs, that is, starting from multiloops and ending at hairpin loops. Finally, to capture all relevant information, while still maintaining the consistency with the chosen direction, we duplicated the graph, relabeled all vertices by adding a distinguishing prefix, and reversed the direction of all edges (see Figure 10C).
Selection of kernel viewpoints
In the NSPD Kernel of GraphClust, all vertices are considered in the generation of features. This is suitable when global RNA sequences are being compared. For RBP binding sites on mRNA, however, only the local target region could be informative and considering all vertices would lead to a substantial amount of noise and decrease the overall predictive performance. Thus, without losing discriminative power, we reduced the number of vertices considered to a fixed subregion of the sequence called the viewpoint (see Figures 2 and 10). In a supervised setting, the viewpoint area is selected randomly for negative examples and, for the positive examples, around the region covered by the RBP-bound sequence identified by the respective high-throughput experimental technique. In a genome-wide scanning setting, it would be selected with a moving window approach. Note that we cannot simply reduce the graph encoding to fit exactly this reduced area, since in so doing, we would lose the information needed to estimate the folding structure of the mRNA. We require that the root vertex of at least one of the two neighborhoods is localized in the viewpoint area. This way we still allow accurate folding of the mRNA, by considering 150 nucleotides upstream and downstream of the viewpoint , but we only select features that are local to the area of interest. The other hyper-parameters of the NSPD Kernel, namely the distance D and the radius R, determine the area of influence around the putative target region, that is, the portion of the mRNA used to extract relevant information for the discriminative task (see Figure 10D). The viewpoint technique was first introduced in .
Preparation of training and test data
Binding sites for PTB-CLIP  were taken from [GEO:GSE19323] (downloaded from the Gene Expression Omnibus ). Sites for all other proteins were downloaded from doRiNA (Additional file 1). Binding sites of more than 75 nucleotides were excluded from all training sets. iCLIP sites were extended by 15 nucleotides upstream and downstream. For each set of CLIP-seq sites, we created a set of unbound sites by shuffling the coordinates of bound sites within all genes occupied by at least one binding site, thus enabling the training of models using a binary classification.
To enable accurate prediction of secondary structures , we extended the binding sites in both directions by 150 nucleotides or until reaching a transcript end. Core binding-site nucleotides, but not the additional context for folding, were marked as viewpoints. All expansions were done using genomic coordinates.
Secondary structure profiles for RNAcontext were calculated using a modified version of RNAplfold that calculates separate probabilities for stacking base pairs (that is stems), external regions, hairpins, bulges, multiloops and internal loops. Profiles for RNAcontext were calculated using the full sequences. Training and testing were performed on the same core binding sites that were marked as viewpoints for GraphProt. This ensures that RNAcontext still has access to the full sequence context required for structure prediction while providing the same concise binding sites as used by GraphProt. MatrixREDUCE was also evaluated using only the viewpoints.
Next 3′ UTRs for Ago2 binding-site predictions were prepared by selecting a non-overlapping set of transcripts with associated fold changes for Ago2 knockdown on day 2, preferring longer over shorter UTRs and with at least 100 but no more than 3,000 nucleotides.
The predictive performance of GraphProt models trained on CLIP-seq data was evaluated by a tenfold cross-validation. Classification performance is given as the AUROC using the SVM margins as the diagnostic results of classification. GraphProt has three main components: the graph encoding part, the graph kernel feature part and the predictive model part. These are parametrized. The main parameter in the graph encoding part is the abstraction level of the shape category. In the graph kernel feature part, the main parameters are the maximal radius R and the maximal distance D, which define the neighborhood subgraph features. In the predictive model part during classification, the SVM models were trained using a stochastic gradient descent approach  and the main parameters are the number of training epochs and parameter λ, which control the trade-off between the fitting accuracy and the regularization strength (Additional files 5 and 6). For the RNAcompete regressions, the main parameters are c and ε, which control the trade-off between the fitting accuracy and the regularization strength (Additional file 7). The optimal values for all these parameters were determined jointly via a line search strategy. All of the parameters were kept fixed except one, which was chosen for optimization in a round-robin fashion.
Given the amount of computation required for the optimization phase, all GraphProt parameters and RNAcontext motif widths were evaluated on a set of 1,000 sequences or 10% of the available data, whichever was smaller (Additional files 5, 6 and 8). The sequences used to determine the optimal parameter values were then discarded for the cross-validated performance assessment procedure. MatrixREDUCE automatically selects appropriate motif widths during training. For each fold of the MatrixREDUCE cross-validation, we evaluated a single motif, setting max_motif to 1 (Additional file 9). RNAcontext and MatrixREDUCE were trained using values 1/-1 for positive/negative class sequences and using motif widths ranging from 4 to 12 nucleotides.
Model evaluation for the RNAcompete data was essentially as published for RNAcontext. Models were evaluated through converting them to binary-classification tasks using the published thresholds. Classification performance is given as the APR, which is better suited than AUROC for unbalanced classes (which have few bound sequences and many unbound sequences). For each of the nine proteins, models were created for the two independent sets and in each case tested on the corresponding sets. We report the mean score of the two evaluations. GraphProt parameters were determined using subsets of 5,000 training sequences (Additional file 7). Support vector regressions were performed using libSVM. RNAcontext motif widths were determined using all training sequences (Additional file 8).
We report the improvement in predictive performance as the relative error reduction, defined as (x′-x)/(1-x) where x is the baseline performance and x′ is the improved performance. The performance is a function with codomain in the interval [ 0,1] and is 1 when the prediction corresponds exactly to the desired target. The (generalized) error is consequently defined as e=1-x.
Predicting RNA-binding protein binding sites
A trained GraphProt model is applied to any transcript (or 3′ UTRs) to predict (novel) binding sites from the same organism (across-species compatibility may exist, but was not tested). Two options for prediction are available. First, an entire sequence window, representing a potential binding site, is assigned a score that reflects the likelihood of binding. The score is the prediction margin as given by the machine-learning software, for example, the SVM. Positive values indicate a true binding site and negative values indicate that no binding occurs. Second, to generate prediction profiles on a nucleotide level, we process the prediction margins reported by the software per feature (that is, the importance of that feature for predicting RBP binding), not per window. Profiles are calculated per nucleotide by summing over all features for which the corresponding nucleotide is a root (central) node (in the feature, that is subgraph, Figure 2C). High-affinity binding sites can be extracted from prediction profiles as we exemplified for Ago2.
Prediction of Ago2 target sites
To predict Ago2 target sites, we calculated binding profiles for the 3′ UTRs of genes with corresponding fold changes from the Ago2 knockdown experiment in  using the GraphProt sequence-only model, trained on the Ago2 HITS-CLIP set. Since proteins do not only bind to single nucleotides, binding scores were averaged for all 12-mer windows. To gain high-affinity Ago2 binding sites we considered the 1% highest-scoring 12-mers and merged overlapping and abutting sites.
Logos of sequence and structure binding preferences
To provide visual representations for both sequence and structural preferences encoded by the GraphProt models, we predicted and scored the approximately 25,000 folding hypotheses of up to 2,000 CLIP-seq-derived binding sites. For each folding hypothesis per binding site, we extracted only the highest-scoring 12-mer, where the score is the average prediction margin per nucleotide from the binding profile, analogous to the method of predicting the Ago2 binding sites. To visualize structure preferences, we compressed full secondary structure information into structure profiles. A nucleotide is assigned to the structure element it occurs in: stem (S), external region (E), hairpin (H), internal loop (I), multiloop (M) or bulge (B). The 1,000 highest-scoring 12-mer nucleotide sequences and structure profiles were converted into sequence and structure logos, respectively (using WebLogo; all logos are in Additional file 10).
The GraphProt software, models, parameters and sequences (CLIP-seq sequences used for training, and PTB and 3′ UTR sequences used for predictions) are available for download . GraphProt is included as Additional file 11 for archival purposes.
area under the receiver operating characteristic curve
cross-linking and immunoprecipitation
double-stranded RNA-binding protein
electrophoretic mobility shift assay
high-throughput sequencing of RNA isolated by cross-linking immunoprecipitation
individual-nucleotide resolution cross-linking and immunoprecipitation
isothermal titration calorimetry
miRNA-induced silencing complex
nuclear magnetic resonance
Neighborhood Subgraph Pairwise Distance Kernel
photoactivatable-ribonucleoside-enhanced cross-linking and immunoprecipitation
RNA immunoprecipitation followed by microarray analysis
RNA recognition motif
Systematic Evolution of Ligands by Exponential Enrichment
small nuclear RNA
Support Vector Machine
Support Vector Regression
single-stranded RNA-binding protein
Funding was received from the German Research Foundation (DFG) (grants BA 2168/4-3 SPP 1395 InKoMBio, BA 2168/3-3 and SFB 992/1 to RB). The article processing charge was funded by the German Research Foundation (DFG) and the Albert Ludwigs University Freiburg in the funding programme Open Access Publishing.
- Baltz AG, Munschauer M, Schwanhausser B, Vasile A, Murakawa Y, Schueler M, Youngs N, Penfold-Brown D, Drew K, Milek M, Wyler E, Bonneau R, Selbach M, Dieterich C, Landthaler M: The mRNA-bound proteome and its global occupancy profile on protein-coding transcripts. Mol Cell. 2012, 46: 674-690. 10.1016/j.molcel.2012.05.021.View ArticleGoogle Scholar
- Castello A, Fischer B, Eichelbaum K, Horos R, Beckmann BM, Strein C, Davey NE, Humphreys DT, Preiss T, Steinmetz LM, Krijgsveld J, Hentze MW: Insights into RNA biology from an atlas of mammalian, mRNA-binding proteins. Cell. 2012, 149: 1393-1406. 10.1016/j.cell.2012.04.031.View ArticleGoogle Scholar
- Ray D, Kazan H, Cook KB, Weirauch MT, Najafabadi HS, Li X, Gueroussov S, Albu M, Zheng H, Yang A, Na H, Irimia M, Matzat LH, Dale RK, Smith SA, Yarosh CA, Kelly SM, Nabet B, Mecenas D, Li W, Laishram RS, Qiao M, Lipshitz HD, Piano F, Corbett AH, Carstens RP, Frey BJ, Anderson RA, Lynch KW, Penalva LOF, et al: A compendium of RNA-binding motifs for decoding gene regulation. Nature. 2013, 499: 172-177. 10.1038/nature12311.View ArticleGoogle Scholar
- Licatalosi DD, Mele A, Fak JJ, Ule J, Kayikci M, Chi SW, Clark TA, Schweitzer AC, Blume JE, Wang X, Darnell JC, Darnell RB: HITS-CLIP yields genome-wide insights into brain alternative RNA processing. Nature. 2008, 456: 464-469. 10.1038/nature07488.View ArticleGoogle Scholar
- König J, Zarnack K, Rot G, Curk T, Kayikci M, Zupan B, Turner DJ, Luscombe NM, Ule J: iCLIP reveals the function of hnRNP particles in splicing at individual nucleotide resolution. Nat Struct Mol Biol. 2010, 17: 909-915. 10.1038/nsmb.1838.View ArticleGoogle Scholar
- Hafner M, Landthaler M, Burger L, Khorshid M, Hausser J, Berninger P, Rothballer A, Ascano MJ, Jungkamp AC, Munschauer M, Ulrich A, Wardle GS, Dewell S, Zavolan M, Tuschl T: Transcriptome-wide identification of RNA-binding protein and microRNA target sites by PAR-CLIP. Cell. 2010, 141: 129-141. 10.1016/j.cell.2010.03.009.View ArticleGoogle Scholar
- Corcoran DL, Georgiev S, Mukherjee N, Gottwein E, Skalsky RL, Keene JD, Ohler U: PARalyzer: definition of RNA binding sites from PAR-CLIP, short-read sequence data. Genome Biol. 2011, 12: R79-10.1186/gb-2011-12-8-r79.View ArticleGoogle Scholar
- Uren PJ, Bahrami-Samani E, Burns SC, Qiao M, Karginov FV, Hodges E, Hannon GJ, Sanford JR, Penalva LOF, Smith AD: Site identification in high-throughput RNA-protein interaction data. Bioinformatics. 2012, 28: 3013-3020. 10.1093/bioinformatics/bts569.View ArticleGoogle Scholar
- Blencowe BJ, Ahmad S, Lee LJ: Current-generation high-throughput sequencing: deepening insights into mammalian transcriptomes. Genes Dev. 2009, 23: 1379-1386. 10.1101/gad.1788009.View ArticleGoogle Scholar
- Derrien T, Estelle J, Marco Sola S, Knowles DG, Raineri E, Guigo R, Ribeca P: Fast computation and applications of genome mappability. PLoS One. 2012, 7: e30377-10.1371/journal.pone.0030377.View ArticleGoogle Scholar
- Ray D, Kazan H, Chan ET, Pena Castillo L, Chaudhry S, Talukder S, Blencowe BJ, Morris Q, Hughes TR: Rapid and systematic analysis of the RNA, recognition specificities of RNA-binding proteins. Nat Biotechnol. 2009, 27: 667-670. 10.1038/nbt.1550.View ArticleGoogle Scholar
- Das MK, Dai HK: A survey of DNA motif finding algorithms. BMC Bioinformatics. 2007, 8: S21-View ArticleGoogle Scholar
- Bailey TL, Boden M, Buske FA, Frith M, Grant CE, Clementi L, Ren J, Li WW, Noble WS: MEME SUITE: tools for motif discovery and searching. Nucleic Acids Res. 2009, 37: W202-W208. 10.1093/nar/gkp335.View ArticleGoogle Scholar
- Foat BC, Morozov AV, Bussemaker HJ: Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE. Bioinformatics. 2006, 22: e141-e149. 10.1093/bioinformatics/btl223.View ArticleGoogle Scholar
- Leibovich L, Paz I, Yakhini Z, Mandel-Gutfreund Y: DRIMust: a web server for discovering rank imbalanced motifs using suffix trees. Nucleic Acids Res. 2013, 41: W174-W179. 10.1093/nar/gkt407.View ArticleGoogle Scholar
- Sanford JR, Wang X, Mort M, Vanduyn N, Cooper DN, Mooney SD, Edenberg HJ, Liu Y: Splicing factor SFRS1 recognizes a functionally diverse landscape of RNA transcripts. Genome Res. 2009, 19: 381-394.View ArticleGoogle Scholar
- Kazan H, Ray D, Chan ET, Hughes TR, Morris Q: RNAcontext: a new method for learning the sequence and structure binding preferences of RNA-binding proteins. PLoS Comput Biol. 2010, 6: e1000832-10.1371/journal.pcbi.1000832.View ArticleGoogle Scholar
- Gupta SK, Kosti I, Plaut G, Pivko A, Tkacz ID, Cohen-Chalamish S, Biswas DK, Wachtel C, Waldman Ben-Asher H, Carmi S, Glaser F, Mandel-Gutfreund Y, Michaeli S: The hnRNP F/H homologue of Trypanosoma brucei is differentially expressed in the two life cycle stages of the parasite and regulates splicing and mRNA stability. Nucleic Acids Res. 2013, 41: 6577-6594. 10.1093/nar/gkt369.View ArticleGoogle Scholar
- Hiller M, Zhang Z, Backofen R, Stamm S: Pre-mRNA secondary, structures influence exon recognition. PLoS Genet. 2007, 3: e204-10.1371/journal.pgen.0030204.View ArticleGoogle Scholar
- Pudimat R, Schukat-Talamazzini E, Backofen R: A multiple-feature framework for modelling and predicting transcription factor binding sites. Bioinformatics. 2005, 21: 3082-3088. 10.1093/bioinformatics/bti477.View ArticleGoogle Scholar
- Hiller M, Pudimat R, Busch A, Backofen R: Using RNA secondary structures to guide sequence motif finding towards single-stranded regions. Nucleic Acids Res. 2006, 34: e117-10.1093/nar/gkl544.View ArticleGoogle Scholar
- Kazan H, Morris Q: RBPmotif: a web server for the discovery of sequence and structure preferences of RNA-binding proteins. Nucleic Acids Res. 2013, 41: W180-W186. 10.1093/nar/gkt463.View ArticleGoogle Scholar
- Lee JH, Kim H, Ko J, Lee Y: Interaction of C5 protein with RNA, aptamers selected by SELEX. Nucleic Acids Res. 2002, 30: 5360-5368. 10.1093/nar/gkf694.View ArticleGoogle Scholar
- Gatignol A, Buckler C, Jeang KT: Relatedness of an RNA-binding motif in human immunodeficiency virus type 1 TAR RNA-binding protein TRBP to human P1/dsI kinase and Drosophila staufen. Mol Cell Biol. 1993, 13: 2193-2202.View ArticleGoogle Scholar
- Lange SJ, Alkhnbashi OS, Rose D, Will S, Backofen R: CRISPRmap: an automated classification of repeat conservation in prokaryotic adaptive immune systems. Nucleic Acids Res. 2013, 41: 8034-8044. 10.1093/nar/gkt606.View ArticleGoogle Scholar
- Hatoum-Aslan A, Maniv I, Marraffini LA: Mature clustered, regularly interspaced, short palindromic repeats RNA (crRNA) length is measured by a ruler mechanism anchored at the precursor processing site. Proc Natl Acad Sci USA. 2121, 108: 8-21222.Google Scholar
- Masliah G, Barraud P, Allain FHT: RNA recognition by double-stranded RNA binding domains: a matter of shape and sequence. Cell Mol Life Sci. 2013, 70: 1875-1895.Google Scholar
- Will S, Joshi T, Hofacker IL, Stadler PF, Backofen R: LocARNA-P: Accurate boundary prediction and improved detection of structural RNAs. RNA. 2012, 18: 900-914. 10.1261/rna.029041.111.View ArticleGoogle Scholar
- Will S, Reiche K, Hofacker IL, Stadler PF, Backofen R: Inferring non-coding RNA families and classes by means of genome-scale structure-based clustering. PLoS Comput Biol. 2007, 3: e65-10.1371/journal.pcbi.0030065.View ArticleGoogle Scholar
- Bernhart SH, Hofacker IL, Will S, Gruber AR, Stadler PF: RNAalifold:, improved consensus structure prediction for RNA alignments. BMC Bioinformatics. 2008, 9: 474-10.1186/1471-2105-9-474.View ArticleGoogle Scholar
- Wang X, Juan L, Lv J, Wang K, Sanford JR, Liu Y: Predicting sequence and structural specificities of RNA binding regions recognized by splicing factor SRSF1. BMC Genomics. 2011, 12: S8-View ArticleGoogle Scholar
- Sturm M, Hackenberg M, Langenberger D, Frishman D: TargetSpy: a supervised machine learning approach for microRNA target prediction. BMC Bioinformatics. 2010, 11: 292-10.1186/1471-2105-11-292.View ArticleGoogle Scholar
- Bernhart SH, Mückstein U, Hofacker IL: RNA accessibility in cubic time. Algorithms Mol Biol. 2011, 6: 3-10.1186/1748-7188-6-3.View ArticleGoogle Scholar
- Lange SJ, Maticzka D, Möhl M, Gagnon JN, Brown CM, Backofen R: Global or local? Predicting secondary structure and accessibility in mRNAs. Nucleic Acids Res. 2012, 40: 5215-5226. 10.1093/nar/gks181.View ArticleGoogle Scholar
- Cortes C, Vapnik V: Support-vector networks. Machine Learning. 1995, Dordrecht: Kluwer Academic Publishers-Plenum Publishers, 273-297.Google Scholar
- Drucker H, Burges CJ, Kaufman L, Smola A, Vapnik V: Support vector regression machines. Adv Neural Info Process Syst. 1997, 9: 155-161.Google Scholar
- Gowri VS, Krishnadev O, Swamy CS, Srinivasan N: MulPSSM: a database of multiple position-specific scoring matrices of protein domain families. Nucleic Acids Res. 2006, 34: D243-D246. 10.1093/nar/gkj043.View ArticleGoogle Scholar
- Anders G, Mackowiak SD, Jens M, Maaskola J, Kuntzagk A, Rajewsky N, Landthaler M, Dieterich C: doRiNA: a database of RNA interactions in post-transcriptional regulation. Nucleic Acids Res. 2012, 40: D180-D186. 10.1093/nar/gkr1007.View ArticleGoogle Scholar
- Xue Y, Zhou Y, Wu T, Zhu T, Ji X, Kwon YS, Zhang C, Yeo G, Black DL, Sun H, Fu XD, Zhang Y: Genome-wide analysis of PTB-RNA interactions reveals a strategy used by the general splicing repressor to modulate exon inclusion or skipping. Mol Cell. 2009, 36: 996-1006. 10.1016/j.molcel.2009.12.003.View ArticleGoogle Scholar
- Tacke R, Chen Y, Manley JL: Sequence-specific RNA binding by an SR protein requires RS domain phosphorylation: creation of an SRp40-specific splicing enhancer. Proc Natl Acad Sci USA. 1997, 94: 1148-1153. 10.1073/pnas.94.4.1148.View ArticleGoogle Scholar
- Gao FB, Carson CC, Levine T, Keene JD: Selection of a subset of mRNAs, from combinatorial 3’ untranslated region libraries using neuronal RNA-binding protein Hel-N1. Proc Natl Acad Sci USA. 1120, 91: 7-11211.Google Scholar
- Perez I, Lin CH, McAfee JG, Patton JG: Mutation of PTB binding sites causes misregulation of alternative 3’ splice site selection in vivo. RNA. 1997, 3: 764-778.Google Scholar
- Colombrita C, Onesto E, Megiorni F, Pizzuti A, Baralle FE, Buratti E, Silani V, Ratti A: TDP-43 and FUS RNA-binding proteins bind distinct sets of cytoplasmic messenger RNAs and differently regulate their post-transcriptional fate in motoneuron-like cells. J Biol Chem. 1563, 287: 5-15647.Google Scholar
- Hoell JI, Larsson E, Runge S, Nusbaum JD, Duggimpudi S, Farazi TA, Hafner M, Borkhardt A, Sander C, Tuschl T: RNA targets of wild-type and mutant FET family proteins. Nat Struct Mol Biol. 2011, 18: 1428-1431. 10.1038/nsmb.2163.View ArticleGoogle Scholar
- Gorlach M, Burd CG, Dreyfuss G: The determinants of RNA-binding specificity of the heterogeneous nuclear ribonucleoprotein C proteins. J Biol Chem. 2307, 269: 4-23078.Google Scholar
- Wilusz J, Shenk T: A uridylate tract mediates efficient heterogeneous nuclear ribonucleoprotein C protein-RNA cross-linking and functionally substitutes for the downstream element of the polyadenylation signal. Mol Cell Biol. 1990, 10: 6397-6407.View ArticleGoogle Scholar
- Dember L, Kim N, Liu K, Anderson P: Individual RNA recognition motifs of TIA-1 and TIAR have different RNA binding specificities. J Biol Chem. 1996, 271: 2783-10.1074/jbc.271.5.2783.View ArticleGoogle Scholar
- Forch P, Puig O, Kedersha N, Martinez C, Granneman S, Seraphin B, Anderson P, Valcarcel J: The apoptosis-promoting factor TIA-1 is a regulator of alternative pre-mRNA splicing. Mol Cell. 2000, 6: 1089-1098. 10.1016/S1097-2765(00)00107-6.View ArticleGoogle Scholar
- Bauer WJ, Heath J, Jenkins JL, Kielkopf CL: Three RNA recognition motifs participate in RNA recognition and structural organization by the pro-apoptotic factor TIA-1. J Mol Biol. 2012, 415: 727-740. 10.1016/j.jmb.2011.11.040.View ArticleGoogle Scholar
- Kim HS, Headey SJ, Yoga YMK, Scanlon MJ, Gorospe M, Wilce MCJ, Wilce JA: Distinct binding properties of TIAR RRMs and linker region. RNA Biol. 2013, 10: 579-589. 10.4161/rna.24341.View ArticleGoogle Scholar
- Sharma S, Maris C, Allain FHT, Black DL: U1 snRNA directly interacts with polypyrimidine tract-binding protein during splicing repression. Mol Cell. 2011, 41: 579-588. 10.1016/j.molcel.2011.02.012.View ArticleGoogle Scholar
- Kojima S, Matsumoto K, Hirose M, Shimada M, Nagano M, Shigeyoshi Y, Hoshino Si, Ui-Tei K, Saigo K, Green CB, Sakaki Y, Tei H: LARK activates posttranscriptional expression of an essential mammalian clock protein, PERIOD1. Proc Natl Acad Sci USA. 2007, 104: 1859-1864. 10.1073/pnas.0607567104.View ArticleGoogle Scholar
- Law MJ, Rice AJ, Lin P, Laird-Offringa IA: The role of RNA structure in the interaction of U1A protein with U1 hairpin II RNA. RNA. 2006, 12: 1168-1178. 10.1261/rna.75206.View ArticleGoogle Scholar
- Ilik IA, Quinn JJ, Georgiev P, Tavares-Cadete F, Maticzka D, Toscano S, Wan Y, Spitale RC, Luscombe N, Backofen R, Chang HY, Akhtar A: Tandem stem-loops in roX RNAs act together to mediate X chromosome, dosage compensation in Drosophila. Mol Cell. 2013, 51: 156-173. 10.1016/j.molcel.2013.07.001.View ArticleGoogle Scholar
- Laver JD, Li X, Ancevicius K, Westwood JT, Smibert CA, Morris QD, Lipshitz HD: Genome-wide analysis of Staufen-associated mRNAs identifies secondary structures that confer target specificity. Nucleic Acids Res. 2013, 41: 9438-9460. 10.1093/nar/gkt702.View ArticleGoogle Scholar
- Karakasiliotis I, Vashist S, Bailey D, Abente EJ, Green KY, Roberts LO, Sosnovtsev SV, Goodfellow IG: Polypyrimidine tract binding protein functions as a negative regulator of feline calicivirus translation. PLoS One. 2010, 5: e9562-10.1371/journal.pone.0009562.View ArticleGoogle Scholar
- Zhang C, Lee KY, Swanson MS, Darnell RB: Prediction of clustered RNA-binding protein motif sites in the mammalian genome. Nucleic Acids Res. 2013, 41: 6793-6807. 10.1093/nar/gkt421.View ArticleGoogle Scholar
- Schmitter D, Filkowski J, Sewer A, Pillai RS, Oakeley EJ, Zavolan M, Svoboda P, Filipowicz W: Effects of Dicer and Argonaute down-regulation on mRNA levels in human HEK293 cells. Nucleic Acids Res. 2006, 34: 4801-4815. 10.1093/nar/gkl646.View ArticleGoogle Scholar
- Selbach M, Schwanhausser B, Thierfelder N, Fang Z, Khanin R, Rajewsky N: Widespread changes in protein synthesis induced by microRNAs. Nature. 2008, 455: 58-63. 10.1038/nature07228.View ArticleGoogle Scholar
- Schnall-Levin M, Rissland OS, Johnston WK, Perrimon N, Bartel DP, Berger B: Unusually effective microRNA targeting within repeat-rich coding regions of mammalian mRNAs. Genome Res. 2011, 21: 1395-1403. 10.1101/gr.121210.111.View ArticleGoogle Scholar
- Grimson A, Farh KKH, Johnston WK, Garrett-Engele P, Lim LP, Bartel DP: MicroRNA targeting specificity in mammals: determinants beyond seed pairing. Mol Cell. 2007, 27: 91-105. 10.1016/j.molcel.2007.06.017.View ArticleGoogle Scholar
- Heyne S, Costa F, Rose D, Backofen R: GraphClust: alignment-free structural clustering of local RNA secondary structures. Bioinformatics. 2012, 28: i224-i232. 10.1093/bioinformatics/bts224.View ArticleGoogle Scholar
- Steffen P, Voss B, Rehmsmeier M, Reeder J, Giegerich R: RNAshapes: an integrated RNA analysis package based on abstract shapes. Bioinformatics. 2006, 22: 500-503. 10.1093/bioinformatics/btk010.View ArticleGoogle Scholar
- Leslie CS, Eskin E, Cohen A, Weston J, Noble WS: Mismatch string kernels for discriminative protein classification. Bioinformatics. 2004, 20: 467-476. 10.1093/bioinformatics/btg431.View ArticleGoogle Scholar
- Costa F, Grave KD: Fast neighborhood subgraph pairwise distance kernel. Proceedings of the 26th International Conference on Machine Learning. 2010, Haifa: Omnipress, 255-262.Google Scholar
- Frasconi P, Costa F, Raedt LD, Grave KD: kLog: a language for logical and relational learning with kernels. CoRR. 2012, abs/1205.3981Google Scholar
- Gene Expression Omnibus. [http://www.ncbi.nih.gov/geo/]
- Bottou L, LeCun Y: Large scale online learning. Advances in Neural Information Processing Systems 16. Edited by: Thrun S, Saul L, Schölkopf B. 2004, Cambridge: MIT PressGoogle Scholar
- Chang CC, Lin CJ: LIBSVM: A library for support vector machines. ACM Trans Intell Syst Technol. 2011, 2: 27-[Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm]View ArticleGoogle Scholar
- Crooks GE, Hon G, Chandonia JM, Brenner SE: WebLogo: a sequence logo generator. Genome Res. 2004, 14: 1188-1190. 10.1101/gr.849004.View ArticleGoogle Scholar
- GraphProt. [http://www.bioinf.uni-freiburg.de/Software/GraphProt]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.