Comprehensive modeling of microRNA targets predicts functional non-conserved and non-canonical sites

mirSVR is a new machine learning method for ranking microRNA target sites by a down-regulation score. The algorithm trains a regression model on sequence and contextual features extracted from miRanda-predicted target sites. In a large-scale evaluation, miRanda-mirSVR is competitive with other target prediction methods in identifying target genes and predicting the extent of their downregulation at the mRNA or protein levels. Importantly, the method identifies a significant number of experimentally determined non-canonical and non-conserved sites.


Comparison of context score values
Context score values were computed by parsing the AU composition, 3'-binding and UTR position features from predicted microRNA::target site alignments.Seed specific scores were calculated using the regression parameters described in Grimson et al. Table S6 [8] using the source code downloaded from [39].Due to differences in predicted miRNA::target alignments, which in our study were generated by miRanda, the context score values computed based on our predicted duplexes (mirSVR-CS) are not exactly the same as those calculated by TargetScan (TargetScan-CS), and this difference may have some impact on the performance assessment.
To assess the significance of the differences between our computed context scores to those provided by TargetScan5.0,we first calculated the Pearson correlation between the two context score values for the canonical sites in the Linsley data set [21].We found a significant correlation between the two scores (Supplementary Table 1), and most of the difference can be attributed to variations in the 3'-binding scores.We also correlated the two scores with the observed log expression changes and found that mirSVR-CS values are in fact better correlated with the observed expression changes than the context scores provided by TargetScan.We conclude that differences in predicting the base-pairing in the miRNA::mRNA duplexes, and the resulting differences in context score values, did not adversely affect context score performance in our comparative analysis.Spearman correlation of canonical and all−site mirSVR models precision, defined as TP/(TP+FP), was compared to a set of 10,000 randomized predictions, using a mirSVR score cutoff of −0.1 to define the number of predictions made.The precision attained by mirSVR (marked by red arrow) satisfies p < 1.0e-4 relative to the randomized predictions.(b) Similarly, mirSVR sensitivity, defined as TP/(TP+FN), satisfies p < 1.0e-4 relative to random predictions.

Figure S 1 :
Figure S 1:Regression coefficient values of mirSVR features.The learned weights of the various features used in mirSVR canonical (red) and all-sites (blue) models.Negative values indicate that the feature contributes to down regulation of the target gene whereas positive values are disfavored.The features include: UTR length, secondary structure accessibility scores (SS) from position -20 to +20 averaged over a window of two; conservation[38]; AU-content around target site in a window of 30 bases upstream and downstream of the site; 3 binding score; relative distance from UTR end (UTRDis), and seed base pairing up to position 9 with additional bit to represent the presence of A across the first microRNA base (1A).In the canonical model (red bars), the coefficients of the seed region bit vector (m2-m7) are zero since all target sites are base-paired to the target mRNA between positions 2-7 and therefore do not contribute to the regression model.

Figure S 3 :
Figure S 2: Comparison of target prediction methods.(a) Spearman rank correlation between target site prediction scores and measured downregulation.Genes with single canonical sites were ordered by mirSVR, context score miRanda [11], PITA [15], PicTar (4way) [10], Diana-microT [25].All methods were required to rank the same set of target sites, and the ranking of the sites from each method was compared the rank order of the Z-transformed log expression change.Due to the low number of predictions by PicTar for some of the Linsley et al. microRNA transfection experiments, not all experiments were included in the analysis.Overall, mirSVR outperforms all other methods in 8 out of 11 experiments, supporting our conclusion that mirSVR improves over the mostly commonly used target prediction methods.(b) The mean log expression change of the top 50 predicted targets from each prediction method were computed in each of the miRNA transfection experiments.The results are summarized in the box plots where red lines represent the median value over the 17 experiments, 50% of the scores are within the blue box, and 75% are within the dashed lines.Overall, the top predictions from mirSVR are more downregulated compared to the top predictions of any other method.

Figure S 7 :
Figure S 4: Comparison of mirSVR canonical and all-sites models.The Spearman rank correlation of mirSVR scores for genes with single-canonical sites from the Linsley et al [21] and Selbach et al [17] (last 5 experiments labeled as proteomics) shows similar performance for both models with a slight advantage for the canonical model.