mirMark: a site-level and UTR-level classifier for miRNA target prediction
© Menor et al.; licensee BioMed Central Ltd. 2014
Received: 18 July 2014
Accepted: 15 October 2014
Published: 25 October 2014
MiRNAs play important roles in many diseases including cancers. However computational prediction of miRNA target genes is challenging and the accuracies of existing methods remain poor. We report mirMark, a new machine learning-based method of miRNA target prediction at the site and UTR levels. This method uses experimentally verified miRNA targets from miRecords and mirTarBase as training sets and considers over 700 features. By combining Correlation-based Feature Selection with a variety of statistical or machine learning methods for the site- and UTR-level classifiers, mirMark significantly improves the overall predictive performance compared to existing publicly available methods. MirMark is available from https://github.com/lanagarmire/MirMark.
MicroRNA (miRNA or miR) is one type of non-coding RNA (ncRNA) that regulates gene expression post-transcriptionally . In mammals, the mature form of miRNA is about 22 nucleotide (nt) long and it forms the miRNA-induced silencing complex (miRISC) in combination with argonaute proteins. Using the miRNA sequence as a guide, this miRISC binds to messenger RNAs (mRNAs) to degrade targeted mRNAs or inhibit translation from mRNAs to proteins . There have been over 1,000 annotated miRNAs in humans, and due to the potential to target multiple mRNAs by each miRNA, it is speculated that as much as 60% of mammalian genes are affected by miRNAs ,. Thus abnormal changes in miRNA expression can cause dysregulation of important biological pathways, and are involved in many diseases such as cancers and cardiovascular disease -. Therefore determination of the target mRNAs of the variety of miRNAs will help understand the development of these diseases.
In mammals, the binding of the miRNA to the mRNA is not perfectly complementary and the underlying mechanism is not fully understood . This makes it a difficult task for computational prediction of the mRNA targets of a particular miRNA. Due to the small number of experimentally verified miRNA-mRNA pairs, early miRNA target prediction methods are rule-based expert systems, such as MiRanda .
Currently, a variety of tools have been proposed for miRNA target prediction, based on different methodologies. Among them, TargetScan is a popular method that removes the free energy component and looks for conservation of the 8mer and 7mer seed region (as opposed to conserved miRs in the original version) . TargetScan uses the context scores to rank the predicted targets, based on linear regression trained on microarray data that consider 3' compensatory pairing (13th to 16th nt), local AU composition, and position effects (distance to closest end of 3' UTR) . As an improvement, the revised context + score adds predicted seed-pairing stability and target-site abundance . On the other hand, RNAhybrid and PITA are based on thermodynamics. RNAhybrid computes scores based on secondary structure , whereas PITA assesses the accessibility of the site (seed match) by the difference between the minimum free energy (MFE) of the duplex and the energy required to unpair and open the target site . Additionally, some recent methods such as mimiRNA  and TarBase , intend to discover miRNA and mRNA correlations by incorporating large amounts of experimental data. As the number of experimentally verified pairs has increased significantly over the years, statistical or machine learning methods that are data-driven are becoming popular in the area of non-coding RNA classification, including miRNA target prediction ,. An example of the data-driven miRNA target predictor is SVMicrO  that combines a larger variety of features than those of rule-based systems with the popular Support Vector Machine (SVM) learning algorithm .
A miRNA can potentially bind to multiple sites in the targeted mRNA. Depending on the resolution, one can perceive miRNA target prediction at two levels. At the gene level, one can predict if a given miRNA will target a particular mRNA. At the finer level, we can predict the sites along the interested region of mRNA which a miRNA will interact with. Correspondingly, miRNA target prediction with the classification approach includes at least two types of classifiers: a site-level classifier that predicts the possible target sites along the mRNA, and a gene-level classifier that predicts potential target mRNA overall. For example, DIANA-microT  scores individual target sites of a miRNA along the mRNA 3' UTR and then computes a combined score for the miRNA-mRNA pair overall using an artificial neural network. The distinction between the two levels is important, since the availability of data for training and testing is very different: many experiments identify miRNA-mRNA pairs, but lack results in the locations of the target sites. As a result, the performance of classifiers in relevance to others needs to be evaluated differently.
Most of the experimentally verified target sites remain historically biased toward the 3' UTRs, although there are growing observations of target sites within coding sequences (CDSs). Therefore in this paper, only target sites in the 3' UTR will be considered, due to the abundance of training data in 3' UTR. UTR-level classification will be used in lieu of gene-level classification in this report. However, the method proposed in this study is adaptable to target site prediction within CDSs, when sufficient amount of training data in CDS are available .
Despite the considerable advances in miRNA target prediction, there is much room for improving the predictive performance of existing methods . This study aims to improve the predictive performance for miRNA target prediction at both the site and UTR level by considering an extensive list of over 700 predictive features and using the latest collection of experimentally verified miRNA target data. Feature selection is used to find the most relevant, yet least redundant, set of features for site- and UTR-level prediction. Several statistical or machine learning methods are used to integrate the selected features and their performances are compared. Finally, the resulting classifiers called mirMark are compared to existing publically available miRNA target prediction methods. mirMark is demonstrated to have significantly improved predictive performance at both the site and UTR levels.
The positive data are obtained from miRecords  and miRTarBase . At the site-level, only human miRNA-mRNA pairs with validated target site information are taken from miRecords. Since miRecords uses a mixture of older (pre-2011) and current mature miRNA nomenclature, the mature miRNA names are resolved using BLAST: the miRNA sequences given in miRecords are compared with the mature human miRNA sequences obtained from mirBase (v19), the most recent version available during the preparation of this manuscript . Similarly, the target region positions on the 3' UTR are inferred using BLAST: the target region sequences given in miRecords are compared to the target 3' UTR sequences obtained from UCSC Genome Browser. Any site-level records with unresolvable miRNA names or target region positions are omitted. The resulting list of 507 miRNA-target site pairs is used as the site-level positive set. This list is provided in Additional file 1: Table S1.
At the UTR level, experimentally validated human miRNA-gene pairs are combined from two sources: (1) all human gene and miRNA pairs from miRecords; and (2) the subset of miRNA-gene pairs that have strong experimental evidence (that is, those that are not labelled as weakly supported) from mirTarBase. In miRecords, again miRNA names that cannot be resolved by comparison with mirBase v16 (version prior to the change of nomenclature) or v19 are omitted, due to the mixture of nomenclature used. In miRTarBase, genes that have multiple distinct/overlapping UTR sequence are omitted, and longest UTRs are used to represent the genes that have some RefSeq UTRs contained within a longer UTR of the same gene. The resulting list of 2,937 miRNA-gene pairs are used as the UTR-level positive set. This list is provided in Additional file 2: Table S2.
The negative data are generated using mock miRNAs in a manner similar to the approaches used in ,. A mock miRNA is a random permutation of a real mature miRNA sequence that does not have any overlap with the seed sequences from known miRNAs. For each mature miRNA, we use the Fisher-Yates shuffle algorithm  to generate random permutations until we find a mock miRNA such that no 7mer in the seed region of the mock miRNA matches a 7mer of the seed region of any real mature miRNA listed in mirBase v19.
At the site level, mock miRNAs are generated for each real miRNA in the site-level positive dataset. For each real miRNA-gene pair in the positive dataset, a corresponding mock miRNA-gene pair is generated and replaces the positive miRNA in the miRNA-gene pair. Negative target regions are then generated for each mock miRNA-gene pair using MiRanda's alignment algorithm with a minimum alignment score of 155. Doing so allows us to find well aligned target sites and create a balanced set of positive and negative data.
At the UTR level, mock miRNA-gene pairs are generated for each real miRNA-gene pair in the UTR-level positive dataset. The mock miRNAs are generated by randomly permutations of the corresponding real miRNA sequences, as in the site-level negative set. Features in site-level are computed for the UTR level as well, and summary features on these sites are calculated for each pair, with additions of other UTR-level specific features (see the `UTR-level features' section below).
One hundred and fifty-one site-level features are considered and the full list is given in Additional file 3: Table S3. Below are the descriptions of the site-level features by category.
The total minimum free energy (Duplex_MFE) is computed using RNAduplex  on the mature miRNA and the candidate target site (CTS). Region specific minimum free energies are computed by using RNAduplex on the miRNA seed (Seed_MFE) or miRNA 3' region (3p_MFE) on the CTS. The local minimum free energy of the CTS (Local_target_MFE) is computed by RNAfold  on the 100 nt window surrounding the CTS. The local minimum free energy of the CTS whose bases are constrained to be unpaired (Local_cons_target_MFE) is also computed using RNAfold on the 100 nt window surround the CTS. The local opening energy of the CTS (Local_open_energy), a measure of CTS accessibility done in software PITA , is computed as the difference between Local_target_MFE and Local_cons_target_MFE.
Seed match type
Binary variables specifying the types of seed match in a CTS are computed using MiRanda's predicted alignment. The types of seed match considered are as follows:
Seed_match_8mer: p1-p8 Watson-Crick (WC) match
Seed_match_8merA1: p1 match/mismatch to A, p2-p8 WC match
Seed_mach7mer1: p1-p7 WC match
Seed_match7mer2: p2-p8 WC match
Seed_match7merA1: p1 match/mismatch to A, p2-p7 WC match
Seed_match6mer1: p1-p6 WC match
Seed_match6mer2: p2-p7 WC match
Seed_match6mer1GU: p1-p6 WC match allowing only one GU wobble
Seed_match6mer2GU: p2-p7 WC match allowing only one GU wobble
Information of the type of target duplex pairing for the first 20 nt of the miRNA (miR_match_P01 to miR_match_P20) is encoded as an integer-based categorical variable as follows:
1: G-C match
2: A-U match
3: G-U wobble
Furthermore, the miRNA pairing information is summarized over the seed region, 3' region, and total miRNA region. This includes the number of G-C matches (Seed_GC), A-U matches, (Seed_AU), GU wobbles (Seed_GU), mismatches (Seed_mismatch), bulges (Seed_bulge), and nucleotides in bulges (Seed_bulge_nt) in the seed region of the miRNA.
Target site accessibility
Position-wise and region accessibility values of CTSs are computed using RNAplfold in ViennaRNA package  with winsize 80, span 40, and ulength 10. The accessibility of the entire seed region (Seed_acc), the 5′ half of the seed region (Seed_5p_acc), the 3' half of the seed region (Seed_3p_acc), and position-wise accessibility of each seed position of the CTS (Seed_P01_acc to Seed_P08_acc) are considered. Furthermore, the accessibility of the regions 10 nt upstream (Up_seed_flank_acc) and 10 nt downstream of the seed region (Down_seed_flank_acc), as well as their corresponding position-wise accessibilities (Up_seed_P01_acc to Up_seedP10 and Down_seed_P01_acc to Down_seed_P10) are considered.
Target site composition
The nucleotide and dimer composition of the CTS (for example, Target_A_comp, Target_AU_comp), and the flanking 70 nt regions upstream and downstream of the CTS (for example, Up_C_comp, Down_GU_comp) are computed using BioPerl . The flanking AU score described by Grimson et al. in , which is a weighted count of AU composition flanking the seed region, is also considered.
Target site conservation
Per base conservation scores of the human 3' UTRs are taken from PhastCons46way . The average per base conservation score of the CTS' seed region (Seed_cons_score), the entire CTS (Target_cons_score), and the 70 nt upstream and downstream flanks of the CTS (Flank_cons_score) are considered.
Location of target site
The location of the CTS is considered by computing the distance of the CTS to the closest 3' UTR end point (Dist_to_end). This distance is scaled by dividing by the length of the 3' UTR.
A total of 624 UTR-level features are considered. These features include summary statistics of site-level features, 3'UTR related information, and CTSs in 3' UTRs. Below are the descriptions of the UTR-level features by category.
Summary of site-level features
Total, minimum, maximum, and mean values of the 151 site-level features of the CTSs of a miRNA-gene pair are computed. Also considered are the total, minimum, maximum, and mean values of the posterior probability from the random forest-based site-level classifier, MiRanda alignment score, and CTS start and end positions.
Other UTR-level features
The length of the 3' UTR (UTR_length) and the number of CTSs for a miRNA-gene pair (number_sites) are considered. The CTS density (site_density) is computed as number_sites/UTR_length, as done in SVMicrO . Finally, another measure of density is computed by counting the maximum number of CTSs that lie within 100 nt of each other (max_100_sites).
Two feature selection methods are considered: Correlation-based Feature Selection (CFS)  and minimum Redundancy Maximum Relevance (mRMR) . Both methods are based on mutual information, a non-linear measure of correlation. Mutual information values are normalized to be between 0 and 1 using the Linfoot's method .
Both CFS and mRMR seek to balance the relevance and redundancy of the features. Relevance is the correlation of a feature to the class (positive or negative miRNA target), as measured using mutual information. For a feature to be selected, it must be relevant to predicting miRNA targets. On the other hand, redundancy is the correlation between two features. Redundancy between selected features is minimized to keep the number of selected features small.
The key difference between CFS and mRMR is that CFS selects an approximately optimal subset of features that balance relevance and redundancy, whereas mRMR only provides a ranking of features and the number of top ranking features to use is left to be determined by other methods, such as cross-validation.
RNAduplex, RNAfold, and RNAplfold  in Vienna RNA package are used for energy and accessibility computations. Nucleotide composition is computed using BioPerl . Weka 3 data mining software  is used for CFS, classifier training, and evaluation. The entropy package  in R is used to compute mutual information with the recommended method of Freedman and Diaconis  to discretize features of continuous variables.
The detailed instruction and open source code of mirMark for miRNA target prediction are available at  and . Additionally, we include the site-level positive data in Additional file 1: Tables S1 and negative data in Additional file 4: Table S4, and the UTR-level positive data in Additional file 2: Table S2 and negative data in Additional file 5: Table S5.
We used the PAR-CLIP data from previous studies , to compare the performance between mirMark and TargetScan. We obtained one set of data from the supplementary material of Hafler et al. , and the other datasets from Kishore et al. , which have the following accession IDs and samples:
Structure of mirMark
Most of the identified locations of miRNA targets in the miRNA target database miRecords  are in the 3' UTR region of the mRNA due to historical reason. Although there is evidence that miRNA can also target the 5′ UTR and coding regions of the mRNA, the data are sparse and therefore the focus of this work is on the 3' UTR, in order to be comparable to the majority of target prediction tools.
Given the list of CTSs of the miRNA along with their predicted alignments (Figure 1B), the site-level classifier will assign a posterior probability that the given CTS is a target site of the miRNA. This prediction is made on the basis of features such as the presence of a seed match, free energy of the duplex, and the accessibility of the target site.
Finally, given the CTSs and their posterior probability of being a true target as computed by the site-level classifier, the UTR-level classifier will assign a posterior probability that the miRNA targets the mRNA overall. This prediction can be made on the basis of features such as the number of CTSs, the number of CTSs of a particular seed type, and the length of the 3' UTR. This step allows for the integration of the information provided by the set of CTSs to improve prediction accuracy (Figure 1A).
Site-level feature selection
Five hundred and seven human miRNA-site pairs are extracted from miRecords  along with their experimentally verified duplex structures. Random permutations of the miRNA sequences are used to generate mock miRNAs. To serve as a negative set, 520 mock miRNA-site pairs corresponding to the real miRNA-site pairs are generated using MiRanda's predicted alignments. The dataset is split in two with 80% for training and cross-validation, and the rest 20% reserved as a hold-out test set for independent evaluation.
Selected site-level features by correlation-based feature selection
Match status of miRNA position 1
Match status of miRNA position 3
Match status of miRNA position 4
Match status of miRNA position 8
Match status of miRNA position 15
Number of bulges in seed region
Number of AU matches in target site
Number of mismatches in target site
Number of bulges in target site
Number of nucleotides within bulges in target site
Accessibility score position 1 of seed region
Conservation score of seed region
The two remaining seed-focused features, Seed_cons_score and Seed_P01_acc, are related to conservation and accessibility of the target site (Table 1). Since miRNAs are involved in regulating many vital biological processes, it is not surprising that many target sites are conserved across species and that the average conservation score of the target site's seed region of the miRNA (Seed_cons_score) is selected. This is in agreement with SVMicrO that also selects the conservation score of the seed region computed using PhastCons28way . The accessibility of the first position of the target site's seed region (Seed_P01_acc) is selected. Corresponding to our result, it has been shown that the accessibility of the target site's seed region is highly predictive by analysis of HITS-CLIP data .
It is also experimentally observed that a group of target sites exist with relative weak binding in the seed region that is compensated for by strong binding on the target sites overall. Examples are centered and 3' compensatory sites that have strong pairing on positions 4 to 15 and 12 to 17 of the miRNA, respectively ,. Our CFS selected features provide evidence to support such observations (Figure 2, Table 1). A group of features examine the stability of the target site duplex overall (Total_AU, Total_mismatch, Total_bulge, Total_bulge_nt). Most impressively, a feature for matching at position 15 of the miRNA (miR_match_P15) that is critical to both centered and a 3' compensatory site is selected by CFS (Figure 2, Table 1).
Compared to the 12 CFS-selected features, the top 12 ranking features selected by another closely related feature selection method mRMR (Additional file 3: Table S3) show 75% agreement. The top 12 features of mRMR do not contain the miR_match_P15, Seed_bulge, Seed_P01_acc, and Total_mismatch features selected by CFS. Instead, mRMR chooses two seed match type features (Seed_match_6mer2GU and Seed_match_7mer2) and two 3' region features (3p_bulge and 3p_mismatch).
Evaluation of site-level classifiers
To evaluate different classifiers for site-level target prediction, we perform 10-fold cross-validation on the training set where CFS is performed per fold. We consider four types of linear classifiers: logistic regression (LR) , Fisher's linear discriminant analysis (FLDA) , naïve Bayes (NB) , and the support vector machine (SVM)  with a linear kernel. In addition, we include two non-linear classifiers: the random forest (RF)  of 100 random trees and the SVM with the Gaussian radial basis function kernel (Gaussian SVM). Increasing the number of random trees has diminishing returns and it was empirically observed through cross-validation results that 100 random trees are sufficient for site-level classification. The SVM complexity parameter and the Gaussian kernel's width parameter are selected through cross-validation. The SVM classifiers also include a LR model to approximate posterior probabilities using Platt's method  to improve performance.
In addition, Gaussian SVM, RF, and LR models are trained using the top ranking mRMR features and the AUC over varying number of features is given in Additional file 6: Figure S1. The AUC achieved is comparable to that of the classifiers using CFS-selected features. Given the practical issue that mRMR only provides a ranking of features and the number of top ranking features to use is left to be determined by other methods (such as cross-validation), we elect to use the features selected by CFS from now on.
Site-level comparison with other existing methods
Using the site-level independent hold-out test set, we compare the performance of RF, Gaussian SVM, and LR to existing publically available miRNA target prediction software: SVMicrO , TargetScan , MiRanda , RNAhybrid , and PITA . Only publically available software is considered in order to obtain predictions on the mock miRNAs. The RF and Gaussian SVM are chosen as they are the top performers in the cross-validation evaluation and LR is chosen as a representative of linear classifiers. Here we only consider site-level predictions in this comparison, and leave UTR-level predictions provided by SVMicrO, TargetScan, MiRanda, and PITA in UTR-level comparisons described later. Note that TargetScan and PITA are seed-level predictors; we extend the seeds to obtain 25 nt long target site regions, so that TargetScan and PITA can be compared with other site-level models.
Next, we investigate the biases that a method may have in the locations overlapping between the expected target sites from miRecords and the predicted target sites. Since the expected target sites vary in length, we divide the length of the expected target sites into 10 equal sized bins. We also include five flanking bins of equal size upstream and downstream of the expected sites from miRecords. The bins count the number of predicted sites that overlap with the coordinates located in the expected target site. We use the predicted target regions given by the methods at about 0.6 true positive rate, according to the 75% overlap ROC curve in Figure 4C. Therefore RNAhybrid is omitted form this analysis. The results are given in Figure 4F to H corresponding to RF, SVMicrO, and MiRanda, respectively. LR and Gaussian SVM have similar results as RF, and they are given in Additional file 7: Figure S2.
In Figure 4F we see that the RF classifier performs very well, as the majority of overlapping predicted target sites are mostly contained within the expected target sites, with only a common overhang at the 5′ end of the expected target site. It also yields most number of predicted regions among the three classifiers tested. In contrast, SVMicrO shows a large overhang at the 5′ end, but a relatively clean cut at the 3' end that pairs with miRNA seed region (Figure 4G). The clean cut at the 3' end is expected as potential target sites are first identified by seeking loose seed matches to the miRNA, and then extended by alignment using MiRanda to form a full target region prediction. The results for MiRanda in Figure 4H have relatively similar shape to those for RF in Figure 4F. This is due to the fact that the RF classifier uses MiRanda's alignment algorithm to identify CTSs. The key difference between MiRanda and the RF classifier is how the CTSs are filtered: MiRanda uses the minimum free energy of the CTS, whereas RF uses a posterior probability estimated from 12 CFS selected-features selected by CFS. This results in a different selection of predicted target sites and therefore some variation in the overlap plots in between the two (Figure 4F, H).
Finally, we also evaluate the performance of mirMark vs. TargetScan using published PAR-CLIP experiment results ,. PAR-CLIP results provide direct `finger-print' information on putative miRNA binding sites genome-wide by pulling down the nucleotide sequences associated with the RNA-binding proteins, making them good additional testing data to detect the sensitivity of the tools. However, note that the direct miRNA and target pairing information is missing in PAR-CLIP data which makes them un-desirable as the positive dataset in mirMark. We randomly selected about 1,000 sites from 100 UTRs detected in PAR-CLIP data as the truth measure, where the UTR sequences are determined by BLAST matching of cross-linked centered regions (CCRs) in PAR-CLIP results. We test the performance of mirMark site-level model on these sites, in comparison with the predictions from TargetScanHuman 6.2 on the same sites. We extend TargetScan predicted seed matching regions to 25 bp, to make it comparable to mirMark. As shown in Additional file 8: Figure S3, in the regions of the high overlap percentages (more than 90%) between prediction and PAR-CLIP results, mirMark with the stringent posterior probability threshold of 0.95 still predicts significantly (27%) more true-positive sites than TargetScan that has a loose threshold of 50% percentile Context + score. This result again shows that mirMark is a better performer at the site level, compared to TargetScan.
UTR-level feature selection
UTR-level positive training data are taken from miRecords and miRTarBase. These are experimentally validated human miRNA-gene pairs with high confidence. A negative set of mock miRNA-gene pairs associated with the real miRNA-gene pairs is generated by random permutations of the miRNA sequences paired with real UTRs. Like the site-level classification, the dataset is split in two with 80% for training and cross-validation and 20% reserved for independent evaluation.
A total of 624 UTR-level features are considered. They include 3' UTR level features, such as the density of the predicted targets sites within the 3' UTR, and summary statistics (maximum, minimum, mean, and summation) based on each of the previously mentioned site-level feature of a miRNA-gene pair (see Methods for details). Also included are the total, minimum, maximum, and mean of the posterior probabilities from the RF site-level classifier on the CTSs of the miRNA-gene pair. The RF site-level classifier is chosen due to its best performance in the site-level evaluation in the previous section.
Selected UTR-level features by correlation-based feature selection
Maximum alignment score between miRNA and target sites
Proportion of target sites with P02-P07 WC match
Match status of miRNA position 1
Proportion of target sites with P02-P08 WC match
Proportion of target sites with P01-P07 WC match
Minimum MFE of seed region of miRNA:site duplexes
Mean MFE of 3' region of miRNA:site duplexes
UC dimer composition of the CTS
Match status of miRNA position 9
Match status of miRNA position 2
Mean number of GU matches in target site seed regions
Match status of miRNA position 7
Minimum distance of target sites to the 5′ end of the 3' UTR
Match status of miRNA position 19
Match status of miRNA position 15
As mentioned in the results of site-level feature selection, there exist target sites with relative weak binding in the seed region that are compensated for by strong binding on the target site overall. There are three selected features focused on the 3' region of the miRNA (Table 2, Figure 5). The mean MFE (X3p_MFE.mean) in the 3' region of the CTSs is indicative of the stability of the duplex beyond the seed region. Furthermore, the existence of a CTS with good binding on positions 15 and 19 of the miRNA (miR_match_P15.min and miR_match_P19.min) provides strong evidence of the importance of central region and 3' compensatory pairing respectively, which were observed by others experimentally ,. Additionally, the maximum MiRanda alignment score (Miranda_score.max) of the CTSs provides evidence of the overall presence of stable CTS bindings.
Finally, the literature has shown that the location of a target site is biased toward the ends of the 3' UTR  and CFS has partially detected this bias through its selection of a feature indicating the distance to the 5′ end of the 3' UTR to the closest CTS (Start_position.min in Table 2).
UTR-level comparison with existing methods
Using the UTR-level hold-out test set that is independent of mirMark models, we compare the performances of RF, LR, and Gaussian SVM classifiers in mirMark to those of TargetScan, SVMicrO, MiRanda, RNAhybrid, and PITA. With the exception of RNAhybrid and TargetScan, all other methods provide predictions at the UTR level by integrating site-level evidence. RNAhybrid produces UTR-level predictions solely based on the minimum MFE in predicted target site.
Given the wide application of TargetScan, we next ask what is the sensitivity of mirMark in the false-negative UTR targets of TargetScan′ To answer this question, we obtain 19641 UTR targets detected by PAR-CLIP ,, among which 1757 UTR targets are not predicted by TargetScanHuman version 6.2. We randomly select 300 of these targets as inputs to mirMark's UTR-level classifier, in combination with miRNAs in miRBase. We choose the posterior probability from the best match for each UTR target, and plot the posterior probability density distribution in Additional file 9: Figure S4. Most of the UTR targets from PAR-CLIP are detected with high confidence from mirMark.
Mock miRNA based vs. real biologically negative dataset
It is debatable what type of negative datasets are the best for miRNA target predictions, with the machine learning approach. True negative target data are simply the complement of the true positive target data, whereas the set of `true positive target data' are an unknown entity. The true negative target dataset is quite large. To obtain a balanced classification design, a subset of negative data must be chosen in an unbiased manner as the negative training data.
We have taken the mock miRNA and real targets pairing approach to generate the negative data, whereas others used negative data with some experimental support, such as using real miRNAs and genes with no experimental evidence of being the targets of miRNAs. In order to investigate which approach applies better in partner with the positive data that we use, we compare mirMark's negative dataset composed of mock miRNA and real targets pairs, with another approach using real miRNAs and genes lacking experimental evidence of being miRNA targets, similar to Marin et al.  and Ritchie et al. . In this alternative approach, we combine all potential targets from miRecords, miRTarBase, and 7 PAR-CLIP datasets , to obtain a set of miRNA targets, and then exclude them from RefSeq genes to get the `biologically negative targets'. We then randomly pair the real positive miRNAs with these biologically negative targets to generate a biologically negative dataset. This biologically negative dataset is then split into training/testing data, with/without combination with the mirMark's mock miRNAs as the testing/training data. This results in four scenarios: (1) mock miRNAs for both model building and validation (the mirMark method); (2) biologically negative data for both model building and validation; and (3, 4) two more cases with mixed mock miRNAs and biologically negative data for model training and validation, and vice versa. We show that the mirMark mock miRNA negative data approach has the best predictive performance on testing dataset (Additional file 10: Figure S5) among the four combinations, including the scenario where biologically negative data are used for both model building and testing-data validation. This consolidates the suitability of using mock miRNA real targets pairs as negative data, in partner with the experimentally validated positive data from mirRecords and mirTarBase.
Due to the use of miRanda to align CTS for the site-level negative data, biases might be introduced for features relevant to the miRanda algorithm. To minimize possible biases, we use a stringent miRanda alignment score threshold for the negative data. Only CTS with a miRanda score of 155 or higher are allowed as part of the negative data. Such a stringent threshold potentially dampens or dilutes the importance of seed matching. Even so, the selected features still (Figure 2, Table 1) highlight the importance of the seed region in miRna targeting, as seven out of the twelve selected features are focused on the seed region.
Relevance of selected features to classes
The correlations of the site- and UTR-level features to the class in Figures 2 and 5, respectively, are not very strong, with values around 0.55 or below. This suggests that an individual feature by itself is not a very strong predictor of miRNA targets. Rather, a set of features is necessary to make target predictions reliably. This is supported by the prediction results given in Figures 4 and 6, where the target predictions based on duplex MFE (MiRanda and RNAhybrid) and accessibility (PITA) underperform the machine learning-based predictors that integrate multiple features (mirMark and SVMicrO).
Comparison of mirMark classification methods
Random forest and Gaussian SVM are two top classifiers that have very close predictive performances overall. At site level, the non-linear methods of mirMark are only marginally superior in cross-validation (Figure 3), regardless of metrics, suggesting that the decision boundary between target sites and non-target sites is nearly linear. Since the SVM learning algorithm approximately optimizes accuracy, it is not surprising the non-linear Gaussian SVM outperforms all other methods in accuracy. Also the training data are balanced between target site and non-target site examples, this may explain why Gaussian SVM yields better F-measure and MCC than other classifier. It has been shown that Platt's method is unreliable at estimating posterior probabilities from SVM outputs , which may explain the slight underperformance of Gaussian SVM compared to random forest in the AUC measure that relies on the posterior probability estimates. However, there is a drop in SVM performance from cross-validation to test results at site level. This may be due to selection bias in the cross-validation results, which was used to select the SVM parameters. The results on the test set suggest that some overfitting of the SVM models is caused by choosing the SVM model with the best observed AUC in cross-validation. On the other hand, random forest is an ensemble of decision trees where the classification is determined by most popular vote among all model trees. It is known to converge without the overfitting problem . This advantage of random forest is exhibited in the hold-out testing set at site-level.
Site-level vs. UTR-level predictions of mirMark
None of the posterior probability features, the outputs of the RF site-level classifier, are selected by CFS for UTR-level classification. This suggests that UTR-level target prediction can be largely independently of results out of the site-level target prediction. Indeed, the majority of the CFS-selected UTR-level features are highly correlated to the posterior features, as the heatmap in Additional file 11: Figure S6 shows. Thus we propose that necessary predictive information for the UTR-level is contained in the summary site-level features and the results of a site-level classifier are not an absolute requirement for UTR-level target prediction. In fact, two summary statistics of site-level features (miR_match_P01.min and miR_match_P15.min) are selected as the UTR-level features, confirming the importance of complementary matching in both the seed region and positions 13 to 16 in the central region that were observed by others experimentally ,. The finding that miR_match_P15 is an important feature has prompted us to conduct more detailed analysis on the type of binding between miRNA and targets at this position. As shown in Additional file 12: Figure S7, this position has slightly more match (23% G-C matches and31% A-U matches) than the no-match cases (15% G-U wobbles, 27% mismatches and 3% gaps), supporting the result that miR_match_P15 is an important feature in the model.
Comparison to SVMicrO
SVMicrO is a recent SVM based miRNA target prediction tool that showed superior performance to earlier methods such as TargetScan and PicTar . Compared to the other programs discussed in this paper, mirMark is algorithmically more similar to SVMicrO. Both methods start with a large variety of features and use feature selection methods to select a smaller subset of features for use in the site- and UTR-level classifiers. mirMark predictors and SVMicrO share a common structure of using MiRanda to identify CTSs and using machine learning methods to train site- and UTR-level classifiers.
Unlike the mock miRNA approach for generating negative data in mirMark, SVMicrO created a negative dataset based on genes that positively correlated to miRNAs in miRNA expression microarray experiments . However, compared to the mock miRNA approach, SVMicrO's approach may be biased to experimental conditions, as well as too restrictive since there may exist many true negative data that are not positively correlated in microarray experiments. These may also explain our observations that SVMicrO performs better than TargetScan at the site level (Figure 4), but not at the UTR level (Figure 6).
Besides the different datasets from SVMicrO, mirMark has improved method design, which may also lead to the significantly better performance than SVMicro. SVMicrO uses MiRanda to identify potential seed matches. This prevents SVMicrO from identifying target sites that have a weak seed match, such as 3' compensatory and centered sites ,. Thus one improvement of the mirMark predictors is the use of MiRanda to identify full CTSs in order to find strong binding regions overall but not just in the seed region. Another improvement arises from the much more features considered by mirMark. At the UTR level, the feature selection conducted for SVMicrO only consists of 60 features relating to the total number of predicted target sites of particular seed types, the top score of the sites provided by the site-level classifier, and the density of the predicted sites . Whereas the feature selection conducted for mirMark casts a wider net of 624 features, including summary statistics of every site-level feature considered. This allows the selection of a subset of UTR-level features that are more predictive than those of SVMicrO.
Potential limitations of mirMark and future work
As mentioned earlier, mirMark is built on the machine learning approach, thus the results of the model are dependent on input data, like all statistical models. Positive data for mirMark are obtained through the combined results from miRecords and a stringent selection of miRTarBase. Unavoidably, the miRNA and target interactions from these databases may simply reflect the miRNAs and genes of interest to the experimentalists who performed the validation , and they may not be representative of the landscape of target interactions in general. Furthermore, the choice of the negative dataset potentially introduces bias in the model. mirMark uses the mock miRNA approach for negative dataset generation. Mock miRNAs are in silico constructions and are not found in nature, according to current knowledge of miRNAs, it is possible that they have different sequences and properties compared to `true negative miRNAs', which we do not know the complete set yet. We used the mock miRNA to pair with true positive targets in the generation of negative data, thus any potential target bias from the positive dataset is carried over to the negative dataset as well. Additionally, miRanda is used to find candidate target sites with seed matching in both positive and negative datasets of mirMark, therefore the selected features may be biased against seed matching but favor other features that are not used by miRanda filtering. This could explain why Total_AU, the number of AU base matches between the miRNA and the target, is selected in our site-level classifier and has better relevance to the classification outcome than other features that are related to seed matching.
While recognizing the potential problems due to the input data, we are optimistic that the machine learning approach is the state-of-art methodology for more accurate miRNA target prediction. The quality and quantity of training data are continuously improving, as more and more miRNA-target interaction data are recorded by databases such as miRecords and miRTarBase. We plan to maintain and update mirMark regularly as new training data become available. Moreover, we will expand mirMark from predicting human to other species, such as mouse, in the near future.
A new site- and UTR-level miRNA target tool, mirMark, is proposed. It initially considers an extensive list of over 700 features. This list is narrowed down to find the sets of the most relevant and minimally redundant features using feature selection. Evaluation of mirMark at the site and UTR levels reveals the overall superior performance of the random forest classification method. Furthermore, mirMark shows significant improvement in predictive performance compared to existing publically available methods for human miRNA target prediction.
This work was supported by NIH/NIGMS P30 GM103341, P20 GM103457, and NIEHS K01 ES025434 to LXG. The authors would like to thank Dr. Guylaine Poisson and Dr. Kyungim Baek for giving MM the opportunity to collaborate with LXG.
- Wilbert ML, Yeo GW: Genome wide approaches in the study of microRNA biology. Wiley Interdiscip Rev Syst Biol Med. 2011, 3: 491-512. 10.1002/wsbm.128.PubMedPubMed CentralView ArticleGoogle Scholar
- Thomas M, Lieberman J, Lal A: Desperately seeking microRNA targets. Nat Struct Mol Biol. 2010, 17: 1169-1174. 10.1038/nsmb.1921.PubMedView ArticleGoogle Scholar
- Lewis BP, Burge CB, Bartel DP: Conserved seed pairing, often flanked by adenosines, indicates that thousands of human genes are microRNA targets. Cell. 2005, 120: 15-20. 10.1016/j.cell.2004.12.035.PubMedView ArticleGoogle Scholar
- Friedman RC, Farh KK, Burge CB, Bartel DP: Most mammalian mRNAs are conserved targets of microRNAs. Genome Res. 2009, 19: 92-105. 10.1101/gr.082701.108.PubMedPubMed CentralView ArticleGoogle Scholar
- Hünten S, Siemens H, Kaller M, Hermeking H: The p53/microRNA Network in Cancer: Experimental and Bioinformatics Approaches. MicroRNA Cancer Regulation. 2013, Springer, Dordrecht, 77-101. 10.1007/978-94-007-5590-1_5.View ArticleGoogle Scholar
- Hata A: Functions of microRNAs in cardiovascular biology and disease. Annu Rev Physiol. 2013, 75: 69-93. 10.1146/annurev-physiol-030212-183737.PubMedView ArticleGoogle Scholar
- Wang KC, Garmire LX, Young A, Nguyen P, Trinh A, Subramaniam S, Wang N, Shyy JY, Li YS, Chien S: Role of microRNA-23b in flow-regulation of Rb phosphorylation and endothelial cell growth. Proc Natl Acad Sci U S A. 2010, 107: 3234-3239. 10.1073/pnas.0914825107.PubMedPubMed CentralView ArticleGoogle Scholar
- Witkos T, Koscianska E, Krzyzosiak W: Practical aspects of microRNA target prediction. Curr Mol Med. 2011, 11: 93-10.2174/156652411794859250.PubMedPubMed CentralView ArticleGoogle Scholar
- John B, Enright AJ, Aravin A, Tuschl T, Sander C, Marks DS: Human microRNA targets. PLoS Biol. 2004, 2: e363-10.1371/journal.pbio.0020363.PubMedPubMed CentralView ArticleGoogle Scholar
- Grimson A, Farh KK-H, Johnston WK, Garrett-Engele P, Lim LP, Bartel DP: MicroRNA targeting specificity in mammals: determinants beyond seed pairing. Mol Cell. 2007, 27: 91-105. 10.1016/j.molcel.2007.06.017.PubMedPubMed CentralView ArticleGoogle Scholar
- Garcia DM, Baek D, Shin C, Bell GW, Grimson A, Bartel DP: Weak seed-pairing stability and high target-site abundance decrease the proficiency of lsy-6 and other microRNAs. Nat Struct Mol Biol. 2011, 18: 1139-1146. 10.1038/nsmb.2115.PubMedPubMed CentralView ArticleGoogle Scholar
- Krüger J, Rehmsmeier M: RNAhybrid: microRNA target prediction easy, fast and flexible. Nucleic Acids Res. 2006, 34: W451-W454. 10.1093/nar/gkl243.PubMedPubMed CentralView ArticleGoogle Scholar
- Kertesz M, Iovino N, Unnerstall U, Gaul U, Segal E: The role of site accessibility in microRNA target recognition. Nat Genet. 2007, 39: 1278-1284. 10.1038/ng2135.PubMedView ArticleGoogle Scholar
- Ritchie W, Flamant S, Rasko JE: mimiRNA: a microRNA expression profiler and classification resource designed to identify functional correlations between microRNAs and their targets. Bioinformatics. 2010, 26: 223-227. 10.1093/bioinformatics/btp649.PubMedView ArticleGoogle Scholar
- Vergoulis T, Vlachos IS, Alexiou P, Georgakilas G, Maragkakis M, Reczko M, Gerangelos S, Koziris N, Dalamagas T, Hatzigeorgiou AG: TarBase 6.0: capturing the exponential growth of miRNA targets with experimental support. Nucleic Acids Res. 2012, 40: D222-D229. 10.1093/nar/gkr1161.PubMedPubMed CentralView ArticleGoogle Scholar
- Yue D, Liu H, Huang Y: Survey of computational algorithms for microRNA target prediction. Curr Genomics. 2009, 10: 478-10.2174/138920209789208219.PubMedPubMed CentralView ArticleGoogle Scholar
- Sun K, Chen XN, Jiang PY, Song XF, Wang HT, Sun H: iSeeRNA: identification of long intergenic non-coding RNA transcripts from transcriptome sequencing data. BMC Genomics. 2013, 14: S7-PubMedPubMed CentralGoogle Scholar
- Liu H, Yue D, Chen Y, Gao S-J, Huang Y: Improving performance of mammalian microRNA target prediction. BMC Bioinformatics. 2010, 11: 476-10.1186/1471-2105-11-476.PubMedPubMed CentralView ArticleGoogle Scholar
- Vapnik VN: Statistical learning theory. 1998, Wiley, New YorkGoogle Scholar
- Reczko M, Maragkakis M, Alexiou P, Papadopoulos GL, Hatzigeorgiou AG: Accurate microRNA target prediction using detailed binding site accessibility and machine learning on proteomics data. Front Genet. 2011, 2: Article 103:1-13-Google Scholar
- Marín RM, Šulc M, Vaníček J: Searching the coding region for microRNA targets. RNA. 2013, 19: 467-474. 10.1261/rna.035634.112.PubMedPubMed CentralView ArticleGoogle Scholar
- Yue D, Meng J, Lu M, Chen CP, Guo M, Huang Y: Understanding MicroRNA regulation: a computational perspective. IEEE Signal Process Mag. 2012, 29: 77-88. 10.1109/MSP.2011.943013.View ArticleGoogle Scholar
- Xiao F, Zuo Z, Cai G, Kang S, Gao X, Li T: miRecords: an integrated resource for microRNA-target interactions. Nucleic Acids Res. 2009, 37: D105-D110. 10.1093/nar/gkn851.PubMedPubMed CentralView ArticleGoogle Scholar
- Hsu SD, Tseng YT, Shrestha S, Lin YL, Khaleel A, Chou CH, Chu CF, Huang HY, Lin CM, Ho SY, Jian TY, Lin FM, Chang TH, Weng SL, Liao KW, Liao IE, Liu CC, Huang HD: miRTarBase update 2014: an information resource for experimentally validated miRNA-target interactions. Nucleic Acids Res. 2014, 42: D78-D85. 10.1093/nar/gkt1266.PubMedPubMed CentralView ArticleGoogle Scholar
- Griffiths-Jones S, Saini HK, van Dongen S, Enright AJ: miRBase: tools for microRNA genomics. Nucleic Acids Res. 2008, 36: D154-D158. 10.1093/nar/gkm952.PubMedPubMed CentralView ArticleGoogle Scholar
- Maragkakis M, Alexiou P, Papadopoulos GL, Reczko M, Dalamagas T, Giannopoulos G, Goumas G, Koukis E, Kourtis K, Simossis VA: Accurate microRNA target prediction correlates with protein repression levels. BMC Bioinformatics. 2009, 10: 295-10.1186/1471-2105-10-295.PubMedPubMed CentralView ArticleGoogle Scholar
- Knuth DE: The Art of Computer Programming: Seminumerical Algorithms II. 2014, Addison-Wesley, Boston, MAGoogle Scholar
- Lorenz R, Bernhart SH, Zu Siederdissen CH, Tafer H, Flamm C, Stadler PF, Hofacker IL: ViennaRNA Package 20. Algorithm Mol Biol. 2011, 6: 26-10.1186/1748-7188-6-26.View ArticleGoogle Scholar
- Stajich JE, Block D, Boulez K, Brenner SE, Chervitz SA, Dagdigian C, Fuellen G, Gilbert JG, Korf I, Lapp H: The Bioperl toolkit: Perl modules for the life sciences. Genome Res. 2002, 12: 1611-1618. 10.1101/gr.361602.PubMedPubMed CentralView ArticleGoogle Scholar
- Pollard KS, Hubisz MJ, Rosenbloom KR, Siepel A: Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res. 2010, 20: 110-121. 10.1101/gr.097857.109.PubMedPubMed CentralView ArticleGoogle Scholar
- Hall MA, Smith LA: Feature selection for machine learning: comparing a correlation-based filter approach to the wrapper. FLAIRS Conference. Edited by: Kumar AN, Russell I. 1999, AAAI, Orlando, FL, 235-239.Google Scholar
- Peng H, Long F, Ding C: Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell. 2005, 27: 1226-1238. 10.1109/TPAMI.2005.159.PubMedView ArticleGoogle Scholar
- Linfoot E: An informational measure of correlation. Inform Contr. 1957, 1: 85-89. 10.1016/S0019-9958(57)90116-X.View ArticleGoogle Scholar
- Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH: The WEKA data mining software: an update. ACM SIGKDD Explorations Newsletter. 2009, 11: 10-18. 10.1145/1656274.1656278.View ArticleGoogle Scholar
- Hausser J, Strimmer K: Entropy inference and the James-Stein estimator, with application to nonlinear gene association networks. JMLR. 2009, 10: 1469-1484.Google Scholar
- Freedman D, Diaconis P: On the histogram as a density estimator: L 2 theory. Probab Theor Relat Field. 1981, 57: 453-476.Google Scholar
- MirMark github. , [https://github.com/lanagarmire/MirMark]
- MirMark Garmire group. , [http://www2.hawaii.edu/~lgarmire/software.html]
- Hafner M, Landthaler M, Burger L, Khorshid M, Hausser J, Berninger P, Rothballer A, Ascano M, Jungkamp AC, Munschauer M, Ulrich A, Wardle GS, Dewell S, Zavolan M, Tuschl T: Transcriptome-wide identification of RNA-binding protein and microRNA target sites by PAR-CLIP. Cell. 2010, 141: 129-141. 10.1016/j.cell.2010.03.009.PubMedPubMed CentralView ArticleGoogle Scholar
- Kishore S, Jaskiewicz L, Burger L, Hausser J, Khorshid M, Zavolan M: A quantitative analysis of CLIP methods for identifying binding sites of RNA-binding proteins. Nat Methods. 2011, 8: 559-564. 10.1038/nmeth.1608.PubMedView ArticleGoogle Scholar
- Marín RM, Voellmy F, von Erlach T, Vaníček J: Analysis of the accessibility of CLIP bound sites reveals that nucleation of the miRNA: mRNA pairing occurs preferentially at the 3'-end of the seed match. RNA. 2012, 18: 1760-1770. 10.1261/rna.033282.112.PubMedPubMed CentralView ArticleGoogle Scholar
- Shin C, Nam J-W, Farh KK-H, Chiang HR, Shkumatava A, Bartel DP: Expanding the microRNA targeting code: functional sites with centered pairing. Mol Cell. 2010, 38: 789-802. 10.1016/j.molcel.2010.06.005.PubMedPubMed CentralView ArticleGoogle Scholar
- Casella G, Berger RL: Statistical inference. 1990, Duxbury Press, Belmont, CAGoogle Scholar
- Duda H, Hart P, Stork DG: Pattern Classification. 2001, John Wiley & Sons, OxfordGoogle Scholar
- Zhang H: The optimality of naive Bayes. Ann Rev Mar Sci. 2004, 1: 3-Google Scholar
- Breiman L: Random forests. Mach Learn. 2001, 45: 5-32. 10.1023/A:1010933404324.View ArticleGoogle Scholar
- Platt J: Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in Large Margin Classifiers. Edited by: Smola AJ, Bartlett P, Scholkopf B, Schuurmans D. 1999, MIT Press, Cambridge, MA, 61-74.Google Scholar
- Ritchie W, Gao D, Rasko JE: Defining and providing robust controls for microRNA prediction. Bioinformatics. 2012, 28: 1058-1061. 10.1093/bioinformatics/bts114.PubMedView ArticleGoogle Scholar
- Tipping ME: Sparse Bayesian learning and the relevance vector machine. JMLR. 2001, 1: 211-244.Google Scholar
- Gäken J, Mohamedali AM, Jiang J, Malik F, Stangl D, Smith AE, Chrois C, Kulasekararaj AG, Thomas NSB, Farzaneh F: A functional assay for microRNA target identification and validation. Nucleic Acids Res. 2012, 40: e75-e75. 10.1093/nar/gks145.PubMedPubMed CentralView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.