Benchmarking mutation effect prediction algorithms using functionally validated cancer-related missense mutations

Background Massively parallel sequencing studies have led to the identification of a large number of mutations present in a minority of cancers of a given site. Hence, methods to identify the likely pathogenic mutations that are worth exploring experimentally and clinically are required. We sought to compare the performance of 15 mutation effect prediction algorithms and their agreement. As a hypothesis-generating aim, we sought to define whether combinations of prediction algorithms would improve the functional effect predictions of specific mutations. Results Literature and database mining of single nucleotide variants (SNVs) affecting 15 cancer genes was performed to identify mutations supported by functional evidence or hereditary disease association to be classified either as non-neutral (n = 849) or neutral (n = 140) with respect to their impact on protein function. These SNVs were employed to test the performance of 15 mutation effect prediction algorithms. The accuracy of the prediction algorithms varies considerably. Although all algorithms perform consistently well in terms of positive predictive value, their negative predictive value varies substantially. Cancer-specific mutation effect predictors display no-to-almost perfect agreement in their predictions of these SNVs, whereas the non-cancer-specific predictors showed no-to-moderate agreement. Combinations of predictors modestly improve accuracy and significantly improve negative predictive values. Conclusions The information provided by mutation effect predictors is not equivalent. No algorithm is able to predict sufficiently accurately SNVs that should be taken forward for experimental or clinical testing. Combining algorithms aggregates orthogonal information and may result in improvements in the negative predictive value of mutation effect predictions. Electronic supplementary material The online version of this article (doi:10.1186/s13059-014-0484-1) contains supplementary material, which is available to authorized users.


Background
Massively parallel sequencing studies have demonstrated that tumors can be regarded as genetically heterogeneous populations of individual clones that accumulate mutations during the process of tumorigenesis and tumor progression [1].These mutations, likely the result of genetic instability, may confer a selective growth advantage and be causally implicated in carcinogenesis (that is, driver mutations), or are either selectively neutral (that is, passenger mutations) or deleterious for the cancer cells and eventually purged [2,3].
Recent advances in nucleic acid sequencing technologies now provide the means to explore whole genomes at base-pair resolution [4].The Cancer Genome Atlas (TCGA), the International Cancer Genome Consortium (ICGC) and endeavors led by individual investigators have demonstrated that the repertoire of genes affected by highly recurrent mutations is limited and that there is a large collection of genes affected by mutations in 1% to 2% of cancers from a given anatomical site [2,4,5].Although defining driver mutations based on the presence of hotspot mutations and recurrence rates has resulted in the identification of bona fide oncogenes and tumor suppressor genes (TSGs) and a partial repertoire of genes significantly mutated in cancer [6][7][8], this strategy cannot be readily applied to the study of the genes affected by mutations in a minority of tumors of a given anatomical site.In fact, recent studies have demonstrated that some of these mutations are of functional significance and likely constitute bona fide drivers, therapeutic targets, or mechanisms of therapy resistance (for example, ERBB2 and ESR1 activating mutations in breast cancer) [9][10][11][12].
Defining whether a non-hotspot mutation is biologically and/or clinically relevant is by no means a trivial task, in particular for missense mutations, and often laborious functional assays need to be performed [9][10][11][12].Given the vast number of mutations being identified by massively parallel sequencing efforts, finding ways to prioritize which mutations should be subjected to functional analysis is crucial.Computational methods to discern which somatic mutations likely result in amino acid changes that could have biologic implications have been developed [13].Most of these algorithms rely on the assumption that protein sequences derived from existing living organisms have survived natural selection [14], and many also utilize sequence, structural information, and/or protein annotation (that is, whether a mutation affects an active site, ligand binding domain, disulfide bridges, or protein-protein interactions) to differentiate mutations that result in no or negligible impact on protein function from those that are likely pathogenic.Prediction is feasible because mutations that affect protein function tend to occur at evolutionarily conserved sites [15].Examples of such computational prediction methods (Additional file 1) are Sorted Intolerant From Tolerant (SIFT) [16], PolyPhen-2 [17], Mutation Assessor [18], CONsensus DELeteriousness score of missense mutations (Condel) [19], Cancer-specific High-throughput Annotation of Somatic Mutations (CHASM) [20], Protein Variation Effect Analyzer (PROVEAN) [14], Functional Analysis Through Hidden Markov Models (FATHMM) [21], Variant Effect Scoring Tool (VEST) [22], Mutation-Taster [23], Cancer Driver Annotation (CanDrA) [24], and others [25].Additionally, CHASM, FATHMM, and CanDrA were developed explicitly to differentiate mutations that are likely to constitute cancer drivers from passengers.In particular, FATHMM is a species-independent method, which incorporates pathogenicity weights and is capable of recognizing protein domains (species-independent/evolutionary units) sensitive to missense mutations [21].CHASM [20] is a machine-learning system trained using the information from the Catalogue Of Somatic Mutations In Cancer (COSMIC) [26] and other cancer-related databases, and utilizes a set of 49 predictive features, including the frequency of a given missense change type in COSMIC.Cancer-Related Analysis of VAriants Toolkit (CRAVAT) is a web-based application for CHASM that provides a simple interface to prioritize genes and variants important for specific cancer tissue types [22].CanDrA is a support vector machine method that renders predictions from a set of 95 features and scores computed by 10 other prediction algorithms [24].While most of these predictors are single/independent predictors, Condel and CanDrA make use of scores generated by other algorithms and, therefore, can be considered meta-predictors (Additional file 1).
These predictors provide a fast and inexpensive way to define functional annotation and to predict the effects of mutations, and could theoretically be employed to assist in the selection of mutations that would be worth exploring experimentally and clinically.Different predictors have been designed based on different algorithms (Additional file 1) and, most importantly, were trained using different sets of functional and neutral mutations.As a consequence of the differences in the underlying methodology, these predictors often return dissimilar or even contradictory results [27].Therefore, we sought to benchmark the performance of 15 mutation effect prediction algorithms comprehensively using a set of missense mutations whose functional effects have been experimentally validated and/or that have been shown to result in early onset breast and ovarian cancer syndrome, Li-Fraumeni syndrome or Li-Fraumeni-like syndrome.To generate a list of neutral and non-neutral mutations, we rigorously compiled a set of mutations in bona fide oncogenes, recently described cancer genes and bona fide TSGs by mining the literature and mutation databases (see Methods) [28][29][30].As a hypothesis-generating aim and using our 'gold standard' list of validated mutations, we sought to define whether the mutation effect predictions made by combinations of algorithms would outperform those made by individual predictors or meta-predictors.

Categorization of mutations based on functional evidence
We included known missense mutations in six bona fide oncogenes (BRAF, KIT, PIK3CA, KRAS, EGFR, ERRB2), six recently described cancer genes (ESR1, DICER1, MYOD1, IDH1, IDH2, SF3B1) and three bona fide TSGs (TP53, BRCA1, BRCA2) in this study.We next performed an exhaustive search in the literature and/or existing databases to gather functional evidence for each of the 3,706 mutations compiled for the 15 genes (see Methods; Additional file 2).Given that PolyPhen-2, MutationTaster, CanDrA, and Condel can only define the potential functional impact of single nucleotide variants (SNVs), dinucleotide and trinucleotide changes were excluded from this study.The final dataset employed consists of 3,591 SNVs (Table 1; Additional file 2).
SNVs with experimentally validated effects on the target protein function or proven to be causative of Li-Fraumeni syndrome, Li-Fraumeni-like syndrome, or early onset breast and ovarian cancer syndrome were considered non-neutral, those that have been experimentally validated as nonfunctional or proven not to be causative of Li-Fraumeni syndrome, Li-Fraumeni-like syndrome, and early onset breast and ovarian cancer syndrome as neutral, and those without definitive experimental validation or considered germline variants of unknown significance were considered uncertain (see Methods).
Using these criteria, 849 SNVs were categorized as nonneutral, 140 SNVs were assigned to the neutral category, and the remaining 2,602 were regarded as uncertain (Table 1; Additional file 2).Of the neutral and non-neutral mutations (n = 989), we collected a median of 28.5 SNVs (range, 23 to 38) for the bona fide oncogenes, a median of five SNVs (range, 1 to 11) for the new cancer genes, and a median of 81 SNVs (range, 62 to 640) for the bona fide TSGs (Table 1).
To evaluate the inter-rater agreement between prediction methods, we performed unsupervised clustering of the calls made for all 3,591 SNVs by each predictor and calculated pairwise unweighted Cohen's Kappa coefficients for each pair of predictors.Unsupervised clustering of the results of the mutation effect prediction algorithms revealed two main groups with an additional outlier CanDrA (breast) (Figure 1A).One of the clusters (referred to as 'Cluster 1') contained all but one of the cancer-specific predictors, namely CHASM (breast), CHASM (lung), CHASM (melanoma), FATHMM (cancer), CanDrA (lung) and CanDrA (melanoma), and the non-cancer-specific predictor FATHMM (missense) and its related metapredictor Condel (Figure 1A; Additional file 1).Their pairwise unweighted Kappa coefficients showed fair-toalmost perfect agreement (median unweighted κ = 0.5679, range κ = 0.3861 to 0.9004; Figure 1B; Additional file 3).CanDrA (breast) was the sole cancer-specific predictor that did not belong to this cluster.The best agreement within this group was between CHASM (breast) and CHASM (lung) (κ = 0.9004), which is not surprising considering that they share the underlying prediction engine.The second cluster (referred to as 'Cluster 2') was composed of non-cancer-specific mutation effect prediction algorithms, namely Mutation Assessor, MutationTaster, PolyPhen-2, PROVEAN, SIFT, and VEST (Figure 1A).The pairwise agreement between these predictors ranged from fair to moderate (median unweighted κ = 0.4347, range κ = 0.3355 to 0.5662, Figure 1B; Additional file 3).The modest agreement between these predictors was surprising, given that conservation is a feature employed by all algorithms (Additional file 1), and the sole feature employed by Mutation Assessor and SIFT [27].Overall, the agreement between predictors from distinct clusters A total of 3,591 single nucleotide variants (SNVs) in six bona fide oncogenes, six new cancer genes and three bona fide tumor suppressor genes were assessed for the evidence supporting a functional role for each of the mutations.These SNVs were classified as non-neutral, neutral or uncertain based on direct experimental/functional data in the literature and/or on the basis of causation of Li-Fraumeni syndrome and Li-Fraumeni-like syndrome (for TP53) or early onset breast and ovarian cancer syndrome (for BRCA1 and BRCA2), as recorded in dedicated mutation databases [28][29][30].For a detailed list, see Additional file 2. Single nucleotide variants (SNVs) were classified as non-neutral, neutral or uncertain based on functional/experimental data from the literature or mutation databases [28][29][30].Each SNV was classified by each of the mutation effect prediction algorithms independently.
ranged from no to fair agreement (median unweighted κ = 0.2062, range κ = -0.0520 to 0.4216, Figure 1B; Additional file 3).CanDrA (breast) was an outlier and displayed no-to-slight agreement with each of the other predictors (Figure 1B; Additional file 3).We repeated the same comparisons employing only the 989 SNVs considered to be non-neutral or neutral on the basis of functional analyses, which revealed similar results, with the exception of CanDrA (breast), which now belongs to Cluster 1 (Additional files 4 and 5).
As the cancer-specific CanDrA, CHASM, and FATHMM (cancer) are all trained using training sets consisting of canonical somatic SNVs and their frequencies, we evaluated the inter-rater agreement of all predictors after excluding from our dataset the SNVs found in the COSMIC database (v68).Of the 3,591 SNVs, 1,699 (47.3%) were not present in COSMIC, of which 297 were experimentally validated as either non-neutral or neutral (Additional files 6 and 7).Akin to the analysis including all SNVs, unsupervised clustering of the predictions made for non-COSMIC SNVs demonstrated that the two main clusters and their compositions were largely maintained (Figure 1C).In this analysis, not only CanDrA (breast) but also VEST emerged as outliers, clustering separately from the two main clusters (Figure 1C).Compared to the Kappa values obtained using all SNVs, when employing only non-COSMIC SNVs, we observed that median Kappa coefficients in Cluster 2 remained largely unchanged, whereas the median unweighted Kappa scores within Cluster 1 decreased from κ = 0.5679 to 0.4558 (Figure 1D; Additional file 3).These data provide evidence to suggest that the agreement between predictors in Cluster 1 is reduced when SNVs present in the COSMIC database were removed given that some mutation effect predictors from Cluster 1 were trained using SNVs included in COSMIC.
It could be hypothesized that the discrepancies in the predictions made by different mutation effect prediction algorithms would predominantly affect SNVs whose classifications are based on predictions of relatively poor confidence.CanDrA (breast), CanDrA (lung), CanDrA (melanoma), PolyPhen-2, and Mutation Assessor have pre-specified categories that identify SNVs whose predictions are based on limited confidence.For the other predictors, we employed a heuristic approach based on the original description of each predictor and additional online sources to define a category of SNVs whose predictions were of poor confidence (see Methods).We classified the 3,591 SNVs included in this study into non-neutral, neutral, and low confidence categories (Additional file 8), and observed that a median of 437 (range, 44 to 1,298) SNVs were classified as of low confidence.CanDrA (breast) classified only 44 SNVs as of low confidence, whereas PROVEAN classified 1,298 in this category.These marked differences may be a mere reflection of the different cutoffs chosen; however, when only the predictors that have a predefined low confidence category (that is, CanDrA (breast), CanDrA (lung), CanDrA (melanoma), PolyPhen-2, and Mutation Assessor) were assessed, the number of SNVs classified as such ranged from 44 (CanDrA (breast)) to 1,099 (Mutation Assessor).In fact, only 27 SNVs had a majority vote of low confidence (that is, eight or more predictors classifying a given SNV as of low confidence; Additional file 9).Hierarchical clustering of the predictions for the 3,591 SNVs including a low confidence category revealed a cluster structure similar to that obtained with only non-neutral and neutral categories (that is, Cluster 1 enriched for cancer-specific predictors and Cluster 2 exclusively composed of non-cancerspecific predictors), however PROVEAN clustered in a separate branch from Cluster 2. Noteworthy, the Cohen's Kappa coefficients were lower than those observed when SNVs were classified into two categories only (that is, as neutral or non-neutral; Additional files 10 and 11).When these analyses were repeated including only the non-COSMIC SNVs, the results of the clustering analysis were similar, however the agreement between predictors was reduced even further (Additional files 10 and 11).By focusing only on the 989 SNVs with functional evidence to classify them as neutral or non-neutral, the addition of a low confidence category resulted again in a similar cluster structure, however PROVEAN was an outlier.This observation was likely due to the fact that 132 of the 989 (13.3%)SNVs were classified as of low confidence by PROVEAN only (Additional file 12).As compared to the Cohen's Kappa coefficients obtained with two categories, the analysis of agreement between predictors for the classification of these 989 SNVs was generally lower when the low confidence category was included (Additional file 13).Similar results were obtained when the subset of 297 non-COSMIC SNVs were analyzed (Additional files 12 and 13).

Performance of 15 commonly used mutation effect prediction algorithms
Among the SNVs included in this study, 989 had sufficient functional evidence to support their classification as either non-neutral (n = 849) or neutral (n = 140) with respect to an effect on protein function (Table 1).Hence, the performance of the predictors was assessed using these validated SNVs.Accuracy, specificity, sensitivity, positive predictive value (PPV), negative predictive value (NPV), and composite score were calculated to evaluate the performance of each predictor (see Methods).This analysis revealed that the proportion of SNVs correctly classified by the different predictors varied considerably (median, 85.84%; range, 73.71% to 91.28%; Figure 2A, Table 3).
Intuitively, the best algorithm would have high and balanced values for each of the performance statistics.With this rationale, we calculated a 'composite score', ranging from 0 to 4, by summing up the sensitivity, specificity, PPV, and NPV of each predictor as an overall performance indicator for each predictor.The median composite score was 2.7866 (range, 1.8056 to 3.3413; Figure 2A, Table 3).Using this parameter, the best-performing single predictor was CHASM (lung; 3.2751, 95% CI 3.1418-3.3988)but it was not significantly different from the best metapredictor CanDrA (lung) in this set of SNVs (3.3413, 95% CI 3.2103-3.4747,P >0.05, Figure 2A, Table 3).
We next performed the same analysis using only the 297 SNVs not included in the COSMIC database.In this analysis, the median accuracy and sensitivity were 76.77% (range, 54.88% to 84.70%) and 79.26% (range, 51.06% to 97.85%), respectively (Figure 2B, Table 4).The most accurate single and meta-predictors in this context were MutationTaster (84.51%, 95% CI 80.13-88.55%)and CanDrA (lung; 84.70%, 95% CI 80.50-88.65%),respectively (Figure 2B, Table 4).As compared to the analysis including all SNVs, when excluding SNVs present in the COSMIC database, the accuracy of all predictors but Mutation Assessor and MutationTaster were statistically significantly reduced (P <0.05).Furthermore, eight of 15 mutation effect prediction algorithms showed statistically significant reduction in sensitivity and 10 of 15 showed a statistically significant reduction in PPV (Figure 2B; Table 4).
To assess whether the mutation effect prediction algorithms would have different performances when SNVs in bona fide oncogenes, bona fide TSGs or new cancer genes were considered, we selected from the set of 989 SNVs those found in bona fide oncogenes, bona fide TSGs or new cancer genes (n = 176, n = 783 or n = 30 SNVs, respectively; Additional file 14    Based on the prediction results of non-neutral and neutral single nucleotide variants not present in the COSMIC database, the accuracy, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and composite score for each predictor were computed.The 95% confidence intervals (CI) generated by bootstrapping are shown in parentheses.
files 15 and 16).On the other hand, FATHMM (cancer), MutationTaster, PolyPhen-2, PROVEAN, and VEST performed better for SNVs in oncogenes than in TSGs (P <0.05).These results suggest that some of the predictors showed preferential performance towards SNVs in either oncogenes or TSGs; alternatively, these differences in performance may stem from the fact that there was a statistically significant difference in the proportion of neutral SNVs in oncogenes as compared to TSGs (Fisher's exact test, two-tailed, P <0.0001).The same comparisons could not be performed for SNVs affecting new cancer genes, as no evidence to support a neutral classification for SNVs affecting these genes was obtained in the literature search, reflecting the relative novelty of these SNVs.Taken together, these results demonstrate that mutation effect prediction algorithms are not equivalent for the classification of individual SNVs, that the predictions from some algorithms may be tumor tissue dependent, and that some may have a better performance for the identification of neutral than non-neutral SNVs.

Combination of mutation effect prediction algorithms
To evaluate whether combinations of single predictors would result in an improvement of the predictions made in this dataset, we generated 11,253 combinations by using n (n =2, 3, … 11) single predictors at a time, with a given mutation being considered non-neutral if at least p (p =1, 2, … 11) predictors called it non-neutral for all combinations of n and p.We computed the performance statistics and their confidence intervals for each of these combinations from 1,000 random subsets comprising two-thirds and one-third of the total dataset (referred to as 'subset 1' and 'subset 2', respectively; Additional file 17).We ranked the combinations based on mean accuracy or based on mean composite score for subsets 1 and 2 separately, and compared their performance to the bestperforming single and meta-predictors, respectively.Furthermore, as excluding SNVs not present in the COSMIC database had a significant impact on the performance of many predictors and their pairwise agreement, we also performed the same experiment using only non-COSMIC SNVs (Additional file 18).
Of the 11,253 possible single predictor combinations we evaluated using all SNVs, 1,854 predictor combinations were found to have a numerically higher mean accuracy in subsets 1 and 2 than the most accurate single predictor (that is, FATHMM (cancer)).Six of these combinations were concurrently significantly more accurate than the most accurate single predictor (that is, FATHMM (cancer)) in both subsets (Additional files 19, 20, and 21).When ranking the predictor combinations according to composite score, 1,483 combinations had numerically higher mean composite scores in subsets 1 and 2 than the composite score of the best single predictor (that is, CHASM (lung)); five of these showed statistically significantly higher composite scores than those of CHASM (lung) in both subsets (Figure 3, Additional files 19 and 21).The mutation effect prediction algorithm combination that resulted in a significant increase in both accuracy and composite score as compared to the best single predictor and meta-predictor in both subsets was CHASM (breast) and MutationTaster.This predictor combination called a given SNV non-neutral if at least one of CHASM (breast) and MutationTaster called it non-neutral, and it was ranked first in terms of both accuracy and composite score in subsets 1 and 2 independently (in subset 1: 95.46%, 95% CI 94.54-96.51%and 3.6255, 95% CI 3.5584-3.6982; in subset 2: 95.43%, 95% CI 93.33-97.27%,and 3.6236, 95% CI 3.4841-3.7573,respectively; Additional file 21).Similar observations were made when SNVs found in the COSMIC database were excluded.Only the CHASM (breast) and MutationTaster predictor combination outperformed the most accurate single predictor MutationTaster consistently in both subsets (Figure 3; Additional files 19, 20, and 22).
Although mutation effect prediction algorithm combinations had a relatively limited impact on accuracy and composite score, some predictor combinations significantly improved the NPV as compared to the best single and meta-predictor (Figure 3, Additional files 20, 21, and 22).Again, the CHASM (breast) and MutationTaster predictor combination resulted in a significant improvement in NPV as compared to the NPV of the best single predictor or the best meta-predictor in all subsets.When analyzing the top 10, top 20, top 50, and top 100 combinations of mutation effect prediction algorithms, we noted that MutationTaster, CHASM (breast), and CHASM (lung) were consistently present in the top performing predictor combinations in subsets 1 and 2 using the 989 functionally validated SNVs, irrespective of whether the combination predictor performance was ranked according to accuracy or composite score (Figure 4; Additional file 23).When only the non-COSMIC SNVs were included in the analysis, the same mutation effect prediction algorithms were consistently present in the best performing mutation effect prediction algorithm combinations (Figure 4; Additional file 23).
While the most consistently accurate predictor combination called a given mutation non-neutral in this dataset if at least one of CHASM (breast) and Mutation-Taster called it non-neutral, we also evaluated whether there were optimal combinations of n and p.In both subsets, for any given n (1 ≤ n ≤4), the highest accuracy was achieved when p ≈ 2n (Additional files 24 and 25 for subset 1; data for subset 2 not shown).Similarly, for any given p (3 ≤ p ≤11), the optimal n was approximately p/2 (Additional files 24 and 25 for subset 1; data for subset 2 not shown).Similar observations could be made for the results generated by using only the non-COSMIC SNVs (Additional files 26 and 27 for subset 1; data for subset 2 not shown).
Taken together, the combination of mutation effect prediction algorithms resulted in a modest but significant improvement in accuracy and composite score.It should be noted, however, that selected combinations of mutation effect prediction algorithms significantly improved NPV.This information can be instrumental in ruling out SNVs that should not be followed up experimentally and/or clinically, given that SNVs considered neutral by these combinations have a higher probability of being genuinely neutral than those called neutral by single predictors or meta-predictors individually.

Discussion
Here we demonstrated, by using a set of extensively curated, 'gold-standard' list of mutations, that mutation effect prediction algorithms are not equivalent for the classification of individual mutations, and that the agreement between predictors is modest and dependent on the set of mutations and mutation effect type.The agreement between cancer-specific prediction algorithms, which define driver versus passenger mutations, was more consistent than that of non-cancer-specific predictors, which differentiate pathogenic versus non-pathogenic mutations.Furthermore, we observed that the predictions made by some algorithms may be tumor tissue dependent, and that others may have a better performance for the identification of neutral than non-neutral mutations.
Most of the single predictors and meta-predictors displayed very good PPVs, however their NPVs were found to be relatively low.Using a combination of single prediction algorithms, the combination of CHASM (breast) and MutationTaster resulted in a significant improvement in accuracy as compared to the accuracy obtained with the best single predictor and meta-predictor in the SNV dataset studied, however this increase was modest.Importantly, however, by using mutation effect predictor algorithm combinations, we achieved substantial statistically significant improvements in NPV.Different combinations of individual predictors including CHASM (breast) and MutationTaster were repeatedly found to have a significantly higher NPV than the best single predictor and the best meta-predictor in this dataset, while at least maintaining equivalent accuracy and composite score.In the effort to sift through lists of mutations to identify biologically interesting candidates to take forward for functional experiments, NPV is an often-overlooked measure.A high NPV allows for the exclusion of passenger or neutral alterations with greater confidence, without the risk of losing truly pathogenic mutations called neutral/passenger by a given algorithm.
Our analysis further revealed that some mutation effect prediction algorithms are dependent on the type of gene altered.In particular for the case of CanDrA metapredictor, as all but one TP53 SNVs were predicted to be 'drivers' by all CanDrA algorithms (that is, CanDrA breast/ lung/melanoma), whereas all BRCA1 and BRCA2 mutations were predicted to be 'drivers' by CanDrA (breast) but almost exclusively 'passengers/no-call' by CanDrA (lung/ melanoma).This suggests that some predictors are highly tissue-specific and users ought to employ predictors appropriate for the tumor tissue type analyzed.
Our study has several limitations, despite using a set of curated mutations in bona fide oncogenes, new cancer genes, and bona fide TSGs.First, the dataset we employed has a limited size, and neutral mutations were largely derived from TSGs, in particular BRCA1 and BRCA2, which may have caused biases in the estimation of specificity and NPV.Importantly, however, unlike previous comparisons of mutation effect prediction algorithms, this study has (See figure on previous page.)Figure 3 Performance statistics of the top five mutation effect prediction algorithm combinations as ranked by composite scores.Prediction results of the non-neutral (n = 849) and neutral (n = 140) single nucleotide variants (SNVs) in the entire dataset (A, B) and the non-neutral (n = 188) and neutral (n = 109) SNVs not present in the COSMIC dataset (C, D) are shown.Results are ranked according to the composite scores of each mutation effect prediction algorithm combination, and the corresponding accuracy, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and composite score of the top five prediction algorithm combinations in subset 1 (A, C) and subset 2 (B, D) are plotted.Error bars represent the 95% CIs generated by 1,000 random samples of subsets 1 and 2. Red bars represent predictor combinations, blue bars single/ independent predictors, and orange bars meta-predictors.Blue stars: statistically significant improvement in composite score as compared to that of the best performing single/independent predictor; orange stars: statistically significant improvement in composite score as compared to that of the best performing meta-predictor; blue triangles: statistically significant improvement in NPV as compared to that of the best performing single/independent predictor; orange triangles: statistically significant improvement in NPV as compared to that of the best performing meta-predictor.
employed rather rigorous standards to define the mutation set to be analyzed, by leveraging functional evidence from an extensive literature search and database mining.Second, mutations may be context-dependent, in that they would only elicit a phenotype under particular circumstances, such as genetic background.Hence, the number of mutations considered of unknown or indeterminate significance was high.Third, a substantial number of neutral and non-neutral mutations was obtained from datasets related to the impact of germline mutations.The actual impact of those mutations when they are found as somatic genetic alterations would require further investigation.Fourth, given that some predictors display distinct performances according to the tumor tissue type and that some SNVs included in this dataset are preferentially found in specific tumor types (for example, BRCA1 and BRCA2 mutations are more frequently found in breast and ovarian cancers), this dataset could theoretically favor mutation effect prediction algorithms that are breast cancer-specific.This was not observed, however, given that the best performing single predictors and metapredictors were not breast cancer-specific.Fifth, the agreement between mutation effect prediction algorithms employing a low confidence category in addition to the neutral and non-neutral categories of SNVs is dependent on the cut-points chosen to define predictions of low confidence.It should be noted, however, that similar results for prediction algorithms with a pre-defined low confidence category were observed as compared to those where a low confidence category was defined employing a heuristic approach based on the original descriptions and online sources of the predictors.Sixth, CHASM (breast/lung/melanoma), FATHMM (cancer/missense), and PolyPhen-2 are algorithms that employed training sets in their development and validation.Given the approach employed to create the 'gold standard' dataset used in this study, overlaps between our 'gold standard' dataset and the training sets employed for the development of these algorithms were inevitable.To minimize potential biases, we have performed all analyses after the removal of all functionally curated SNVs present in COSMIC; however, even after taking this step, a residual number of SNVs present in the original training sets remained (Additional file 28).When we compared the performance of these mutation effect predictors after the removal of SNVs present in COSMIC or in the original training sets, we either observed no significant differences in their accuracy or a lower accuracy when COSMIC SNVs were removed (Additional file 29).Finally, although experimental validation is informative, it is not often definitive.In particular for neutral effects, it is plausible that the results of such experiments are context dependent (that is, cell line or organism employed and the constellation of mutations already present in a given model).For instance, the true effect of some mutations may be conditioned by the genetic make-up (that is, quantitative trait loci, epistasis) [31,32], be only effective in a particular developmental stage [33], or be species-specific [34].On the other hand, non-neutral mutation effects do not necessarily imply causality with regards to an organismal level phenotype; many dozens of loss-of-function variants exist in healthy humans [35].

Conclusions
Our study demonstrates that the challenges researchers face at the time of analyzing massively parallel sequencing data to identify variants for further experimental studies are genuine.The information provided by mutation effect prediction algorithms is not equivalent.None of the algorithms analyzed here was found to deliver optimal accuracy, sensitivity, specificity, PPV, and NPV in the mutation dataset studied.Mutation effect predictors are not equivalent for the classification of individual mutations.The performance of some of these predictors may be dependent on tumor tissue and mutation type (that is, canonical versus non-canonical mutations, neutral versus non-neutral mutations).Combinations of mutation effect predictors, albeit providing only modest but significant improvements in the overall accuracy when compared to individual predictors or meta-predictors, were found to result in substantially improved NPVs without compromising accuracy.

Mutation sets
To standardize the procedure of compiling mutations that can be employed for the benchmarking of mutation effect predictors, mutations affecting six bona fide oncogenes (BRAF, KIT, PIK3CA, KRAS, EGFR, and ERRB2), whose mutations preferentially affect kinase domains, six recently described cancer genes (DICER1, ESR1, IDH1, IDH2, MYOD1, and SF3B1), whose mutations do not affect kinase domains, and three bona fide TSGs (TP53, BRCA1, and BRCA2) were retrieved from the TCGA Pan-Cancer dataset by Kandoth et al. [36] and from studies functionally testing mutations affecting these genes (Additional file 2).In addition, for TSGs, specific databases were employed; for TP53, the IARC database [29,37], and for BRCA1 and BRCA2, the Universal Mutation Database (UMD) [28,38,39].This mining exercise resulted in the identification of 3,706 mutations, of which 3,591 were SNVs (Table 1, Additional file 2).Given that some mutation effect prediction algorithms (that is, PolyPhen-2, MutationTaster, CanDrA, and Condel) do not process dinucleotide or trinucleotide missense mutations, and to have the same number of mutations successfully analyzed by each predictor, we have only included SNVs for the purpose of creating a mutation dataset to benchmark mutation effect predictors.SNVs were also annotated based on their presence in the COSMIC dataset v68 [26].

Literature search
Literature search was performed by four of the authors (LGM, MRDF, YZ, SP) to identify experimental evidence of functional effects of each mutation.This strategy entailed the use of Boolean logic in combination with search engines such as PubMed, ScienceDirect, Google Scholar, and MEDLINE.Search terms were combined using Boolean (that is, AND, OR) and Adjacency (that is, NEAR) operators to create search statements as well as to narrow and refine the search (for example, BRAF AND V600E AND validation/function, ERBB2 AND mutation NEAR kinase domain AND validation/function).In addition, the references listed in the papers found were also scrutinized to aid the search of additional literature in support of the findings.For TSGs, in addition to literature search, the IARC TP53 functional assessment dataset [29], UMD-BRCA1, and UMD-BRCA2 mutation databases [28] were employed to ascertain whether specific missense mutations in TP53 would be causative of Li-Fraumeni or Li-Fraumeni-like syndrome or have been functionally assessed, and mutations in BRCA1 and BRCA2 would be causative of early onset breast and ovarian cancer syndrome, respectively (see below).

Oncogenes and new cancer genes
The six bona fide oncogenes, namely BRAF, KIT, PIK3CA, KRAS, EGFR, and ERRB2 and the six new cancer genes, namely DICER1, ESR1, IDH1, IDH2, MYOD1, and SF3B1 used for the present analysis were selected on the basis of the presence and the absence of a kinase domain, respectively, and the availability of studies investigating functionally the impact of mutations affecting genes.In this literature search, direct functional evidence to determine whether a mutation was neutral (that is, passenger) or non-neutral (that is, pathogenic) was sought.The functional impact of SNVs affecting these 12 genes was categorized into six groups, namely: (1) Change in kinase, GTPase, or other enzymatic activity (for example, RNase); (2) Effect in response to ligand/ substrate, impact on downstream effectors/pathways, or cell proliferation/survival, differentiation, apoptosis; (3) Ability to immortalize or transform human or murine cells (for example, MCF10A, BaF3, NIH3T3 cell lines) and/or anchorage-independent growth; (4) Response to specific chemical/biological compounds, therapeutic agents, or temperature; (5) Tumor growth/induction in vivo (for example, xenografts, mouse/fish models), or changes in the rates of progression-free or overall survival in pre-clinical models; and (6) Changes in genome (that is, aneuploidy), epigenome (that is, methylation), transcriptome (that is, splicing), miRNA biogenesis, or DNA/RNA binding affinity (Additional file 2).SNVs affecting these genes were considered non-neutral if there was literature evidence to support their impact on at least one of the mentioned categories.When the functional testing demonstrated no significant impact on the wild-type function of the protein in at least one functional category, and/or no evidence was found for other categories, the SNVs were classified as neutral mutations.SNVs for which no reliable functional evidence or conflicting evidence was found for any of the six categories, were regarded as uncertain.
Tumor suppressor genes (TSGs) The IARC datasets 'TP53 germline mutations and family history' and 'Functional assessment of p53 mutant proteins in various experimental assays' (R17) [29,37] were employed to ascertain whether specific mutations affecting TP53 would be associated with the development of Li-Fraumeni syndrome or Li-Fraumeni-like syndrome, and/or whether specific TP53 SNVs would result in conserved wild-type function, loss of function, dominant negative activity, gain of function, and/or temperature sensitivity in various systems and cell lines [29,37].TP53 mutations strictly associated with Li-Fraumeni syndrome or Li-Fraumeni-like syndrome, and present in patients without additional germline mutations affecting cancer causing genes (for example, PTEN, BRCA1, or BRCA2), and/or meeting at least one of the above functional levels of evidence were considered non-neutral, whereas those mutations lacking an association with Li-Fraumeni syndrome or Li-Fraumeni-like syndrome or a functional impact were considered neutral.The remaining TP53 SNVs were considered uncertain.For BRCA1 and BRCA2, the UMD-BRCA1 (4 February 2014 update) and UMD-BRCA2 (22 January 2014 update) mutation databases [28,38,39] were used to collect information on specific BRCA1 and BRCA2 mutations.Mutations with validated functional evidence classified as '5 -Causal' in the database were classified as non-neutral in this study, '1 -Neutral' in the database were classified as neutral in this study, and '4 -Likely causal', '3 -UV', and '2 -Likely neutral' in the database were classified as uncertain in this study.
We next tested the set of 3,591 SNVs using 15 prediction algorithms by introducing a prediction category of low confidence.CanDrA (breast/lung/melanoma) ('nocall'), PolyPhen-2 ('possibly damaging'), and Mutation Assessor ('low') have pre-specified categories that identify SNVs whose predictions are based on limited confidence.For the other predictors, we employed a heuristic approach based on the original description of each predictor and additional online sources to define a category of SNVs whose predictions were of low confidence.For CHASM (breast/lung/melanoma), based on the histograms of CHASM scores for driver mutations and passenger mutations described in the original study [20], we called predictions with scores between 0.25 and 0.35 as low confidence.For FATHMM (cancer), based on the interactive prediction threshold graph (online documentation, [48]), prediction scores between -1.65 and 0.1 were considered as low confidence.For FATHMM (missense), based on the distribution of the predicted magnitude of effect for disease-associated and functionally neutral mutations using the weighted method in the original study [21], the region with the highest overlap of the disease-associated and functionally neutral scores (that is, between -2.5 and 0) was classified as low confidence.For MutationTaster, we considered predictions with probabilities of ≤95% as low confidence.For PRO-VEAN, based on the stringency score thresholds described in the online documentation [49], predictions with scores between -4.1 and -1.3 were considered low confidence.For SIFT, given the lack of reported stringency score thresholds in the original study [16] or in the online documentation, low confidence prediction scores between 0.04 and 0.1 were selected based on the distribution of SIFT scores in our dataset.In brief, these cutoffs were chosen on the basis of a density plot of the SIFT scores generated with the results from our dataset, which displayed a nonparametric distribution (left skewed) with a mode centered around 0.005 and a second much smaller mode centered around 0.999.For VEST, based on the density plots created from VEST score distributions reported in the original study [22], predictions with scores between 0.45 and 0.55 were considered low confidence.For Condel, based on the density of scores for the known neutral and deleterious mutations as reported in the online documentation [50] the region with the highest overlap of the neutral and deleterious mutation scores (that is, between 0.489 and 0.547) was classified as low confidence.
The mutation effect prediction algorithms were assessed using either all SNVs included in this study (n = 3,591) or using only the neutral and non-neutral SNVs (n = 989).In addition, we performed the analyses of these two sets of SNVs by removing either all SNVs present in the COSMIC dataset v68 (n = 1,699 and n = 297, respectively) or by removing all SNVs present in the training sets of CHASM (breast/lung/melanoma), FATHMM (cancer), FATHMM (missense) or PolyPhen-2.For CHASM (breast/lung/melanoma), training sets were retrieved from the drivers.tmps,null.tmps and passengers.tmps of the respective built classifiers [51] (Additional file 28).For FATHMM (cancer) and FATHMM (missense), training sets were retrieved from [52] (Additional file 28).For PolyPhen-2, training sets were retrieved from [53] (Additional file 28).

Analysis of agreement between mutation effect predictors
We converted the calls made by each predictor into neutral and non-neutral (as described above) and performed hierarchical clustering using complete linkage and Hamming distance metric.We assessed the agreement between the predictors using unweighted Cohen's Kappa coefficient with 95% confidence intervals (CIs).Inter-rater agreement was tested to determine the agreement between different mutation effect prediction algorithms.We considered Kappa coefficients <0 to be no agreement, between 0.01 and 0.20 to be slight agreement, between 0.21 and 0.40 to be fair agreement, between 0.41 and 0.60 to be moderate agreement, between 0.61 and 0.80 to be substantial agreement, and between 0.81 and 1 to constitute almost-perfect agreement [54].

Assessment of mutation effect predictor performance
The performance of each predictor was evaluated based on the concordance with our established, functionally validated neutral and non-neutral mutation categories.We evaluated the accuracy, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV) for each predictor, and a composite score, ranging from 0 to 4, defined as the sum of sensitivity, specificity, PPV, and NPV.For the analyses performed, sensitivity measures the proportion of experimentally validated non-neutral mutations that were correctly identified, that is, sensitivity ¼ the number of non-neutral mutations correctly classified as non-neutral the number of all experimentally validated non-neutral mutations whereas specificity measures the proportion of experimentally validated neutral mutations that were correctly identified, that is, specificity ¼ the number of neutral mutations correctly classified as neutral the number of all experimentally validated neutral mutations PPV is defined as the proportion of mutations predicted to be non-neutral that were experimentally validated as non-neutral, that is, PPV ¼ the number of non-neutral mutations correctly classified as non-neutral the number of mutations predicted to be non-neutral and NPV is the proportion of mutations predicted to be neutral that were experimentally validated as neutral, that is, NPV ¼ the number of neutral mutations correctly classified as neutral the number of mutations predicted to be neutral 95% CIs for each of these measures were generated by performing resampling with replacement (that is, bootstrapping) for 1,000 iterations.For the CIs generated from bootstrapping, if CIs touch or do not overlap, the difference is considered statistically significant as the significance levels satisfies P <0.05 [55].Briefly, if the two standard errors are se 1 and se 2 , then the standard error of the difference is ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi se 1 2 þ se 2 2 p and the difference between the means is 2(se 1 + se 2 ), hence the P value can be calculated from z ¼ 2 se 1 þse 2 ð Þ ffiffiffiffiffiffiffiffiffiffiffiffiffiffi se 1 2 þse 2 2 p .This approach has been shown to be valid for a range of values and successfully employed in a previous study [55].

Combination of mutation effect predictors
To assess the effect of combining single predictors, we made use of the computed categories of individual predictors.For each possible combination of the 11 single predictors using n (n =2, 3, … 11) predictors at a time, a mutation would be considered non-neutral if at least p (p =1, 2, … n) predictors considered it non-neutral.To test the impact of the combinations of predictors, we used a split-sample approach by dividing the dataset of functionally validated mutations randomly into two sub-datasets (that is, subsets), each consisting of two-thirds ('subset 1') and one-third ('subset 2') of the mutations included in the entire dataset.We computed performance statistics (accuracy, sensitivity, specificity, PPV, NPV, and composite score) for each combination of n and p, resulting in 11,253 unique combinations.To define the confidence intervals for the predictions, we repeated the splitting of the dataset 1,000 times to create 1,000 random splits of the dataset and computed the performance statistics for each iteration.
Separately for subsets 1 and 2, we ranked the predictor combinations based on their accuracy or their composite scores.Differences in the performance statistics were considered statistically significant if their confidence intervals touched or did not overlap (see above).

Figure 1
Figure 1 Inter-rater agreement between 15 mutation effect prediction algorithms.Hierarchical clustering of the calls made by 15 mutation effect prediction algorithms using (A) all 3,591 single nucleotide variants (SNVs) included in this study, and (C) the 1,699 SNVs not present in the COSMIC database.The unweighted Cohen's Kappa coefficient was computed for each pair of predictors using (B) all 3,591 SNVs and (D) the 1,699 SNVs not present in the COSMIC database.The ranges of unweighted Kappa values and their corresponding colors are indicated in the color key.

Figure 2
Figure 2 Performance statistics of mutation effect prediction algorithms.Based on the prediction results of (A) the non-neutral (n = 849) and neutral (n = 140) single nucleotide variants (SNVs) included in this study and (B) the non-neutral (n = 188) and neutral (n = 109) SNVs not present in the COSMIC database, the accuracy, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and composite score for each predictor are plotted.Error bars represent the 95% CIs generated by bootstrapping.Single/independent predictors are shown in blue bars, and meta-predictors are shown in orange bars.

Figure 4
Figure 4  Recurrence of individual mutation effect prediction algorithms in the top performing mutation effect prediction algorithm combinations ranked by composite score.The top 10, top 20, top 50, and top 100 combinations of prediction algorithms were defined using the non-neutral (n = 849) and neutral (n = 140) single nucleotide variants (SNVs) included in the entire dataset and ranked according to composite score.The frequency of each single mutation effect predictor present in these top combinations was determined in subset 1 and subset 2 (A).The top 10, top 20, top 50, and top 100 combinations of prediction algorithms were defined using the non-neutral (n = 188) and neutral (n = 109) SNVs not present in the COSMIC database and ranked according to composite score.The frequency of each single mutation effect predictor present in these top combinations was determined in subset 1 and subset 2 (B).

Table 1
Single nucleotide variants included in this study stratified according to the evidence supporting their impact on protein function

Table 2
Predictions of 3,591 functionally validated single nucleotide variants by 15 mutation effect prediction algorithms

Table 3
Performance statistics of mutation effect prediction algorithms using all single nucleotide variants tested functionally (n = 989)Based on the prediction results of all non-neutral and neutral single nucleotide variants included in this study, the accuracy, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and composite score for each predictor were computed.The 95% confidence intervals (CI) generated by bootstrapping are shown in parentheses.

Table 4
Performance statistics of mutation effect prediction algorithms using only functionally tested single nucleotide variants not present in the COSMIC database (n = 297)