Evaluation of in silico algorithms for use with ACMG/AMP clinical variant interpretation guidelines

Background The American College of Medical Genetics and American College of Pathologists (ACMG/AMP) variant classification guidelines for clinical reporting are widely used in diagnostic laboratories for variant interpretation. The ACMG/AMP guidelines recommend complete concordance of predictions among all in silico algorithms used without specifying the number or types of algorithms. The subjective nature of this recommendation contributes to discordance of variant classification among clinical laboratories and prevents definitive classification of variants. Results Using 14,819 benign or pathogenic missense variants from the ClinVar database, we compared performance of 25 algorithms across datasets differing in distinct biological and technical variables. There was wide variability in concordance among different combinations of algorithms with particularly low concordance for benign variants. We also identify a previously unreported source of error in variant interpretation (false concordance) where concordant in silico predictions are opposite to the evidence provided by other sources. We identified recently developed algorithms with high predictive power and robust to variables such as disease mechanism, gene constraint, and mode of inheritance, although poorer performing algorithms are more frequently used based on review of the clinical genetics literature (2011–2017). Conclusions Our analyses identify algorithms with high performance characteristics independent of underlying disease mechanisms. We describe combinations of algorithms with increased concordance that should improve in silico algorithm usage during assessment of clinically relevant variants using the ACMG/AMP guidelines. Electronic supplementary material The online version of this article (doi:10.1186/s13059-017-1353-5) contains supplementary material, which is available to authorized users.


Background
Many in silico methods have been developed to predict whether amino acid substitutions result in disease. Use of this type of evidence has become a routine part of assessment of novel variants identified through genefocused projects or as a part of whole exome or genome annotation pipelines. In a clinical setting, predictions from in silico algorithms are included as one of the eight evidence criteria recommended for variant interpretation by the American College of Medical Genetics and Genomics (ACMG) and Association of Molecular Pathologists (AMP) [1]. The ACMG/AMP guideline for use of in silico algorithms specifically states: "If all of the in silico programs tested agree on the prediction, then this evidence can be counted as supporting. If in silico predictions disagree, however, then this evidence should not be used in classifying a variant." For a given missense variant, predictions by numerous algorithms are publicly available, e.g. via dbNSFP [2] or Variant Effect Predictor [3] from which a few algorithms are typically chosen for variant interpretation and are often used without additional calibration. Different testing laboratories use distinct combinations of in silico algorithms for variant interpretation and this can lead to discordant interpretations. For example, in a recent assessment of the ACMG/AMP guidelines by the Clinical Sequence Exploratory Research consortium (CSER), the frequency of use of in silico algorithm evidence for pathogenic and benign variant assertion were 39% and 18%, respectively [4]. The CSER study noted that use of in silico algorithms were one major source of discordance among different clinical laboratories and that the ACMG/AMP guideline for in silico algorithm usage may be aided by further recommendations [4].
Missense variants constitute a major set of variants of uncertain significance (VUS) in ClinVar [5]. An improved recommendation for use of in silico algorithms is important for reducing the VUS burden in clinical medicine and increasing concordance of variant interpretation. Currently, there is little consensus among clinical labs on how many and which algorithms to use for missense variant interpretation. For example, a recent exome sequencing study classified variants in 180 medically relevant genes for hereditary cancer according to ACMG/AMP guidelines. The authors found that the VUS rate was higher when requiring full concordance vs majority agreement among the 13 in silico algorithms used in their pipeline [6]. Other examples from the literature demonstrate requiring full concordance among three [7] to seven [8] different algorithms for variant interpretation. However, to our knowledge, no analysis has been conducted to assess the applicability of the current ACMG/AMP guideline for in silico algorithm usage. Here, using predictions from 25 in silico algorithms for 14,819 clinically relevant missense variants in the ClinVar database, we highlight several limitations of implementing the ACMG/AMP guideline for in silico algorithm usage. We find highly variable degree of concordance among different combinations of algorithms with particularly low concordance of the predictions of variants reported in ClinVar as benign. Using the ClinVar dataset, we identify algorithms with higher predictive power whose performances are robust to variables such as disease mechanism, level of constraint, and mode of inheritance.

Concordance among in silico algorithms
To identify the extent of concordance among in silico algorithms for known pathogenic and benign variants, we obtained 14,819 missense variants from ClinVar for which the rationale for pathogenic or benign assertion has been provided by at least one submitter (one-star status in ClinVar), primarily clinical laboratories, and annotated these variants with scores and predictions from 25 algorithms using dbNSFP (v3.2) [9] or the respective authors' websites. We generated a matrix of binary predictions (pathogenic or benign) for these variants with scores from the 18 algorithms, for which thresholds of pathogenicity were publicly available (Fig. 1a, Additional file 1: Table S1, see "Methods"). We found that when using this large number of algorithms only 5.2% of the benign and 39.2% of pathogenic variants had concordant assertions across all algorithms (Fig. 1a, Table 1). Some algorithms did not produce a prediction for all 14,819 variants (indicated by white spaces in Fig. 1). To ensure that missing data did not introduce a bias in our results, we analyzed concordance with a smaller dataset of 8386 variants (Additional file 2: Figure S1), where there were no missing data. Similarly, we found that for the dataset without missing values only 3.2% of the benign and 41.5% of pathogenic variants had concordant assertions across all of them (Additional file 1: Table S2, Additional file 2: Figure S1). We also obtained similar results when we restricted our analysis to benign and pathogenic variants in ClinVar that had identical assertions from at least two independent laboratories (two stars - Fig. 1b, Table 1, Additional file 1: Table S1, Additional file 2: Figure  S1B), suggesting that errors in ClinVar assertions by a Fig. 1 Concordance among predictions of 18 algorithms for variants in ClinVar. Binary predictions made by 18 algorithms for each pathogenic or benign variants in ClinVar are shown in the upper and lower panels. Each variant is along a row and an orange, green, or white tile depicts a pathogenic, benign, or missing data call, respectively, by the corresponding algorithm. A total of 14,819 variants with ClinVar review status one star or above (a) and 2966 variants with ClinVar review status two stars or above (b) are shown single submitter or missing data contributes little to the low level of concordance among algorithms.
We then computed the pairwise differences among all the algorithms separately for 7346 benign and 7473 pathogenic variants in our dataset (see "Methods"). We found that, on average, two algorithms tend to differ from each other significantly more in the interpretation of benign as opposed to pathogenic variants (p < 0.0001, Welch's two-sample t-test) (Fig. 2a). Our data suggest that while interpreting large number of variants, full concordance, as suggested by the ACMG/AMP guidelines, is less likely to be achieved even when using only two algorithms, particularly for benign variants, consistent with earlier observation of poor correlations among predictors by Thusberg et al. [10].
To assess the level of concordance among the most commonly used algorithms, we reviewed algorithm use in the medical genetics literature between January 2011 and January 2017 (see "Methods"). We found that Polyphen [11] and SIFT [12] are cited most frequently, followed by MutationTaster [13], CADD [14], PROVEAN [15], Mutpred [16], and Condel [17]. We did not detect any consistent pattern of combinations among these algorithms. In general, there was more frequent usage of some algorithms while others, especially the more recently developed algorithms, are used less frequently. Predictions from five commonly used algorithms (Polyphen, SIFT, CADD, PROVEAN and MutationTaster) resulted in higher concordance relative to all 18 algorithms, with 79% concordance for pathogenic variants and only 33% for benign variants (Table 1) with no significant departure in concordance when using the dataset without missing data (Additional file 1: Table S2).
In addition to lack of full concordance in prediction, we also identified "false concordance" across algorithms as a potential problem for variant classification. We identified 773 of 7346 (10.5%) variants classified as benign in ClinVar, for which all five commonly used algorithms predicted the variant to be pathogenic and conversely 64 of 7473 (0.8%) pathogenic variants in ClinVar were predicted benign by all five algorithms. These numbers grow to 815 and 68 variants, respectively, when we included concordance among algorithms for variants with some missing predictions (Table 1). Evaluating only three commonly used algorithms (Polyphen, SIFT, and CADD) resulted in higher concordance for pathogenic (84%) and benign (46%) variants, however, coupled with an increase in false concordances (Table 1, Additional file 1: Table S2). In fact, 22.5% (1653/7346) of benign ClinVar variants were assessed as pathogenic by ≥ 50% of the 18 algorithms including 87 variants where the benign classification of the variant had been reviewed by a ClinVar recognized expert panel (three-star review status) suggesting that these variants are benign and not misclassified in ClinVar (Additional file 1: Table S3). There were 57 benign variants spread among 42 genes, for which all 18 algorithms predicted the variant as pathogenic. In comparison, 5.2% (389/7473) of ClinVar pathogenic variants were deemed to be benign by ≥ 50% of algorithms (Additional file 1: Table S3). It is possible that some of these variants may result in splicing defects that may not be captured by these 18 algorithms. However we note, for example, that the variant NP_000234.1:p.Va-l726Ala in the MEFV gene interpreted as pathogenic by four independent clinical laboratories was predicted to be tolerant by 16 of 18 algorithms. The ClinVar record for this variant suggests a decrease in catalytic activity rather than a splicing defect as part of the reason why the clinical laboratories interpreted this variant as pathogenic. Conversely, the MSH2 variant, NP_000242.1:p.Glu198Gly, was classified as benign by an expert panel due to lack of an effect in functional assays including splicing but was predicted to be damaging by all 18 algorithms.
Not surprisingly, we failed to identify any combinations of algorithms that resulted in false concordance of zero and true concordance of 100% among the 18 algorithms whose default predictions are publicly available. We generated all possible combinations of three (n = 816), four (n = 3060), or five (n = 8568) algorithms and obtained their true and false concordance rates across the 14,819 variants. As before, there was a lower false concordance rate and a higher true concordance rate for pathogenic variants relative to the benign variants (Fig. 2b). Overall, the concordance among combinations was in the range of 83.4-64.5% for two to five algorithms, respectively (Table 2, Additional data 2-5, see Availability of data and materials section). We found that the best performing combinations of algorithms were different for benign and pathogenic variants (Additional file 2: Table S4, Additional data 2-5, see Availability of data and materials section). For example, for benign variants the best performing combinations of three algorithms consisted of VEST3 [18], REVEL [19], and MetaSVM [20] with a true concordance rate of 81.3% and a false concordance rate of 2.8%, whereas for pathogenic variants the same combination resulted in a 70% true concordance and a 5.4% false concordance. For pathogenic variants, the best performing trio combination consisted of MutationTaster, Mcap [21], and CADD (Additional file 1: Table S4), with a fairly high false concordance for benign variants (18.2% and 25.8% false concordance, Additional data 3, see Availability of data and materials section). We obtained similar results for combinations of four or five algorithms (Additional file 1: Table S4, Additional data 4 and 5, see Availability of data and materials section). Note that these best performing combinations are relevant only in the context of the particular dataset we used and may not be optimal across other designs (such as whole-genome analysis). In general, many different combinations performed better for pathogenic than benign variants (Fig. 2b). Taken together, our results suggest that a given combination of algorithms (using the publicly available threshold scores) will perform quite differently across benign and pathogenic variants with a significant chance of erroneous assertion due to false concordance among algorithms. These false concordances could potentially bias the variant interpretation towards a VUS classification if all the other available variant data suggests the opposite assertion.

Further analysis of algorithm prediction and concordance
For some algorithms such as Eigen [22], hEAt [23], GERP [24], etc., cut-offs defining pathogenic or benign assertion are either not recommended or inferred arbitrarily. We therefore used the actual output scores provided by all 25 algorithms as a continuous variable to identify algorithms whose predictions are likely to be concordant independent of the algorithms internal cutoffs. A hierarchical clustering of the normalized output scores of 14,819 missense variants for each of the algorithms revealed seven clusters (Fig. 2c). All the largely evolutionary conservation algorithms such as phyloP [25], phastCons [26], GERP [27], and Siphy [28] belong to different clusters from the metapredictors REVEL, MetaSVM, and MetaLR (Fig. 2c).

Comparison of performance of in silico algorithms
To identify well-performing algorithms when tested against variants in ClinVar disease genes with prediction abilities that are robust to the nature of a variant, gene constraint and underlying disease mechanism, we quantified performance of the in silico algorithms on multiple test datasets by determining the area under the receiver operator characteristic curve (AUC) (see "Methods").
We analyzed two overlapping datasets differing in the confidence of variant assertions. These were 14,819 ClinVar variants that are assigned at least one-star review status and 2966 ClinVar variants with concordant scores from at least two laboratories (two stars, see "Methods"). For both datasets, we observed wide variation in performance of the algorithms with AUCs in the range of 0.5-0.96 (Fig. 3a). We identified several algorithms with AUC ≥ 0.9 in these datasets that did not differ significantly in their performance between > 1or > -2-star datasets (Fig. 3a). In the above analyses, we collapsed the "likely" and "definite" categories into one category for the benign and pathogenic variants. To identify if there was a confound introduced by this  Fig. 3a are used in 3b for comparison. Any instance of ** represents variants with ClinVar review status of two stars or above. Ensemble predictors are indicated by dark green labels on the y-axis approach, we took a subset of 7766 variants excluding variants with "likely" assertions (resulting in pathogenic: 3293 variants and benign: 4473 variants) and conducted the performance analysis. There was a similar overall rank order of the algorithm performance with this dataset (Fig. 3a). We next sought to identify algorithms whose performance did not differ whether a given variant resulted in gain-of-function (GOF) or loss-of-function (LOF) of a gene product by analyzing datasets enriched in activating/ GOF mutations in oncogenes and LOF mutations in tumor suppressor genes (TSG) which are both pathogenic in cancer development [29] (see "Methods"). We also separately evaluated 1169 benign and 1427 pathogenic variants in genes linked to diseases with primarily recessive mode of inheritance as another proxy for a dataset enriched in LOF variants. These datasets were a subset of the larger ClinVar 1 star or above data (see "Methods"). We did not observe significant differences in performance of algorithms in the GOF and LOF datasets including across the high-performing algorithms (Fig. 3a). Additionally, we analyzed variants in genes that are primarily linked to diseases with dominant mode of inheritance. The latter dataset is likely a mixture of LOF and GOF variants. Again, there was no major departure from the rank-order of the top five performing algorithms that we observed in the other datasets (Fig. 3a).
Finally, we explored whether the performance of algorithms was affected by the level of constraint on a gene, as defined by comparing the expected and observed missense variants in ExAC [30] (missense Z scores). We obtained variants in genes with high, intermediate, or low levels of constraint by a missense Z score threshold of > 2.5, 0-2.49, or < 0, respectively. We did not observe any major changes in the rank order of the algorithms (Fig. 3a).
Taken together, our analyses suggest that the performance of the majority of algorithms in current use are unlikely to be affected significantly as a function of the nature of a variant or the level of constraint on the gene. The high-performing algorithms are robust to these variables and are more likely to give rise to consistent interpretation across different variant datasets in a variety of disease mechanism settings.

Evaluation of potential circularity in algorithm analysis
There is significant concern that the result of analyses such as that described here may arise from circularity in data used. For example, REVEL, a meta-predictor whose features include 13 out of the 25 algorithms that we analyzed [19] performed well in all nine datasets described above with high sensitivity and specificity (Fig. 3, Additional file 1: Table S6 and Additional file 2: Figure S2 and S3). It was possible that the variants we used to assess the performance of REVEL and other algorithms described here were also included in its training sets, which for REVEL included some of the ClinVar and HGMD variants available until October 2015. This type of circularity inflates the performance measures of some algorithms and is referred to as type I circularity [31]. To examine the possibility of type 1 circularity inflating the performance of algorithms that were trained on HGMD and ClinVar variants, we compared performance of all the algorithms in six additional datasets: The resulting predictions of these datasets which removed variants used in training different algorithms did not demonstrate a major change in the rank order of the top five algorithms that exhibited an AUC > 0.9 with REVEL performing the best in these datasets (Fig. 3b, Additional file 2: Figure S2). We next tested if the top performing algorithms suffered from type 2 circularity [31], which has been described as a caveat introduced due to the reliance of an algorithm's performance on the distribution of pathogenic or benign variants in a protein. Thus, in a variant dataset where there are proteins with only pathogenic variants or only benign variants (unbalanced dataset) some algorithms tend to perform better than in a dataset that have equal number of pathogenic and benign variants per gene (perfectly balanced), even if this is not what is biologically present. To this end, we compared performance on an unbalanced dataset (Varibenchselected [31]) and a balanced dataset which includes equal number of pathogenic and benign variants per protein in ClinVar (see "Methods"). Consistent with earlier results [31], we found that FATHMM [32] is particularly sensitive to this type of circularity. In other words, there is a drop in performance of FATHMM in analysis of a dataset that is perfectly balanced (balanced dataset * in Additional file 2: Figure  S3). We also detected evidence for potential type 2 circularity for algorithms such as MetaSVM/LR and MCAP (balanced vs varibench for MetaSVM, MetaLR, and MCAP, bootstrap p value < 0.0001; Additional file 2: Figure S3). Thus, caution should be used in interpreting scores using algorithms such as FATHMM as the prediction efficacy is partly dependent on pathogenic and benign variant distributions in any given gene.

Discussion
The ACMG/AMP guideline for use of in silico algorithms in a clinical setting suggests full concordance among multiple algorithms for this type of evidence to be used in missense variant classification without further clarification of the number or choice of algorithms. As we have shown, such usage leads to discrepancies arising mainly because of the lack of specification of the guideline. Our review of the literature reveals that the metric for concordance is not consistent across different laboratories. While some studies have adhered to the ACMG/ AMP guidelines for strict concordance, others have used a majority vote rule and the number of algorithms used by laboratories also varies widely. It has been reported that use of the strict ACMG criteria gave rise to a higher rate of VUS [6] and increased discrepancies among laboratory classifications [4]. The lack of a standard guidance for incorporating in silico algorithms could potentially lead to increased VUS burden and inter-lab discrepancies. As discussed in the following section, we identified certain limitations of using the ACMG/AMP guideline for in silico evidence and provide suggestions that may help optimize this guideline (Box 1).
In addition, based on review of abstracts in the medical literature, we find that frequently used algorithms are older and vary in performance. Our analyses identified several high-performance relatively newer algorithms which do not appear to have been incorporated into clinical laboratory pipelines such as REVEL, VEST3, etc. Many of these algorithms are ensemble predictors (e.g. REVEL, VEST3, MetaLR, MetaSVM, Condel, Mcap, Eigen, and CADD) incorporating many older algorithms as features. When tested across 1821 disease genes with variants in ClinVar, the performances of these algorithms are robust to technical artifacts, levels of constraint on genes, the underlying nature of variants, and Mendelian inheritance pattern. Moreover, we find that performance does not seem to be affected by restriction to definite pathogenic and benign categories. Thus, laboratories will potentially benefit from modifying pre-existing variant interpretation pipelines that currently use older algorithms.
The ACMG/AMP guideline encourages use of multiple algorithms. However, as expected, we observed an increase in the discordant calls as more algorithms are used to infer variant pathogenicity thus hindering the use of in silico evidence. An alternative is to use metapredictors that in effect combine multiple individual predictors to generate a score. These metapredictors satisfy the concept underlying the multiple algorithm criteria. However, combining metapredictors with their constituent predictors for variant interpretation may not be ideal given the duplication of analyses.
In general, we show, using author recommended thresholds for variant assertion, a substantial likelihood of discordance, particularly for benign variants. Many of the algorithms analyzed here were not necessarily designed for clinical classification purposes. They are often designed to predict whether a missense variant disrupts a protein domain, inferring that this is damaging to protein function, and were not intended to be a classifier of disease causality. However, as demonstrated in our review of the medical literature, clinical classification pipelines often use these damaging or deleterious predictions as a proxy for pathogenicity.
We also found several algorithms exhibit a bias towards calling a variant pathogenic (Additional file 1: Table S5) leading to incorrect inferences that may lead to relatively higher concordance for pathogenic variants. Consistent with this, we identified more false concordance (in which multiple algorithms made concordant assertions that were opposite to what is reported in ClinVar) for benign variants than pathogenic variants. Although this could be a result of misclassification in ClinVar, we found that a ClinVar designated expert panel has interpreted some of these benign variants. These false concordances by in silico prediction are another important source of error for variant interpretation. The problem of false concordance both increases the VUS rate and highlights why it may be inappropriate to increase the ACMG/AMP evidence strength for computational algorithms from "supporting" to "moderate" or "strong" without some further calibration of thresholds for the genes under consideration. We independently identified combinations of algorithms that tend to be more concordant via a hierarchical clustering of the output scores of all algorithms. The clustering pattern suggested that it is probably best to make inferences separately for evolutionary conservation algorithms, e.g. GERP and for metapredictors. Combining them is likely to result in discordant calls.
Our results analyzing variant data from a large number of disease genes are not designed to identify a single algorithm for use across all genes or for disease gene discovery. However, the data provided here suggests that high performing algorithms perform well across many different gene and mutation mechanism type. In addition, gene-specific algorithms or gene-specific calibration of algorithms using well-characterized sets of benign and pathogenic variants may perform better than the general approach described here. Several algorithms are very sensitive to the multiple sequence alignment being used [33]. The performance of SIFT and other algorithms within our analyses and others such as Align-GVGD [34] could potentially be improved if gene-specific curated alignments are provided to the classification pipeline.

Conclusions
The analyses and the data presented in this article highlights problems associated with the strict use of ACMG/ AMP guidelines for in silico algorithm usage. In particular, our results identify poor concordance among algorithms, particularly for variants classified as benign by clinical laboratories. We highlight the problem that the concordance rate varies substantially depending on the combination of algorithms utilized, which contributes to inter-clinical laboratory discrepancy in variant assertion. We also identify a previously unreported source of error in variant interpretation where in silico predictions are opposite to the evidence provided by other sources (false concordance). Finally, we identify high-performing algorithm combinations, many of which are based on recently developed algorithms and metapredictors that perform well independent of the underlying disease mechanism. Taken together, this analysis provides the necessary data and framework for optimization of the ACMG guidelines and offers methods to potentially reduce the burden of variants of uncertain significance in clinical variant interpretation.

Variant data and annotation
We downloaded the variant_summary.txt files from the ClinVar ftp site for variants used in the analysis. In this manuscript, we used the files ftp://ftp.ncbi.nlm.nih.gov/ pub/clinvar//tab_delimited/archive/2016/variant_sum-mary_2016-09.txt.gz and ftp://ftp.ncbi.nlm.nih.gov/pub/ clinvar//tab_delimited/archive/2016/variant_sum-mary_2016-12.txt.gz along with their corresponding.xml files. We removed all variants whose review status were "no assertion criteria provided." We also excluded any variants of uncertain significance from our analysis. We next considered only the missense variants and filtered out all the other classes of variants such as frameshift, termination, silent, non-coding, etc. Finally, we collapsed the Likely pathogenic and Pathogenic variants in one group and Likely Benign and Benign variants in another group. Thus, our final data of 14,819 variants had two levels of clinical significance: pathogenic and benign (Additional data 1, see Availability of data and materials section).

Algorithms and scores
We annotated these variants with 25 algorithm scores using dbNSFP and authors' publicly available websites (Additional file 1: Table S1). To generate binary predictions, we used the threshold recommended by dbNSFP3.2 or by the algorithms' authors. Certain algorithms such as MutationTaster, Mutation Assessor, and Polyphen have thresholds such that it generates more than two classes. We collapsed the "probably damaging" and "possibly damaging" classes variants of Polyphen into a single "damaging" class. For MutationTaster, we collapsed the "A" (disease causing automatic) and "D" (disease causing) classes into a single "damaging" class, while the "N" ("polymorphism") or "P" ("polymorphis-m_automatic") were collapsed into a single "Tolerated" class. For MutationAssessor that generates four predictions, the high ("H") or medium ("M") categories were treated as "Damaging" whereas the low ("L") or neutral ("N") categories were treated as "Tolerant." LRT predictions in dbNSFP gives three classes, namely "Damaging," "Neutral," and "Unknown". We treated the "Unknown" labels as no data available or NA in our analysis. For certain algorithms such as SIFT, MutationTaster, PROVEAN, and FATHMM, multiple scores for a given variant corresponding to different transcripts are provided by dbNSFP. We used the most damaging score predicted by the corresponding algorithm for a given variant in our analyses.
MutationTaster p values were converted as per dbNSFP3.2 for performance analysis. Note that not all algorithms produced a score for all the variants. These were treated as NAs in our analyses and are shown by the white color (instead of orange and green) in Fig. 1a and b. The number of variants included in our analyses are provided in Additional file 1: Table S1, columns G and H. We did the concordance analysis presented in Additional file 1: Table S2 and Additional file 2: Figure S1 with a smaller subset of the data (n = 8386) without any missing data (see "Datasets" section for additional information).

Literature search
To identify the frequency of usage of algorithm during the years 2011-2017, we conducted a literature search in PubMed (search date 19 January 2017) using Pubme-reporting discovery and comparative analysis of algorithms as defined by the above search term. We obtained 507 of articles that mentioned an algorithm in the title or abstract. The number of articles per algorithm term was used as a proxy for the usage of algorithms used in our analysis. Note that there could be a bias towards groups that use the algorithms names in the title or abstract, likely due to the importance of reaching the conclusion in a paper, and may not necessarily reflect routine use.

Concordance analysis
To determine concordance among algorithms, we obtained the publicly available thresholds (Additional file 1: Table S1) to define a dataset of pathogenic and benign prediction for each variant. We next generated all possible pairwise combinations of algorithms and determined the proportion of variants for which they agree with each other with a dataset with as well as without any missing values. Next, we also generated all possible combinations of algorithms with three, four, or five members and determined the concordance with ClinVar assertions for each of these pairs. We also determined the fraction of variants for which algorithms in each combination was concordant but the assertion was opposite to that designated in ClinVar. We refer to these instances as false concordances. A list of such combinations and their true and false concordances are provided in Additional data 2-6.

Clustering
The scores for 25 algorithms for each of the 14,819 variants were used to cluster the algorithms using the pvclust package in the R programming environment. We identified the most confident clusters by using 50,000 bootstrap replicates of the data, Euclidean distance as a measure of similarity, and ward's D2 method of hierarchical clustering as implemented in the pvclust function [35]. We called clusters as stable if they had a 0.99 or above probability of having the same members in the bootstrap replications. The final rendering of the plot was done using the dendextend [36] package in R.

Performance analysis
We compared the performance of each of the algorithms on all datasets separately by estimating the AUC of a receiver operator characteristic (ROC) curve and its 99% confidence interval using the OptimalCutpoints library in R. We estimated significant differences between any two AUCs by using 10,000 stratified bootstrap replicates of the datasets in question (where each replicate contained the same number of benign and controls than in the original sample), calculating AUC for each replicate for each and then testing for the statistical significance as implemented in the library pROC in R. We also estimated the sensitivity, specificity, positive and negative predictive values using cut-offs estimated from the ROC curve for the three datasets using the caret library in R (as indicated in Additional file 1: Table S6).

Datasets
All the data are available as additional data files or are available from the respective authors. We provide brief descriptions of the datasets that we used below.
ClinVar one star: 14,819 ClinVar variants (7346 benign and 7473 pathogenic variants) that are assigned one star or above (meaning at least one laboratory [primarily clinical laboratories] have provided their rationale for variant assertion).
ClinVar 2 star: 2966 (1914 benign and 1052 pathogenic) ClinVar variants with two-star status or above. These variants have concordant assertions from at least two independent laboratories.
ClinVar Oct 2015 to December 2016: this dataset contains 6949 (4093 benign and 2856 pathogenic) variants in ClinVar that were obtained from the variant_summary.txt file released in December 2016 after removing the variants that were present in the October 2015 data release.
ClinVar Sept 2016 to March 2017: this is a set of 3792 benign and 1310 pathogenic missense variants with one star or above ClinVar review status. These were obtained by filtering out the variants in the variant_summary.txt file in ClinVar from September 2016 from the variant_summary.txt files from March 2017 in ClinVar.
Oncogene variants: this dataset consists of 87 benign and 321 pathogenic variants in oncogenes as defined by genes having a high oncogene score and a low TSG score as described in [29].
TSG variants: this dataset consists of 502 benign and 532 pathogenic variants in TSGs as defined by genes having a high TSG score and a low oncogene score as described in [29].
Dominant: this dataset contains variants in genes that were associated with dominant mode of inheritance as determined by both [37] and [38]. There were 480 benign and 1591 pathogenic variants in this dataset.
Recessive: this dataset contains variants in genes that were associated with recessive mode of inheritance as determined by both [37] and [38]. There were 1169 benign and 1429 pathogenic variants in this dataset.
REVEL testset: this is the test dataset that contained ClinVar variants (Test Data 2) as described in [19].
MetaSVM/LR testset: this dataset consisted of 12,496 (6275 benign and 6221 pathogenic) ClinVar variants (with one or more review status in ClinVar) which did not include the variants used in the training sets of MetaSVM/LR. predictSNPdsel: this is a benchmark dataset as described in [31]. It does not contain CADD training data.