An ensemble approach to accurately detect somatic mutations using SomaticSeq

SomaticSeq is an accurate somatic mutation detection pipeline implementing a stochastic boosting algorithm to produce highly accurate somatic mutation calls for both single nucleotide variants and small insertions and deletions. The workflow currently incorporates five state-of-the-art somatic mutation callers, and extracts over 70 individual genomic and sequencing features for each candidate site. A training set is provided to an adaptively boosted decision tree learner to create a classifier for predicting mutation statuses. We validate our results with both synthetic and real data. We report that SomaticSeq is able to achieve better overall accuracy than any individual tool incorporated. Electronic supplementary material The online version of this article (doi:10.1186/s13059-015-0758-2) contains supplementary material, which is available to authorized users.


Half of the Test Data
Cross validation with ground truth in silico Titration and Somatic-Spike

Half of the Test Data
Cross validation with constructed truth set  Original DREAM Challenge Stage 3 Compare to experimentally validated mutations CLL1 Original DREAM Challenge Stage 3 Compare to experimentally validated mutations Table 1. Specifying what training data are used for each test data set in this study, and what data is used for validation.
• Stage 3 Tumor: https://dream.annailabs.com/cghub/data/analysis/download/ 8fe6fc33-2daf-4393-929f-7c3493d04bef For our in silico titration, the two genomes can be downloaded at the following locations: •   Table 2. VarScan+Filter and Sniper+Filter contains a subset of calls where the authorrecommended false positive filters are applied to the original call sets. Setting A: Stage 3 data straight up. Setting B: the matched normal is contaminated with 5% tumor. Setting C: the tumor is contaminated with 30% normal, in which case variant allele frequencies of 35%, 23%, and 14% are present in the tumor sample. Setting D: combination of C and D, i.e., the normal is contaminated with 5% tumor, and tumor is contaminated with 30% normal. The ensemble contained all calls from VarScan2 and SomaticSniper (without false positive filter), plus with VarDict's internal filters relaxed.  Table 4. SomaticSpike. Tumor sequencing depth = 10X. Prior probability of somatic mutation is enforced to be 1 in a million in order to get a more realistic performance.

20X
Recall  Table 5. SomaticSpike. Tumor sequencing depth = 20X. Prior probability of somatic mutation is enforced to be 1 in a million in order to get a more realistic performance.

30X
Recall Precision F 1 Score VAF 5% 10% 20% 40% 5% 10% 20% 40% 5% 10% 20% 40%  Table 6. SomaticSpike. Tumor sequencing depth = 30X. Prior probability of somatic mutation is enforced to be 1 in a million in order to get a more realistic performance.  Table 7. SomaticSpike. Tumor sequencing depth = 40X. Prior probability of somatic mutation is enforced to be 1 in a million in order to get a more realistic performance.

50X
Recall  Table 8. SomaticSpike. Tumor sequencing depth = 50X. Prior probability of somatic mutation is enforced to be 1 in a million in order to get a more realistic performance.  Table 9.        Figure 1. The average F 1 scores vs. the number of tools. The gain in accuracy with each addition is the greatest when the data are the mostchallenging (i.e., DC3D and N 2.5 T 15 ), and the least when the data are the simplest (i.e., DC3A and N 0 T 50 ). There is also a diminishing return as you add more and more tools.    Table 19. Reduced size of training data set for SomaticSeq INDELs. Size is defined by the number of true positives. The final row (T.P. fraction) is the fraction of true positives in the training data for each data set, i.e., for DC3A (DREAM Challenge Stage 3, Setting A), the fraction of true positives is 0.138. Thus, when there are 10 true positives, the total training data consisted of 10/0.212 = 47 calls (of which 10 are true positives and 37 are false positives). 0 means no SomaticSeq training. F 1 score of the individual tools are included for easy comparison.  Table 17. showing that almost all true somatic mutations have MQ above 57. (d) z-score of base quality rank sum between the reference and alternate reads in tumor reads. It is a measure of base quality bias between the reference and alternate reads. This is a weaker predictor than BQ, but also holds value as large-magnitude z-scores are enriched with false positives comparing to z-scores close to 0. All figures here are generated from Stage 3 of DREAM Challenge data.  Table 18. For comparison, the best individual tool's F 1 scores were 0.789 (MuTect), 0.624 (MuTect), 0.607 (SomaticSniper) and 0.296 (VarDict) for DC3A, DC3D, N 0 T 50 , and N 2.5 T 15 , respectively.  Table 19. For comparison, the best individual tool's F 1 scores were 0.707 (VarDict), 0.525 (VarDict), 0.729 (Indelocator), and 0.165 (VarDict) for DC3A, DC3D, N 0 T 50 , and N 2.5 T 15 , respectively.