Somatic mutation callers
While many individual callers have optional input parameters that can improve prediction results for sequencing data of different characteristics (e.g., ploidy, purity, mutation rate, etc.), those parameters are usually unknown to researchers. Thus, in most cases, we used default or recommended settings for each caller. For MuTect, we supplied dbSNP version 138 [27], Catalogue of Somatic Mutations in Cancer (COSMIC) version 69 [28], and a panel of normal based on Phase 1 of the 1000 Genomes Project as resource files for the real sequencing data. We did not supply COSMIC for the DREAM Challenge, because the synthetic mutations were randomly chosen and were not enriched in COSMIC sites. In our in silico titration and SomaticSpike experiments, none of these databases was used.
For SomaticSniper, we used a mapping quality cut-off of 25, a base quality cut-off of 15, and a prior somatic mutation probability of 10 −4. For VarScan2, we used a mapping quality cut-off of 25 and a base quality cut-off of 20. For JointSNVMix2, we used a convergence threshold of 0.01 in training, and only considered calls with a somatic probability ≥0.95. For VarDict, we relaxed some built-in filters to increase its sensitivity (at the expense of precision). Specifically, we relaxed the variant depth filter from 4 to 2, and the FET p-value cut-off from 0.05 to 0.15. In addition, we allowed each call to fail for up to two out of 20 VarDict filters.
These parameters were chosen from our experiences with Stage 3 of the DREAM Challenge. Except for the different resource files we have supplied to MuTect, we used the same configuration for all of our analyses presented in this study.
SomaticSeq workflow
The complete SomaticSeq workflow is illustrated in Fig. 1. From the raw sequencing reads, BAM files for the normal and tumor sequencing reads are generated using GATK best practices [29]. From the tumor–normal BAM files, a union of somatic SNVs is called from MuTect, SomaticSniper, VarScan2, JointSNVMix2, and VarDict. Somatic indels are called from VarScan2, VarDict, and Indelocator (aka SomaticIndelDetector). For each of the mutation candidate positions, we integrate and standardize the feature sets. We use SAMtools [30] and GATK HaplotypeCaller on the tumor and normal BAM files to obtain a number of independent sequencing features that have predictive values for their somatic mutation statuses, e.g., mapping quality, base call quality, strand bias, depth of coverage, tail distance bias, etc. Some caller features, e.g., somatic mutation scores based on its distinct statistics, are also included. For the DREAM Challenge and real data, we also consider whether the site is in dbSNP. Two of the most important features in the adaptively boosted classifiers include the root-mean-square mapping quality score and the number of read mismatches compared to the reference. Histograms visualizing some of the features’ predictive values are shown in Additional file 1: Fig. S3.
We use the R package ada to train the stochastic boosting machine-learning algorithm in SomaticSeq [13]. The stochastic boosting learner constructs a classifier consisting of a sequence of decision trees based on up to 72 genomic and sequencing features to discriminate true somatic mutations in the training set. The constructed classifier is then applied to a target set, and calculates the probability of each candidate site being a true somatic mutation (Fig. 2). Some of those features are stronger predictors than others, but they all add some value to the model. When all features are combined, the model is very accurate. For the results described in this study, we have used P≥0.7 as the cut-off for our SomaticSeq results, i.e., a candidate site of P≥0.7 is considered a PASS call, whereas a candidate site of P<0.7 is considered LowQual. The cut-off value of 0.7 is chosen to prioritize slightly precision over sensitivity, though the actual accuracies tend to be very robust to a wide range of values (Fig. 2).
Trained models
The importance of each feature differs from data set to data set, but there is a lot of overlap for the most important features. In all settings of the DREAM Challenge SNV data, 18 of the top 20 features overlapped. They are listed as follows (not ranked):
-
Classification by MuTect (binary values of 0 or 1)
-
Classification by JointSNVMix2
-
Classification by SomaticSniper
-
Classification by VarDict
-
FET somatic p-value reported by VarDict
-
Average mapping quality in the tumor BAM file
-
Average mapping quality in the normal BAM file
-
Forward and reverse read counts supporting a variant in tumor
-
Forward and reverse read counts supporting a reference in tumor
-
Forward and reverse read counts supporting a reference in normal
-
Read depth in tumor
-
Read depth in normal
-
Number of mismatches (compared to reference) in tumor reads
-
dbSNP membership (binary values of 0 or 1)
-
Strand bias odds ratio reported by VarDict
Four of the top 18 features were simple classifications made by the individual tools (whether or not the caller has called it a somatic mutation). The only caller classification not on the list was VarScan2, although it was outside the top 20 only in Setting D. For Settings A, B, and C, the VarScan2 classification was ranked number 16, 20, and 17, respectively. Because VarScan2 has a series of (tunable) filters prior to evaluations, e.g., a minimum VAF of 10 % and minimum coverage of 8 ×, VarScan2’s sensitivity in challenging data sets like Setting D is reduced. Other important features are related to the quantity of evidence (e.g., read counts and read depth), quality of evidence (read mismatch and mapping qualities), and sequencing artifacts (strand bias). Prior knowledge (dbSNP membership) was also valuable. Since eight of the top 18 features related directly to sequencing depth, it is important for the trained model to have a comparable sequencing depth as the target set. Thus, it would not be appropriate to use a 30 × whole-genome sequence trained model to predict somatic mutations in a 500 × targeted sequencing, e.g., three variant reads in 30 × data imply 10 % VAF, but 3 reads in 500 × data imply 0.6 % VAF, which is very little evidence over expected sequencing errors.
A trained model consists of an ensemble of decision trees with different relative weights. For the model trained from the combined DREAM Challenge Settings A and B, which we used to predict real data, the top decision tree can be described as follows:
T_MQ is the root-mean-square mapping quality in the tumor BAM file. VarScan2_Score is the Phred-scale FET p-value reported by VarScan2. The asterisks denote terminal nodes. This is the number 1 decision tree from the trained model we used to predict somatic mutations for the publicly available data (Table 1). The tree view is presented in Additional file 1: Fig. S4. This trained model, along with the indel model trained from the same data set, can be downloaded from our Git repository [14].
The prediction accuracy of the target set largely depends on the similarity between the training and target sets. Thus, ideally, a randomly sampled subset of the same data is used for training as described by Kim et al., e.g., randomly choose 500–1000 mutation candidates from a large sequencing study for validation sequencing, to construct a training set with hundreds of true somatic mutations and confirmed true negatives [11]. It is important to keep the training and target sets similar in terms of sequencing depth, platforms, and identical data processing. For instance, if different mappers are used, or if quality scores are calibrated differently, then the predictions may be less accurate. In addition, it may not be appropriate to use a carcinogen-driven tumor like lung cancer to predict somatic mutations in pediatric cancer, because the mutation profiles of those two types of cancers are vastly different. We have tested the performance of SomaticSeq with a small subset of the data (i.e., a random sampling that includes 10–1000 true somatic mutations), and have presented the results in Additional file 1: Tables S18 and S19, and Additional file 1: Figs. S5 and S6, for somatic SNVs and indels, respectively. Unsurprisingly, the accuracy improves with increasing number of data points in the training set, with diminishing returns after the training set reaches around 100–200 true somatic mutations. However, improvements over individual tools are shown with just 10–20 true mutations in the training set. However, in the absence of such a training set, a synthetic data set with characteristics close enough to the target set would also suffice, as we have done for COLO-829 and CLL1 with the DREAM Challenge Stage 3 for training. In this study, we calculate the accuracy of SomaticSeq’s predictions based on the known ground truth for synthetic data (i.e., DREAM Challenge, in silico titration, and SomaticSpike) and published lists of mutations for real data (Fig. 10).
In silico titration
We mixed two human genomes (NA12878 and NS12911) in silico to create a virtual tumor–normal sequencing experiment as illustrated in Fig. 4. To create any mixture of the two genomes, the two BAM files were randomly downsampled at the appropriate fraction, and then merged together using Picard tools. To construct the ground truth from the in silico experiment, if a location is a homozygous reference in NA12878 (designated normal) and heterozygous in NS12911 (designated tumor), this location was considered to be a somatic mutation. If the site is a homozygous reference in both genomes, it was considered a reference, so any somatic call that fell in those regions was considered a false positive. All other genotyping possibilities were considered ambiguous for somatic analysis, and were ignored in the downstream analysis. The schematic to build the ground truth is illustrated in Fig. 5. One major difference between our in silico titration and SomaticSpike is that we used two entirely different genomes as the tumor and normal, whereas SomaticSpike used the same genome for both, but selectively spiked in alternate reads from a different genome as virtual mutations. Our in silico titration was not only able to capture sequencing artifacts associated with sequencers, but it also captured artifacts associated with two separate sample preparations as would be expected from some tumor and normal studies.
To obtain the highest confidence truth set, heterozygous variant calls in NS12911 must be agreed upon by three germline callers: HaplotypeCaller, FreeBayes, and SAMtools. In addition, we only considered the high-confidence callable regions (minimum depth of 10, minimum mapping quality of 20, and minimum base quality of 10) in both NA12878 and NA12911 for this exercise.
On average, there is at least one single-nucleotide polymorphism per 1000 bp in a human genome, and this in silico titration created over 810,000 virtual somatic mutations, which presented a prior somatic mutation rate of about one out of 2700. This was orders of magnitude more frequent than the “rule of thumb” of one somatic mutation out of every million base pairs [31]. Thus, using all 810,000 virtual somatic mutations would be a poor evaluator of a somatic caller’s precision because true hits would unrealistically outnumber false positives. It was necessary to enforce a realistic somatic mutation rate to evaluate better the precision of each tool in real cancer analysis. Therefore, 2218 somatic SNVs and indels were randomly chosen, while the remainder were masked from our analysis. In SomaticSpike, where we investigated SomaticSeq’s performance as a function of sequencing depth and VAF, we randomly chose 3000 somatic SNVs to enforce a one in a million prior mutation probability. We also refrained from using dbSNP or COSMIC membership as features for the in silico experiments, because their membership statuses in these databases did not reflect the reality in real cancers. In other words, the results for the in silico titration presented in this study did not take information from any prior knowledge.
In in silico titration, we created impure tumors by mixing tumor and normal sequencing reads. Some of the reads in the virtual impure tumor were identical to reads in the normal, without expected experimental artifacts because the normal contamination in this case did not come from normal tissues, but directly from the normal data. This led to an inflated precision since identical reads cannot “fool” the caller. Therefore, to obtain a realistic number of false positives, we used the false positives obtained from pure normal/pure tumor analysis as false positives for all in silico analyses. In SomaticSpike, a separate tumor/normal whole genome analysis was done for every sequencing depth, and false positives from that were used for every VAF study of that sequencing depth.
Functional annotation
SNVs reported by SomaticSeq were annotated with CADD rank scores in dbNSFP v2.8, where the scores were provided for every non-synonymous SNV. Rank score is a ratio of the rank of the raw score over the total number of raw scores in dbNSFP. Larger values indicate relatively higher deleteriousness. The Wilcoxon rank-sum test was conducted using pairwise.wilcox.test in R with multiple testing corrections using the Holm–Bonferroni method (see Fig. 9). All SNVs reported by SomaticSeq were also annotated with raw CADD scores generated using CADD v1.2 and the raw scores were converted to rank scores (see Additional file 1: Fig. S7).
Comments
View archived comments (1)