Varlociraptor: enhancing sensitivity and controlling false discovery rate in somatic indel discovery

Accurate discovery of somatic variants is of central importance in cancer research. However, count statistics on discovered somatic insertions and deletions (indels) indicate that large amounts of discoveries are missed because of the quantification of uncertainties related to gap and alignment ambiguities, twilight zone indels, cancer heterogeneity, sample purity, sampling, and strand bias. We provide a unifying statistical model whose dependency structures enable accurate quantification of all inherent uncertainties in short time. Consequently, false discovery rate (FDR) in somatic indel discovery can now be controlled at utmost accuracy, increasing the amount of true discoveries while safely suppressing the FDR.

S1 Why naive approaches to compute the likelihood function fail.
To understand why efficient computation of equation (1) is difficult, consider that each of the reads Z h i , Z t j could (a) not stem from the particular variant locus, (b) stem from the locus, but is not affected by the variant, (c) stem from the locus, and is indeed affected by the variant.
We recall that it can be particularly difficult to be certain about (a), (b) or (c) when dealing with reads being associated with midsize indel loci (30-250 bp; sometimes termed the "NGS twilight zone"). Let k = |Z t | and l = |Z h | be the read coverage of the locus in the tumor and the healthy sample. Since there are 3 different possibilities-namely (a), (b) or (c)-for the overall k + l reads, we obtain that there are 3 k+l different scenarios that could reflect the truth, all of which apply with a particular probability. For computing equation (1) following a fully Bayesian approach to inverse uncertainty quantification [1]which is the approved and canonical way to quantify uncertainties in our setting-one needs to integrate over all the possible k +l choices. In a naive approach, this translates into computing a sum with 3 k+l summands. Because k + l amounts to at least 60 to 70 in standard settings, naive approaches fail to compute the integral in human feasible runtime. This is further aggravated because one usually needs to consider hundreds of thousands of putative indel loci. So, methodical efforts are required for uncertainty quantification in our setting.

S2 Uniqueness and computation of the maximum likelihood estimate
The likelihood function of θ h , θ c , and β given the data Z h and Z t as shown in equation (1) is a higherorder polynomial, which makes it infeasible to derive its maximum analytically. We show in this section, however, by proving Theorem 3.2 that under weak conditions the likelihood function attains a unique global maximum on the unit interval for each value of θ h and β. We, in addition, show that the loglikelihood function is strictly concave, which simplifies the numerical maximization.
Proof. The likelihood function with θ h and β fixed can be written in the form where C is the constant In the case that theorem condition 1 is not met, C = 0. The likelihood L(θ h , θ c , β | Z h , Z c ) equals zero for all θ c and, therefore, does not attain a unique global maximum.
Suppose theorem condition 1 is met (C > 0). Let us consider theorem condition 2. Note that L(θ h , θ c , β | Z h , Z c ) = 0 when θ c ∈ I, since for those θ c 's there exists an observation for which the P (Z t j | θ h , θ c β) = 0. The likelihood L is by definition strictly larger than zero when θ c ∈ I. Since the function in equation (41) is an l-th order polynomial and, therefore, continuous, it must attain a global maximum on the interval I.
Suppose theorem condition 2 is met. The point θ c is a maximum of L(θ h , ·, · | Z h , Z c ) if and only if it is a maximum of the loglikelihood function (with θ h , β fixed and θ c ∈ I) since the logarithm is a monotonic transform. (Note that is only defined on the subset I). The second order derivative of the loglikelihood with respect to θ c is found to be indicating that the loglikelihood function is concave. Note that it is strictly concave, i.e., ∂ 2 /∂θ 2 c < 0, iff there exists an observation z t j for which This inequality holds only when α = 0, π t j = 0 and p t j = a t j , which constitutes theorem conditions 3 and 4. Suppose I is the non-empty closed set [a, b] on the unit interval. Since the loglikelihood is strictly concave when theorem conditions 3 and 4 are met, it attains a unique global maximum θ c on I. Because the logarithm is a monotonic transformation, θ c must be a unique global maximum of the likelihood function as well.
A similar reasoning holds when I is open or half-open. The maximum must lie on the interior of I, since the likelihood function is zero for those endpoints not in I. For example, when I is the open interval   Below the diagonal, the control is conservative. Above the diagonal, the FDR would be underestimated. Importantly, points below the diagonal mean that the true FDR is smaller than the threshold provided, which means that FDR control is still established; in this sense, points below the diagonal are preferable over points above the diagonal.     Fig. S11: Recall and precision for calling somatic insertions on synthetic data (mixture rate 5%). Results are grouped by deletion length, denoted as interval at the top of the plot. For our approach (Varlociraptor+*) curves are plotted by scanning over the posterior probability for having a somatic variant (for readability, each curve is terminated by a square mark). For other callers that provide a score to scan over (e.g. p-value for Lancet) we plot a dotted line. Ad-hoc results are shown as single dots. Results are shown if the prediction of the caller did provide at least 10 calls.