Skip to main content
Fig. 4 | Genome Biology

Fig. 4

From: Combining accurate tumor genome simulation with crowdsourcing to benchmark somatic structural variant detection

Fig. 4

Characteristics of prediction errors. Random forests assess the importance of 16 sequence-based variables for each caller’s FN (a, c, e, g, i) and FP (b, d, f, h, j) breakpoints. Each panel shows variable importance on the left, where each row represents the best performing set of predictions by the given team/caller (on the given in silico tumor), and each column represents the indicated variable. Dot size reflects variable importance, i.e., the mean change in accuracy caused by removing the variable from the model (generated to predict erroneous breakpoints). Color reflects the directional effect of each variable (red and blue for greater and lower variable values, respectively, associated with erroneous breakpoints; black for categorical variables or insignificant directional associations, two-sided Mann-Whitney P > 0.01). Background shading indicates the accuracy of the model (see the color bar). Variable importance for FN and FP breakpoints in each of the three tumors is shown for the following SV callers: CREST (a, b), Delly (c, d), and Manta (e, f). Manta only called two FPs in IS1; thus, variable importance for FP breakpoints could not be computed (indicated by Xs in the plot). Variable importance for FN and FP breakpoints in IS2 (g, h) and IS3 (i, j) is shown for each team. In the right plot (g–j), the first four columns indicate usage of the indicated algorithmic approaches by each team, and the last column indicates the aligner used. Gray indicates that algorithmic approaches and aligner are unknown for the given team. Abbreviations: Algm, algorithm; SNP, single-nucleotide polymorphism; INDEL, short insertion or deletion

Back to article page