Evaluation of SV detection algorithms using simulated and real WGS data
We accessed 79 publicly available SV detection algorithms that can handle the human WGS data but do not require multiple samples such as matched datasets (e.g., control and tumor samples). We excluded 10 algorithms that did not work in our computational environment. Completed results were obtained with 69 algorithms using simulated and real human WGS data (Additional file 1: Tables S1 and S2, please see Additional file 1: Table S1 for the reference for each algorithm described below and Additional file 1: Table S2 for the list of unworked algorithms) to calculate the precision and recall. A simulated short read dataset was generated using the VarSim simulator [37]: first, a simulated GRCh37 human diploid genome into which known SVs had been introduced at the known sites was generated, then this was used to generate simulated paired-end short reads (125 bp) with 500 bp insert size averaging 30× coverage of the simulated genome (Sim-A). The number of simulated SVs of each type was slightly larger than the mean numbers detected for an individual human genome in the 1000 Genome project [6] (e.g., 1.3-fold higher for DELs, Additional file 1: Table S4-A and S4-C). Four sets of the NA12878 Illumina short read data (data1, data2, data3, and data4) and three sets of PacBio long read data (PacBio-data1, PacBio-data2, and PacBio-data3) were used as real datasets and were acquired from different sources with different read lengths and/or insert sizes (Additional file 1: Table S3). A reference SV dataset for the real data was generated by merging the DGV dataset corresponding to NA12878 and the INS, DEL, and INV data detected from NA12878 long read assemblies (Additional file 1: Table S4; see the “Methods” section for details).
These datasets, including the simulated data and four or three NA12878 datasets, were aligned with the GRCh37d5 reference genome using bwa [38] or other specific alignment tools (see the “Methods” section). The alignment data or read data were then used for calling DELs, DUPs, INSs, and INVs in all but the Y chromosome for the real data. Translocations were not evaluated because there are few known translocations in the databases and VarSim cannot simulate translocations. For DELs and DUPs, SVs were divided into four and three categories, respectively, depending on their sizes (DEL-SS: 50–100 bp; DEL-S and DUP-S, 100 bp to 1 kb; DEL-M and DUP-M, 1–100 kb; DEL-L and DUP-L, 100 kb to 1 Mb). We defined true called SVs as the called SVs that significantly overlap with the reference SVs by proportions (≧ 50% [or ≧ 80% for the simulated data] reciprocal overlap for DELs, DUPs, and INVs; overlap with a BP ± 200 bp for INSs). The outline of the entire evaluation processes is presented in Figure S1 in Additional file 1.
We observed changes in precision and recall by using different filtering thresholds; the minimum number of reads supporting the called SVs, termed “RSS” (Reads Supporting SV) in this study (see Additional file 1: Figure S2 for representative examples). Thus, to compare the performance of each algorithm as objectively as possible, we selected an RSS for each call set at which the numbers of calls for an SV type approximates the simulated reference data or the expected number of SVs in an individual (see the “Methods” section for detail). Both precision and recall were calculated for each size range of DELs (Additional file 1: Figure S3), DUPs (Additional file 1: Figure S4), INSs, and INVs (Additional file 1: Figure S5); for the real data, the mean precision and recall from the four short read datasets are presented. The numerical data for all the results for the Sim-A and multiple NA12878 real datasets are presented in Tables S5-S9 in Additional file 3. The precision and recall values at the selected RSSs for the four NA12878 real datasets and the mean and the standard deviation (SD) are presented in Table S10 in Additional file 3.
The precision and recall for calling SVs varied greatly depending on the algorithm, the SV type, and the size of the SV. Figures 1 and 2 highlight a number of algorithms that specifically and/or sensitively detected SVs for each type of SV and for each size range of SV (also see Additional file 1: Figures S3–S5 for precision–recall plots). Figure 1 shows the combined statistics (F-measure) for the precision and recall of each algorithm for calling each SV type and highlights a subset of algorithms that can call many SVs with a high level of precision and recall for both simulated and real datasets, which include 1-2-3-SV [39], DELLY [32], GRIDSS [40], inGAP-sv [41], Lumpy [35], Manta [42], MetaSV [43], Pindel [34], SoftSV [44], SvABA [45], and Wham [46]. Although many of the algorithms that call DELs or DUPs covered all the size ranges (S, M, and L) for both the simulated and real datasets, a subset of algorithms exhibited a limited performance in a specific size range (Fig. 2). For example, CLEVER [47] less effectively detected large DELs, and depth-based algorithms (e.g., AS-GENESENG [48], Control-FREEC [49], CNVnator, OncoSNP-Seq [50], readDepth [51], and GenomeSTRiP [33]) less effectively detected small DELs and/or DUPs.
The algorithms benchmarked in this study are based on one of the 10 method classes, including RP, RD, SR, AS, or LR alone, or one of five combined methods (RP-RD, RP-SR, RP-AS, RP-RD-S, and RP-SR-AS) (Additional file 1: Table S1). For calling DEL and DUP, the SR, LR, and RP-SR-AS methods achieved relatively good performance both with the simulated and the real data as shown in the precision–recall plots for the 10 categorized SV detection methods (Additional file 1: Figure S6).
In addition, we determined potential false-positive calls for each algorithm using NA12878 pedigree data, NA12878 for child and NA12891 and NA12892 for parents (Additional file 1: Table S3). The variants present only in child but not in both parents are attributable to Mendelian inheritance errors or de novo variants. Because the occurrence of de novo SVs is quite low and is thus negligible [28], the SV calls from only child are derived from Mendelian inheritance errors or false-negative call in parents. We determined Mendelian inheritance error rate (MIER; the percentage of Mendelian inheritance errors in the total calls) for each algorithm in each SV type. We observed a weak correlation between “100 − MIER” and precision for each algorithm in each SV type (the Spearman rank correlation coefficients, 0.31~0.46 for each SV type) (Additional file 1: Figure S7 and Additional file 3: Tables S6–S10 for numerical data). The weak correlation may be due to false-negative calls in parents and/or the presence of false positives that are called commonly between parents and child.
Evaluation with HG00514 WGS data
We further evaluated SV detection algorithm using another WGS real data of a Han Chinese individual HG00514 (Additional file 1: Table S3), which is one of the data used in the Human Genome Structural Variation Consortium (HGSV). In HGSV, a HG00514 SV set had been generated using 13 short read-based SV detection algorithms and using an approach with long read-based assemblies [36]. We used this SV set as a reference SV set, although it was devoid of INVs (Additional file 1: Table S4; see the “Methods” section for detail). We showed the performance of each algorithm for each type of SV and for each size range of SV using F-measure (Additional file 1: Figures S8 and S9) and using precision–recall plots (Additional file 1: Figures S10 and S11, and Additional file 3: Table S11 for numerical data), as demonstrated for the NA12878 datasets in the previous section. Although the tendency of precision and recall between algorithms was similar to that of the NA12878 results, the overall precision values especially for DELs were lower than those of NA12878 (mean precision in HG00514: 53.6 for DEL, 22.5 for DUP, 42.9 for INS; mean precision in NA12878: 62.0 for DEL, 27.9 for DUP, 47.7 for INS).
We examined the correlation in the SV calling accuracies between the six datasets (the four NA12878 real datasets, one HG00514 real dataset, and one simulation dataset), by comparing the accuracy ranks of algorithms between SV types and/or datasets with the Spearman rank correlation coefficients (Additional file 1: Figure S12). The rank correlation coefficients for these algorithms were high (> 0.7 for almost all cases) for all types of SV between the five real datasets, suggesting that the determined SV calling accuracies for the tested algorithms were robust at least among the NA12878 and HG00514 datasets. The accuracy ranks between the simulated and NA12878 real datasets correlated reasonably well for DELs (0.72) and INSs (0.61) but weakly correlated for INVs (0.57) and DUPs (0.48). This result suggests that the simulated data fails to accurately model the mechanisms of SV formation, especially the properties of the real DUPs and INVs, which often involve complex SVs in which other types of SVs are integrated [24]. Alternatively, DUPs and INVs for NA12878 may be insufficiently represented in the reference databases. Exceptionally, the accuracy ranks for DUPs between the simulated and HG00514 real datasets (0.72) were considerably higher than those between the simulated and NA12878 real datasets (0.49). This high correlation is probably because HG00514 DUPs reported in HGSV have been detected mainly with short read-based SV detection algorithms [36], in contrast with NA12878 DUPs that are derived mainly from array-based detection. On the other hand, the high correlation between all the datasets observed for DELs was probably because the NA12878 reference DELs were covered with the datasets derived from both array-based and assembly-based SV detection.
Evaluation of algorithms that call MEIs, NUMTs, and VEIs
Based on the identity of the inserted sequence, some INSs can be classified into special classes including MEIs, NUMTs, and VEIs. Thus, we next evaluated the subset of computational algorithms that detect specific classes of INSs. We used three different simulated datasets (Sim-MEI, Sim-NUMT, and Sim-VEI, generated using only the chr17 sequence; see the “Methods” section) and the four NA12878 real datasets to evaluate the performances of 12 algorithms and an additional five derivatives of three algorithms (Fig. 3, and see Additional file 3: Tables S5–S10 for the numerical data). For the real data, the numbers of true positives (TPs) was determined in place of recall, because MEI, NUMT, and VEI have not been defined for the NA12878 INS reference. We added NUMT-compatible versions of Mobster [52], MELT [53], and Tangram [54] (Mobster-numt, MELT-numt, and Tangram-numt) and VEI-compatible versions of Mobster and Tangram (Mobster-vei, Tangram-vei) to NUMT- and VEI-detection algorithms, respectively (see Additional file 4: Supplementary methods for detail).
For MEI calling, MELT and Mobster achieved higher performances with both the simulated and real data than the other algorithms (> 88% in precision and > 50% in recall [> 900 TPs], Fig. 3a and b). Although MELT had the highest recall for MEI calling, RetroSeq, Tangram, and Mobster exhibited higher recall metrics in calling simulated LINE1 than MELT (Additional file 3: Table S5). For NUMT, MELT-numt exhibited the highest precision (> 92%) both with the simulated and the real data but exhibited only 20% recall with the simulated data (Fig. 3c and d). A more increased recall for NUMT calling may be achieved by a combination with Tangram-numt or DINUMT, because MELT-numt calls exhibited only 67% overlap with the Tangram-numt or DINUMT calls. For VEI, Mobster-vei had the highest precision (100%) and recall (~ 90%) in the simulated data (Fig. 3c).
Evaluation of algorithms with long read data
We evaluated the performances of three SV detection algorithms with long read data, including PBHoney [22], Sniffles [55], and pbsv [56]. We also added a modified PBHoney algorithm (PBHoney-NGM), which used NGM-LR as alignment tool (see the “Methods” section). To generate a simulated dataset of long reads, PacBio long reads (average 7.5–20 kb) aimed at 10× coverage were simulated with Sim-A using the PBSIM simulator [57] (Fig. 4, Additional file 1: Table S3). For real data, we used long read datasets from three individuals: NA12878 (PacBio-data1 to PacBio-data3), HG002 (PacBio-HG002), and HG00524 (PacBio-HG00524) to determine precision and recall (Additional file 1: Table S3). pbsv achieved the highest precision and recall in DEL calling with the simulated data (Fig. 4, Additional file 3: Tables S5-S10 for the numerical data). Overall, however, the three algorithms exhibited similar accuracy in the real data, especially in the HG002 data. Although the input datasets used for evaluation of short read-based and long read-based algorithms were different, we compared the evaluation results of these three detection algorithms with those of short read-based ones (Figs. 1 and 2, Additional file 1: Figures S3–S5 and S8–S11). The long read-based algorithms exhibited good performances in calling short DELs (DEL-SS and DEL-S) and INSs despite the lower coverage of the long read data (10×) than that of the short read data (30×).
Effect of different properties of read data on detection accuracy
We examined how read and library characteristics affect the precision and recall of SV calling among algorithms with relatively high precision and/or recall for each type and each size range. We generated datasets with different read lengths (100 bp, 125 bp, and 150 bp), read coverage (10×, 20×, 30×, and 60×), and library insert size (400 bp, 500 bp, and 600 bp) and evaluated the SV calling accuracies of the algorithms with these datasets (Additional file 2: Figure S13).
Changes in read coverage prominently affected recall and precision (see Additional file 1: Tables S12 and S13 for the summarized and statistical results). Data with higher coverage exhibited higher recall due to an increased number of signals including discordant reads and split reads. Interestingly, for many algorithms data with higher coverage resulted in lower precision than data with lower coverage when compared at the same threshold of RSS (as representative examples, see Additional file 2: Figure S13-A, S13-N, S13-X, S13-Z, S13-AJ, S13-AN, S13-AS, and S13-AU). In many cases, the precision using high-coverage data was comparable to that with lower coverage when the threshold values of RSS were increased (Additional file 2: Figure S13-M, S13-T, S13-X, S13-Y, S13-AB, S13-AD, S13-AH, S13-AL, S13-AN, S13-AP, S13-AR, and S13-AU). These results suggest that increasing the read coverage results in an increased number of spuriously aligned reads that lead to miscalling of SVs. In contrast to read coverage, neither read length nor insert size greatly affected recall and precision. We noted overall moderate effects on recall and precision for INS calling, while larger insert sizes led to greater than 10% decreased recall for DEL calling for several algorithms including BreakDancer [30], DELLY, inGAP-sv, Meerkat [58], and RAPTR-SV [59] (Additional file 1: Tables S12 and S13).
Accuracy for calling breakpoints, sizes, and genotypes of SVs
We evaluated the accuracy with which each algorithm called breakpoints (BPs) and SV length (both calculated in root mean squared errors, RMSEs) using the Sim-A data (Additional file 3: Table S14; also see the “Methods” section for RMSEs). BreakSeek [60], BreakSeq2 [61], CREST [62], DELLY, GRIDSS, PBHoney-NGM, pbsv, SvABA, SVseq2 [63], and Wham achieved the highest accuracy (< 60-bp RMSE) for calling BPs for all size ranges of the DELs and/or DUPs. CREST, Manta, FermiKit [64], Pamir [65], pbsv, SVseq2, SoftSearch [66], Wham, and the specific INS detection algorithms (MEI and NUMT algorithms) exhibited the highest accuracy (< 10-bp RMSE) for calling INS BPs. Most algorithms that called BPs accurately used the split reads-based or assembly-based methods whereas algorithms only using the read depth-based alone approach exhibited poor BP resolution. BreakSeek, BreakSeq2, CLEVER, CREST, DELLY, FermiKit, GASVPro [67], GRIDSS, inGAP-sv, laSV [68], Lumpy, Manta, PBHoney-NGM, pbsv, PRISM [69], SvABA, SVseq2, and Wham provided higher accuracy (< 100-bp RMSV) for lengths of called DELs and/or DUPs, and most of these algorithms used the read pair-based or assembly-based method. These results suggest that the basic method used in SV detection algorithms affects the resolution of the called BPs and sizes.
Twenty-two algorithms used in this study call the genotypes or copy number associated with the detected SVs. We determined the precision and recall of the SV genotypes called with these algorithms using the Sim-A and NA12878 real datasets (Additional file 1: Figure S14 and Table S15). In the real datasets, only 335 DELs and 120 DUPs with specified genotype information were available. For the real DEL data, most algorithms exhibited > 95% precision. In contrast, most of the called DUPs did not match the 120 reference DUPs, limiting interpretation (Additional file 1: Table S15). For the simulated DEL data, Manta, Lumpy, Pindel, and ERDS [70] exhibited top performance in terms of both precision (> 90%) and recall (> 1900 TPs). PennCNV-Seq, CNVnator, BICseq2 [71], and readDepth exhibited high precision (> 89%) and recall (> 800 TPs) for the DUP data. For the INS data, Manta achieved the best performance, with > 97% precision. We note that algorithms with high performance genotype calling are also algorithms with good SV detection precision and recall.
Run time and memory consumption
Figure 5 shows run time and maximum memory per CPU for each SV detection algorithm, which were determined with 30× short read data (10× for long reads) of the NA12878 data1 that were aligned to the NA12878 chromosome 8 (146 Mb). SV detection algorithms directly using fastq read files (FermiKit, laSV, MinTheGap, Pamir, ITIS, and VirusSeq), many of which use the assembly method, exhibited long run time and large memory consumption. Algorithms requiring specific alignment tools, including VariationHunter [72] and long read-based algorithms, took longer run time than the standard algorithms using BWA. Pindel, known as a popular algorithm, also took longer run time although it exhibited good SV calling accuracy. Many of algorithms using the read depth method or detecting viral element insertions consumed larger memory than the others.
Systematic identification of pairs of algorithms showing high accuracy in their overlapping, called SVs
The above results revealed that the precision and recall with which a given algorithm calls SVs varies widely and depends on the types and size ranges of the SVs. However, few algorithms could call SVs with high precision, especially for DUP, INS, and INV of the real data, although the real dataset is likely to be incomplete (i.e., there are unidentified true SVs not present in our reference SV set). Several studies have taken the strategy of selecting SVs that are commonly called by multiple algorithms to increase the precision of the called SVs [13, 14, 24,25,26,27,28,29]. However, there has been no systematic investigation into optimal strategies to combine the results of multiple algorithms. We selected a total of 51 algorithms (12–38 algorithms for each SV type and size range) that exhibited relatively high precision and recall [the sum of recall (or precision) of the simulated and the NA12878 real data is > 10 for INS and INV or > 30 for the other types of SVs] for each type and each size range, and determined the precision and recall of the SVs that were commonly called for each combination of pairs of algorithms (Fig. 6 for INS and Additional file 1: Figures S15–S22 for DEL, DUP, and INV, also see Additional file 3: Table S16). The set of SVs called in common by two algorithms was more precise than the SVs called with either algorithm alone, as expected, yet this came at the cost of decreased recall. The degree of increased precision and decreased recall was varied depending on the algorithm combination. Combinations of algorithms that yielded more precise calls for a given type and size range of SV in both the simulated and real data are highlighted (Fig. 6 and Additional file 1: Figures S15–S22). We calculated the mean precision and recall values of overlapped calls between pairs of algorithms for each SV category (Additional file 1: Figure S23, Additional file 3: Table S17). As expected, high precision in the overlapped calls was often observed in pairs containing an algorithm exhibiting high precision by itself. Interestingly, however, several algorithms with a moderate level of precision in an SV category yielded higher precision in their overlapped calls. Examples of such good “team players” include CREST and VariationHunter in the DEL category and BASIL-ANISE [73] and BreakSeek in the INS category, each of which showed over twofold increase in combination with another algorithm.
We then examined how precision and recall change when combining algorithms across the six SV detection methods, including RP, SR, RD, AS, LR, and CB (Fig. 7 and Additional file 3: Table S18). The DEL-calling precision increased less than the other types of SV because precision was already high. In general, combinations of algorithms from two different method class led to higher precision but lower recall than two algorithms using the same methods (mean fold change of precision: 1.63× for the same method and 1.82× for different methods; mean fold change of recall, 0.5× for the same method and 0.33× for different methods) (Fig. 7). These results suggest that combining algorithms from two different methods is a better strategy for obtaining an accurate representation of SV than using two algorithms of the same class. However, the results also suggest that the importance of obtaining overlapping SV calls with high precision and high recall to select good pairs of algorithms, irrespective of the combination of methods used in the algorithms.