Overall identification of variability
Each of the four family members was sequenced using NGS Illumina technologies over three replicates and across three different sequencing centers. Each resulting data set was processed using four mappers and subsequently marked for PCR duplicated reads (dedup) as well as base quality score recalibration (recal) (see the “Methods” section). This generated 72 data points per individual and a total of 288 for the entire family. We obtained a stringent set of SVs for each data point utilizing combined calls from MetaSV [25] and Parliament2 [26] to avoid generating high variability from individual SV callers (Fig. 1A, Methods for details) [17]. Overall, the call sets showed higher variability between the parents (LCL7 and LCL8) as expected than within the identical twins (LCL5 and LCL6) (Fig. 1C, D). On average, we identified 12,723.36 SVs across all 72 SV call sets from LCL5. The majority of these were deletions (7990.26 SVs) followed by insertions (2330.22 SVs) and duplications (997.38 SVs). Figure 1E shows a summary of the number, variability, and different types of SVs detected (see Additional file 1: Table S1 for details) across LCL5. The SV distribution followed the expected number and type of SVs observable by short reads (Fig. 1E) [12]. On average, 9479.35 SVs overlapped with the GIAB consortium high-confidence regions (see the “Methods” section). As expected, most SVs were in intergenic regions followed by intragenic regions (Additional file 2: Table S2).
For this study, we also produced a SV call set using long-read sequencing (see the “Methods” section) to examine the quality of the short-read SV calls; previous reports demonstrate that long-read sequencing approaches often improves the sensitivity and false positive rates compared to NGS [12, 14, 27, 28]. For PacBio LCL5, we detected 15,171 SVs of which 7465 (49.21%) were insertions and 6412 (42.26%) were deletions. This follows the expected SV type distribution that has been previously observed by GIAB [12]. For LCL5, 7,734 (50.98%) SV calls from Illumina SV call sets overlapped with the PacBio SV call set. The vast majority of missed SV calls in the Illumina call sets were insertions (4308) followed by deletions (1863). For LCL5, on average per Illumina call set, we only observed 2279.99 overlapping SVs, where the majority were deletions. These data further demonstrated the large variability across the Illumina call sets.
To investigate this variability further, we mapped the pairwise overlap of each data set across the entire family. Figure 1C shows this overlap across a heatmap. While the general trend is encouraging — we observe more similarity between the siblings (LCL5+LCL6) compared to the parents (LCL7+LCL8) — it becomes apparent that there are methodological differences within each sample that appear to be systematic.
Short-read SV variability detection
Investigating variability between Illumina SV call sets required preliminary analyses of SV prediction across short-read SV mappers, sequencing centers, and replicates. On average, 26.86% of the SV were supported by 1 out of 4 mappers. More specifically, mapper calls supported by one center represented 57.14% of SV calls and mapper calls supported by one replicate represented 48.88% of SV calls (Fig. 1B). These variabilities highlighted the contribution of different factors (mappers, sequencing centers, replicates, and dedup/recal) to variability in SV calling. This is also apparent when considering the family structure as singleton (called in one replicate) SV calls in the twins are often (23.13% on average) not supported by the call sets of the parents (Fig. 1D). This reaffirmed a general high variability between the different centers, replicates, and approaches. To investigate each source of variability independently, we stratified for other analytically accessible causes of variability and compared the outcomes across the family. Figure 2A illustrates the use of this analytical strategy in investigating the different sources of variability in this data set.. For example, concordant SV calls between centers, replicates, and dedup/recall SV call sets enable examining variability between mappers (see “Methods” section for further detail). Across strategies, we observed that most singleton SVs were deletions, and a minority were insertions; this was consistent with the pattern observed in the total number of SVs per stratification strategy (Additional file 3: Table S3). The majority of SVs including singleton SVs are 100–1000bp in size followed by 1000–10,000bp (Additional file 4: Table S4). Moreover, mapper variability was enriched with insertions (13.23%) compared to other strategies (less than 2%), which highlights the deficiencies in identifying insertions by mappers.
Variability due to mapping methods
First, we investigated the contribution of different mapping methods (BWA-MEM, Bowtie2, Isaac, and Stampy) to the variability of SV calling after stratifying for other sources (e.g., centers, replicates, and dedup/recal) of variability (Fig. 2A). The variability attributed to different mappers per sample was the largest in comparison to other sources of variability (Additional file 5: Table S5). This was exemplified by Bowtie2, which did not report split reads leading to the smallest number of SVs identified (Fig. 2B). Generally, most SVs were supported by one or two mappers (Fig. 2B). Stampy showed the largest number of SVs per sample followed by Isaac; accordingly, both Stampy and Isaac showed the largest overlap with other mapping methods (Fig. 2B). Bowtie2, BWA-MEM, Stampy, and Isaac SVs overlapped the call set of at least one other method on average at 99.23%, 92.76%, 89.95%, and 85.19%, respectively.
Next, we investigated LCL5 and identified 4257 SVs across mappers after stratifying for other variabilities (Fig. 2B). SVs supported by one mapper (i.e., singletons) represented 25.02% (1065) of the total SVs and most of them (903 SVs, 84.79%) did not overlap with the PacBio call set. This high rate of non-overlapping SVs with the PacBio set indicated a high false positive rate for the singletons across the different mappers. For the non-singleton SVs (3192 SVs, 74.98%), we observed a substantial increase in the overlap with the PacBio call set to 60.35% (1926 SVs). Thus, the majority of the SV calls are likely true positives. Discrepancies between the call sets demonstrated that not all mappers capture all SVs, leading to increased variability between the SV call sets. Moreover, SVs were likely missed due to the differences between the SV mapping methods at that specific location. Thus, SVs supported by one (15.21% overlap), two (31.81% overlap), three (43.09% overlap), or four (87.13% overlap) mappers overlapped the PacBio call set at an increasing rate, respectively.
We observed that 59.25% of the SVs (631 SVs) supported by one mapper for LCL5 overlapped with mapper singletons from other family members (Fig. 3A). Both family overlapping and non-overlapping SVs were highly discordant with PacBio at 12.68% and 19.35%, respectively (Additional file 6: Table S6). This was consistent with a recurrent noise pattern pointing to false positive SV calls based on different SV mapping methods.
Variability due to different sequencing centers
The second largest contribution to variability was observed across sequencing centers (Fig. 2C). This was likely due to differences in coverage and sequencing library preparation including insert size variability. We observed that Center 1 showed the lowest average coverage (24.37x) followed by Center 2 (26.33x) and Center 3 (36.33x) (Fig. 3F). The insert size also varied on average with Center 1 (321.55 bp), Center 2 (318.53 bp), and Center 3 (399.12 bp) (Fig. 3F). Overall, we identified a slightly positive correlation in the number of detectable SV with an increase in coverage or insert size. Using Pearson correlation, we observed only a slight correlation between total number of SVs and the insert size that was statistically significant (cor=0.26, p value=0.028). This was likely because the increase in insert size leads to a larger span of the read pairs, and thus, a higher likelihood that they spanned a breakpoint. Accordingly, more pairs supported a certain SV compared to a smaller insert size library. We observed no statistically significant correlation between total SVs and mean coverage (cor=0.12, p value=0.33). This is likely the result of standardized workflows between the labs as described in the “Methods” section.
Next, starting with LCL5, we stratified for other sources of variability as described in our analysis strategy and examined variability across centers (Fig. 2A). Post-stratification, we obtained 2750 SVs for the three sequencing centers (Fig. 2C). We observed that a majority of the center variability was attributed to SVs that were supported by one sequencing center (i.e., singletons) with 696 (25.31%) SVs. Center 3 supported the highest number of singleton SV calls per center followed by sequencing Centers 1 and 2. Interestingly, 45.12% (314 SVs) of these singleton SVs overlapped with the PacBio SV call set. Thus, they were likely true positives and indicated that these singletons represented false negatives based on other centers. The variability in coverage, insert size, and library preparation, discussed previously, contributed to the differences observed between the centers (Fig. 3F). Overlapping SVs between two of the three centers resulted in 525 (19.09%) SVs of which we observed an even higher rate of 59.23% (311) SVs overlapping with the PacBio-based SV calls. The increase in false negatives and the decrease in false positives indicated an expected higher rate of SVs missed in individual centers. After this stratification strategy, 1529 SVs (55.6%) overlapped between all three centers and were concordant with the PacBio SV call set (87.25% or 1334 SVs). Overall, and compared to mappers, center variability SVs were concordant at a much higher rate with the PacBio data set highlighting that these were not false positives as has often been believed. Rather, individual centers captured true positives representing false negatives for the other centers.
Next, we investigated if these singleton SVs per center were similar across the family members (Fig. 3B). For LCL5, 59.91% (417 SVs) of singleton SVs overlapped with LCL6, LCL7, and/or LCL8 singletons. Of these overlapping SVs, 47.72% (199 SVs) overlapped with the PacBio SV call set (Additional file 6: Table S6). This indicated a high recurrent incidence of missed SV calls (false negatives) in the call sets of multiple centers even across related samples (Additional file 7: Fig. S1). Singleton SVs that are non-overlapping with other family members amounted to 40.09% (279 SV) for LCL5 across the centers. Of these SVs, 41.94% (117 SVs) overlapped the PacBio call set and were likely false negatives.
Other minor variability impacting SV detection
Subsequently, we analyzed variability attributed to differences between replicates per sample (Fig. 2D). Starting again with LCL5, we stratified for other causes of variability to obtain an SV set representing replicate variability (Fig. 2A). This resulted in 2453 SVs including 18.79% (461) supported by a single replicate, 18.51% (454 SVs) agreeing between two replicates, and 62.70% (1538 SVs) agreeing among all three. Over half (54.88%) of the singleton SV calls in LCL5 did not overlap with the PacBio SV call set and were likely false positive calls. The concordance with PacBio increased as supporting replicates increased from two replicates (64.10%) to three (87.26%). This indicated a decrease in false positives and an increase in false negatives for SVs supported by two replicates. Overall, we observed similar trends for other samples (Additional file 5: Table S5).
Next, we investigated the overlap among family members by aggregating each individual’s detected SVs that were singletons in each replicate after stratification (Fig. 3C). For LCL5, we observed that 54.01% (249 SVs) of replicate singleton SVs were shared between family members and 45.99% (212 SVs) were unique to LCL5. Similar proportions of each overlapped with the PacBio SV call set at 44.98% and 49.06%, respectively (Additional file 6: Table S6). Singleton replicate variability across family members did not show an enrichment of false positives or negative SVs based on this analysis.
Lastly, we investigated the impact of duplicate reads marking (dedup) and base quality score recalibration (recal) on the SV calling reproducibility. Differences between these steps per sample were the smallest contributor to variability after stratifying for variability from centers, mappers, and replicates. For LCL5, we defined a working set of 1653 SVs of which 150 were singletons and 1503 were non-singletons. 68% of singleton and 86.96% of non-singleton SVs overlapped with the PacBio call set. Thus, most of the variability in the call sets was likely due to false negative calls attributed to the differences between the dedup or recal call sets. For singleton SVs, we observed that 33.33% (50 SVs) overlapped with family singletons while 66.67% (100 SVs) did not (Fig. 3D). Of singleton SVs overlapping family singletons, 62% overlapped with the LCL5 PacBio call set while 71% of singleton SVs not overlapping family singletons overlapped with the PacBio call set. The majority of the LCL5 singleton SVs (80 SVs) were attributed to the dedup strategy while a minority (70 SVs) were part of the recal call set.
Best practices for short-read-based SV calling
In the previous sections, we defined and ranked the analytically accessible sources of variability impacting the SV call sets. This included examining alignment methods, insert sizes/coverage, and variability attributed to centers and replicates. We further examined all the different short-read sequencing pipelines and results to identify the most robust settings and benchmarked its variability using the PacBio long-reads SV call set. We observed that Stampy consistently resulted in the highest number of SVs calls (~15,076.55 across LCL5), however, with the smallest ratio of 21.89% overlapping SVs (~3301.61 SVs across LCL5) with the Pacbio call set (Additional file 8: Table S7). In contrast, Bowtie2 resulted in the least SVs called (~7704.55 SVs across LCL5), but with a higher ratio of 32.83% overlapping SVs (2529.83 SVs on average) with the PacBio call set (Additional file 8: Table S7). For non-overlapping SV, we observed that the major difference was seen in an increase in duplications, inversions followed by insertion calls. For example, for LCL5, the portion of falsely detected SVs increases from Bowtie2 (~5174.72 SVs), BWA-MEM (~8370.67 SVs), and Isaac (~9308.78 SVs) to Stampy (~11774.94 SVs) in that order, respectively.
Next, we investigated if by taking a consensus between mappers the SV calling can be improved. We observed that using a consensus between mapping techniques increased precision (Additional file 8: Table S7). This is especially useful when utilizing SV mappers that have a high sensitivity such as Stampy and Isaac, but initially a lower precision compared to Bowtie2. For example, for LCL5, SVs identified by both alignments from Isaac and Stampy (~8384.83 SVs) had a 34.18% overlap with Pacbio (2866.72 SVs) compared to Isaac alone (26.57%) or Stampy alone (21.9%). Whereas an overlap between Bowtie2 and Stampy did not yield a high recall (~5190.33 SV with a recall of 15.66%) although it had higher precision (~2376.5 SVs or 45.79%).
In summary, we observed that Bowtie2 mappings lead to the SV sets with the highest precision, but lowest sensitivity. This might promote its usage in settings where precision is more important outside of the research setting or in the clinical setting. Nevertheless, Bowtie2 potentially leads to less precise breakpoints that could hinder the interpretation of the SV itself. Combined approaches using Stampy, Isaac, and BWA-MEM allow for a less precise albeit a more sensitive approach to detecting SVs, which lends itself better to research applications. These observed trends hold true across SV call sets from all family members across the different mappers (Additional file 8: Table S7).