SVelter algorithm
SVelter takes aligned Illumina paired-end sequence data in sorted BAM format as input as well as the reference genome against which the sequences were aligned and reports all predicted SVs in both a custom format as well as VCFv4.1. Default parameters are chosen to best balance sensitivity and efficiency, though are adjustable for users to best fit their own data. The SVelter framework consists of three major modules: null model determination, BP detection, random iterative rearrangement, and structure scoring (Fig. 2).
Null model determination
SVelter first filters the reference genome to exclude regions of low mappability from downstream analysis to increase efficiency by avoiding regions where alignments are unreliable. Such regions include gaps and unknown regions in the reference genome (Ns) and these are integrated with previously reported genomic regions identified by ENCODE [31] (wgEncodeDacMapabilityConsensusExcludable and DukeMapabilityRegionsExcludable obtained from UCSC Genome Browser) that are of low mappability to form a final version of excluded regions. Next, the IS distribution (f
IS
) is determined by calculating the mean (μ
IS
) and standard deviation (σ
IS
2) of all RPs aligned to genomic regions that are either randomly sampled or collected from a set of copy neutral (CN2) genomic regions defined as places in the genome where no polymorphic CNVs, segmental duplications, or repetitive elements have been annotated and thus providing a good estimate of the baseline alignment characteristics [22]. Normal distribution is constructed (f
IS
∼ N(μ
IS
, σ
IS
2)). A normal distribution of RD (f
RD
∼ N(μ
RD
, σ
RD
2)) and physical coverage (f
PC
∼ N(μ
PC
, σ
PC
2)) are characterized by sliding a fixed-size window (default: 100 bp) across the same genomic region and constructing the sample mean and standard deviation. Alternatively, in situations where the RD is not high enough be approximated as normal (empirically, <10X), SVelter provides options for more complex but less efficient models, i.e. bimodal (fitted by mixtools) for IS,
$$ {f}_{IS}\sim p\times N\left({\mu}_{I{S}_1},{\sigma}_{I{S}_1}^2\right)+\left(1-p\right)\times N\left({\mu}_{I{S}_2},{\sigma}_{I{S}_2}^2\right) $$
and negative binomial for RD and physical coverage:
$$ \begin{array}{l}{f}_{RD}\sim NB\left({r}_{RD},{p}_{RD}\right),\ where\ {r}_{RD}=\frac{{\mu_{RD}}^2}{{\sigma_{RD}}^2-{\mu}_{RD}},\ {P}_{RD}=1-\frac{\mu_{RD}}{{\sigma_{RD}}^2}\hfill \\ {}{f}_{PC}\sim NB\left({r}_{PC},{p}_{PC}\right), where\ {r}_{PC}=\frac{{\mu_{PC}}^2}{{\sigma_{PC}}^2-{\mu}_{PC}},\ {P}_{PC}=1-\frac{\mu_{PC}}{{\sigma_{PC}}^2}\hfill \end{array} $$
Detection and clustering of putative BPs
SVelter next scans the input alignment file to define putative BPs where the sample genome differs from the reference. These are defined through the identification of aberrant read alignments. Clusters of RPs showing abnormal insert length or aberrant mapping orientation may indicate BPs nearby, while reads with truncated (clipped) split read (SR) alignments are indicative of precise BP positions. SVelter specifically defines aberrant reads as follows:
-
1.
RPs outside expected IS (μ
IS
± s × σ
IS
, where s is the number of standard deviation from the mean, default as 3);
-
2.
RPs that do not have forward reverse pair orientation;
-
3.
SRs with high average base quality (i.e. 20) clipped portion with minimum size fraction of overall read length (i.e. 10 %).
It should be noted that the default parameters used by SVelter were determined empirically and can be adjusted by the user. Discordant RPs of the within a window of mean IS + 2*std distance and of the same orientation are clustered together. Next, split reads within this window and downstream of the read direction are collated and the clipped position is considered as a putative BP. If no such reads exist, the rightmost site of forward read clusters or leftmost site of reverse read clusters is assigned instead. For each cluster of aberrant RPs, a BP is assigned if the total number of split reads exceeds 20 % of the RD or the total number of all aberrant reads exceeds 30 %. For samples of poorer quality, higher cutoffs might be preferred. Each putative BP will be paired with other BPs that are defined by mates of its supporting reads. BP pairs that intersect or are physically close (<1 kb) to each other will be further grouped and reported as a BP cluster for the next step.
Random iterative rearrangement
For each BP cluster containing n putative BPs, a randomized iterative procedure is then applied on the n-1 genomic blocks between adjacent BPs. SVelter has three different modules implemented for this step: diploid module (default) that detects structural variants on both alleles simultaneous, heterozygous module that only report high quality heterozygous SVs, and homozygous module for high quality homozygous SVs only. For the diploid module, a simple rearrangement (deletion, inversion, or insertion) is randomly proposed and applied to each block on one allele while the other allele is kept unchanged and the newly formed structure is scored against the null models of expectation for each feature through the scoring scheme described below. A new structure is then selected probabilistically from the distribution of scores such that higher scores are more likely but not assured. The same approach is then applied to the other allelic structure representing a single iteration overall. For heterozygous and homozygous modules, only one allele is iteratively rearranged while the other allele remains either unchanged or is mirrored, respectively.
The iterative process will terminate and report a final rearranged structure if one of the following configurable situations is met:
-
1.
No changes to a structure after 100 continuous iterations; or
-
2.
The maximum number of iterations is reached (100,000 as default).
After the initial termination, the structure is reset and the process is repeated for another 100 iterations while avoiding the fixed structure, at which point the highest scoring structure overall is chosen.
Structure scoring
Assume S
j
as the score of rearranged structure j. To estimate S
j
, four different characteristics of RP i: IS (IS
ij
), Pair Orientation (PO
ij
), RD (RD
ij
), and Physical Coverage Through a BP (PC
ij
) would be calculated and integrated. As the distribution of IS, RD, and Physical Coverage has been defined, the density function would be calculated and transformed to log scale:
$$ \begin{array}{l} Score\_I{S}_{ij}= log\left({f}_{IS\ }\left(I{S}_{ij}\right)\right)\hfill \\ {} Score\_R{D}_{ij}= log\left({f}_{RD}\left(R{D}_{ij}\right)\right)\hfill \\ {} Score\_P{C}_{ij}= log\left({f}_{PC}\left(P{C}_{ij}\right)\right)\hfill \end{array} $$
Score of Pair Orientation is specified by the indicator function:
$$ Score\_P{O}_{ij}=\left\{{}_{0,\ if\ other\ wise}^{1,\ if\ PO= Forward - Reverse}\right\} $$
Assuming total number of n pairs of reads are aligned in the targeted genomic region, for each structure j, individual scores of each RP would be integrated to form the structure score:
$$ {S}_i={\displaystyle \sum_{i=1}^n}\ Score\_I{S}_{ij}\times \left(1+\frac{{\displaystyle {\sum}_{i=1}^n}\ Score\_P{O}_{ij}}{n}\right)+\tau {\displaystyle \sum_{i=1}^n} Score\_R{D}_{ij}\times \left(1-{\displaystyle \sum_{i=1}^n}\ Score\_P{C}_{ij}\right) $$
where \( \tau =\frac{log\left({f}_{IS}\left({\mu}_{IS}\right)\right)}{log\left({f}_{RD}\left({\mu}_{RD}\right)\right)} \) serves as the factor to regulate two parts into same scale.
Performance assessment
Both simulated and real data were used to evaluate performance of SVelter. To produce simulation datasets, we altered the human GRCh37 reference genome to include both homozygous and heterozygous simple SVs and complex SVs independently while adding micro-insertions and short tandem repeats around the junctions in frequencies consistent with previously reported BP characteristics [32]. Details about specific types of SVs simulated are summarized in Additional file 1: Table S1, and specific details regarding the generation of the simulated data can be found in Additional file 3: Supplemental Methods. Paired-end reads of 101 bp with an IS of 500 bp mean and 50 bp standard deviation were simulated using wgsim (https://github.com/lh3/wgsim) across different RDs (10X, 20X, 30X, 40X, 50X).
For comparisons using real sequence data, we adopted two previously published samples: CHM1 [24] and NA12878 [18]. CHM1 has been deep sequenced by Illumina whole-genome sequence to 40X and by Single Molecule, Real-Time (SMRT) sequencing to 54X, and SVs of the sample have been detected and published by the same group as well (http://eichlerlab.gs.washington.edu/publications/chm1-structural-variation/). NA12878, together with the other 16 members from CEPH pedigree 1463, has been deep sequenced to 50X by Illumina HiSeq2000 system (http://www.illumina.com/platinumgenomes/). Additionally, the GIAB Consortium has published the PacBio sequencing data (20X) of NA12878 and also provided a set of high-confident SV calls [24, 27].
We assessed SVelter against four other algorithms with diverse approaches: Pindel, Delly, Lumpy, and ERDs. We applied these algorithms to both simulated and real data with default settings, except that SVelter’s homozygous module was used for CHM1. All algorithms were compared using the same set of excludable regions and were run on the same computing cluster.
Assessment of simulated simple SVs
For simulated datasets, we compared the performance of each algorithm by calculating their sensitivity and FDR on each type of simple SV (deletion, disperse duplication, tandem duplication, inversion). As Lumpy reports BPs in terms of range, we calculated the median coordinate of each reported interval and consider it as the BP for downstream comparison. A reported SV would be considered as a TP if the genomic region it spanned overlapped with a simulated SV of the same type by over 50 % reciprocally. As Delly and Lumpy did not differentiate tandem and dispersed duplication in their SV report, we compare their reported duplications to both simulated tandem and dispersed duplications independently to calculate sensitivity, but use the entire set of simulated duplications together for the calculation of specificity. In this manner, the result will be biased towards higher TP and TN rates for these approaches. Dispersed duplications reported by Pindel were very rare and as such were processed in the same way as Delly and Lumpy.
Assessment of real SVs
We initially made use of reported simple and complex SVs in CHM1 and NA12878 as gold standard sets; however, the FP rate of each algorithm was high compared to previously published performance. To augment this set, we therefore have developed our own approach to validate simple and complex SVs using PacBio long reads. For each reported SV, we collect all PacBio reads that go through the targeted region and hard clip each read prior to the start of the region. We then compare each read to the local reference and an altered reference reflecting the structure of the reported SV by sliding a 10 bp window through the PacBio read and aligning it against the reference sequence. Coordinates of each window are plotted against its aligned position in the form of a recurrence plot. If a read was sampled from the reference genome, most of the matched points should distribute close to the diagonal line. However, if a read was sampled from an altered genomic region, a continuous diagonal line would only show when plotted against a correctly resolved sequence. In this manner, shorter SVs can be validated by accessing the deviation of all matched points from diagonal. If aligning long read j against reference k, deviation of point i (x
ijk
, y
ijk
) is defined as d
ijk
= |x
ijk
− y
ijk
|, i.e. the vertical distance of the point to the diagonal. The deviation score of each j is calculated by summing up deviation of all points
$$ {S}_{jk} = {\displaystyle {\sum}_{i=1}^n{d}_{ijk}} $$
where n is the number of matches. For each long read j, the recurrence plot is made against the reference in both the original and altered formats, with corresponding scores S
j,k = orig
and S
j,k = alt
assigned, and the score for the read is defined as
$$ {S}_j=\frac{S_{j,k= orig}}{S_{j,k= alt}}-1 $$
such that a positive score indicates the priority of altered genome over reference genome, and vice versa. The validation score of an SV is defined as proportion of supportive reads among the total m interrogated reads
$$ {S}_{val}=\frac{{\displaystyle {\sum}_{j=1}^m}\ I\left(1,\ if\ {S}_j > 0;\ 0,\ otherwise\right)\ }{m\left( number\ of\ reads\ checked\right)} $$
SVs with validation score >0.5 for haploid genome, or >0.3 for diploid genome would be considered validated. We further assess our ability to interrogate SVs in this fashion by scoring the reference sequence against itself at each region. In highly repetitive regions, the deviation scores will be higher overall and we can label such regions as non-assessable.
For longer (>5 kb) SVs, PacBio reads spanning through the whole targeted region are rarely observed in these data. In this situation, we scored each BP by adding 500 bp flanks and assessing each individually. The final validation score is then determined through the collation of matches from all BPs.
We reassessed our initial TP and FP simple calls from each algorithm by combining our PacBio validated SVs from each algorithm together with the reported calls. For simple SVs, we utilized a 50 % reciprocal overlap criterion. However, for CSVs we utilized a more complex comparison strategy to take into account that some algorithms will often detect individual parts of a complex rearrangement as distinct events. With each CSV predicted by SVelter, we extracted SVs with over 50 % reciprocal overlap from other algorithms and calculated the validation score for each of them using our PacBio validation approach described above. When multiple SVs were extracted from an algorithm, averaged scores were adopted. Validation scores of a CSV from all algorithms were ranked and normalized from 0 to 1 for comparison.