We propose a breakpoint prediction framework that can accommodate multiple classes of evidence from multiple sources in the same analysis. Our framework makes use of an abstract breakpoint evidence type to define a set of functions that serve as an interface between specific evidence subtypes (for example, paired-end sequence alignments and split-read mappings) and the breakpoint type. Any class of evidence for which these functions can be defined may be included in our framework. To demonstrate the applicability of this abstraction, we defined three breakpoint evidence subtypes: read-pair, split-read, and a general breakpoint interface.
Since our framework combines evidence from multiple classes, it extends naturally to include evidence from multiple sources. The sources that can be considered in a single analysis may be any combination of evidence from different samples, different evidence subclasses from a single sample, or prior information about known variant positions. We refer to a given set of data as a breakpoint evidence instance, and assume that each instance contains only one evidence subtype and is from a single sample. To help organize the results of analysis with multiple samples or multiple instances for a single sample, each instance is assigned an identifier that can be shared across instances.
Breakpoint
A breakpoint is a pair of genomic coordinates that are adjacent in a sample genome but not in a reference genome. Breakpoints can be detected, and their locations predicted by various evidence classes such as paired-end sequence alignments or split-read mappings. To support the inclusion of different evidence classes into a single analysis, we define a high-level breakpoint type as a collection of the evidence that corroborates the location and variety (for example, deletion, tandem duplication, and so on) of a particular breakpoint. Since many evidence classes provide a range of possible breakpoint locations, we represent the breakpoint’s location with a pair of breakpoint intervals where each interval has a start position, an end position, and a probability vector that represents the relative probability that a given position in the interval is one end of the breakpoint. More formally, a breakpoint is a tuple b = ⟨E,l,r,v⟩, where b.E is the set of evidence that corroborates the location and variety of a particular breakpoint; b.l and b.r are left and right breakpoint intervals each with values b.l.s and b.l.e that are the start and end genomic coordinates and b.lp is a probability vector where |b.l.p| = b.l.e – b.l.s and b.l.p[i] is the relative probability that position b.l.s + i is one end of the breakpoint (similar for b.r); and b.v is the breakpoint variety. Within the context of this method, breakpoint variety determinations are based on the orientation of the evidence. It is important to note that while a breakpoint may be labeled as a deletion when it contains evidence from a paired-end sequence alignment with a +/-orientation, the breakpoint may in fact be the result of some other event or series of complex events.
If there exist two breakpoints b and c in the set of all breakpoints B where b and c intersect (b.r intersects c.r, b.l intersects c.l, and b.v = c.v), then b and c are merged into interval m, b and c are removed from B, and m is placed into B. The evidence set m.E is the union of the evidence sets b.E and c.E.
A straightforward method to define breakpoint intervals m.l and m.r would be to let m.l.s = max(b.l.s, c.l.s) and m.l.e = min(b.l.e, c.l.e), similar for m.r. However, if a spurious alignment is merged into a set of genuine breakpoints, the resulting breakpoint interval can be ‘pulled’ away from the actual breakpoint. The impact of an outlier can be minimized or eliminated once the full set of corroborating alignments is collected for a given breakpoint, but collecting the full set is complicated by the fact that alignments are considered in order and outliers typically occur first. To account for this, we define a liberal merge process where m.l.s is the mean start position for the left intervals in m.E, and m.l.e is the mean end position for the left intervals in m.E, similar for m.r.
Once all the evidence has been considered, an SV call s (also a breakpoint) is made for each breakpoint b ∈ B that meets a user-defined minimum evidence threshold (for example, four pieces of evidence). The boundaries of the breakpoint intervals s.l and s.r are the trimmed product of the distributions of the left and right intervals in b.E. Let s.l.s = max({e.l.s | e ∈ b.E}), s.l.e = min({e.l.e | e ∈ b.E}), and s.l.p[i] = ∏e∈b.Ee.l.p[i-o] where o is the offset value e.l.s - s.l.s (similar for s.r). The intervals s.l and s.r can then trimmed to include only those positions that are in the top percentile (for example, top 99.9% of values) based on a user-provided value. Given the liberal merge process, it is possible for b.E to contain non-overlapping distributions that would result in a zero-length product. In this case, we identify the maximum point among the sum of the distributions in b.E, any distribution not intersecting this point is removed, and the resulting subset processed normally. Regardless of the trimming value, LUMPY reports both the intervals that contain 95% of the resulting probability distribution and the maximum position of s.l.p and s.r.p. Summation is another option for calculating the combined distribution boundaries and values. In that case s.l and s.r are the trimmed mixture distributions of the left and right intervals in b.E. Let s.l.s = min({e.l.s | e ∈ b.E}), s.l.e = max({e.l.e | e ∈ b.E}), and s.l.p[i] = ∑e∈b.Ee.l.p[i-o]. The value at s.l.p[i] (or s.r.p[i]) represents the level of agreement among the evidence in b.E that position i is one end of the breakpoint. While summation will give less precise breakpoint predictions, it can be a useful option when considering low-quality data.
Breakpoint evidence
To combine distinct SV alignment signals such as read-pair and split-read alignments with the general breakpoint type defined above, we define an abstract breakpoint evidence type. This abstract type defines an interface that allows for the inclusion of any data that can provide the following functions: is_bp determines if a particular instance of the data contains evidence of a breakpoint; get_v determines the breakpoint variety (for example, deletion, tandem duplication, inversion, and so on); and get_bpi maps the data to a pair of breakpoint intervals.
To demonstrate the applicability of this abstraction, we defined three breakpoint evidence instances: paired-end sequencing alignments, split-read alignments, and a general breakpoint interface. Read-pair and split-read are the most frequently used evidence types for breakpoint detection, and the general interface provides a mechanism to include any other sources of information such as known variant positions or output from other analysis pipelines (for example, read-depth calls). As technologies evolve and our understanding of structural variation improves, other instances can be easily added.
Paired-end alignments
Paired-end sequencing involves fragmenting genomic DNA into roughly uniformly sized fragments, and sequencing both ends of each fragment to produce paired-end reads ⟨x, y⟩, which we will refer to as ‘read-pairs’. Each read is aligned to a reference genome R(x) = ⟨c, o, s, e⟩, where R(x).c is the chromosome that x aligned to in the reference genome, R(x).o = +| - indicates the alignment orientation, and R(x).s and R(x).e delineate the start and end positions of the matching sequence within the chromosome. We assume that both x and y align uniquely to the reference and that R(x).s < R(x).e < R(y).s < R(y).e. While in practice it is not possible to know the position of read x in the sample genome (in the absence of whole-genome assembly), it is useful to refer to S(x) = ⟨o, s, e⟩ as the alignment of x with respect to the originating sample’s genome.
Assuming that genome sequencing was performed with the Illumina platform, read-pairs are expected to align to the reference genome with a R(x).o = +, R(y).o = - orientation, and at distance R(y).e - R(x).s roughly equivalent to the fragment size from the sample preparation step. Any read-pair that aligns with an unexpected configuration can be evidence of a breakpoint. These unexpected configurations include matching orientation R(x).o = R(y).o, alignments with switched orientation R(x).o = -, R(y).o = +, and an apparent fragment length (R(y).e - R(x).s) that is either shorter or longer than expected. We estimated the expected fragment length to be the sample mean fragment length l, and the fragment length standard deviation to be the sample standard deviation s from the set of properly mapped read-pairs (as defined by the SAM specification) in the sample data set. Considering the variability in the sequencing process, we extend the expected fragment length to include sizes l + v
l
s, where v
l
is a tuning parameter that reflects spread in the data.
When x and y align to the same chromosome (R(x).c = R(y).c), the breakpoint variety can be inferred from the orientation of R(x) and R(y). If the orientations match, then the breakpoint is labeled as an inversion, and if R(x).o = - and R(y).o = + then the breakpoint is labeled as a tandem duplication. Any breakpoint with the orientation R(x).o = + and R(y).o = - is labeled as a deletion. When x and y align to different chromosomes (R(x).c ≠ R(y).c), the variety is labeled inter-chromosomal. At present, LUMPY does not explicitly support identification of insertions that are spanned by paired-end reads; however, if desired these can be identified in a post-processing step through assessment of ‘deletion’ calls.
To map ⟨x, y⟩ to breakpoint intervals l and r, the ranges of possible breakpoint locations must be determined and probabilities assigned to each position in those ranges. By convention, x maps to l and y to r, and for the sake of brevity we will focus on x and l since the same process applies to y and r. Assuming that a single breakpoint exists between x and y, then the orientation of x determines if l will be upstream or downstream of x. If the R(x).s = +, then the breakpoint interval begins after R(x).e (downstream), otherwise the interval ends before R(x).s (upstream).
The length of each breakpoint interval is proportional to the expected fragment length L and standard deviation s. Since we assume that only one breakpoint exists between x and y, and that it is unlikely that the distance between the ends of a pair in the sample genome (S(y).e - S(x).s) is greater than L, then it is also unlikely that one end of the breakpoint is at a position greater than R(x).s + L, assuming that R(x).o = +. If R(x).o = -, then it is unlikely that a breakpoint is at a position less than R(x).e - L. To account for variability in the fragmentation process, we extend the breakpoint to R(x).e + (L + v
f
s) when R(x).o = +, and R(x).s - (L + v
f
s) when R(x).o = -, where v
f
is a tuning parameter that, like v
l
, reflects the spread in the data.
The probability that a particular position i in the breakpoint interval l is part of the actual breakpoint can be estimated by the probability that x and y span that position in the sample. For x and y to span i, the fragment that produced ⟨x, y⟩ must be longer than the distance from the start of x to i, otherwise y would occur before i and x and y would not span i (contradiction). The resulting probability is P(S(y).e - S(x).s > i - R(x).s) if R(x).o = +, and P(S(y).e - S(x).s > R(x).e - i) if R(x).o = -. While we cannot directly measure the sample fragment length (S(y).e - S(x).s), we can estimate its distribution by constructing a frequency-based cumulative distribution D of fragment lengths from the same sample that was used to find L and s, where D(j) gives the proportion of the sample with fragment length greater than j.
Split-read alignments
A split-read alignment is a single DNA fragment X that does not contiguously align to the reference genome. Instead, X contains a set of two or more substrings x
i
…x
j
(X = x1x2…x
n
), where each substring aligns to the reference R(x
i
) = ⟨c, o, s, e⟩, and adjacent substrings align to non-adjacent locations in the reference genome R(x
i
).e ≠ R(x
i
+ 1).s + 1 or R(x
i
).c ≠ R(x
i
+ 1).c for 1 ≤ i ≤ n – 1. A single split-read alignment maps to a set of adjacent split-read sequence pairs (⟨x1, x2⟩,⟨x2, x3⟩,…,⟨x
n
-1,x
n
⟩), and each pair ⟨x
i
,x
i
+ 1⟩ is considered individually.
By definition, a split-read mapping is evidence of a breakpoint and therefore the function is_bp trivially returns true.
Both orientation and mapping location must be considered to infer the breakpoint variety for ⟨x
i
, x
i
+ 1⟩. When the orientations match R(x
i
).o = R(x
i
+ 1).o, the event is marked as either a deletion or a tandem duplication. Assuming that R(x
i
).o = R(x
i
+ 1).o = +, R(x
i
).s < R(x
i
+ 1).s indicates a gap caused by a deletion and R(x
i
).s > R(x
i
+ 1).s indicates a tandem duplication. These observations are flipped when orientations R(x
i
).o = R(x
i
+ 1).o = -. When the orientations do not match R(x
i
).o ≠ R(x
i
+ 1).o, the event is marked as an inversion and the mapping locations do not need to be considered. When x and y align to different chromosomes, the variety is marked as inter-chromosomal. LUMPY does not currently attempt to identify insertions that are completely contained within a long read, but this will be supported in future versions. We note that this capability requires long-read aligners to report the number and order of alignments within a read (which is not formally supported in the current SAM format specifications).
The possibility of errors in the sequencing and alignment processes creates some ambiguity in the exact location of the breakpoint associated with a split-read alignment. To account for this, each alignment pair ⟨x
i
, x
i
+ 1⟩ maps to two breakpoint intervals l and r centered at the split. The probability vectors l.p and r.p are highest at the midpoint and decrease exponentially toward their edges. The size of this interval is a configurable parameter v
s
and is based on the quality of the sample under consideration and the specificity of the alignment algorithm used to map the sequences to the reference genome.
Depending on the breakpoint variety, the intervals l and r are centered on either the start or the end of R(x
i
) and R(x
i
+ 1). When the breakpoint is a deletion l is centered at R(x
i
).e and r at R(x
i
+ 1).s, and when the breakpoint is a tandem duplication l is centered at R(x
i
).s and r at R(x
i
+ 1).e. If the breakpoint is an inversion, l and r are both centered either at the start positions or end positions of R(x
i
) and R(x
i
+ 1), respectively. Assuming that R(x
i
).s < R(x
i
+ 1).s, if R(x
i
).o = + then l and r are centered at R(x
i
).e and R(x
i
+ 1).e, otherwise they are centered at R(x
i
).s and R(x
i
+ 1).s. If R(x
i
).s > R(x
i
+ 1).s, then the conditions are swapped.
Generic evidence
The generic evidence subclass provides a mechanism to directly encode breakpoint intervals using the BEDPE format [17]. BEDPE is an extension of the popular BED format that provides a means to specify a pair of genomic coordinates; in this case the pair represents the two breakpoint positions in the reference genome. This subclass extends our framework to include SV signal types that do not yet have a specific subclass implemented. For example, the set of variants that are known to exist in the population can be included in the analysis of an individual or variants that are known to exist in a particular type of cancer can be included in the analysis of a tumor. This signal can be included in the analysis by expanding the edges of the predicted intervals to create breakpoint intervals, and encoding these intervals in BEDPE format. Each BEDPE entry is assumed to be a real breakpoint (is_bp), the variety is encoded in the auxiliary fields in BEDPE (get_v), and the intervals are directly encoded in BEDPE (get_bpi).
Performance comparisons
Both simulated and real datasets were used to compare the sensitivity and FDR of LUMPY to other SV detection algorithms (GASVPro, DELLY, and Pindel). Two types of simulations were performed: one in which homozygous variants of diverse varieties were introduced at random positions throughout the reference genome, and another in which a heterogeneous tumor sample was simulated by mixing reads from a modified ‘abnormal’ human reference genome (containing 1000 Genomes deletions) and an unmodified ‘normal’ human reference genome in varying proportions. We also used publicly available Illumina sequencing data of the NA12878, NA12891, and NA12892 individuals. Two scenarios were considered: the original 50X coverage files, and 5X subsamples of the original data sets.
In the case of the homozygous simulation, we used SVsim to create new versions of the human reference genome (build 37) containing 2,500 simulated variants of each variety. For deletions, tandem duplications and inversions we randomly placed 2,500 non-overlapping variants ranging from 100 bp to 10,000 bp in size. To simulate translocations, we randomly inserted 2,500 non-overlapping inter-chromosomal regions of 1,000 bp, derived from random donor sites in the reference genome. Although we note that the true variant variety in this case is actually an insertion, the inserted segment exceeds the insert size of the sequencing library as well as the read length, and thus the breakpoints formed by such insertions accurately simulate a translocation. Each simulated genome was sampled to 40X, 20X, 10X, 5X, and 2X coverage.
To simulate a heterogeneous tumor sample, we combined simulated reads from both a modified and unmodified version of the human reference genome (build 37). The modified genome was created using SVsim, and included 5,516 non-overlapping deletions identified by the 1000 Genomes Project. Each simulation combined reads from both the modified and unmodified genomes in varying proportions. We refer to the proportion of reads that were derived from the modified genome as the SV allele frequency. The simulated SV allele frequencies were 5%, 10%, 20% and 50%, and the simulated coverages were 10X, 20X, 40X, and 80X. For example, in the simulation with 5% SV allele frequency and 10X coverage, the modified genome was sampled at 0.5X coverage and the unmodified genome was sampled at 9.5X coverage. The two sets of reads are then pooled into a single 10X coverage sample.
For all simulations, WGSIM was used to sample paired-end reads with a 150 bp read length, a 500 bp mean outer distance with a 50 bp standard deviation, and default error rate settings. Paired-end reads were mapped to the reference genome with NOVOALIGN version V2.07.08, using the random repeat reporting and allowing only one alignment per read. From the NOVOALIGN output, all soft-clipped (≥20 bp clipped length) and unmapped reads were realigned with the split-read aligner YAHA using a word length of 11 and a minimum match of 15. The NOVOALIGN output was used as input to DELLY, GASVPro, and Pindel, and both NOVOALIGN and YAHA output were used as input to LUMPY. In all algorithms, the minimum evidence threshold was four. For LUMPY, the tuning parameters min_non_overlap was set to 150, discordant_z was set to 4, back_distance was set to 20, weight was set to 1, and min_mapping_threshold was set to 1. For GASVPro, LIBRARY_SEPARATED was set to all, CUTOFF_LMINLMAX was set to SD = 4, WRITE_CONCORDANT was set to true, and WRITE_LOWQ was set to true. For DELLY, map-qual was set to 1, and the inc-map flag was set. DELLY paired-end (pe) and split-read (sr) calls were combined into a single paired-end and split-read (pe + sr) callset by taking the union of the two sets where the split-read call was retained when a call was common to both sets. For Pindel, minimum_support_for_event was set to 4, all chromosomes were considered, and report_interchromosomal_events was set to true. Since the output of DELLY and Pindel are single-base-resolution intervals, we increased the size of those intervals to match the mean interval size of a similar LUMPY call. Any call that had split-read support (Pindel and DELLY sr calls) was expanded to a 28 bp interval, and any call that had only paired-end support (DELLY pe calls) was expanded to a 282 bp interval. The intervals for GASVPro were not modified since, like LUMPY, it reports an interval whose size is based on the supporting evidence.
For the real data, LUMPY, GASVPro, DELLY, and Pindel considered Illumina sequencing of the NA12878 individual. The original sequencing files were at 50X coverage and were used in the 50X experiments. The 5X experiments considered sequencing files that were created by subsamples 10% of the original paired-end alignments. For all the tools, only deletion predictions on chromosomes 1 though X were considered. All sequencing samples were retrieved from the European Nucleotide Archives (submission ERA172924), and were previously aligned using BWA. Soft-clipped (≥20 bp clipped length) and unmapped reads were realigned with the split-read aligner YAHA using a word length of 11 and a minimum match of 15. In addition to the NA12878 data, the LUMPY trio results also considered sequencing data from that individual’s parents, NA12891 and NA12892. The LUMPY prior result considered all 1000 Genomes variant calls using the generic evidence module. The LUMPY read-depth results considered all deletion calls made by CNVnator [13] for the NA12878 genome with a window size of 100 for the 50X coverage experiment and 1,000 for the 5X coverage experiment. The single-base-resolution regions predicted by CNVnator were extended upstream and downstream by one-half the window size before being considered by LUMPY. Each tool was run with the same options that were used in the simulation experiments, except the minimum mapping quality for LUMPY, GASVPro, and Pindel was increased to 10. Since Pindel uses paired-end reads differently than the other tools, the default mapping quality of 20 was used. Each call required support of at least four. In the LUMPY trio result, a call had to have support of four from at least one individual (NA12878, NA12891, or NA12892) and at least one piece of support from NA12878. The weight for the 1000 Genomes variant calls in the LUMPY prior result was set to 2. In the LUMPY read-depth result, a call had to have support from read-depth and paired-end or split-read.
For the identification of the first truth sets, the Mills et al. study [12] validated 14,012 deletions in NA12878 across 11 independent laboratories. Once duplicate predictions were removed, the first truth set contained 3,376 non-overlapping deletions. The SV breakpoints predicted by each algorithm were compared to the known variants. A true positive was defined as a variant call where the two breakpoint intervals reported by a given SV detection tool both intersect with the two breakpoints introduced in the reference genome by simulation, and where the SV types (for example, deletion) match. To account for varied spatial resolution among the tools, we pad the simulated breakpoint coordinates with 50 bp of bidirectional slop, such that each simulated variant is represented by two 100 bp breakpoint intervals.
The second truth set consisted of the 4,095 deletions called by at least one tool in the 50X dataset or by the 1000 Genomes Project [12], and that were also validated by long-read sequencing from PacBio or Illumina Moleculo data. Overlapping calls were merged by retaining the minimum shared interval. PacBio reads (median length, 880 bp; mean depth, 30X) were aligned to GRCh37 using BWA Smith-Waterman (bwa bwasw -b 5 -q 2 -r 1 -z 20 -w 500). Illumina Moleculo reads (median length, 3,012 bp; mean depth, 30X) were aligned to GRCh37 with BWA-MEM (bwa mem -t 1 -B 4 -O 6 -E 1 -M). These BAM files are available from the 1000 Genomes Project repository at [24, 25], respectively.
Deletion calls were considered to be validated if supported by at least two non-duplicate PacBio split reads or at least one Moleculo split read. A supporting long-read is defined by the following criteria: 1) the long-read is split by the aligner such that at least 20 bp aligned to the flanking sequence on either side of the reported breakpoint; 2) the left and right intervals of the split long-read both intersect with the respective left and right intervals of reported breakpoint, allowing 5 bp of slop space on either side of the long-read breakpoint to account for microhomology or inserted sequence at the novel adjacency; 3) the strand orientation of the split long-read is consistent with the strand orientation at the reported breakpoint; 4) in the case of PacBio, where at least two non-duplicate long-reads are required to validate a call, the two long-reads must not only fill criteria 1 to 3, but must also overlap with each other within 5 bp of slop on either side.
We performed Monte Carlo shuffling of the callsets to estimate the number of spurious long-read validations due to random chance. We shuffled each callset 100 times using BEDTools shuffle while retaining identical interval sizes, variant spans, and number of variants for each iteration [23]. Regions of the genome that were excluded from the original analyses were also excluded from the shuffling (see ‘Excluded regions’ section below). We then performed long-read validation as described above to each shuffled callset, with less than 3% of shuffled calls validating (Table 1).
Excluded regions
For structural variation detection with LUMPY and other tools, we excluded regions of the reference genome with consistently high sequencing depth over multiple individuals, since high depth is indicative of artifacts in the reference assembly. To define these regions, we first aligned the 17-member CEPH 1463 pedigree to the GRCh37 human reference genome using BWA-MEM 0.7.5a-r405 (bwa mem -t 32 -M -p) [26]. Each member of the pedigree was whole-genome sequenced from a PCR-free library to 50X coverage with 101 bp reads, and is publicly available through the Illumina Platinum Genomes project [27]. We used BEDTools v2.17.0 to generate a BED graph of aggregate per-base coverage from all 17 individuals [23]. The mode and standard deviation of the aggregate depth were calculated separately for the autosomes and sex chromosomes. Any regions with depth exceeding 2 * mode + 3 standard deviations were excluded from our analyses. (We chose to double the mode to allow inclusion of duplicated copy number variant regions.) Finally, the mitochondrial chromosome was excluded entirely. A BED graph of the excluded regions can be obtained at [28].