An integrative probabilistic model for identification of structural variation in sequencing data
 Suzanne S Sindi^{1, 2}Email author,
 Selim Önal^{3},
 Luke C Peng^{3},
 HsinTa Wu^{1, 3} and
 Benjamin J Raphael^{1, 3}Email author
DOI: 10.1186/gb2012133r22
© Sindi et al.; licensee BioMed Central Ltd. 2012
Received: 26 October 2011
Accepted: 27 March 2012
Published: 27 March 2012
Abstract
Pairedend sequencing is a common approach for identifying structural variation (SV) in genomes. Discrepancies between the observed and expected alignments indicate potential SVs. Most SV detection algorithms use only one of the possible signals and ignore reads with multiple alignments. This results in reduced sensitivity to detect SVs, especially in repetitive regions. We introduce GASVPro, an algorithm combining both paired read and read depth signals into a probabilistic model that can analyze multiple alignments of reads. GASVPro outperforms existing methods with a 50 to 90% improvement in specificity on deletions and a 50% improvement on inversions. GASVPro is available at http://compbio.cs.brown.edu/software.
Background
Structural variation, including duplications, deletions and rearrangements of large blocks of DNA sequence, is now recognized as an important contributor to the genetic differences between individual humans and the somatic differences between normal and cancer cells [1–7]. It is also prevalent in other organisms, including many model organisms [8–10]. Knowledge about the extent of structural variation has increased rapidly in the past few years with improvements in DNA microarray and sequencing technologies. In particular, sequencing approaches identify all types of structural variation, including copy number variants and balanced rearrangements like inversions and reciprocal translocations [11–13]. While next generation sequencing technologies are now widely used to assess both genetic variation in normal genomes [14–21] and somatic structural variation in cancer genomes [4, 7, 22, 23], the short reads and short inserts of these technologies make the identification of many structural variants (SVs) nontrivial. Since de novo assembly of mammalian genomes from nextgeneration sequencing technologies remains a challenge [24, 25], many SVs are identified using a resequencing approach where sequence reads from an individual genome are aligned to a reference human genome assembly. The resequencing approach thus leverages the extensive finishing efforts employed in the generation of the human reference genome.
Many strategies have been employed to predict structural variation using the resequencing approach [11–13]. First, read depth (RD), the density of mapped reads to an interval of the reference genome, has been used successfully to identify copy number variants [26–31]. However, RD is unable to detect copy neutral variants such as inversions and balanced translocations. Second, paired read (PR) approaches have been used to identify all types of SVs, both copy number variants and copyneutral variants [16, 28, 32–35]. These approaches analyze the collection of PR mappings and find clusters of aberrantly mapped PRs that suggest SVs distinguishing the two genomes. Third, split read (SR) methods have been employed to directly identify sequence reads that contain breakpoints of SVs [36]. However, the short reads produced by current secondgeneration sequencing technologies have limited the use of SRs for SV detection; for example, Ye et al. [36] rely on anchoring the search for SRs using a fulllength alignment of one read from a PR.
While there has been extensive development of methods for structural variation prediction, there remains room for improvement. First, most existing methods for SV prediction use only one of the possible signals (RD, PR or SR). A few methods employ a second signal in later postprocessing of predictions. Such a post hoc approach may improve specificity, but it does not increase sensitivity by combining multiple, weak signals. Although a few recent methods have begun to consider both RD and PR signals [37, 38], these methods have focused only on copy number variants. Second, most methods for structural variation prediction used only reads with unique highconfidence alignments to the reference genome, ignoring reads with lower quality alignments or multiple possible alignments [32, 33, 39]. As such, these methods have very low sensitivity to detect repeatassociated rearrangements. Since many SVs are associated with repetitive sequences, including segmental duplications [40], and mobile elements [2], a substantial improvement in sensitivity may be possible by including reads with multiple alignments. More recently, a few methods have been introduced that consider multiple or lower quality alignments of reads relying on various criteria to select among possible candidate alignments [34, 41, 42]. While these methods may predict more true variants, this increased sensitivity often comes at the cost of reduced specificity as these methods produce many false positive predictions. Thus, there is a need for additional improvements in sensitivity and specificity for SV prediction. For example, the pilot study of the 1000 Genomes Project did not report inversion SVs [43] even though such variants have been previously shown to be abundant in normal genomes [16].
Here, we introduce GASVPro, an algorithm for SV identification that integrates both RD and PR signals into a unified probabilistic model. We find that the likelihood of a predicted variant under our probabilistic model provides a better criteria for prioritizing predictions than the number of supporting PRs, a common heuristic for ranking predictions. In addition to combining both RD and PR signals, GASVPro explicitly reports uncertainty in each predicted breakpoint, which is useful information for identification of SRs or designing assays for experimental validation. This breakpoint localization is obtained using a computational geometric algorithm, Geometric Analysis of Structural Variants (GASV) [33], that represents all possible breakpoints, or breakends, that are consistent with the aligned reads as a polygon in twodimensional genome space. By carefully clustering only those PRs that genuinely support the same breakends, GASV avoids overcollapsing fragments into the same SV prediction, a problem demonstrated in other methods (see Results) and reports coordinates consistent with the true variant points.
Moreover, GASVPro exploits this explicit representation of the breakends to incorporate a subtle signal of highly localized drops in coverage at the variant breakends. We call this signal breakend read depth (beRD), and it occurs for both copy number variants as well as copyneutral SVs. Using this signal, GASVPro predicts whether a generic breakend is a homozygous or a heterozygous variant, even when relatively few PRs support the variant. Thus, GASVPro is the first method to utilize RD to predict generic SVs, including inversions and reciprocal translocations, and not just copy number variants. For deletions, GASVPro uses the stronger signal of RD across the entire deleted interval, and this combination of PR and RD leads to highly sensitive and specific deletion predictions. GASVPro also considers reads with multiple possible alignments, using a Markov chain Monte Carlo (MCMC) approach to sample over the space of possible mappings for each pairedend sequenced fragment. In this way, GASVPro does not select only a single 'best' alignment for each fragment, but rather computes a posterior probability of each variant over all possible alignments of each read.
We demonstrate the advantages of GASVPro on simulated data and Illumina sequencing data from two sequenced human genomes, NA18507 [14] and NA12878 [44] (1000 Genomes Project). We compare predictions to known variants with a novel metric, the 'double uncertainty' metric, developed to allow for unambiguous comparisons when there is uncertainty in the breakpoint locations. For deletions, GASVPro outperformed competing methods by attaining equal or greater sensitivity while making at least 50% and up to 90% fewer predictions. In addition, on a subset of deletions with known ploidy, GASVPro successfully classifies over 85% as homozygous or heterozygous. For inversions, GASVPro is up to twice as specific at maximum sensitivity than existing methods. In particular, because of GASVPro's use of the beRD signal, it is the only method to attain optimal specificity and sensitivity on our simulated data set. In other cases, GASVPro's use of the beRD signal at inversion breakpoints results in equal or better specificity than competing methods despite considering a larger set of possible alignments.
Results
A probabilistic model of structural variant breakends
Identifying structural variants from pairedread sequencing data
Although the length of each individual fragment is generally unknown, the size selection that is performed during the construction of the sequencing library yields an approximate distribution of fragment lengths. We assume that fragment lengths are between ${L}_{\text{min}}$and ${L}_{\text{max}}$; these values can be derived from the empirical distribution of mapped fragments. Fragments with both ends mapping uniquely to the reference with 'convergent orientation' and with mapped distance in the range $\left[{L}_{\text{min}},{L}_{\text{max}}\right]$ are called concordant fragments because their mapping indicates concordance (no SV) between the test and reference genome. (We note that the definition of convergent orientation depends on sequencing technology. For example, with Illumina pairedend data, the reads are obtained from opposite DNA strands and thus convergent orientation is defined as reads with opposite orientation, with the left read forward and the right reversed (+/). With SOLiD pairedend data, reads are obtained from the same DNA strand and thus should have the same orientation. In this case, convergent orientation is defined as reads with positive orientation when the first sequenced has smallest mapped coordinate (+/+) and negative orientation when the first sequenced read has largest mapped coordinate (/).) The remaining discordant fragments indicate potential SVs or sequencing/alignment errors.
Although researchers typically focus on common classes of SVs, such as deletions and inversions, more generally a SV corresponds to a rearrangement creating one or more novel adjacencies between pairs of locations in the reference genome. That is, two locations a and b, which were originally separated in the reference genome, are now adjacent in the test genome. For example, a deletion creates one novel adjacency while an inversion creates two (Figure 1a). Following the terminology of VCF (Variant Call Format) version 4.1 [45], we refer to locations a and b individually as breakends and as mated breakends when paired at either end of a SV created by a rearrangement.
We define a predicted SV $V$ as a pair $V=\left(F,B\right)$ where $F=\left\{{f}_{1},{f}_{2},\dots ,{f}_{k}\right\}$ is a set of k discordant fragments containing the novel adjacency, and $B$ is the breakend polygon, a region describing all possible mated breakends (a,b) determined by the discordant fragment mappings (Figure 1b). The breakend polygon is defined by the positions of the mapped ends of each fragment and the minimum (${L}_{\text{min}}$) and maximum (${L}_{\text{min}}$) length of fragments. If $V$ is a true SV, then there is an ordered pair $\left(a,b\right)\in B$ corresponding to a novel adjacency created by the rearrangement. That is, there is a $\left(a,b\right)\in B$ such that a and b are the breakends of the SV in the reference genome. (See Materials and methods and [33] for more information on how the breakend polygon is defined.)
Discordant and concordant fragments provide complementary information about a variant. Discordant fragments define the breakend polygon $B$ while concordant fragments (or lack thereof) provide additional information about the precise location of the breakends within the polygon. If a and b represent mated breakends created by a deletion, inversion or other rearrangement in the reference genome, we should see a decrease in the coverage by concordant fragments at these points. The type of signal we expect to see depends on the type of SV present (Figure 1c). For a deletion, we expect a drop in the coverage of concordant fragments throughout the genomic interval [a,b]. This is commonly known as the RD signal and has previously been exploited to reveal copy number variants [38]. For an inversion or reciprocal translocation, we expect a sharp drop in coverage in the regions immediately surrounding a and b as many of the fragments containing a or b in the test genome are discordant when mapped to the reference. However, there is no drop in coverage 'inside' the inversion or translocation. We define this highly local drop in coverage as the breakend read depth (beRD) signal.
We develop a probabilistic model based upon the mapped locations of all fragments, concordant and discordant, in the test genome. By doing so we integrate both the presence of discordant fragments (PR signal) and concordant coverage (RD signal) into a single probabilistic method, GASVPro. In addition, GASVPro directly estimates the location of the breakends a and b for a SV $V$ and classifies the prediction as homozygous or heterozygous. We first present our model in the restricted context where every fragment has a unique mapping to the reference genome. Then, we extend our model to fragments with multiple alignments by using an MCMC approach to sample over the possible mappings for each fragment.
Probability of a structural variant
We determine the probability of a potential SV $V$ by considering the number, k, of discordant fragments as well as the beRD, the depth of coverage at each candidate breakend. By doing so, we directly estimate the novel adjacency created by $V$ by considering all possible mated breakends consistent with the discordant fragments. Since our formulation depends only on the process of sampling fragments from the test genome, and not on the class of variant, our probabilistic model is applicable to generic rearrangements.
We follow the LangerWaterman model [46] of sequencing and assume that the starting positions of the fragments are sampled from the test genome uniformly so that the left positions of fragments follow a Poisson process with parameter λ. If all sequenced fragments had fixed length L, the number of fragments containing an arbitrary point p from the test genome, called the coverage of p, would simply be the number of fragments sampled with left endpoint in the interval [p  L + 1,p]. According to the Poisson process, the coverage of a point p follows a Poisson distribution with mean λL. In general, we do not know the size of any particular fragment and thus we use the average fragment length, ${L}_{avg}$, and model the coverage of p by a Poisson distribution with mean ${\lambda}_{c}=\lambda {L}_{avg}$.
If p is sufficiently far from all sites of structural variation, we expect all sequenced fragments containing p to be concordant with respect to the reference genome. However, if p is the breakend of an SV, coverage will be reduced, as there will be fewer concordant fragments containing the breakend. In particular, the distribution of the number of fragments containing a breakend p is approximated by a Poisson distribution with mean ${\lambda}_{d}=\left({L}_{avg}2\times readlength\right)\lambda $ (see Materials and methods and Figure A1 in Additional file 1).
Consider a candidate SV $V=\left(F,B\right)$. If $V$ is a true SV, then there is an ordered pair, $\left(a,b\right)\in B$, corresponding to a novel adjacency in the test genome created by the rearrangement. As such, the number of concordant fragments containing a or b should be lower than expected for an arbitrary point in the reference genome. Alternatively, if $V$ is not a true SV, then the coverage of points a and b by concordant fragments will follow the Poisson distribution with mean ${\lambda}_{c}$. We next describe the probability of a variant $V$ by conditioning on the choice of breakends and number of copies of the novel adjacency (a,b) in the test genome. Specifically, for a candidate novel adjacency $\left(a,b\right)\in B$, let C(a,b) = {0,1,2} indicate the number of copies of the novel adjacency in the test genome. (Here we are considering only copyneutral or copy number loss events (for example, deletions) and not duplications. The extension to the latter case is future work.) We consider three events: (1) a and b are breakends of a homozygous SV, (C(a,b) = 2); (2) a and b are breakends of a heterozygous SV (C(a,b) = 1); (3) a and b are not SV breakends (C(a,b) = 0).
where by C(B) = 2 we mean the breakpoint region $B$ defines a homozygous SV.
In practice we report variants $V$ where $\text{log}\Lambda \left(V\right)$ exceeds a prescribed threshold. In addition to assigning a likelihood to a SV, our formulation determines a maximum likelihood estimate for the novel adjacency (a,b) and if a variant is homozygous or heterozygous.
Probability of a deletion
There are several additional factors we consider when using our model on sequencing data. First, there are factors other than SVs that can impact the coverage of concordant fragments over an interval. As such, to adjust for differences in the ability to map reads throughout the genome, in our model for deletions we scale the number of concordant fragments by the local mapability of the putative deleted interval. Second, since in this study we are primarily interested in inversion and deletion SVs, in practice we utilize a heuristic to eliminate regions of the genome with extremely high coverage by concordant fragments. Further information on these practical details are given in the Materials and methods section.
Selecting a mapping for each fragment
In the previous sections, we assumed that there was a single highquality alignment for all reads and therefore one highquality alignment for each fragment. However, some reads may have multiple highquality alignments due to repetitive sequences in the reference or sequencing errors in the reads. Selecting one of the possible alignments for each read from the pair defines an alignment of the fragment. Since each fragment represents a unique contiguous region of the test genome, at most one alignment is the correct one and we refer to this as the mapping of the fragment.
Selecting a mapping for each fragment defines the set of concordant and discordant fragments and an associated set of SVs that could be evaluated using the model in the previous section. Although any such selection defines a fragment configuration consistent with the data, each selection has a different probability. Thus, rather than selecting a mapping for each fragment in advance, we consider the space of all possible mappings for all fragments and use a MCMC approach to sample from the space of possible mappings in proportion to their probability.
With these distinctions, we now revisit our notions of 'concordant' and 'discordant' from above. A concordant fragment is a fragment whose unique mapping is concordant. That is, both reads have a single highquality alignment to the reference and the alignments are concordant with respect to the sequencing process. A discordant fragment is a fragment whose entire set of alignments are discordant. (Note, this formulation ignores any fragment with multiple alignments, at least one of which is concordant.)
Let $F=\left\{{f}_{1},{f}_{2},\dots ,{f}_{m}\right\}$ be the set of all discordant fragments. Suppose that the two reads from a fragment $f\in F$ map to s and t locations, respectively. An alignment of a fragment corresponds to selecting an alignment for each read, and thus we define $A\left(f\right)=\left\{\left({x}_{i},{y}_{j}\right)\right\}$ where i = 1,2,...s and j = 1,2,...t as the set of all alignments for a fragment $f$, only one of which may be the true mapping. Let $A=\left\{\left(A({f}_{1}\right),A\left({f}_{2}\right),\dots ,A\left({f}_{m}\right)\right\}$ be the set of alignments for all fragments.
 1.
${m}_{ij}\le {a}_{ij}$; that is, ${m}_{ij}=1$ only if ${a}_{ij}=1$,
 2.
$\sum _{i}{m}_{ij}\le 1$for all i; that is, each row in $M$ has at most one nonzero entry.
Finally, as before, the probability of variants depends on the associated copy number, $C\left(B\right)$, of a variant. We explicitly distinguish between homozygous and heterozygous SVs by including a binary vector $C=\left({C}_{1},{C}_{2},\dots ,{C}_{n}\right)$ where ${C}_{j}=C\left({B}_{j}\right)$. If any discordant fragments are assigned to ${V}_{j}$, we require ${C}_{j}>0$. Together $C$ and $M$ define the differences between the test and reference genome.
Probability of a mapping matrix
Our data D consists of a set $F$ of discordant fragments, a set $A$ of alignments, a set $V$ of possible SVs, and the positions of all concordant mappings in the genome. We next generalize our probability model from the previous section to the probability of a mapping matrix based on the generation of the data D from a given genome.
Note that $M$ specifies a unique mapping for each fragment supporting a variant; thus, one solution would be to consider $P\left(MA\right)$ over all possible mapping matrices. However, because the number of possible mapping matrices $M$ grows exponentially with the number of fragments, we use a MCMC procedure to efficiently sample from the space possible mapping matrices $M$ (Figure 2; Section A2 and Figures A2, A3 in Additional file 1). Our MCMC procedure converges to the unique stationary distribution given in Equation 10.
Deriving the predicted structural variants
Our MCMC procedure samples mapping matrices in proportion to their probability $P\left(MA\right)$; however, our ultimate goal is to report a final set of SV predictions. One approach to SV prediction is to select a single $M$ according to some criteria; for example, the $M$ that minimizes the total number of SVs predicted. This approach is used by a number of SV detection methods that consider multiple assignments for fragments, such as VariationHunter [42] and Hydra [34]. We instead predict SVs by considering the entire space of mapping matrices $M$ according to $P\left(MA\right)$ as described in the Materials and methods. In practice, we found only minor differences in the receiver operating characteristic (ROC) curves for the different reporting methods we considered (Figure A4 in Additional file 1).
Results on sequenced data
We applied GASVPro to simulated pairedend data on the Venter Genome (HuRef) [47], as well as two previously sequenced human genomes, NA18507 [14] and a European individual, NA12878, from the 1000 Genomes study [44]. We also compared results from GASVPro to two previously published methods, Hydra [34] and BreakDancer [32], as well as the original GASV. (We also performed some comparisons with VariationHunter [42]. Since results were strikingly similar to Hydra, as previously noted in [34], and we were unable to process the full datasets for NA12878 and NA18507 using the current publicly available distribution of VariantionHunter, we present only the results for Hydra.) Finally, we compare to CNVer, a method combining RD and PR to detect copy number variants [38].
These methods, and other similar SV prediction programs, typically employ several steps, including alignment of reads to the reference genome, predicting SVs from alignments, postprocessing predictions (for example, pruning a set of predicted SVs to remove redundancy) and comparison to known variants. In an effort to directly compare the performance of the SV prediction algorithms, rather than the specific pre and postprocessing steps, we standardized the alignment, postprocessing and comparison steps. In particular, we used the same read alignments for all methods. (Note this involved modifying the source code for Breakdancer to consider only a userspecified set of discordant fragments.) For GASVPro and Hydra, the methods that allow fragments to have multiple possible alignments, we realigned reads to the reference genome with Novoalign [48] and distinguish results on the full set of alignments (GASVPro and Hydra) from results on only the highquality unique alignments (GASVProHQ or HydraHQ). Before comparing results, redundant predictions were removed with the same pruning procedure for each method (see Materials and methods).
We compare predictions to a known set of variants using the double uncertainty metric, a novel metric developed to represent uncertainties in the breakpoint locations for both the predictions and the known variants (see Materials and methods; Figures A5 and A6 in Additional file 1). We use a ROC type analysis to show the number of novel predictions and true positives for each method as a function of the number of supporting fragments (Hydra, Breakdancer, GASV), the predicted depth of coverage (CNVer) or the likelihood of a predicted variant (GASVPro). Note that in the results shown below, GASVProHQ and GASV consider the same set of high quality unique alignments and utilize the same clustering algorithm. As such, both methods have the same maximum sensitivity, but GASVProHQ has higher specificity due to our probabilistic model. On the other hand, GASVPro uses a larger set of alignments, including lower quality and ambiguous alignments, and as such GASVPro can achieve higher sensitivity than GASVProHQ and GASV.
Simulated data
We first test GASVPro on simulated data generated from the Venter genome [47]. We produced a synthetic dataset by inserting the list of annotated SVs on chromosome 17 of Venter's genome (8,801 deletions, 8,572 insertions and 4 inversions) into the human reference genome (hg18). These SVs varied in length from one to several thousands of bases. We simulated 100× coverage of this chromosome by 50bp PRs with a mean fragment length of 200 bp and a standard deviation of 20 bp using the SAMtools wgsim program [49]. For all methods, the resulting sets of predictions were pruned and compared to known variants with the double uncertainty metric with reference uncertainty set to 0 (see Materials and methods).
All methods had greater sensitivity than CNVer, which made 218 predictions but detected only 3 deletions with the double uncertainty metric. The lower sensitivity of CNVer can be explained in part by internal filtering: the published code of CNVer reports only copynumber events that are larger then 1 kb, which eliminates all but 9 out of 124 simulated deletions. In addition, the reported coordinates from CNVer lie farther from true breakends, although the predicted deletion interval typically contains the true deletion. We note that 16 of 218 CNVer predictions completely contained a true deletion, including 5 of 9 deletions larger than 1 kb. Thus, some of the difficulties with CNVer result from how it merges potential copynumber variants before reporting a final set of predictions (Section A3 in Additional file 1).
We next discuss GASV compared with Breakdancer, Hydra and HydraHQ. Before removing redundant predictions by pruning, GASV predicts 648 deletions with at least one supporting fragment, which detects 60 Venter deletions. Thus, the maximum sensitivity is 48%. A common method to increase specificity is to increase the minimum number of supporting fragments for a prediction. As discussed previously, however, many predictions from SV methods overlap. Removing these overlapping predictions (see Materials and methods) improves performance more than increasing the number of supporting fragments. For GASV, restricting the set of predictions to those with at least two supporting fragments results in 244 predictions but detects only 46 deletions. In comparison, pruning the 648 predicted deletions with at least one fragment retains 347 predictions that detect 57 true deletions. In comparison, HydraHQ and Hydra had slightly lower sensitivity, predicting only 44 deletions at maximum sensitivity, but had similar overall performance to GASV. Breakdancer had similar performance throughout with slightly higher sensitivity than Hydra/HydraHQ and GASV and equal specificity.
The integrative probabilistic model used by GASVPro greatly improves specificity. Analyzing only high quality unique mappings, GASVProHQ predicts only 64 deletions with positive log likelihood, $\text{log}\Lambda \left(V\right)$> 0, which include 50 true deletions. Note that these 64 predictions are a subset of those predicted by GASV. Thus, compared to GASV, GASVProHQ has a substantially lower false positive rate at highest sensitivity. The improved specificity of GASVProHQ over GASV is evidence that our likelihood statistic is a better predictor of true variants than the number of supporting fragments (see also Figure A7 in Additional file 1 for a comparison). Including lowquality and ambiguous alignments increases the space of possible variants substantially without significantly increasing the number of detectable deletions. That is, the full set of possible alignments suggest 1,051 potential deletion events that overlap, at most, 61 out of 124 true deletions. However, GASVPro has similar performance to GASVProHQ throughout. This suggests that the MCMC sampling method is able to successfully eliminate many false positive predictions even with a much larger number of initially possible variants.
Comparison of performance of methods with respect to identifying the four inversions on Venter chromosome 17
Method  Minimum number to detect 3  Minimum number to detect 4 

High quality alignments  
Breakdancer  3  NA 
GASV  3  NA 
GASVProHQ  3  NA 
HydraHQ  4  NA 
All alignments  
GASVPro  3  4 
Hydra  5  10 
Sequencing data
NA12878 deletions
We next compared the methods on Illumina sequencing data of a CEU individual, NA12878, from the 1000 Genomes Project. There are two sets of validated SVs available for this individual. First, deletions and inversions were validated from a previously published fosmid study [16] and deletions were separately validated as part of the 1000 Genomes Project [44]. In addition, the validated deletions from the 1000 Genomes data set were also annotated as homozygous or heterozygous.
Individual NA12878 was sequenced in both Pilot 1 (≈4× coverage) and Pilot 2 (≈40× coverage) of the 1000 Genomes Project. For Pilot 1, a single library was sequenced with a read length of 37 bp and an average fragment size of 230 bp. For Pilot 2, multiple libraries were sequenced with read lengths from 37 to 52 bp and an average fragment size of 150 to 350 bp. Thus, we analyzed both datasets to examine the effect of different coverage on the ability of methods to predict SVs.
We first consider the results on the higher coverage Pilot 2 data (Figure 5a,b). Four curves represent methods run on only uniquely mapped fragments: GASV, GASVProHQ, HydraHQ and Breakdancer. Breakdancer has slightly improved performance compared to HydraHQ and GASV, attaining equal sensitivity with up to 200 fewer predictions throughout. However, this may be an artifact of Breakdancer's aggressive clustering procedure (discussed in Section A3 of Additional file 1). GASVProHQ has the best overall performance with over a 85% reduction in novel predictions at highest sensitivity compared to Breakdancer, GASV and Hydra.
Of the three methods that use all alignments (GASVPro, GASVProMin2 and Hydra), GASVPro has the highest sensitivity, detecting 119 of 139 true deletions with 19,715 novel predictions on the set of validated deletions from the 1000 Genomes study. By increasing the minimum likelihood threshold, and thus reducing the number of predictions, GASVPro predicts 114 of 139 true deletions with only 907 novel predictions; this represents a 95% decrease in the number of novel predictions with only a 3% decrease in true positives. GASVProMin2 has higher specificity than GASVPro, making around 200 fewer predictions than GASVPro at equal sensitivity. Notice the addition of ambiguous mappings alone does not greatly improve performance as the behavior of Hydra and HydraHQ is very similar, with Hydra being slightly more sensitive. Thus, regardless of whether unique or ambiguous fragments are used, combining both read depth and PRs with our probabilistic model (GASVProHQ, GASVProMin2 or GASVPro) results in significant improvements to sensitivity and specificity.
In addition to improving the ability to successfully predict true deletions, our probabilistic model also accurately classifies these variants as homozygous or heterozygous. GASVProHQ correctly classified 104 out of the 119 known deletions with highest likelihood as homozygous or heterozygous according to the annotations in the 1000 Genomes data set. Remarkably, all 28 homozygous variants in this set were correctly classified even though some had fewer supporting discordant fragments than many correctly classified heterozygous variants.
On Pilot 1 data, we also compare the performance of CNVer, which uses both discordant mappings and read depth to predict copy number variants. In contrast to the simulated data set above, all known deletions analyzed here are larger than 1 kb and thus CNVer attains similar sensitivity to PR methods, like Hydra, GASV and Breakdancer. However, the number of discordant fragments per prediction, the criteria used to rank results for PR methods, provides a better trade off between true and false positive predictions than the estimated depth of coverage, which we use to rank CNVer predictions.
Even with the reduced coverage, compared to Pilot 2, the benefits of our probabilistic models are evident, and GASVPro outperforms all competing methods. GASVProHQ and GASVProMin2 have improved performance compared to Hydra, HydraHQ, Breakdancer and GASV. Note that the specificity for GASVPro drops below all other methods at the highest likelihood threshold (Figure 5c,d). This drop in performance is due to many predictions of GASVPro consisting of only a single discordant fragment mapping to a large region with very few concordant fragments. While it is possible these are true variants, it is more likely that most of them are false positives and, as such, eliminating these predictions (GASVProMin2) restores performance to that obtained by GASVProHQ. On this dataset, GASVProHQ correctly classifies 84 out of the 102 known deletions with highest likelihood as homozygous or heterozygous. As in the Pilot 2 data set, all 26 of 102 homozygous deletions were correctly classified, 3 of which have fewer than 3 supporting fragments.
We next evaluate the effect of increased coverage on each method by comparing the results from Pilot 2 (Figure 5a,b) with Pilot 1 (Figure 5c,d). For the methods utilizing only discordant mappings (Hydra, HydraHQ, GASV, and Breakdancer) performance is similar between Pilot 1 and Pilot 2 data. In contrast, performance of our probabilistic methods, GASVProHQ, GASVProMin2 and GASVPro, increases substantially with coverage. The maximum sensitivity of GASV Pro and GASVProHQ increases by about 20% on both data sets, from 97 to 119 and 96 to 114, respectively, for the fosmid validated set and 100 to 119 and 103 to 120 on the 1000 Genomes validated set. This improved performance results from integration of both discordant fragments (PR signal) and concordant fragments (RD signal). Increasing the sequencing coverage increases both discordant and concordant mappings throughout the genome. However, higher discordant coverage contributes to both true and false predictions, and thus methods that analyze only discordant fragments are less able to leverage the increased coverage to distinguish true from false predictions. In contrast, increased coverage by concordant fragments leads to sharper delineations between normal and deleted regions in the genome. Although it is possible that CNVer results would have also improved with the higher coverage data, a comparison was not possible as multiple libraries are not supported in the published CNVer implementation.
Finally, we remark on a practical difficulty in assessing the performance of methods on sequenced genomes. As indicated above, the complete set of SVs on these genomes is unknown. Thus, it is possible that predictions classified as 'novel predictions' could in fact be true, but yet unknown, variants. In addition, the set of validated variants that we use as true positives may not be representative of all SVs in these genomes. For example, we attained significant improvements in specificity for both inversions and deletions on NA12878 when we used a 'homozygousonly" model in GASVPro (Figure A8 in Additional file 1). This suggests that the set of known variants may underrepresent heterozygous deletions and inversions, which are presumably more difficult to detect and validate.
NA18507 deletions
As above, employing our integrative probabilistic model for discordant fragments with unique mappings, GASVProHQ greatly improves performance compared to the original GASV. Using GASV alone, at maximum sensitivity we predict 55 of 93 deletions from the fosmid study with 2,240 novel predictions. In comparison, GASVProHQ successfully predicts the same 55 of 93 deletions with only 573 novel predictions. Similarly, for the 1000 Genomes deletions, at maximum sensitivity GASV predicts 95 of 118 deletions with 2,201 novel predictions while GASVProHQ attains the same sensitivity with only 1,372 novel predictions. Thus, using our probabilistic framework provides a twofold increase in specificity at equal sensitivity. On the fosmid validated deletions, CNVer attains higher sensitivity than other methods and has overall higher specificity than GASV or Hydra at equal sensitivity (Figure 6a). However, this performance is not maintained on both sets of validated deletions (Figure 6b).
Overall, methods that analyze only unique mappings (Breakdancer, GASV, GASVProHQ, HydraHQ) outperformed those considering lower quality and ambiguous mappings. For this data set, including the full set of mappings (GASVPro and Hydra) greatly increases the number of predictions while, at best, modestly increasing the number of validated deletions that are correctly predicted. Indeed, running Hydra on only the unique mappings yields an 'ROC curve' similar to GASV alone. Although both GASVPro and original GASV match 70 of 93 variants from the fosmid study and 95 of 118 from 1000 Genomes Project, this is at the expense of predicting thousands of novel deletions on each data set, 5,535 and 21,523, respectively. We attain improved performance on the ambiguous data set by considering predictions with more than one supporting fragment, GASVProMin2; however, these results are still worse than GASV alone.
The decreased performance of GASVPro and Hydra on this data set, compared to NA12878 above, cannot be solely attributed to the read length as in both cases the sequenced reads were, on average, the same length, 37 bp. The differences seem likely due to difficulties in mapping uniquely to the reference. For NA12878, 31% of all mappings were unique while for NA18507, less than 1.5% of mappings were. In addition, there were more discordant fragments considered for NA18507, but fewer validated SVs. This combination may explain the substantial increase in 'novel' predictions, as compared to known deletions.
Inversions
In comparison to deletions, inversion SVs are more difficult to analyze for three reasons. First, there is no difference in read depth across the inversion, but only a change in read depth at the break ends (break end read depth). Second, there are few known inversion variants available for testing. Indeed, the 1000 Genomes SV paper [43] reports thousands of deletions but no inversions. Third, inversion SVs are known to have breakpoints with segmental duplications or other repetitive sequences, and aligning reads to these regions is complicated.
We compared predicted inversions for all methods to a set of validated inversions from a previous fosmid study [16] (see Materials and methods). The number of validated inversions is significantly smaller than the number of validated deletions; 23 inversions were validated in NA12878 and 10 in NA18507. All methods were far less sensitive in identifying inversions than deletions; maximum sensitivity over all methods was less than 20% on NA12878 and 70% on NA18507.
Inversion prediction in individual NA12878
Method  Minimum number to detect 1  Minimum number to detect 2  Minimum number to detect 3 

High quality alignments  
Breakdancer  47 (37)  80 (221)  NA (NA) 
GASV  34 (158)  76 (298)  5,028 (NA) 
GASVProHQ  11 (20)  116 (102)  206 (346) 
HydraHQ  61 (139)  108 (246)  284 (NA) 
All alignments  
GASVPro  28 (59)  394 (286)  550 (504) 
GASVProMin2  28 (59)  160 (334)  NA (NA) 
Hydra  159 (258)  NA (470)  NA (NA) 
Inversion prediction in individual NA18507
Method  Minimum number to detect 2  Minimum number to detect 3 

High quality alignments  
Breakdancer  138  NA 
GASV  72  NA 
GASVProHQ  61  NA 
HydraHQ  43  NA 
All alignments  
GASVPro  141  286 
GASVProMin2  141  286 
Hydra  551  752 
Discussion
We introduce GASVPro, a method for SV detection that: (1) integrates both the RD signal (including the more localized beRD) and PR signal of structural variation into a single probabilistic model; (2) analyzes multiple possible read alignments using an MCMC procedure; and (3) explicitly defines uncertainty in the breakends of a variant. GASVPro is the first method to utilize a probabilistic formulation to identify generic SVs and not only copy number variants. We demonstrated that, compared to the previously published methods Breakdancer, Hydra and GASV, GASVPro has significantly higher specificity at equal or greater sensitivity in detecting known variants. Finally, our method is easily generalized to include additional signals predictive of variants.
The increased specificity and sensitivity of GASVPro demonstrates the benefit of integrating multiple signals of structural variation into a probabilistic model. In particular, read depth provides a strong signal to detect deletions and classify them as homozygous or heterozygous. As previously noted, GASVProHQ successfully classifies 104 of 119 deletions with known ploidy on NA12878. In contrast, methods that consider only discordant fragments, including Breakdancer, GASV and Hydra, yield more false positive predictions than GASVPro. In addition, we show that beRD is useful in increasing specificity for predicting copyneutral inversions. Finally, our likelihood formulation provides more useful criteria for prioritizing predictions than the commonly used heuristic of the number of supporting fragments. We anticipate that including SRs will also aid in eliminating false positive predictions. In particular, the breakend polygon and beRD signal will suggest the sequence content of SRs. Thus, it will be possible to examine the data for SRs based on their sequence without exhaustive realignments to the reference.
The results of GASVPro demonstrate improved sensitivity when including reads with multiple possible alignments to the reference genome. However, this gain in sensitivity comes at a cost of reduced specificity as GASVPro makes many more predictions. On its surface, this is not too surprising as the inclusion of the additional lower quality alignments greatly increases the space of possible variants. The MCMC algorithm used in GASVPro is able to overcome the added ambiguity in part, with increased specificity over naïve inclusion of ambiguous alignments, but there remains a tradeoff in improved sensitivity versus reduced specificity. An important caveat of this conclusion is that it is not possible to compute the actual specificity for the two sequenced human genomes, as the set of experimentally validated SVs is likely not to be the complete list of SVs in these genomes. In particular, the SVs with breakpoints in repetitive regions  those where we expect GASVPro to have some advantage  are also the hardest to predict and experimentally validate, and are thus likely greatly underrepresented in the list of experimentally validated predictions. As the lists of validated SVs become more complete, it will be possible to perform more complete benchmarking of the sensitivity and specificity of prediction methods.
The increased specificity attained by GASVPro demonstrates the benefit of including concordant coverage. An important consideration when using concordant mappings is that distinct regions of the genome will have reduced coverage for reasons unrelated to structural variation. As discussed in the Materials and methods section, repetitive sequences in the reference genome will reduce the ability of alignment software to align concordant fragments. In addition, as previously noted, there is a bias in Illumina sequencing related to the GC content of a region [14]. For the probabilistic model for deletions, we found that scaling concordant coverage according to the local mapability from the Rosetta Uniqueness Track improved sensitivity for detection. However, the use of a specific track is not essential for our model; indeed, the GASVPro code is modular and allows the user to substitute alternative models for concordant coverage and scaling. Finally, it has been previously suggested that RD is better modeled by distributions other than Poisson [50] and these could be used in place of the Poisson distribution in Equations 1 to 9.
The probabilistic method of GASVPro is formulated for a 'generic breakend' and is thus applicable to any SV class since we expect a drop in the coverage by concordant fragments at the breakends of the SV. Although deletion SVs have a stronger signal of decreased coverage throughout the region, by carefully considering the uncertainty in the location of mated breakends we identify the subtle signal of highly local drops in concordant coverage consistent with copy neutral variants such as inversions and reciprocal translocations. In this formulation, we assume 'clean' breaks in the genome, meaning there is no gain or loss of additional bases at the rearrangement junction. In practice, however, ambiguity in breakend location is likely to cause difficulties in estimating the true location and likelihood of a variant. For example, on the simulated Venter genome, coverage around the true variant breakends was significantly reduced by short indels.
As presented, our probabilistic model considered only concordant and discordant mappings; however, the model is easily generalized to include additional information about the alignments of PRs. As stated above, the SR signal can be included as part of the expected coverage around a breakend. The distribution of fragment lengths can be included when computing the likelihood of mated breakends (a,b) as each choice imposes a length on the supporting discordant fragments. Similarly, the mapping quality (or alignment score) of each mapped fragment can be incorporated into the probability function by considering the probability a chosen mapping is the correct one. We experimented with including quality scores on our simulated Illumina data set, but found this had a marginal effect on the results. However, with the addition of thirdgeneration sequencing technologies with different error models [51], quality scores may be important.
Finally, because our probabilistic model is based on the generative processes of sequencing genomes, our model can be adapted to more general settings, such as detecting structural variation in cancer genomes. However, the extension to cancer genomes is nontrivial. In particular, to accurately analyze cancer genomes one would need to consider sample heterogeneity as the sequenced genomes are inevitably a mixture of normal and cancer genomes and possibly tumor subpopulations. In addition, our probabilistic model would need to incorporate aneuploidy by allowing more than two copies of the genomic region.
Conclusions
Structural variation  including duplications, deletions, insertions, inversions and translocations  is an important component of genetic variation in both human and cancer genomes. Current methods for SV detection typically consider only one of several signals from resequencing data when predicting structural variation. We introduced GASVPro, a probabilistic model for identification of structural variation integrating both RD and PR signals of SVs. Compared to existing methods, GASVPro has high sensitivity in predicting known variants while reducing the number of false positives by up to 90% for deletions and 50% for inversions.
Materials and methods
Defining breakpoint regions with GASV
GASVPro clusters discordant PRs using the previously published program GASV [33]. The GASV algorithm explicitly represents uncertainty in the location of the endpoints of the SV, the mated breakends, by a polygon and clusters discordantly mapped fragments by utilizing a computational geometric approach for intersecting polygons. We briefly overview the approach used in GASV; for a more detailed discussion of the GASV algorithm, refer to [33].
A discordant mapping indicates a SV in the test genome defined by a novel adjacency (a,b), where positions a and b are adjacent in the test genome, but not in the reference genome (Figure 1). A single fragment alone does not uniquely specify the pair of breakends (a,b) defining the rearrangement, but rather defines uncertainty in the location of the breakends. Formally, if we assume that a discordant fragment corresponds to exactly one SV, then the mapped locations, x and y, of the fragment endpoints (without loss of generality we restrict x < y), and the breakends a and b satisfy:
${L}_{\text{min}}\le sign\left(x\right)\left(ax\right)+sign\left(y\right)\left(by\right)\le {L}_{\text{max}}$, (11)
where sign(x) and sign(y) are 1 if the reads align to the positive strand and have convergent orientation and 1 otherwise. Here we assume convergent orientation is when reads have opposite orientation with the left read forward and the right read reversed as in the case for Illumina sequencing technology. The inequality (Equation 11) defines a trapezoid in the plane; discordant fragments corresponding to the same SV will have overlapping trapezoids and their intersection can be used to further refine the uncertainty in breakend location as in Figure 1b.
Concordant coverage and mapability
We consider concordant mappings when computing the likelihood of a variant because statistically significant changes in coverage indicate the presence of rearrangements relative to the reference genome. However, in addition to SVs, several local factors will affect coverage by concordant fragments.
Reads originating from duplications present in both the test and reference genome cannot be mapped to a unique position. Thus, such regions will have low coverage due to restrictions in local mapability. To adjust for variable mapability throughout, in the deletion model we scaled the number of concordantly mapped fragments using The Rosetta Uniqueness Track. The Rosetta Uniqueness Track, created by John Castle at Rosetta Inpharmatics (Merck; UCSC Genome Browser), quantifies mapability by considering a 35bp tiling of the genome and determining which 35mers will have a unique mapping to the reference genome with the BurrowsWheeler aligner (BWA) mapping tool.
where we use $\alpha =0.3$ and $\beta =0.7$. Notice, when the interval $I$ does not have compromised mapability, that is, $R\left(I\right)=1$, we do not adjust the number of observed fragments, $\widehat{n}\left(I\right)=n\left(I\right)$.
Note that in our analysis we do not scale the number of discordant fragments. In practice we found an abundance of discordant fragments mapping to regions of very lowmapability and scaling the number of discordant fragments led to an abundance of false positive predictions. Finally, we utilized a heuristic when computing the likelihood of SVs. If the concordant coverage for a breakpoint or interval was in the top 0.01% according to the Poisson model, we automatically assigned $C\left(B\right)=0$. Since under the Poisson model extremely high coverage by concordant fragments occurs with low probability, this threshold further restricts the region considered to represent SV endpoints. In such cases, we expect coverage by concordant fragments to be decreased compared to the rest of the genome.
Prediction uncertainty and the double uncertainty metric
Most studies determine if predictions match a known variant by overlapping a predicted genomic interval with the interval reported for the known variant. However, the criteria for 'overlap' differs among methods. For example, Chen et al. [32] considered a match when the intersection of the intervals is at least 50% of the union of the two or if the predicted interval entirely contains the known variant. While Hormozdiari F et al. [42] reported a deletion as matching a known variant if they had 50% reciprocal overlap and considered any overlap between an inversion and known variant. Although these criteria do eliminate some types of spurious identification, the inherent weakness of these metrics is that they do not unambiguously represent the underlying uncertainty in the predicted or reported variants.
 1.
$\left[x\in ,x+\in \right]\cap \left[a\delta ,a+\delta \right]\ne \varnothing $
 2.
$\left[y\in ,y+\in \right]\cap \left[b\delta ,b+\delta \right]\ne \varnothing $
We provide illustrations of the double uncertainty metric in Additional file 1. We illustrate the conversion from output formats from different SV programs to use in the comparison in Figure A4 and overlap in the double uncertainty metric in Figure A5 in Additional file 1.
In practice, the reference uncertainty, $\delta $, and prediction uncertainty, $\in $, reflect limitations on the technology used, such as fragment size, but may also include ambiguity inherent in a breakend within a repetitive region. We use prediction uncertainty $\in ={L}_{\text{max}}/2$ to reflect the sequencing process. We base the reference uncertainty on the specific data set and technology used to obtain the known variants. For the Venter simulated deletions, reference uncertainty is 0 because these variants are specified to the breakpoint. For variants from fosmid mappings of NA12878 or NA18507 we use fosmid mappings reported by Kidd et al. [16] to determine the breakend polygon with GASV, and use the uncertainty directly from these polygons.
Markov chain Monte Carlo procedure
results in convergence of $\mathcal{M}$ to the stationary distribution $P\left(MA\right)$. The first step in our MCMC procedure (Figure A2 in Additional file 1) is to stay at the same mapping matrix $M$ with probability 1/2. This selfedge guarantees aperiodicity, but irreducibility depends on the set $\Gamma $ of possible moves. We developed several classes of moves (Figure A3 in Additional file 1) to explore the space of mapping matrices that yield irreducibility. The first move consists of naively moving a fragment from one mapping to another:
Naive (N):
Select a row i with uniform probability:
If there is a j such that ${m}_{ij}=1$ set ${m}_{ij}=0$. If there exists k ≠ j such that ${a}_{ik}=1$, then with probability $\left(1{p}_{err}\right)$ select a k uniformly and set ${m}_{ij}=1$. Otherwise, leave ${m}_{ik}=0$ for all k.
If ${m}_{ij}=0$ for all j, with uniform probability select a j where ${a}_{ij}=1$ and set ${m}_{ij}=1$.
Notice that the Naive (N) move always changes the mapping matrix. If $\Gamma $ consists of only class N moves, then the chain $\mathcal{M}\left(\Gamma \right)$ satisfies irreducibility because any two mapping matrices $M$ and ${M}^{\prime}$ may be reached from one another by a series of class N moves. Thus, the Markov chain $\mathcal{M}\left(\Gamma \right)$ with $\Gamma $ equal to the all class N moves will yield the stationary distribution $P\left(MA\right)$. However, we found empirically that the mixing time of a Markov chain with only a class N move was long (Section A2 in Additional file 1). Thus, we define three additional moves, which empirically yielded improved mixing times. (Recall from the main text that, for an assignment matrix $A$ and associated mapping matrix $M$, $\mathcal{V}\left(M\right)$ is the number of SVs with positive support, ${\mathcal{R}}_{j}\left(A\right)$ is the set of rows in $A$ with a 1 in the column j and ${\mathcal{S}}_{j}\left(M\right)=\left{\mathcal{R}}_{j}\left(M\right)\right$ is the total support for a variant j.)
Remove a single column (Z): this move zeroes out a column of M:
With probability $\frac{1}{\mathcal{V}\left(M\right)}$, select a nonzero column j.
For all $i\in {\mathcal{R}}_{j}\left(A\right)$:
If ${m}_{ij}=1$, set ${m}_{ij}=0$. If there exists k ≠ j such that ${a}_{ik}=1$, with probability $\left(1{p}_{err}\right)$ uniformly select a column k and set ${m}_{ik}=1$. Otherwise, leave ${m}_{ik}=0$ for all k.
Revive a zero column ($\stackrel{\u0304}{Z}$): This move adds support to a zero column of A:
With probability $\frac{1}{\left(n\mathcal{V}\left(M\right)\right)}$, where n is the total number of variants in V, select a zero column j, that is, a column j with ${\mathcal{S}}_{j}\left(M\right)=0$.
While ${m}_{ij}=0$ for all $i\in {\mathcal{R}}_{j}\left(A\right)$:
For each i, with probability 1/2, set ${m}_{ij}=1$ and ${m}_{ik}=0$ for all k ≠ j.
Swap columns (S): this move swaps some entries of two columns of A:
select a pair of columns $\left(j,k\right)$ conditional on at least one column having nonzero entries.
For all $i\in {\mathcal{R}}_{ij}\left(A\right)$:
If ${m}_{ij}=1$ or ${m}_{ik}=1$, then with probability 1/2, swap ${m}_{ij}$ and ${m}_{ik}$.
Repeat if necessary to ensure at least one entry is swapped between columns j and k.
The last step in formalizing the Markov chain is to compute the acceptance probabilities. As described above, the acceptance probability depends on the proposal distribution and the probability of the mapping matrix. In fact, since the acceptance probability depends on only the ratio $\frac{P\left({M}^{\prime}A\right)}{P\left(MA\right)}$, the computation is simplified to considering only the columns (variants) and rows (fragments) that differ between the matrices. This ratio is simple for class N and S moves since a Naive move alters exactly one row and at most two columns and a Swap move alters exactly two columns. Although in the worst case Z and $\stackrel{\u0304}{Z}$ may alter every row and column of the mapping matrix, this is quite rare in practice.
In order to compute the proposal distribution, we need to consider all ways to transition between mapping matrices ${M}^{\prime}$ and $M$. First, note that all move classes result in a new matrix ${M}^{\prime}$. Thus, the probability of a selfloop is always fixed at 1/2. Second, note that in many cases different move types will create the same resulting mapping matrix. For example, a class Z move on a variant with only a single supporting fragment is the same as a class N move on that supporting fragment. Thus, the proposal distribution $q\left({M}^{\prime},M\right)$ and $q\left(M,{M}^{\prime}\right)$ must consider all possible move types. We use ${q}_{N},{q}_{S},{q}_{Z}$ and ${q}_{\stackrel{\u0304}{Z}}$ to distinguish between the proposal distribution conditional on a move class.
 1.
If the altered row had only one possible nonzero entry in A, that is $\left{\mathcal{R}}_{i}\left(A\right)\right=1$, ${q}_{N}\left(M,{M}^{\prime}\right)=\left(1/F\right)$,
 2.
If ${m}_{ij}=1$ and ${m}_{ik}^{\prime}=1$ for ${m}_{ij}\in M$ and ${m}_{ik}^{\prime}\in {M}^{\prime}$, then ${q}_{N}\left(M,{M}^{\prime}\right)=\left(1/F\right)\frac{\left(1{p}_{err}\right)}{\left{\mathcal{R}}_{i}\left(A\right)\right1}$,
 3.
If ${m}_{ij}=1$ and ${m}_{ik}^{\prime}=0$ for ${m}_{ij}\in M$ and ${m}_{ik}^{\prime}\in {M}^{\prime}$ for all k, then ${q}_{N}\left(M,{M}^{\prime}\right)=\left(1/F\right){p}_{err}$,
 4.
If ${m}_{ij}=0$ and ${m}_{ik}^{\prime}=1$ for ${m}_{ij}\in M$ for all i and ${m}_{ik}^{\prime}\in {M}^{\prime}$, then ${q}_{N}\left(M,{M}^{\prime}\right)=\frac{\left(1/F\right)}{\left{\mathcal{R}}_{i}\left(A\right)\right},$
and 0 if no class N move is possible.
and 0 if no class Z move is possible.
and 0 if no class $\stackrel{\u0304}{Z}$ move is possible.
The proposal distribution for a class S move is nearly identical to the class $\stackrel{\u0304}{Z}$ move, except we need only consider the probability of picking the two columns instead of picking a single nonempty column. As before we define ${q}_{S}\left(M,{M}^{\prime}\right)=0$if no class S move is possible.
where $\chi \left(M,{M}^{\prime}\right)$ is an appropriate weighting factor based on which moves are possible. For example, if the transition from $M$ to ${M}^{\prime}$ is possible with all move types, then $\chi \left(M,{M}^{\prime}\right)=1/4$.
Efficient sampling of mapping matrices
A practical difficulty in sampling from the space of mapping matrices is the high dimension of the sampling space with millions of discordant fragments and hundreds of thousands of potential variants in the genomes in this study. However, we are still able to efficiently explore the space of mapping matrices by subdividing potential variants and discordant fragments into independent subsets and sampling instead over submatrices of $M$ (Figure 3).
To see that Equation 20 holds as equality, recall that P(M'A)/P(MA) is computed over rows and columns of M and M' except the term $\eta {e}^{\eta \mathcal{V}\left(M\right)}$, which only depends on the total number of variants with positive support. Again, let c' be the component affected by the move. Thus, when computing $P\left({M}^{\prime}A\right)/P\left(MA\right)$, every term corresponding to rows or columns that belong to components other than c' are cancelled. Moreover the ratio $\frac{\eta {e}^{\eta \mathcal{V}\left({M}^{\prime}\right)}}{\eta {e}^{\eta \mathcal{V}\left(M\right)}}={e}^{\eta \left(\mathcal{V}\left(M\right)\mathcal{V}\left({M}^{\prime}\right)\right)}$ depends on only the number of columns whose support changes. Thus, this ratio also depends on only the columns affected by the move. Therefore, $P\left({M}^{\prime}A\right)/P\left(MA\right)$ also becomes $P\left({M}_{{c}^{\prime}}^{\prime}{A}_{{c}^{\prime}}\right)/P\left({M}_{{c}^{\prime}}{A}_{{c}^{\prime}}\right)$ and Equation 20 is satisfied.
Defining the predicted variants
As indicated in the main text, we considered several different procedures for reporting a final set of predictions from the mapping matrices M sampled during the MCMC procedure. The simplest method is to consider a single mapping matrix that maximizes P(MA) as the truth and report the resulting variants. However, we found a useful procedure was to consider the entire set of mapping matrices sampled during the Markov chain $\mathcal{M}$.
Notice that ${\stackrel{\u0304}{M}}^{\left(\tau \right)}$ has two favorable properties: we consider at most one mapping for each fragment, and we exclude fragments that do not strongly support a single SV. In the results we present we define our final set of predictions by ${\stackrel{\u0304}{M}}^{\left(\tau \right)}=\left[{{\stackrel{\u0304}{m}}_{ij}}^{\tau}\right]$ and report variants based on their likelihood ratio Λ according to this mapping matrix. However, over all datasets studied, we found only minor differences in the ROC curves for three different sets of predictions (see Figure A4 in Additional file 1 for a comparison).
Mapping reads
We analyzed alignments to two human genomes (NA18507 and NA12878) and a simulated human chromosome. In all cases, we used two sets of alignment data: a high quality data set and a low quality data set. The high quality data set consisted of fragments with a clear and unique mapping to the reference genome. For human genomes NA18507 and NA12878, the reported mappings (from [14] and [44], respectively), were taken as the highquality set. For the simulated data for Venter chromosome 17, we mapped reads to the reference chromosome 17 with BWA [52] to determine the high quality unique mappings.
The low quality alignments were obtained by using NovaAlign [48] to realign reads not belonging to a uniquely mapped pair. We allowed up to 100 alignments per read, but to eliminate fragments from highly repetitive data, we removed all fragments with more than 100 alignments genomewide. Although our low quality alignments contained ambiguous fragments, many were low quality unique mappings. For NA18507, 516,941 out of 888,868 fragments included in the low quality set had unique mappings. For NA12878, 69,388 out of 157,842 fragments had unique mappings. For both sets of alignments, we removed concordant fragments, and retained all alignments with mapped distance ≤500 kb and with mapping quality >10.
For Breakdancer and GASV, results were only given on the high quality mappings. For GASVPro and Hydra, the suffix 'HQ' specifies results on the high quality datasets; results without the suffix were on the combined high and low quality datasets.
Running GASVPro
Runtime analysis
We now discuss details of the GASVPro algorithm; after identifying the set of discordant and concordant mappings, there are three steps in the pipeline of GASVPro: (1) clustering discordant mappings with GASV (O(nlogn) in the number of discordant fragments); (2) determining concordant coverage over each breakend polygon, (O(ClogC) where C the number of concordant fragments); and (3) running the MCMC sampling procedure.
When running the MCMC procedure on each connected component (Figure 3), we utilize a fixed number of burnin iterations (10^{5}) and sampling steps (9 × 10^{5}) based on heuristics developed in analyzing the simulated data. The complexity of the MCMC depends on selecting a move and deciding whether to accept the proposed move, each of which depends on the size of the connected component considered. With m discordant fragments and n variants in a connected component, the time to select a move, over all possible move types, is $O\left(m+n+{n}^{2}\right)$. Determining if a move is accepted depends on the number of altered variants and fragments; in the worst case all variants and fragments could be altered O(n+m), but in practice the total number of modified cases is quite small.
MCMC parameters and considerations
The parameter λ varied with the coverage of the data; we used λ = 0.3 for the simulated Venter chromosome and NA12878, and λ = 0.6 for NA18507. Our final results were not sensitive to the exponential prior on the number of variants. However, this term may be useful in other analyses. As discussed in Additional file 1, when η is large, P(MA) is dominated by the exponential distribution $\eta {e}^{\eta \mathcal{V}\left(M\right)}$, which is maximized when the number of variants is minimized. In addition, for the genomes we studied we further restricted the space of mapping matrices by fixing the mapping for fragments with a unique assignment in the genome. As such, the MCMC procedure would transition between possible mappings for only truly fragments with multiple possible alignments. Such a heuristic greatly reduces the computation time of the MCMC by significantly reducing the total space to sample.
Finally, additional considerations were made in the analysis of NA18507. A combination of high coverage and short reads (37 bp) resulted in nearly half a million predicted deletions and several extremely large connected components ≥2 × 10^{5} clusters with over 10^{6} fragments. Because of computational difficulties in analyzing these clusters, along with difficulties in determining convergence of the MCMC procedure, we eliminated all mappings to the centromeres and retained only mappings that indicated a deletion larger than 1,000 bp. (In the results discussed in the main text, all methods were compared on this same reduced set of fragments.) After these measures, there remained six connected components where the number of edges in the graph exceeded 10^{6}. In analyzing these connected components, we simply assigned fragments with unique mappings.
Pruning predicted structural variants
Results from all methods were pruned in a postprocessing step to eliminate redundant predictions. First, predictions from all methods were converted to intervals. For GASV, a breakend polygon B was converted into an interval $I\left(B\right)=\left[{a}_{\text{max}},{b}_{\text{min}}\right]$, where ${a}_{max}=\underset{a}{\text{arg}\text{max}}\left\{\left(a,b\right)\in B\right\}$ and ${b}_{\text{min}}=\underset{b}{\text{arg}\text{min}}\left\{\left(a,b\right)\in B\right\}$. For Breakdancer we considered the reported interval [x,y] and for Hydra we considered the interval [IE,OS] (Figure A5 in Additional file 1).
Two predictions were said to be redundant if the intersection of their intervals was at least 50% of the union or if one interval contained the other. In such cases, for GASV, Breakdancer and Hydra, the prediction with more supporting fragments was retained. For GASVPro, the prediction with greater likelihood according to the respective model was retained.
Known variants
As discussed in the text, we compared predictions to sets of known SVs with the double uncertainty metric. Importantly, this metric considers uncertainty in the location of both the prediction and known variant.
For the simulated Venter genome, we compare predictions to the set of deletions and inversions detailed in [47]; there were 4 inversions and we used the 124 deletions with length ≥125 bp. When comparing predictions to known variants, we use reference uncertainty δ = 0 because the true location of the breakpoints is known exactly.
For both genomes, NA12878 and NA18507, we compared predictions from each method to two sets of validated variants. The first was from a fosmid sequencing study [16] and the second from the 1000 Genomes Project pilot study [44]. Although the combined set of variants likely contained duplicates, to maximize sensitivity we did not attempt to reduce these sets by eliminating predictions reported by both studies.
A fosmid sequencing study validated hundreds of inversions and deletions [16]. We considered only the subset of predictions that were validated in the same individuals (NA18507, NA12878). As previously reported, several of the predictions from the original study did not have a common breakpoint region defined by fosmid mappings [33]. Thus, we restricted the validated set to the 93 deletions and 10 inversions for NA18507 and 151 deletions and 23 inversions for NA12878 that corresponded to clusters of at least two fosmids. In comparisons, we utilized the inherent uncertainty in breakend polygons [33] as the reference uncertainty.
The 1000 Genomes Project pilot study reported validated deletion variants as well as the individuals to whom they belonged [44]. (Note that the pilot study did not report inversion SVs.) We separated validated variants that were identified by PR mapping and restricted to deletions that were larger than 5 kb. The final set represented a total of 118 deletions for NA18507 and 139 deletions for NA12878. Because many nextgeneration sequencing libraries were used in predicting these variants, we used δ = 200 as an approximation for the prediction uncertainty in these variants.
Abbreviations
 beRD:

breakend read depth
 bp:

base pair
 BWA:

BurrowsWheeler aligner
 GASV:

Geometric Analysis of Structural Variation
 MCMC:

Markov chain Monte Carlo
 PR:

paired read
 RD:

read depth
 ROC:

receiver operating characteristic
 SR:

split read
 SV:

structural variant.
Declarations
Acknowledgements
We would like to thank Paul Medvedev for his assistance with running CNVer and Anna Ritz for helpful discussions regarding the manuscript. This work is supported by National Institutes of Health (R01 HG5690) and Burroughs Wellcome Career Award at the Scientific Interface to BJR. Some read alignments were obtained using the free version of Novoalign [48].
Authors’ Affiliations
References
 Conrad DF, Pinto D, Redon R, Feuk L, Gokcumen O, Zhang Y, Aerts J, Andrews TD, Barnes C, Campbell P, Fitzgerald T, Hu M, Ihm CH, Kristiansson K, Macarthur DG, Macdonald JR, Onyiah I, Pang AW, Robson S, Stirrups K, Valsesia A, Walter K, Wei J, Wellcome Trust Case Control Consortium, TylerSmith C, Carter NP, Lee C, Scherer SW, Hurles ME: Origins and functional impact of copy number variation in the human genome. Nature. 2009, 464: 704712.PubMedPubMed CentralView ArticleGoogle Scholar
 Xing J, Zhang Y, Han K, Salem AH, Sen SK, Huff CD, Zhou Q, Kirkness EF, Levy S, Batzer MA, Jorde LB: Mobile elements create structural variation: analysis of a complete human genome. Genome Res. 2009, 19: 15161526. 10.1101/gr.091827.109.PubMedPubMed CentralView ArticleGoogle Scholar
 Navin N, Krasnitz A, Rodgers L, Cook K, Meth J, Kendall J, Riggs M, Eberling Y, Troge J, Grubor V, Levy D, Lundin P, Månér S, Zetterberg A, Hicks J, Wigler M: Inferring tumor progression from genomic heterogeneity. Genome Res. 2010, 20: 6810.1101/gr.099622.109.PubMedPubMed CentralView ArticleGoogle Scholar
 Ding L, Ellis MJ, Li S, Larson DE, Chen K, Wallis JW, Harris CC, McLellan MD, Fulton RS, Fulton LL, Abbott RM, Hoog J, Dooling DJ, Koboldt DC, Schmidt H, Kalicki J, Zhang Q, Chen L, Lin L, Wendl MC, McMichael JF, Magrini VJ, Cook L, McGrath SD, Vickery TL, Appelbaum E, Deschryver K, Davies S, Guintoli T, Lin L, et al: Genome remodelling in a basallike breast cancer metastasis and xenograft. Nature. 2010, 464: 9991005. 10.1038/nature08989.PubMedPubMed CentralView ArticleGoogle Scholar
 Pleasance ED, Cheetham RK, Stephens PJ, McBride DJ, Humphray SJ, Greenman CD, Varela I, Lin ML, Ordóñez GR, Bignell GR, Ye K, Alipaz J, Bauer MJ, Beare D, Butler A, Carter RJ, Chen L, Cox AJ, Edkins S, KokkoGonzales PI, Gormley NA, Grocock RJ, Haudenschild CD, Hims MM, James T, Jia M, Kingsbury Z, Leroy C, Marshall J, Menzies A, et al: A comprehensive catalogue of somatic mutations from a human cancer genome. Nature. 2009, 463: 191196.PubMedPubMed CentralView ArticleGoogle Scholar
 Ding L, Wendl M, Koboldt D, Mardis E: Analysis of nextgeneration genomic data in cancer: accomplishments and challenges. Hum Mol Genet. 2010, 19: R18810.1093/hmg/ddq391.PubMedPubMed CentralView ArticleGoogle Scholar
 Wittler R, Chauve C: Consistencybased detection of potential tumorspecific deletions in matched normal/tumor genomes. BMC Bioinformatics. 2011, 12: S21PubMedPubMed CentralView ArticleGoogle Scholar
 Carreto L, Eiriz M, Gomes A, Pereira P, Schuller D, Santos M: Comparative genomics of wild type yeast strains unveils important genome diversity. BMC Genomics. 2008, 9: 52410.1186/147121649524.PubMedPubMed CentralView ArticleGoogle Scholar
 Cridland J, Thornton K: Validation of rearrangement break points identified by pairedend sequencing in natural populations of Drosophila melanogaster. Genome Biol Evol. 2010, 2: 8310.1093/gbe/evq001.PubMedPubMed CentralView ArticleGoogle Scholar
 Yalcin B, Wong K, Agam A, Goodson M, Keane TM, Gan X, Nellåker C, Goodstadt L, Nicod J, Bhomra A, HernandezPliego P, Whitley H, Cleak J, Dutton R, Janowitz D, Mott R, Adams DJ, Flint J: Sequencebased characterization of structural variation in the mouse genome. Nature. 2011, 477: 326329. 10.1038/nature10432.PubMedPubMed CentralView ArticleGoogle Scholar
 Medvedev P, Stanciu M, Brudno M: Computational methods for discovering structural variation with nextgeneration sequencing. Nat Methods. 2009, 6: S13S20. 10.1038/nmeth.1374.PubMedView ArticleGoogle Scholar
 Alkan C, Coe B, Eichler E: Genome structural variation discovery and genotyping. Nat Rev Genet. 2011, 12: 363376. 10.1038/nrg2958.PubMedPubMed CentralView ArticleGoogle Scholar
 Dalca A, Brudno M: Genome variation discovery with highthroughput sequencing data. Brief Bioinformatics. 2010, 11: 310.1093/bib/bbp058.PubMedView ArticleGoogle Scholar
 Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG, Hall KP, Evers DJ, Barnes CL, Bignell HR, Boutell JM, Bryant J, Carter RJ, Keira Cheetham R, Cox AJ, Ellis DJ, Flatbush MR, Gormley NA, Humphray SJ, Irving LJ, Karbelashvili MS, Kirk SM, Li H, Liu X, Maisinger KS, Murray LJ, Obradovic B, Ost T, Parkinson ML, Pratt MR, et al: Accurate whole human genome sequencing using reversible terminator chemistry. Nature. 2008, 456: 5359. 10.1038/nature07517.PubMedPubMed CentralView ArticleGoogle Scholar
 Sudmant P, Kitzman J, Antonacci F, Alkan C, Malig M, Tsalenko A, Sampas N, Bruhn L, Shendure J, Eichler E: Diversity of human copy number variation and multicopy genes. Science. 2010, 330: 64110.1126/science.1197005.PubMedPubMed CentralView ArticleGoogle Scholar
 Kidd JM, Cooper GM, Donahue WF, Hayden HS, Sampas N, Graves T, Hansen N, Teague B, Alkan C, Antonacci F, Haugen E, Zerr T, Yamada NA, Tsang P, Newman TL, Tuzun E, Cheng Z, Ebling HM, Tusneem N, David R, Gillett W, Phelps KA, Weaver M, Saranga D, Brand A, Tao W, Gustafson E, McKernan K, Chen L, Malig M, et al: Mapping and sequencing of structural variation from eight human genomes. Nature. 2008, 453: 5664. 10.1038/nature06862.PubMedPubMed CentralView ArticleGoogle Scholar
 Korbel JO, Urban AE, Affourtit JP, Godwin B, Grubert F, Simons JF, Kim PM, Palejev D, Carriero NJ, Du L, Taillon BE, Chen Z, Tanzer A, Saunders ACE, Chi J, Yang F, Carter NP, Hurles ME, Weissman SM, Harkins TT, Gerstein MB, Egholm M, Snyder M: Pairedend mapping reveals extensive structural variation in the human genome. Science. 2007, 318: 420426. 10.1126/science.1149504.PubMedPubMed CentralView ArticleGoogle Scholar
 Iafrate A, Feuk L, Rivera M, Listewnik M, Donahoe P, Qi Y, Scherer S, Lee C: Detection of largescale variation in the human genome. Nat Genet. 2004, 36: 949951. 10.1038/ng1416.PubMedView ArticleGoogle Scholar
 Tuzun E, Sharp AJ, Bailey JA, Kaul R, Morrison VA, Pertz LM, Haugen E, Hayden H, Albertson D, Pinkel D, Olson MV, Eichler EE: Finescale structural variation of the human genome. Nat Genet. 2005, 37: 727732. 10.1038/ng1562.PubMedView ArticleGoogle Scholar
 Sebat J, Lakshmi B, Troge J, Alexander J, Young J, Lundin P, Månér S, Massa H, Walker M, Chi M, Navin N, Lucito R, Healy J, Hicks J, Ye K, Reiner A, Gilliam TC, Trask B, Patterson N, Zetterberg A, Wigler M: Largescale copy number polymorphism in the human genome. Science. 2004, 305: 525528. 10.1126/science.1098918.PubMedView ArticleGoogle Scholar
 Abel H, Duncavage E, Becker N, Armstrong J, Magrini V, Pfeifer J: SLOPE: a quick and accurate method for locating nonSNP structural variation from targeted nextgeneration sequence data. Bioinformatics. 2010, 26: 268410.1093/bioinformatics/btq528.PubMedView ArticleGoogle Scholar
 Volik S, Zhao S, Chin K, Brebner JH, Herndon DR, Tao Q, Kowbel D, Huang G, Lapuk A, Kuo WL, Magrane G, De Jong P, Gray JW, Collins C: Endsequence profiling: sequencebased analysis of aberrant genomes. Proc Natl Acad Sci USA. 2003, 100: 76967701. 10.1073/pnas.1232418100.PubMedPubMed CentralView ArticleGoogle Scholar
 Hillmer AM, Yao F, Inaki K, Lee WH, Ariyaratne PN, Teo AS, Woo XY, Zhang Z, Zhao H, Ukil L, Chen JP, Zhu F, So JB, SaltoTellez M, Poh WT, Zawack KF, Nagarajan N, Gao S, Li G, Kumar V, Lim HP, Sia YY, Chan CS, Leong ST, Neo SC, Choi PS, Thoreau H, Tan PB, Shahab A, Ruan X, et al: Comprehensive longspan pairedendtag mapping reveals characteristic patterns of structural variations in epithelial cancer genomes. Genome Res. 2011, 21: 66510.1101/gr.113555.110.PubMedPubMed CentralView ArticleGoogle Scholar
 Alkan C, Sajjadian S, Eichler EE: Limitations of nextgeneration genome sequence assembly. Nat Methods. 2011, 8: 6165. 10.1038/nmeth.1527.PubMedPubMed CentralView ArticleGoogle Scholar
 Schatz MC, Delcher AL, Salzberg SL: Assembly of large genomes using secondgeneration sequencing. Genome Res. 2010, 20: 11651173. 10.1101/gr.101360.109.PubMedPubMed CentralView ArticleGoogle Scholar
 Yoon S, Xuan Z, Makarov V, Ye K, Sebat J: Sensitive and accurate detection of copy number variants using read depth of coverage. Genome Res. 2009, 19: 15861592. 10.1101/gr.092981.109.PubMedPubMed CentralView ArticleGoogle Scholar
 Chiang DY, Getz G, Jaffe DB, O'Kelly MJ, Zhao X, Carter SL, Russ C, Nusbaum C, Meyerson M, Lander ES: Highresolution mapping of copynumber alterations with massively parallel sequencing. Nat Methods. 2009, 6: 99103. 10.1038/nmeth.1276.PubMedPubMed CentralView ArticleGoogle Scholar
 McKernan KJ, Peckham HE, Costa GL, McLaughlin SF, Fu Y, Tsung EF, Clouser CR, Duncan C, Ichikawa JK, Lee CC, Zhang Z, Ranade SS, Dimalanta ET, Hyland FC, Sokolsky TD, Zhang L, Sheridan A, Fu H, Hendrickson CL, Li B, Kotler L, Stuart JR, Malek JA, Manning JM, Antipova AA, Perez DS, Moore MP, Hayashibara KC, Lyons MR, Beaudoin RE, et al: Sequence and structural variation in a human genome uncovered by shortread, massively parallel ligation sequencing using twobase encoding. Genome Res. 2009, 19: 15271541. 10.1101/gr.091868.109.PubMedPubMed CentralView ArticleGoogle Scholar
 Xie C, Tammi MT: CNVseq, a new method to detect copy number variation using highthroughput sequencing. BMC Bioinformatics. 2009, 10: 8010.1186/147121051080.PubMedPubMed CentralView ArticleGoogle Scholar
 Nord A, Lee M, King M, Walsh T: Accurate and exact CNV identification from targeted highthroughput sequence data. BMC Genomics. 2011, 12: 18410.1186/1471216412184.PubMedPubMed CentralView ArticleGoogle Scholar
 Abyzov A, Urban A, Snyder M, Gerstein M: CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. Genome Res. 2011, 21: 97410.1101/gr.114876.110.PubMedPubMed CentralView ArticleGoogle Scholar
 Chen K, Wallis JW, McLellan MD, Larson DE, Kalicki JM, Pohl CS, McGrath SD, Wendl MC, Zhang Q, Locke DP, Shi X, Fulton RS, Ley TJ, Wilson RK, Ding L, Mardis ER: BreakDancer: an algorithm for highresolution mapping of genomic structural variation. Nat Methods. 2009, 6: 677681. 10.1038/nmeth.1363.PubMedPubMed CentralView ArticleGoogle Scholar
 Sindi S, Helman E, Bashir A, Raphael BJ: A geometric approach for classification and comparison of structural variants. Bioinformatics. 2009, 25: i222230. 10.1093/bioinformatics/btp208.PubMedPubMed CentralView ArticleGoogle Scholar
 Quinlan A, Clark R, Sokolova S, Leibowitz M, Zhang Y, Hurles M, Mell J, Hall I: Genomewide mapping and assembly of structural variant breakpoints in the mouse genome. Genome Res. 2010, 20: 62310.1101/gr.102970.109.PubMedPubMed CentralView ArticleGoogle Scholar
 Hormozdiari F, Hajirasouliha I, Dao P, Hach F, Yorukoglu D, Alkan C, Eichler E, Sahinalp S: Nextgeneration VariationHunter: combinatorial algorithms for transposon insertion discovery. Bioinformatics. 2010, 26: i35010.1093/bioinformatics/btq216.PubMedPubMed CentralView ArticleGoogle Scholar
 Ye K, Schulz M, Long Q, Apweiler R, Ning Z: Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from pairedend short reads. Bioinformatics. 2009, 25: 286510.1093/bioinformatics/btp394.PubMedPubMed CentralView ArticleGoogle Scholar
 Qi J, Zhao F: inGAPsv: a novel scheme to identify and visualize structural variation from paired end mapping data. Nucleic Acids Res. 2011, 39: W56710.1093/nar/gkr506.PubMedPubMed CentralView ArticleGoogle Scholar
 Medvedev P, Fiume M, Dzamba M, Smith T, Brudno M: Detecting copy number variation with mated short reads. Genome Res. 2010, 20: 161310.1101/gr.106344.110.PubMedPubMed CentralView ArticleGoogle Scholar
 Bashir A, Volik S, Collins C, Bafna V, Raphael B: Evaluation of pairedend sequencing strategies for detection of genome rearrangements in cancer. PLoS Comput Biol. 2008, 4: e100005110.1371/journal.pcbi.1000051.PubMedPubMed CentralView ArticleGoogle Scholar
 Antonacci F, Kidd J, MarquesBonet T, Ventura M, Siswara P, Jiang Z, Eichler E: Characterization of six human diseaseassociated inversion polymorphisms. Hum Mol Genet. 2009, 18: 255510.1093/hmg/ddp187.PubMedPubMed CentralView ArticleGoogle Scholar
 Lee S, Cheran E, Brudno M: A robust framework for detecting structural variations in a genome. Bioinformatics. 2008, 24: 5967. 10.1093/bioinformatics/btn176.View ArticleGoogle Scholar
 Hormozdiari F, Alkan C, Eichler EE, Sahinalp SC: Combinatorial algorithms for structural variation detection in highthroughput sequenced genomes. Genome Res. 2009Google Scholar
 Mills R, Walter K, Stewart C, Handsaker R, Chen K, Alkan C, Abyzov A, Yoon S, Ye K, Cheetham R, Chinwalla A, Conrad D, Fu Y, Grubert F, Hajirasouliha I, Hormozdiari F, Iakoucheva L, Iqbal Z, Kang S, Kidd J, Konkel M, Korn J, Khurana E, Kural D, Lam H, Leng J, Li R, Li Y, Lin CY, Luo R, et al: Mapping copy number variation by populationscale genome sequencing. Nature. 2011, 470: 5965. 10.1038/nature09708.PubMedPubMed CentralView ArticleGoogle Scholar
 Altshuler D, Durbin RM, Abecasis GR, Bentley DR, Chakravarti A, Clark AG, Collins FS, De La Vega FM, Donnelly P, Egholm M, Flicek P, Gabriel SB, Gibbs RA, Knoppers BM, Lander ES, Lehrach H, Mardis ER, McVean GA, Nickerson DA, Peltonen L, Schafer AJ, Sherry ST, Wang J, Wilson R, Gibbs RA, Deiros D, Metzker M, Muzny D, Reid J, Wheeler D, et al: A map of human genome variation from populationscale sequencing. Nature. 2010, 467: 10611073. 10.1038/nature09534.PubMedView ArticleGoogle Scholar
 VCF (Variant Calling Format) version 4.1. [http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcfvariantcallformatversion41]
 Lander E, Waterman M: Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics. 1988, 2: 231239. 10.1016/08887543(88)900079.PubMedView ArticleGoogle Scholar
 Levy S, Sutton G, Ng P, Feuk L, Halpern A, Walenz B, Axelrod N, Huang J, Kirkness E, Denisov G, Lin Y, MacDonald J, Pang A, Shago M, Stockwell T, Tsiamouri A, Bafna V, Bansal V, Kravitz S, Busam D, Beeson K, McIntosh T, Remington K, Abril J, Gill J, Borman J, Rogers Y, Frazier M, Scherer S, Strausberg R, Venter J: The diploid genome sequence of an individual human. PLoS Biol. 2007, 5: e25410.1371/journal.pbio.0050254.PubMedPubMed CentralView ArticleGoogle Scholar
 Novocraft: Novoalign. [http://www.novocraft.com/main/index.php]
 Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, 1000 Genome Project Data Processing Subgroup: The sequence alignment/Map format and SAMtools. Bioinformatics. 2009, 25: 207810.1093/bioinformatics/btp352.PubMedPubMed CentralView ArticleGoogle Scholar
 Xi R, Hadjipanayis A, Luquette L, Kim T, Lee E, Zhang J, Johnson M, Muzny D, Wheeler D, Gibbs R, Kucherlapati R, Park P: Copy number variation detection in wholegenome sequencing data using the Bayesian information criterion. Proc Natl Acad Sci USA. 2011, 108: E11281136. 10.1073/pnas.1110574108.PubMedPubMed CentralView ArticleGoogle Scholar
 Metzker M: Sequencing technologies  the next generation. Nat Rev Genet. 2009, 11: 3146.PubMedView ArticleGoogle Scholar
 Li H, Durbin R: Fast and accurate short read alignment with BurrowsWheeler transform. Bioinformatics. 2009, 25: 17541760. 10.1093/bioinformatics/btp324.PubMedPubMed CentralView ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.