VALOR2: characterization of large-scale structural variants using linked-reads

Karaoğlanoğlu, Fatih; Ricketts, Camir; Ebren, Ezgi; Rasekh, Marzieh Eslami; Hajirasouliha, Iman; Alkan, Can

doi:10.1186/s13059-020-01975-8

Method
Open access
Published: 19 March 2020

VALOR2: characterization of large-scale structural variants using linked-reads

Fatih Karaoğlanoğlu¹^na1,
Camir Ricketts^2,4,
Ezgi Ebren¹,
Marzieh Eslami Rasekh³,
Iman Hajirasouliha^4,5 &
…
Can Alkan^1,6

Genome Biology volume 21, Article number: 72 (2020) Cite this article

4437 Accesses
10 Citations
14 Altmetric
Metrics details

Abstract

Most existing methods for structural variant detection focus on discovery and genotyping of deletions, insertions, and mobile elements. Detection of balanced structural variants with no gain or loss of genomic segments, for example, inversions and translocations, is a particularly challenging task. Furthermore, there are very few algorithms to predict the insertion locus of large interspersed segmental duplications and characterize translocations. Here, we propose novel algorithms to characterize large interspersed segmental duplications, inversions, deletions, and translocations using linked-read sequencing data. We redesign our earlier algorithm, VALOR, and implement our new algorithms in a new software package, called VALOR2.

Background

Alterations of DNA content and organization larger than 50 bp, commonly referred to as genomic structural variations (SVs) [1], are among the major drivers of evolution [2, 3] and diseases of genomic origin [4]. Despite decades of research, they remain difficult to accurately characterize contributing to our lack of full understanding of the etiology of complex diseases, termed missing heritability [5].

High-throughput sequencing (HTS) technologies are widely employed to discover and genotype various classes of SVs since their inception [6–13]. However, effectiveness has been limited by either very short read lengths (e.g., Illumina) or high error rates (e.g., PacBio and Oxford Nanopore). The human genome complexity further contributes to our lack of full characterization of structural variants, especially large-scale duplications and balanced rearrangements (inversions and balanced translocations) due to the repetitive and duplicated sequence at the SV breakpoints [14]. Despite high error rates and high requirement for DNA input, long reads offer improvement in complex SV discovery, either used alone [15, 16] or when integrated with standard short-read sequencing data [17].

Recently the linked-read sequencing method such as the 10x Genomics system (10xG), transposase enzyme linked long-read sequencing (TELL-Seq), and single-tube long fragment read (stLFR) was introduced as an alternative method to generate highly accurate Illumina short-read data with additional long-range information [18]. In linked-read sequencing, large DNA molecules (typically 10–100 kbp) are barcoded and randomly separated into a very large number of partitions (here, we term these partitions “pools”). For example, in the 10xG system, each pool contains roughly 2–30 large molecules, and the number of pools is typically over a million. These pools are then sequenced at very low coverage (∼0.1×) using the standard Illumina platform. Shared barcodes among Illumina read pairs show them as generated from the same pool. Since each pool is diluted to contain only a very small fraction of the input DNA, the probability of barcode collision is negligible [19]. For example, assuming 20 molecules per pool and an average size of 30 kbp per molecule, each pool on average contains only \(\frac {1}{5000}\) of the haploid human genome. Linked-reads then can be used to “reconstruct” large molecules that originate from the same haplotype. Furthermore, linked-read sequencing makes it possible to obtain very high physical coverage with the cost of generating moderate sequence coverage data^{Footnote 1}.

The ability of extracting long-range information from accurate and inexpensive but short-read sequencing data makes linked-read sequencing attractive for various applications [13]. It has been used for genome scaffolding [20], haplotype-aware assembly [18, 21, 22], metagenomics [23], single-cell transcriptome profiling [24, 25] and regulatory network clustering [26], haplotype phasing [18, 21, 27], and genome structural variation discovery [19, 28–30].

Linked-read techniques for genomic structural variation discovery include VALOR [28], Long Ranger [29], and GROC-SVs [30]. VALOR was the first algorithm that used “split molecule” signature, similar to the commonly used split read signature [31], together with traditional read pair signature [1, 8, 32] to characterize large (> 500 kbp) inversions. Split molecules are defined as large molecules that span an SV breakpoint, and therefore mapped as two disjoint intervals to the reference genome.

Long Ranger [29] is a comprehensive software package developed by 10x Genomics, for the purpose of barcode-aware read alignment (Lariat module) and resolving full-scale human germline genome variation, while GROC-SVs is an optimized tool for somatic and complex SVs in cancer genomes. Both Long Ranger and GROC-SVs employ a novel idea to utilize discordance in expected “barcode coverage” as well as barcode similarities across distant locations for potential large-scale SV signals. In addition, GROC-SVs [30] performs local assembly on barcoded reads to detect large complex events that are between 10 and 100 kbp with breakpoint resolution.

Despite the aforementioned advances in SV discovery using various technologies, detecting complex SV such as balanced rearrangements (i.e., inversions and translocations), and segmental duplications (SDs) remains challenging due to mapping ambiguity. Note that it is still possible to identify increase in SD copy number using read depth signature [33, 34]; however, no linked-read-based method yet exists to anchor a new SD (i.e., find their insertion locations). We note that the TARDIS algorithm [35] can locate new SDs; however, it is developed for short-read sequencing data only; therefore, it can find only short duplications (up to 10 kbp) copied to a distance of up to 50 kbp.

Here, we present novel algorithms to discover deletions, inversions, translocations, and large (> 40 kbp) direct and inverted interspersed SDs using linked-read sequencing data. We redesign and extend upon VALOR and use split molecule and read pair signatures to detect SDs and estimate the insertion sites of the new SD paralogs, and further include read depth signature to filter potential false positives caused by incorrect mappings. We implemented our new algorithms as the VALOR2 software package. Briefly, VALOR2 differs from the former version of VALOR through (1) it can characterize segmental duplications in both direct and inverted orientation, (2) it can discover translocations and deletions, (3) it incorporates read depth information to improve predictions and reduce false calls, (4) it provides full support to alignment files (i.e., BAM) generated from 10xG linked-read data sets, and (5) provides a 10-fold speed up in run time (data not shown).

Using simulated data sets, we show that VALOR2 achieves high precision and recall (85% and 83%, respectively) for segmental duplications, 83% and 60% for large inversions, 91% and 87% for deletions, and 100% and 71% for translocations. We also applied VALOR2 to the genomes of NA12878 and a Yoruban trio (NA19238, NA19239, NA19240) in addition to two haploid genomes (CHM1 [18], CHM13 [36]) sequenced with the 10xG platform.

Methods

We have previously described an earlier version of VALOR2 that uses split molecules and read pair signature to detect inversions [28]. Here, we describe novel formulations, algorithms, and optimizations to characterize large (> 80 kbp) inversions, deletions (> 100 kbp), translocations (> 100 kbp), and segmental duplications (> 40 kbp) in both direct and inverted orientations. We depict the split molecule and read pair sequence signatures for these types of large SVs in Fig. 1.

Glossary

Here, we define several terms that we use in this manuscript:

Molecule: a large molecule (30–50 kbp) that was barcoded and pooled using a linked-read platform. Here, we refer to as the physical entity.
Submolecule: a molecule identified in silico by the VALOR2 algorithm by analyzing the read map locations.
Candidate split: a pair of submolecules with the same barcode that potentially signal single breakpoint of an SV event.
Split molecule pair: a pair of candidate splits with different barcodes that potentially signal the different breakpoints of the same SV event.

Overview of the VALOR2 algorithm

VALOR2 depends on only the alignment files (i.e., BAM) with the necessary barcode information generated with Long Ranger/Lariat, BWA-MEM, or a similar read mapper. Briefly, VALOR2 first tries to identify the underlying large molecules separately for each barcode, which we call submolecules. In this step, we do not consider reads that map to satellite regions, and we discard very short submolecules. Two identified submolecules are paired together (called candidate splits) if the summation of their span is ≤μ_molecule+3σ_molecule where μ_molecule is the average and σ_molecule is the standard deviation of the inferred submolecule sizes. Next, VALOR2 removes those candidate splits with no read pair support. VALOR2 then (1) matches candidate splits with different barcodes that are likely to signal individual breakpoints of the same SV event; (2) filters out candidates with low read pair support, additionally it discards those that signal a deletion or duplication event without read depth support; and (3) models the split molecule pairs as vertices in a graph and approximately discovers the maximal quasi cliques for each connected component of this graph. In this graph, edges represent overlap (i.e., “agreement”) between two split molecule pairs. Finally, VALOR2 reports SVs that are supported by more than a threshold of split molecules.

Below, we present the details for each step in the VALOR2 algorithm.

Molecule recovery

The first step of the VALOR2 algorithm involves identification (or, recovery) of the large molecules from mapped data. Initially, we call the intervals returned by this recovery as submolecules. For this purpose, we use a sliding window approach to greedily group reads with the same barcode which are mapped in close proximity (Additional file 1: Algorithm S1). Here, we only consider concordantly mapped read pairs, and we take the full span of a read pair as a fragment. For each barcode, we scan each chromosome and merge together fragments if they are within a user-defined distance T, or if a new fragment is within distance Q from the leftmost fragment in a re-identified submolecule. We use Q=2·μ_molecule and T=μ_molecule/4 by default^{Footnote 2}, determined by parameter sweeping. Finally, we remove very short submolecules (< 3 kbp by default) that correspond to less than 10% of expected average molecule size from consideration.

Candidate split matching

We first record all pairs of submolecules that share the same barcode and map to the same chromosome as candidate splits and then compare all possible pairs of candidate splits across different barcodes (termed split molecule pairs) to find those that signal a structural variation (see Fig. 1 for the depiction of candidate splits and split molecule pairs). We limit inversion predictions and the duplication size by the largest inversion size we can find in the literature [37] (≈ 7 Mbp). Next, we test whether the split molecule pairs are supported by read pair signature (Fig. 1). Here, we require at least 3 read pairs to signal the same SV event, and we remove candidate splits with insufficient support from consideration.

Candidate splits for translocations

While it is possible to exhaustively test all pairs of candidate splits for intra-chromosomal events, it is infeasible to follow the same approach for inter-chromosomal variants. This is due to the relatively high number of distinct molecules sharing the same barcode (up to 30) and very high number of barcodes (up to 4 million). To overcome this issue, we first use discordant read pairs as anchors and attach two other submolecules with the same barcode that map close to each end (Additional file 1: Figure S2).

Clustering using SV graph

We construct an SV graph G as follows (Fig. 2). We denote each remaining split molecule pair as a vertex in G, and we create an edge between two vertices if their corresponding split molecule pairs signal the same SV event. Finally, on the resulting graph, we find clusters of read pair-supported split molecule pairs by approximately solving the maximal clique problem using the quasi-clique formulation [38]. Here, a quasi clique is defined as an approximate clique with V vertices and γ·(|V|2) edges, where γ is a user-defined parameter, which we set to γ=0.6 by default. Each quasi clique defines a putative SV event.

We identify inversion and deletion breakpoints with two coordinates, duplications, and translocations with three coordinates. Third breakpoint denotes the insertion coordinates given within a confidence interval.

Molecule depth filtering

Although there are only a small number of molecules that share the same barcode (2–30), it is still possible that two or more different molecules originate from the same chromosome. Additionally, the molecule sizes do not follow Gaussian, Poisson, or a similar distribution (Fig. 3); thus, it is not possible to distinguish true split molecules from “normal” but short molecules. The read pair sequence signature is not entirely reliable either due to the mismapping artifacts within or around repeats and duplications. We, therefore, apply additional filtering on duplication calls based on “molecule depth.” We reason that the number of molecules that originate from segmental duplications must be higher than the genome-wide average, similar to the traditional read depth signature [33, 39]. In this step, we first calculate the average molecule depth (μ_depth) and standard deviation (σ_depth) in the entire genome. We then discard segmental duplication predictions with molecule depth <μ_depth+σ_depth, deletion predictions with molecule depth >0.5μ_depth+0.5σ_depth, and translocation predictions with molecule depth outside μ_depth±1.5σ_depth at the source.

Results

We tested VALOR2 using both simulated and real data sets to compare the precision and recall rates of VALOR2 with the state-of-the-art tool that use linked-read sequencing (Long Ranger [29]), three tools that use only short-read WGS data sets (DELLY [40] LUMPY [11], TARDIS [12, 35]), and one that uses long read WGS data sets (Sniffles [41]). For LUMPY, we used the smoove wrapper as recommended by the authors. We also tried to run GROC-SVs; however, the tool crashed due to excessive memory usage.

Among these tools, VALOR2 and TARDIS are the only tools that can characterize interspersed duplications. However, the size range of variants that they can detect is complementary. VALOR2 aims to find duplications larger than 40 kb copied to > 80 kb away from the source, where TARDIS can only detect duplications that are copied within 50 kb from the source; therefore, we removed TARDIS from comparisons of segmental duplication predictions. Since there is no comparable tool to our knowledge, we only provide VALOR2 results on interspersed duplications. We compared inversion and deletion prediction performance of VALOR2 with LUMPY, DELLY, TARDIS, Sniffles, and Long Ranger. Similarly, we compared the translocation predictions with LUMPY, DELLY, and Long Ranger since TARDIS and Sniffles do not currently support translocation discovery. As we designed VALOR2 as a complementary method, we also provide results of union and intersection of VALOR2 and Long Ranger SV calls.

Simulation experiments

We used VarSim [42] to generate a simulated diploid human genome. We note that VarSim randomly selects SNVs, indels, and SVs from a database of known variants and inserts them into the simulated genome. Our simulation included variants of various lengths and types: 2.8 million SNPs, ≈ 195,000 indels, and ≈ 5000 SVs (> 50 bp, up to 6 Mbps). We found that VarSim only generates tandem duplications and does not simulate translocations; therefore, we randomly changed a subset of simulated tandem duplications to interspersed duplications and non-reciprocal translocations (by deleting the source copy) in the simulated VCF file, assigned random insertion breakpoints, and then applied changes to the reference. We then generated Illumina WGS reads using ART [43] and PacBio long reads using PBSim [44] at 40 × depth of coverage and 10xG linked-reads at 50 × coverage using LRSim [45]. The 10xG linked-read simulation has extra coverage to account for the barcode sequences that are part of the read and other losses as also described in [29].

Auxiliary files released with the current version of VarSim only support the human reference genome build 37 (GRCh37); therefore, we mapped the simulated reads to GRCh37 using BWA-MEM [46] for Illumina, NGMLR [41] for PacBio (as recommended by Sniffles authors), and Long Ranger for 10xG data sets. We then applied the standard BAM processing that includes sorting with SAMtools [47] and marking duplicates with Sambamba [48]. We used VALOR2 and Long Ranger to generate SV call sets from the 10xG simulation, and DELLY, LUMPY, and TARDIS to call variants using the Illumina simulation, and Sniffles using the PacBio simulation (see Additional file 1: Table S1 for version numbers for tools and respective command lines). We limited our comparison to only large SVs (> 80 kbp for inversions, > 40 kbp for duplications (> 100 kbp for deletions and translocations), and we required > 50% reciprocal overlap between the simulation and the prediction for SVs using BEDtools [49]. We also require the inferred insertion breakpoint is within a distance of μ_molecule/2 (in simulation experiments μ_molecule = 50 kbp) of the simulated breakpoint to consider a duplication to be correctly predicted.

We present the prediction performance of the tools we tested in Table 1. We found that VALOR2 is able to correctly predict > 82% of large duplications (inverted and direct combined) and 60% of large inversions, while maintaining 84–86% precision for duplications and 83% precision for inversions. Long Ranger, the other algorithm that used linked-reads, demonstrated the same recall rate (60%) of the inversions with lower precision (73%).

Table 1 Prediction performance evaluation using simulated structural variants

Full size table

Of the WGS-based tools, Sniffles achieved the highest sensitivity for inversions owing to its use of long reads as it was able to correctly predict 80% of large inversions; however, it suffered from very low precision (11%). On the contrary, using only short reads, TARDIS achieved high precision (97%), but it was able to discover only 38% of the simulated inversions. This is likely because none of the WGS-based tools was optimized to find such large inversion events. VALOR2 showed a very good precision/recall balance with an F1 score of 0.70, but overall, combination of Long Ranger and VALOR2 performed the best in terms of precision/recall for inversions in the simulation experiment.

For large deletions, once again, Long Ranger and VALOR2 combination performed the best, but VALOR2 by itself was able to correctly predict 87% of the simulated variants with a high precision rate (91%). As expected, WGS-based tools (based on both short and long reads) achieved low precision (15 to 46%), although they performed well in terms of recall (78 to 85%).

Finally, the translocation simulation experiment proved VALOR2 to be the best single algorithm in terms of precision with no false-positive calls, with a good recall rate (71%). Only DELLY surpassed VALOR2 in recall (79%), but it suffered from a high number of false positives (26% precision). As in the other experiments, using both Long Ranger and VALOR2 achieved the best F1 score of 95%.

Size detection spectrum for structural variation

As we have described above, our simulation included SVs with different sizes, starting from 50 bp to 6 Mbp. To understand the detection power of using different sequencing technologies, we investigated the size distribution of the correctly identified deletions and inversions in the simulation (Fig. 4). We observe that the both short read-based (TARDIS, DELLY, LUMPY) and long read-based (Sniffles) tools tend to capture similarly sized and relatively shorter SVs compared to the linked-read based (Long Ranger, VALOR2) algorithms. Among the linked-read-based tools, VALOR2 captures larger SVs than Long Ranger, demonstrating its complementary use to Long Ranger, and short- and long-read WGS analysis.

Biological data sets

Next, we evaluated VALOR2 and compared it to a linked-read-based method (Long Ranger) and three WGS-based tools (DELLY, LUMPY, and TARDIS) using biological data sets. We obtained both linked-read and WGS data from the genomes of a parent-child trio from Yoruba population (NA19238, NA19239, NA19240) [50], one individual of Northern European descent (NA12878) [51], and two haploid genomes (CHM1 and CHM13). The details of the data sources are given in the “Availability of data and materials” section, and we provide large deletion, inversion, and translocation calls generated by VALOR2 in Additional file 1: Tables S2, S3, and S4, respectively. We used DELLY, LUMPY, and TARDIS to generate SV call sets using the WGS data and VALOR2 using the linked-read data on the human reference genome GRCh38. For the haploid genomes, we used VALOR2 in haploid-aware mode, where the read depth and split molecule support thresholds are adjusted accordingly. We obtained the publicly available Long Ranger calls: Yoruba trio call set is available from the Human Genome Structural Variation Consortium [50], and NA12878 call set is available in the European Nucleotide Archive (accession number PRJEB28297), published by Marks et al. [29]. We have run Long Ranger on the CHM1 and CHM13 genomes.

Table 2 summarizes the prediction results of large deletions, segmental duplications (SDs), translocations, and inversions. We note that TARDIS predicts only smaller SDs (< 10 kb), and Long Ranger, DELLY, and LUMPY do not differentiate between tandem and interspersed SDs. We therefore merged different types of SD predictions generated by VALOR2. We also compared our predictions with two different gold standard data sets. For deletions and duplications, we used the non-redundant data set in dbVar [52], and for inversions and translocations, we used gnomAD SV calls [53]. Since gnomAD call set was only available in GRCh37, we used the UCSC liftOver tool to convert the coordinates to GRCh38.

Table 2 Large structural variants found in biological data sets

Full size table

Note that in the absence of complete and curated large SVs that are experimentally validated for these biological data sets, we cannot calculate precision and recall rates. However, assuming the dbVar and gnomAD resources are gold standard, deletion predictions of VALOR2 include no false positives (Table 2). Long Ranger and TARDIS also show low number of false positives for deletions. For inversions, we found that 28 to 70% of VALOR2 calls intersect with previously identified inversions. Although Long Ranger calls intersected better with the gnomAD calls, it also predicted only a handful of inversions. As expected, WGS-based tools showed a higher ratio of likely false positives.

VALOR2 predicts only interspersed segmental duplications (SDs), where Long Ranger, LUMPY, and DELLY can detect only tandem SDs, and TARDIS can detect both, although new location of interspersed SDs should be < 50 kb away from the source. The SDs reported in dbVar are detected using read depth-based methods; therefore, there is no discrimination between interspersed and tandem. Therefore, dbVar only includes the coordinates of the “source copy” of the duplicated segments. We thus compared the source coordinates of our interspersed SD calls with dbVar and found that 43 to 67% of SDs predicted by VALOR2 were previously reported. Only Long Ranger achieved a higher intersection with known data, however with fewer predictions.

Finally, none of the translocation calls predicted by either tool intersects with the gnomAD call set. This is in fact on par with the literature, since no translocations are expected to occur in the germline genomes of healthy individuals as they often play roles in cancer development [54]. Therefore, any translocation predictions are either false positives or could be caused by cell line artifacts [55].

Functional consequence of predicted variants

A majority of predicted translocations and duplications span regions that do not contain gene coding sequences. This is unsurprising since a large amount of disruptive variants are not expected to be in normal genomes. However, VALOR2 did identify events that potentially affect protein coding genes. A large segmental duplication event at chr1:16,728,420–16,797,669 is present in 5 of the 6 genomes analyzed and found to overlap the CROCC gene which encodes a structural component of ciliary motility [56]. Another duplication event covering CLEC18B was also found in 3 of 6 genomes. The human C-type lectin 18 is expressed abundantly in various cell contexts in the body [57]. VALOR2 calls also revealed deletion polymorphisms, some of which have been previously characterized, in the human genome (Additional file 1: Table S2). Deletion of UGT2B17 and UGT2B28, genes involved in the metabolism of sex steroid hormones, as well as OR4F5 (olfactory receptor) were found in at least 3 genomes. These have been previously described as null mutations within the genome [58]. Similarly, only 3 inversion calls overlap protein coding regions in these genomes (Additional file 1: Table S3) though further validation is necessary to confirm functional effect of these SVs on these genes.

Discussion

Linked-read sequencing techniques emerged very recently and are still developing. Many groups are already realizing the power of these techniques for SV detection and phasing. For example, the InPSYght Consortium has sequenced a schizophrenia case/control cohort of 545 individuals using the 10x Genomics Chromium linked-read technology with the aim to study complex structural variants in a large cohort [59].

While we used the 10xG linked-read datasets to demonstrate the utility of our SV discovery methods, several other linked-read platforms are available. BGI has recently developed a single-tube long fragment read (stLFR) technology (https://www.bgi.com/global/sequencing-services/dna-sequencing/lfr-whole-genome-sequencing/, essentially a linked-read method. The stLFR linked-read technique produces reads longer than 10 kb [60] and BGI plans to make the technique their standard of sequencing in the near future. Several other linked-read platforms are becoming commercially available. In particular, TELL-Seq by Universal Sequencing Technologies (https://www.universalsequencing.com/ is also a recent single-tube linked-read method. TELL-Seq does not require a 10xG-like Chromium instrument and offers a simpler and cheaper library prep routine. Loop Genomics (https://www.loopgenomics.com/) is another developing linked-read method.

PacBio with the release of their Sequel II method and Oxford Nanopore with their newest PromethION have reduced the cost of long-read methods. While it is not prohibitively expensive anymore to generate long reads, the error rate is still much higher compared to short reads and linked-reads. Moreover, long-read protocols cannot be utilized with very low input DNA (e.g., less than 10 ng), which makes ultra-low input linked-read method a very attractive alternative.

In this work, we presented novel algorithms to effectively utilize the encoded long-range information in linked-read data for the purpose of characterizing largescale structural variations. The current state of the art SV detection techniques using linked-read such as Long Ranger or GROC-SVs is optimized for certain range of SV sizes. For example, GROC-SVs achieves the best sensitivity for events in the range of (30–100 kb). However, our technique, VALOR2, can detect events of a size larger than 100 kb, including segmental duplications and translocations. We also demonstrated that VALOR2 is a complementary approach to Long Ranger, and both short and long read-based WGS-based tools for deletion and inversion discovery (Fig. 4). Through simulations, we also showed that VALOR2 is a powerful tool for discovering interspersed segmental duplications and translocations, two of the most difficult and neglected forms of structural variation [13].

A future direction for our study is to integrate additional techniques such as local assembly to characterize smaller-scale SVs (i.e., starting from only 50 bp) and to resolve SV breakpoints more precisely by integrating split reads and local assembly. Local assembly was recently used for detection and assembly of novel sequence insertions using linked-reads [61]. Single-molecule sequencing techniques such as PacBio and Oxford Nanopore (ONT) and long-range genome mapping techniques at single-molecule resolution such as Bionano Genomics are becoming more developed and cost effective. We can explore single-molecule techniques not only for the purpose of further validation of our algorithms but also for devising integrative computational techniques to fully resolve the complexity of repetitive DNA common in mammalian genomes.

Notes

For example, 30 × sequence coverage corresponds to 150 × physical coverage when molecule coverage is only 0.2 ×.
Note that the empirical value of μ_molecule is calculated after the molecule recovery step. Therefore, here, we use μ_molecule as the expected value and set to 40,000 by default (can be changed by the user).

References

Alkan C, Coe BP, Eichler EE. Genome structural variation discovery and genotyping. Nat Rev Genet. 2011; 12:363–76.
Article CAS PubMed PubMed Central Google Scholar
Marques-Bonet T, et al.A burst of segmental duplications in the genome of the African great ape ancestor. Nature. 2009; 457:877–81.
Article CAS PubMed PubMed Central Google Scholar
Prado-Martinez J, et al.Great ape genetic diversity and population history. Nature. 2013; 499:471–5.
Article CAS PubMed PubMed Central Google Scholar
Stankiewicz P, Lupski JR. Structural variation in the human genome and its role in disease. Annu Rev Med. 2010; 61:437–55.
Article CAS PubMed Google Scholar
Eichler EE, et al.Missing heritability and strategies for finding the underlying causes of complex disease. Nat Rev Genet. 2010; 11:446–50.
Article CAS PubMed PubMed Central Google Scholar
Korbel JO, et al.Paired-end mapping reveals extensive structural variation in the human genome. Science. 2007; 318:420–6.
Article CAS PubMed PubMed Central Google Scholar
Hormozdiari F, Alkan C, Eichler EE, Sahinalp SC. Combinatorial algorithms for structural variation detection in high-throughput sequenced genomes. Genome Res. 2009; 19:1270–8.
Article CAS PubMed PubMed Central Google Scholar
Medvedev P, Stanciu M, Brudno M. Computational methods for discovering structural variation with next-generation sequencing. Nat Methods. 2009; 6(11 Suppl):13–20.
Article CAS Google Scholar
Sindi S, Helman E, Bashir A, Raphael BJ. A geometric approach for classification and comparison of structural variants. Bioinformatics. 2009; 25:222–30.
Article CAS Google Scholar
Hajirasouliha I, et al.Detection and characterization of novel sequence insertions using paired-end next-generation sequencing. Bioinformatics. 2010; 26:1277–83.
Article CAS PubMed PubMed Central Google Scholar
Layer RM, Chiang C, Quinlan AR, Hall IM. LUMPY: a probabilistic framework for structural variant discovery. Genome Biol. 2014; 15:84.
Article Google Scholar
Soylev A, Kockan C, Hormozdiari F, Alkan C. Toolkit for automated and rapid discovery of structural variants. Methods. 2017; 129:3–7.
Article CAS PubMed Google Scholar
Ho SS, Urban AE, Mills RE. Structural variation in the sequencing era. Nature Rev Genet. 2019. https://doi.org/10.1038/s41576-019-0180-9.
Kidd JM, et al.Mapping and sequencing of structural variation from eight human genomes. Nature. 2008; 453:56–64.
Article CAS PubMed PubMed Central Google Scholar
English AC, Salerno WJ, Reid JG. PBHoney: identifying genomic variants via long-read discordance and interrupted mapping. BMC Bioinformatics. 2014; 15:180.
Article CAS PubMed PubMed Central Google Scholar
Jain M, et al.Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat Biotech. 2018; 36:338–45.
Article CAS Google Scholar
Ritz A, et al.Characterization of structural variants with single molecule and hybrid sequencing approaches. Bioinformatics. 2014; 30:3458–66.
Article CAS PubMed PubMed Central Google Scholar
Mostovoy Y, et al.A hybrid approach for de novo human genome sequence assembly and phasing. Nat Methods. 2016; 13:587–90.
Article CAS PubMed PubMed Central Google Scholar
Xia LC, et al.Identification of large rearrangements in cancer genomes with barcode linked reads. Nucleic Acids Res. 2018; 46:e19.
Article CAS PubMed Google Scholar
Yeo S, Coombe L, Warren RL, Chu J, Birol I. ARCS: scaffolding genome drafts with linked reads. Bioinformatics. 2018; 34:725–31.
Article CAS PubMed Google Scholar
Seo J-S, et al.De novo assembly and phasing of a Korean human genome. Nature. 2016; 538:243–7.
Article CAS PubMed Google Scholar
Weisenfeld NI, Kumar V, Shah P, Church DM, Jaffe DB. Direct determination of diploid genome sequences. Genome Res. 2017; 27:757–67.
Article CAS PubMed PubMed Central Google Scholar
Danko DC, Meleshko D, Bezdan D, Mason C, Hajirasouliha I. Minerva: an alignment and reference free approach to deconvolve linked-reads for metagenomics. Genome Res. 2019; 29:116–24.
Article CAS PubMed PubMed Central Google Scholar
Skelly DA, et al.Single-cell transcriptional profiling reveals cellular diversity and intercommunication in the mouse heart. Cell Rep. 2018; 22:600–10.
Article CAS PubMed Google Scholar
Wolf FA, Angerer P, Theis FJ. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 2018; 19:15.
Article PubMed PubMed Central Google Scholar
Aibar S, et al.SCENIC: single-cell regulatory network inference and clustering. Nat Methods. 2017; 14:1083–6.
Article CAS PubMed PubMed Central Google Scholar
Zheng GXY, et al.Haplotyping germline and cancer genomes with high-throughput linked-read sequencing. Nat Biotech. 2016; 34:303–11.
Article CAS Google Scholar
Eslami Rasekh M, et al.Discovery of large genomic inversions using long range information. BMC Genomics. 2017; 18:65.
Article PubMed PubMed Central Google Scholar
Marks P, et al.Resolving the full spectrum of human genome variation using linked-reads,. Genome Res. 2019; 29:635–45.
Article CAS PubMed PubMed Central Google Scholar
Spies N, et al.Genome-wide reconstruction of complex structural variants using read clouds. Nat Methods. 2017; 14:915–20.
Article CAS PubMed PubMed Central Google Scholar
Ye K, Schulz MH, Long Q, Apweiler R, Ning Z. Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics. 2009; 25:2865–71.
Article CAS PubMed PubMed Central Google Scholar
Tuzun E, et al.Fine-scale structural variation of the human genome. Nat Genet. 2005; 37:727–32.
Article CAS PubMed Google Scholar
Alkan C, et al.Personalized copy number and segmental duplication maps using next-generation sequencing. Nat Genet. 2009; 41:1061–s7.
Article CAS PubMed PubMed Central Google Scholar
Sudmant PH, et al.Diversity of human copy number variation and multicopy genes. Science. 2010; 330:641–6.
Article CAS PubMed PubMed Central Google Scholar
Soylev A, Le TM, Amini H, Alkan C, Hormozdiari F. Discovery of tandem and interspersed segmental duplications using high-throughput sequencing. Bioinformatics. 2019; 35:3923–30.
Article PubMed PubMed Central Google Scholar
Miga KH, et al.Telomere-to-telomere assembly of a complete human x chromosome. bioRxiv. 2019. https://doi.org/10.1101/735928.
Antonacci F, et al.Characterization of six human disease-associated inversion polymorphisms. Hum Mol Genet. 2009; 18:2555–66.
Article CAS PubMed PubMed Central Google Scholar
Brunato M, Hoos HH, Battiti R. On effectively finding maximal quasi-cliques in graphs In: Maniezzo V, Battiti R, Watson J-P, editors. LION 2007 II, LNCS 5313. Berlin, Heidelberg: Springer: 2008. p. 41–55.
Google Scholar
Bailey JA, et al.Recent segmental duplications in the human genome. Science. 2002; 297:1003–7.
Article CAS PubMed Google Scholar
Rausch T, et al.DELLY: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics. 2012; 28:333–9.
Article CAS Google Scholar
Sedlazeck FJ, et al.Accurate detection of complex structural variations using single-molecule sequencing. Nat Methods. 2018; 15:461–8.
Article CAS PubMed PubMed Central Google Scholar
Mu JC, et al.VarSim: a high-fidelity simulation and validation framework for high-throughput genome sequencing with cancer applications. Bioinformatics. 2015; 31:1469–71.
Article CAS PubMed Google Scholar
Huang W, Li L, Myers JR, Marth GT. ART: a next-generation sequencing read simulator. Bioinformatics. 2012; 28:593–4.
Article CAS PubMed Google Scholar
Ono Y, Asai K, Hamada M. PBSIM: PacBio reads simulator—toward accurate genome assembly. Bioinformatics. 2013; 29:119–21.
Article CAS PubMed Google Scholar
Luo R, Sedlazeck FJ, Darby CA, Kelly SM, Schatz MC. LRSim: a linked-reads simulator generating insights for better genome partitioning. Comput Struct Biotech J. 2017; 15:478–84.
Article CAS Google Scholar
Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. 2013. https://arxiv.org/abs/1303.3997. Accessed 30 July 2019.
Li H, et al.The sequence alignment/map format and SAMtools. Bioinformatics. 2009; 25:2078–9.
Article CAS PubMed PubMed Central Google Scholar
Tarasov A, Vilella AJ, Cuppen E, Nijman IJ, Prins P. Sambamba: fast processing of ngs alignment formats. Bioinformatics. 2015; 31:2032–4.
Article CAS PubMed PubMed Central Google Scholar
Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010; 26:841–2.
CAS PubMed PubMed Central Google Scholar
Chaisson MJP, et al.Multi-platform discovery of haplotype-resolved structural variation in human genomes. Nat Comm. 2019; 10:1784.
Article CAS Google Scholar
Eberle MA, et al.A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree. Genome Res. 2017; 27:157–64.
Article CAS PubMed PubMed Central Google Scholar
Lappalainen I, et al.dbVar and DGVa: public archives for genomic structural variation. Nucleic Acids Res. 2013; 41:936–41.
Article CAS Google Scholar
Karczewski KJ, et al.Variation across 141,456 human exomes and genomes reveals the spectrum of loss-of-function intolerance across human protein-coding genes. bioRxiv. 2019. https://doi.org/10.1101/531210.
Rowley JD. Chromosome translocations: dangerous liaisons revisited. Nat Rev Cancer. 2001; 1:245–50.
Article CAS PubMed Google Scholar
Kaur G, Dufour JM. Cell lines: valuable tools or useless artifacts. Spermatogenesis. 2012; 2:1–5.
Article PubMed PubMed Central Google Scholar
Bahe S, et al.Rootletin forms centriole-associated filaments and functions in centrosome cohesion. J Cell Biol. 2005; 171:27–33.
Article CAS PubMed PubMed Central Google Scholar
Huang Y, et al.Human CLEC18 gene cluster contains C-type lectins with differential glycan-binding specificity. J Biol Chem. 2015; 290:21252–63.
Article CAS PubMed PubMed Central Google Scholar
Mccarroll S, et al., The International HapMap Consortium. Common deletion polymorphisms in the human genome. Nat Genet. 2006; 38:86–92.
Whelan CW, et al.Detecting inversion polymorphisms at population scale with linked read sequencing. In: ASHG Meeting: 2018. https://eventpilot.us/web/page.php?page=IntHtml&project=ASHG18&id=180123430.
McElwain MA, Zhang RY, Drmanac R, Peters BA. Long fragment read (LFR) technology: cost-effective, high-quality genome-wide molecular haplotyping. Methods Mol Biol. 2017; 1551:191–205.
Article CAS PubMed Google Scholar
Meleshko D, et al.Detection and assembly of novel sequence insertions using linked-read technology. bioRxiv. 2019. https://doi.org/10.1101/551028.
Karaoglanoglu F, et al.VALOR2: characterization of large-scale structural variants using linked-reads. GitHub. 2020. https://github.com/BilkentCompGen/valor. Accessed 5 May 2019.
Karaoglanoglu F, et al.VALOR2: characterization of large-scale structural variants using linked-reads. DockerHub. 2020. https://hub.docker.com/r/alkanlab/valor. Accessed 10 Dec 2019.
Karaoglanoglu F, et al.VALOR2: characterization of large-scale structural variants using linked-reads. Zenodo. 2020. https://doi.org/10.5281/zenodo.3380054. Accessed 10 Jan 2020.
Marks P, et al.Resolving the full spectrum of human genome variation using linked-reads. EBI ENA. 2019. https://www.ebi.ac.uk/ena/data/view/PRJEB28297. Accesssed 7 May 2019.
Eberle MA, et al.A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree. https://www.illumina.com/platinumgenomes.html. Accessed 7 May 2019.
Chaisson MJP, et al.Multi-platform discovery of haplotype-resolved structural variation in human genomes. NCBI FTP. 2019. http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/hgsv_sv_discovery/data/YRI/. Accessed 9 Mar 2019.
Chaisson MJP, et al.Multi-platform discovery of haplotype-resolved structural variation in human genomes. NCBI SRA. 2019. https://www.ncbi.nlm.nih.gov/sra/PRJNA477862. Accessed 9 Mar 2019.
Marks P, et al.Linked-read whole genome sequencing of CHM1. https://support.10xgenomics.com/de-novo-assembly/datasets/2.0.0/chm. Accessed 5 May 2019.
Miga KH, et al.Telomere-to-telomere assembly of a complete human X chromosome. GitHub. 2019. https://github.com/nanopore-wgs-consortium/CHM13. Accessed 12 Jun 2019.

Download references

Acknowledgements

We thank H. İ. Özercan, A. Söylev, and D. Meleshko for the computational support. We also thank M. Chaisson for early access to the HGSV data.

Review history

The review history is available as Additional file 2.

Funding

This work was supported by a grant by TÜBİTAK (215E172) and an EMBO Installation Grant (IG-2521) to CA. This work was also supported by start-up funds (Weill Cornell Medicine) and a National Science Foundation (NSF) grant under award number IIS-1840275 to IH. CR received support from the Tri-Institutional Training Program in Computational Biology and Medicine (via NIH training grant 1T32GM083937). The authors also acknowledge the Computational Genomics Summer Institute funded by NIH grant GM112625 that fostered the international collaboration among the groups involved in this project.

Author information

Fatih Karaoglanoglu and Camir Ricketts contributed equally to this work.

Authors and Affiliations

Department of Computer Engineering, Bilkent University, Ankara, 06800, Turkey
Fatih Karaoğlanoğlu, Ezgi Ebren & Can Alkan
Tri-Institutional Computational Biology & Medicine Program, Cornell University, 1300 York Ave, New York, 10065, NY, USA
Camir Ricketts
Graduate Program in Bioinformatics, Boston University, 24 Cummington Mall, Boston, 02215, MA, USA
Marzieh Eslami Rasekh
Department of Physiology and Biophysics, Institute for Computational Biomedicine, Weill Cornell Medicine, 1300 York Ave, New York, 10065, NY, USA
Camir Ricketts & Iman Hajirasouliha
Englander Institute for Precision Medicine, The Meyer Cancer Center, Weill Cornell Medicine, 1300 York Ave, New York, 10065, NY, USA
Iman Hajirasouliha
Bilkent-Hacettepe Health Sciences and Technologies Program, Bilkent University, Ankara, 06800, Turkey
Can Alkan

Authors

Fatih Karaoğlanoğlu
View author publications
You can also search for this author in PubMed Google Scholar
Camir Ricketts
View author publications
You can also search for this author in PubMed Google Scholar
Ezgi Ebren
View author publications
You can also search for this author in PubMed Google Scholar
Marzieh Eslami Rasekh
View author publications
You can also search for this author in PubMed Google Scholar
Iman Hajirasouliha
View author publications
You can also search for this author in PubMed Google Scholar
Can Alkan
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Authors’ contributions

Fatih Karaoğlanoğlu and Camir Ricketts contributed equally to this work. IH and CA conceived the concept and initiated and supervised the project. FK, MEH, IH, and CA designed the VALOR2 algorithm. FK implemented the VALOR2. FK, CR, EE, and MEH evaluated VALOR2’s performance and carried out the analysis of the results. All authors contributed to the writing and read and approved the final manuscript.

Authors’ information

Twitter handles: @fkaraoglan_cs (Fatih Karaoğlanoğlu), @CamirRicketts (Camir Ricketts), @ezgiebren (Ezgi Ebren), @mzrasekh (Marzieh Eslami Rasekh), @hajirasouliha (Iman Hajirasouliha), and @calkan_cs (Can Alkan).

Corresponding authors

Correspondence to Iman Hajirasouliha or Can Alkan.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Iman Hajirasouliha and Can Alkan are co-senior authors.

Supplementary information

Additional file 1

Algorithm S1, Figure S1-S2, Table S1-S5.

Additional file 2

Review history.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Karaoğlanoğlu, F., Ricketts, C., Ebren, E. et al. VALOR2: characterization of large-scale structural variants using linked-reads. Genome Biol 21, 72 (2020). https://doi.org/10.1186/s13059-020-01975-8

Download citation

Received: 17 December 2019
Accepted: 24 February 2020
Published: 19 March 2020
DOI: https://doi.org/10.1186/s13059-020-01975-8

VALOR2: characterization of large-scale structural variants using linked-reads

Abstract

Background

Methods

Glossary

Overview of the VALOR2 algorithm

Molecule recovery

Candidate split matching

Candidate splits for translocations

Clustering using SV graph

Molecule depth filtering

Results

Simulation experiments

Size detection spectrum for structural variation

Biological data sets

Functional consequence of predicted variants

Discussion

Notes

References

Acknowledgements

Review history

Funding

Author information

Authors and Affiliations

Contributions

Authors’ contributions

Authors’ information

Corresponding authors

Ethics declarations

Ethics approval and consent to participate

Competing interests

Additional information

Publisher’s Note

Supplementary information

Additional file 1

Additional file 2

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Genome Biology

Contact us