Long Read based Human Genomic Structural Variation Detection with cuteSV

Long-read sequencing technologies enable to comprehensively discover structural variations (SVs). However, it is still non-trivial for state-of-the-art approaches to detect SVs with high sensitivity or high performance or both. Herein, we propose cuteSV, a sensitive, fast and lightweight SV detection approach. cuteSV uses tailored methods to comprehensively collect various types of SV signatures, and a clustering-and-refinement method to implement a stepwise SV detection, which enables to achieve high sensitivity without loss of accuracy. Benchmark results demonstrate that cuteSV has better yields on real datasets. Further, its speed and scalability are outstanding and promising to large-scale data analysis. cuteSV is available at https://github.com/tjiangHIT/cuteSV.

3 sample considering the lower demands on sequencing coverage. We anticipate that the widespread use 66 of cuteSV will promote the vigorous development of genome research.
CuteSV identifying SVs implements four major steps (a schematic illustration is in Figure 1):

70
(1) Main-stream TGS-based aligners are employed to perform long-read mapping, and Samtools [35] 71 is implemented to finish the conversion, coordinate sorting and index building of alignments.

72
(2) cuteSV identifies and collects various types of potential SV signatures comprehensively from 73 spatial features, such as coordinates, orientations, chromosomes, etc., in intra-and inter-alignments.

74
(3) cuteSV completes a stepwise SV detection through clustering and refinement of SV signatures in 75 each read to identify and screen possible candidates.
respectively. Overall, cuteSV and PBSV had close power in SVs detection on HiFi data and got one step ahead of Sniffles and SVIM.
The GIAB evidence callsets help construct a baseline of assessment, which describes the performance how much percentage of structural variants presented in offspring that did not make an appearance in 114 either biological parent to assess Mendelian consistency. In this end, we used the well-studied GIAB 115 Ashkenazi Trio [36] (see Figure 2C). cuteSV had the lowest MDRs for almost of all the SV types with

123
Moreover, we evaluated the runtime and memory consumption using the 69× HG002 CLR dataset to 124 show the characteristics of computing power (see Figure 2D and Figure 2E). For runtime in a single 125 thread, Sniffles had the fastest running speed (i.e. 367 minutes) compared to other methods. SVIM (i.e. 126 423 minutes), cuteSV (i.e. 983 minutes), and PBSV (i.e. 1715 minutes) were in order after Sniffles. When 127 using multi threads, the runtime of cuteSV was on the decrease significantly as the thread increased.

128
Notably, cuteSV only took less than an hour and a half in SVs detection with 16 threads. For memory 129 usage, state-of-the-art methods occupied more than a dozen GB, i.e. 17.61 GB on Sniffles, 10.57 GB on 130 PBSV, and 9.41 GB on SVIM. While cuteSV only used 0.35 GB no matter how many threads in use, 131 that was, cuteSV saved 96% to 98% of memory and was a memory-efficient method.  Figure 3B and Supplementary Table 2). In terms of less than 1000 bp insertions and deletions, there was no significant distinction between the two employed mappers since only 31 different FPs in similar between the two employed aligners (i.e. 2 on cuteSV, 27 on Sniffles, and 13 on PBSV, 156 respectively). However, 939 more FP calls on SVIM with NGMLR were found, which decreased the 157 precision dramatically.

158
In view of Mendelian consistency, we compared the effect of aligner selection in MDR as well (see 159 Figure 3C). For all approaches, using NGMLR could result in the lowest MDRs in identifying deletions 160 and translocations significantly, which degraded 0.35% to 7.52%. For the combination of insertions and 161 duplications, and inversions, we could not find a significant bias in both aligners. In terms of total calls, 162 NGMLR could help cuteSV and PBSV to generate variants with higher Mendelian consistency due to 163 the 2% to 3% of MDR reduction. However, PBMM2 was a better choice for SVIM since the over 15% 164 MDR reduction. Sniffles was compatible with these two mappers because there was no obvious 165 difference between them.

167
To make certain the influence of parameter settings and provide advice of implementation of cuteSV in 168 practice, we assessed the performance of variant calling under two pivotal parameters setting, i.e. -169 min_support (-s) and -min_size (-l), on multiple coverages of HG002 CLR datasets.

170
The parameter of -min_support indicates the minimal number of reads supporting an SV that to be 171 reported. From the results shown in Figure 4A and Supplementary Table 3, keeping-min_size (with 30 172 bp as the default) constant, cuteSV accomplished the best performance when set -s1 to -s10 at 5×, 10×, 173 20×, 30×, 40×, and 69× coverage, respectively. And a clear trade-off between precision and recall on -174 min_support setting was observed, where set a lower number could enhance a higher sensitivity whereas 175 might decrease precision respectively, and vice versa.

176
Another parameter of -min_size means the minimal size of SV signature considered in clustering. We 177 compared the performance between two different -min_size (i.e. -l 30 and -l 50) settings under the same 178 -min_support (see Figure 4B). At various coverages, cuteSV succeed 0.38% to 1.71% more in accuracy 179 using -min_size 50, while 0.06% to 2.19% more recall rates were observed in -min_size 30. Hence, one 180 another obvious trade-off between precision and recall was implied as well that a smaller number of -181 min_size setting improved sensitivity and brought accuracy down, and vice versa. Nevertheless, for F-182 measure, the scores were close to each other at each depth and only less than 1% of the change rate was 183 discovered.

192
57.66%). PBSV, Sniffles, and SVIM were behind cuteSV, in turn. One point that could not be ignored relatively poor one in inversion, which was less than 20% in F1. It was also noted that the evaluation 199 results on the NA19240 were far below them on the HG002. This was mainly due to the relatively low 200 quality of ground truth and the lack of high confidence regions about this sample for a highly accurate 201 assessment. Nevertheless, cuteSV was still a promising approach since its superior sensitivities and F1 202 scores.  SVs yielded together on different sequencing data, 94.86% calls in ground truth were found by both 217 datasets, and 3.10% calls in ground truth could be detected through only one data as well. Only 2.04% 218 calls in ground truth were unable to identify. We could confirm the finding in [30] that the vast majority 219 (97%) of ground truth could be identified by PacBio or ONT sequencing technologies.

221
Collection and interpretation of evidence from the long-read alignments is a promising strategy for 222 detecting high-quality SVs since it scans the signatures of fragments changing and constructs putative 223 candidate of SVs. However, under the circumstances of complex SV events and high sequencing error 224 rates, this task is non-trivial in practice.

225
(1) For some regions containing multi types of SV, it is inefficient for restoring a specific class of SV 226 without considering the signature category and sometimes might generate erroneous predictions. (2)

227
Similarly, for some regions containing multi SV alleles, it is too hard to figure out the correspondences 228 between the real allele and its supporting evidence, due to a large amount of less varied signatures mixing 229 in a limited genomic region. (3) Moreover, owing to the high sequencing errors, quite a lot of mistakes 7 All the above bottlenecks could decrease the performance of sensitive SVs detection. To break through these challenges, we propose a sensitive, fast and lightweight computational method, cuteSV, for sensitively and robustly detecting SVs using long-read. The most critical factor of cuteSV to guarantee 235 the high sensitivity is its stepwise-refinement strategy on clustering through comprehensive signatures 236 and presupposed polymorphic alleles model. that coming from complex SV, might be split into two parts or more. It is a challenging task to extract 239 and reconstruct the continuous and complete variant-spanning fragment without using local assembly. 240 cuteSV integrates the same type of signatures, which are distributed in a local area densely, to recover 241 and enrich the real evidence which might enhance discovering SVs in complex sequence regions of the 242 genome effectively.

243
(2) In the genome region with multi non-reference alleles, various allele-supporting signatures and 244 errors, which share a similar breakpoint, are interwoven in a regional cluster. Some real SV alleles with 245 less supporting signatures might be discarded due to insufficient evidence, especially for low coverage 246 sequencing data. To this end, cuteSV proposes a presumed polymorphic alleles model for identifying all 247 potential non-reference alleles through setting a specific ratio of size partitioning to re-cluster each 248 signature and purifying the consistent features. This approach can further improve the recognizing 249 sensitivity of SVs with multiple non-reference alleles.

250
cuteSV also has a prominent computational performance with its block analysis design. One of the 251 great benefits is that the maximum memory usage is under control by setting the amount of data in the 252 process. Another advantage is that cuteSV can manipulate signatures extraction and clustering in parallel 253 efficiently since the independence of sequencing data. The implementation of both two benefits enables 254 cuteSV to perform SV calling more feasibly just on a desktop computer as well.

255
With the assistance of high confidence callsets released by several previous studies [37,41], both sensitivity and specificity are assessed objectively on state-of-the-art SV callers and cuteSV. In most of 257 the datasets, cuteSV discovers more SV calls consistent with the ground truth without loss of accuracy 258 and achieves better F-measure scores. Meanwhile, this advantage is significant no matter what the 259 platforms and SV types are. Besides, cuteSV has better Mendelian consistency among offspring and its 260 parents than other methods. These suggest that cuteSV has a promising capability to identify genomic 261 SVs sensitively and efficiently.

262
Another important feature of cuteSV is that it can identify SVs with high sensitivity and accuracy on 263 low coverage datasets. TGS has been in development for more than a decade, however, the cost is still 264 more expensive compared to NGS. cuteSV can improve SVs detection on low coverage data through 265 refines signatures in each read and self-adapting clustering, which ensures the high specificity and 266 sensitivity. It is certainly a great improvement which might save considerable sequencing expensive.

267
Although cuteSV is a promising SV caller with high sensitivity, there are still some SV calls in high 268 confidence callsets that cannot be detected (see Supplementary Figure 1 and 2). From these two examples, 269 we find that the false-negative calls are produced mainly due to the lack of corresponding evidence that 270 supports the ground truth. More specifically, there is a huge difference in size or breakpoint of SV 271 signatures from alignments which enables cuteSV to generate absurd SV calls unexpectedly. These FN 272 calls also happen in Sniffles, PBSV, and SVIM, respectively. So, these could be plausible as well even generation of the consensus sequence of a specific allele and further optimization of the genotype 275 assignment model. We will continue to refine these new functionalities and carry out progressively in Long-read sequencing technologies provide us with an unprecedented view of genetic variation in the In this paper, we provide a sensitive, fast and lightweight approach to discover high-quality SV with 283 long sequencing reads. It employs a stepwise refinement clustering algorithm to process the 284 comprehensive signatures from inter-and intra-alignment, construct and screen all possible alleles thus 285 completes high-quality SV calling. Benchmark results shows that cuteSV achieves superior sensitivity, 286 accuracy, and power compared to existing methods even on low coverage sequencing data. We expect it 287 brings a lot of potentials for cuteSV to improve the vigorous advancement of genome research.

290
As a sensitive SV caller, cuteSV imputes SV candidates mainly from the signatures implied in intra-and 291 inter-alignments. We investigate start-of-the-art aligners, which contain high-quality base calling 292 (nucleotide-level alignment) and abundant supplementary alignments, to perform long-read mapping.

293
After long-read mapping, Samtools is used to perform the alignment format conversion, coordinate 294 sorting and index building. Then input the coordinate-sorted BAM file into cuteSV to analyze SVs.

295
Discovering SV signatures 296 cuteSV complete signatures discovering mainly from two types of alignment, i.e. intra-alignment and 297 inter-alignment. For intra-alignment, if a deletion or insertion that over 30 bp in length was found in 298 CIGAR filed, a corresponding SV signature will be recorded as a triple, such as

316
(1) If two adjacent segments on the read level are aligned to the same chromosome and share the same 317 orientation, cuteSV will compare the , a description of the distance on the genome level minus 318 the distance on the read level, which is Similarly, when < 30 and ≤ −30 , which means a redundant sub-read with 327 no mapping in a local reference, cuteSV recognizes a insertion signature as

329
(2) If two adjacent segments on the read level are mapped to the same position on the genome level

346
(5) For some more complicated events, e.g. a read with three segments or more and containing a 357 where and are breakpoints or breakends of the two signatures, is a threshold that control the compactness of clustering on different SV types. In general, is 50 to 500 bp.

359
After the first-round of clustering above, cuteSV filters each candidate SV group based on quantity of 360 signal reads. Those groups with the shortage of quantity will be discarded, which might carry the 361 sequencing errors. For the remaining clusters, cuteSV runs a new round of clustering to refine variants 362 and their every potential SV allele as below:

363
(1) For deletions and insertions, we introduce _ to describe a length with specific percentage on the mean variant size of signatures, which is

366
where is the percentage parameter, is a given cluster, and is the variant size of a 367 signature in . cuteSV clusters in each given cluster again using as a standard to divide 368 various SV alleles.

369
For the subcluster with most supporting reads, cuteSV recognizes a major allele when meeting where is the first breakpoint, is the next breakpoint, means the scope of extension

471
Availability of data and material

472
The human reference genome, raw sequencing data, alignments and gold standard truth callsets used in 473 this study are available from the respective publications and websites listed in Supplementary Table 6.