Author Correction: SMURF-seq: efficient copy number profiling on long-read sequencers

We present SMURF-seq, a protocol to efficiently sequence short DNA molecules on a long-read sequencer by randomly ligating them to form long molecules. Applying SMURF-seq using the Oxford Nanopore MinION yields up to 30 fragments per read, providing an average of 6.2 and up to 7.5 million mappable fragments per run, increasing information throughput for read-counting applications. We apply SMURF-seq on the MinION to generate copy number profiles. A comparison with profiles from Illumina sequencing reveals that SMURF-seq attains similar accuracy. More broadly, SMURF-seq expands the utility of long-read sequencers for read-counting applications. Electronic supplementary material The online version of this article (10.1186/s13059-019-1732-1) contains supplementary material, which is available to authorized users.

Step Time RE digestion ∼30 min Spin-column clean-up ∼15 min Re-ligation ∼30 min Ampure XP beads clean-up ∼15 min Total ∼90 min Table 1: Steps involved and time needed for SMURF-seq protocol: SMURF-seq protocol involves digesting DNA molecules using restriction enzymes and re-ligating them with DNA Ligase. SMURF-seq protocol takes approximately 90 minutes to complete.

Fragmentation techniques for SMURF-seq
In the SMURF-seq protocol, we used restriction enzymes to fragment DNA molecules. We also tested acoustic shearing (Covaris, Cat. no. S220) and dsDNA fragmentase (NEB, Cat. no. M0348L) to fragment DNA. These methods increase the time needed for SMURF-seq because of the additional end-repair and wash steps that are required. Moreover, the ligated molecules from these methods were shorter compared with the molecules using the restriction enzyme protocol. For acoustic shearing, the machine was set to shear DNA to approximately 150 bp (Fig. 1a). After shearing, DNA was end-repaired to have blunt ends and then re-ligated with T4 DNA ligase (Fig. 1b). We found that repeating the end-repair and re-ligation steps increased the molecule lengths (Fig. 1c). Finally, the ligated molecules were washed with 0.5× volume Ampure XP beads to remove molecules less than 500 bp (Fig. 1d).
For dsDNA fragmentase, the standard protocol provided by the manufacturer was followed to fragment DNA molecules and the reaction was stopped after 30 minutes (Fig. 2a). The molecules were then endrepaired to have blunt ends and re-ligated with T4 DNA ligase (Fig. 2b). Finally, the ligated molecules were washed with 0.5× volume Ampure XP beads to remove molecules less than 500 bp (Fig. 2c).
2 Mapping SMURF-seq reads 2.1 Simulating SMURF-seq reads to evaluate mapping programs The basic functionality of identifying boundaries between fragments within reads already exists in several mapping tools. To test these, we chose to create simulated reads with the technical characteristics we expect in idealized SMURF-seq data. We first selected a fragment length and a number k of fragments per read. Then, for a given WGS nanopore data set, we took the set of mapped long reads as determined by BWA-MEM (with -x ont2d option). Each of the mapped reads was split into fragments of length (with a random offset of 0 to − 1 at the start of the long read). Each fragment was validated by requiring that it did not overlap a deadzone in the genome (as determined by the deadzone program available from https://github.com/smithlabcode/utils for 40 bp). The reason for excluding deadzones is that even when a short fragment has a "known" mapping location when it is part of a longer read, we cannot compare its reported mapping location as a short fragment with that known location, since we expect any good mapping algorithm to identify that the fragment maps ambiguously. Among these validated fragments, subsets of k were sampled uniformly at random and concatenated (in random order and orientation) to form simulated SMURF-seq reads.
The first and last fragments in a read should be slightly easier to identify and map than the rest, since one of their boundaries is known. Using the above procedure, we select k = 20 so that the simulated reads have a sufficient number of fragments to eliminate the influence of the first and last fragments in each read on the results. There is no need to have large k otherwise.
By lowering and making the fragments shorter, the task of mapping the fragments becomes more challenging. Real SMURF-seq reads have fragment lengths determined by restriction site density, size selection and other aspects of the experiments. But in testing mapping algorithms and optimizing parameters, there is no disadvantage to making the task more challenging. We only need to be able to distinguish the relative performance of different mapping tools and parameter combinations. Real SMURF-seq reads have varying fragment lengths, but in evaluating mapping tools, there is no need to randomize fragment lengths. None of the algorithms we evaluated are capable of either deducing or leveraging the fact that all simulated fragments have the same length. We selected = 100, which begins to challenge the various mapping strategies. These values of are slightly lower than the average in real SMURF-seq data.

Evaluating performance using simulated SMURF-seq reads
Within the simulated reads, the boundaries of each fragment are known a priori, as are their mapping locations. We used this information to evaluate mapping tools in terms of (1) how well they identify fragments purely for the purpose of counting molecules, which is the primary information used in CNV analysis, and (2) how well they identify individual mapping bases within reads. The latter criteria becomes important in challenging cases and will be increasingly important as fragment sizes are reduced.
Performance on identifying fragments: After mapping these simulated reads, each mapping result is called a predicted fragment. Each predicted fragment is considered a positive prediction, and we assume an arbitrary order over positive predictions. A positive prediction is a true positive if: • The predicted fragment maps uniquely.
• The mapping locations of at least half the bases in the predicted fragment are equal to the original mapping locations for those bases, and those bases are all part of the same original fragment (we assume that it is unlikely for two fragments on a simulated read to have the same mapping location but opposite orientation, and thus do not check for the orientation of a fragment). In this case, we say the predicted fragment is associated with that original fragment. • The predicted fragment is the first among predicted fragments associated the same original fragment. False positives are predicted fragments that are not true positives. Any original fragment with no associated predicted fragment is a false negative. These criteria penalize splitting one original fragment or merging two original fragments. By defining true positives, false positives and false negatives we are able to calculate precision, recall, and F-score for a particular mapping strategy.
Performance on identifying individual mapping bases: After mapping simulated reads, each mapping result is decomposed into individual nucleotides and associated with a location in the genome. Those locations are retained. We keep multiplicities, so when two mapped fragments overlap in the genome we count certain nucleotides twice. These are the predicted positive bases in the reference. The condition positive bases are those known a prior from the simulation. The original fragment mapping locations may overlap in the reference genome, leading to multiplicities in the condition positive bases, but with low probability. The true positives are the intersection of the condition positive and the predicted positive bases. When there are multiplicities of mapped fragments and simulated fragments overlapping the same bases in the reference genome, this is determined by taking the smaller of the two values. After removing the true positives bases, the remaining predicted positive bases are false positives, and the remaining condition positive bases are false negatives. These criteria penalize mapping approaches that do not cover the entire simulated SMURF-seq reads, and also penalize approaches that predict fragments that overlap within the read. The true positives, false positives, and false negatives here allow us to assign precision and recall in terms of individual bases and corresponding F-scores. Although the reference bases for both predicted positive and condition positive could involve multisets, since our simulations used relatively low coverage this almost never happened.

Initial selection of mapping tools
We tested the following mapping tools: BWA-MEM [3], Minimap2 [4], LAST [5], GraphMap [6], BLASR [7], rHAT [8], and LAMSA [9]. These were selected either because they are known to perform well on certain mapping tasks or have unique properties that plausibly could help in mapping SMURF-seq reads. We tested each of these using default parameters on simulated reads (see above) and downsampled real SMURFseq reads (data not shown). Among these BWA-MEM, Minimap2, and LAST had higher accuracy on simulated data, and the other tools identified at most 15 fragments per read on real data. Thus, we explored performance of BWA-MEM (0.7.17), LAST (963), and Minimap2 (2.15) in more detail, varying parameters to improve performance.
We remark that none of these tools were designed to map SMURF-seq reads; results we report here do not reflect the overall performance of the various mapping tools, only that the three aforementioned tools happened to perform relatively well on a task for which they were not directly designed for.

Detailed evaluation and parameter optimization
The selected mapping tools have variations on the following basic steps: • Identifying seeds: All tools have a step of identifying seeds, which are short exactly matching parts of the reads. Choices in how seeds are defined and used are often made for mapping speed. The total size of SMURF-seq data sets is currently (relatively) small, so speed is not our primary concern. We favor the most sensitive seed strategy, but depending on implementation too many seed hits could lead to ambiguity later in the mapping process. • Chaining seeds: The identified seeds are further extended and filtered to avoid aligning potentially false positive seed hits. • Aligning within the chains: In this stage a Smith-Waterman alignment is performed, typically allowing users to specify a mismatch penalty along with penalties for both gap-open and gap-extend. • Selecting best alignments: When high-scoring alignments overlap within a read, one of them (or both) could be trimmed or one is selected and the other discarded. The choices made here could lead to discarding entire fragments. These mapping tools have several parameter options, in general, these are related to: (1) the seeding and chaining algorithm used by the individual tool. (2) The Smith-Waterman alignment scores, i.e. the match score, and the mismatch and indel penalty. The seeding and chaining parameters control the number of proto alignments that are further refined by aligning parts of the read to the reference genome using the specified alignment scores.
The Smith-Waterman alignment score used to align fragments to the reference genome is crucial for determining the optimal fragment length. On one extreme, a match score of 1 with a mismatch and indel penalty of 0 will result in one identified fragment covering the entire read and mapping perfectly, but will always map ambiguously. On the other extreme, a match score of 1 with a mismatch and indel penalty of −∞ will result in any mismatch or indel on the read to be considered as a fragment boundary. Therefore to align SMURF-seq reads, we need to determine optimal alignment scores to use.
In order to determine the optimal alignment score, we kept the seeding related parameters constant, and varied the alignment score combinations to perform a grid search. We varied the mismatch penalty from 1 to 6, gap open penalty from 0 to 4, and gap extend penalty form 1 to 4. The match score was fixed at 1. Thus for each tool we tested 120 (6 × 5 × 4) combinations of alignment scores.
The seeding and chaining related parameters for each tool was set at follows (along with the four alignment scores): • BWA-MEM: -x ont2d -k 12 -W 12 -T 30 • Minimap2: -w 1 -m 10 -s 30 • LAST (NEAR): lastal -Q0 -e 20 and last-split -m 1 -s 30 We set the seeding and chaining parameters in a liberal manner to allow for higher sensitivity than the default parameter of each tool, and the minimum alignment score to output was set at 30.
After aligning the simulated reads, we calculated the average precision and recall, each for the mapped fragment locations and nucleotides, for the four datasets. The F-score was computed for each, and the mean of the F-scores was used to determine the optimal alignment parameter for each tool. These results for each parameter combinations are reported in Additional file 2: Additional table 1. Based on these results BWA-MEM outperformed other tools for aligning SMURF-seq reads. BWA-MEM performed best with a mismatch, open, and extension penalty of 2, 1, 1 respectively.
To further refine the optimal alignment parameter for BWA-MEM, we aligned the simulated reads with parameter values around the value described above with a higher resolution. We varied the mismatch penalty from 1.5 to 2.5, and open and extend penalties from 0.5 to 1.5 in increments of 0.25. However, BWA-MEM does not accept floating point values for alignment score parameters. To overcome this, we scaled the alignment score proportionately to have integer values, i.e we varied the mismatch penalty from 6 to 10, open and extend penalties from 2 to 6, and fixed the match score at 4 (125 combinations). These results are presented in Additional file 2: Additional table 2. Based on these results, the highest accuracy was obtained with the mismatch, open, and extension penalty of 2.5, 1.5, 0.75 respectively (corresponding scaled values are 10, 6 and 3). We used these optimal alignment scores for mapping real SMURF-seq read, and all the CNV profiles presented are based on these.
3 Short molecule sequencing with long-read sequencers 3 0.5 µg of restriction enzyme digested DNA in 45 µl of nuclease-free water was end-repaired and dAtailed (New England Biolabs (NEB), Cat. no. E7546), followed by elution in nuclease-free water after 1.5× volume Ampure XP beads clean-up. Sequencing adapters (AMX1D) were ligated with Blunt/TA Ligase Master Mix (NEB, Cat.no. M0367) and cleaned with 1.0× volume Ampure XP beads (manufacturer's protocol uses 0.4× volume XP beads, we increased to 1.0× to get as many short molecules as possible) and eluted using 15 µl Elution Buffer (ELB) following the manufacturer's protocol (Oxford Nanopore Technologies (ONT), 1D genomic DNA by ligation protocol).
The prepared library was loaded on R9.4 Flowcell following the manufacturer's protocol (ONT) and sequenced for 48 hours. Base-calling was performed using ONT Guppy (2.3.5). The sequencing run produced 2.58 million reads with a mean read length of 630.93 bp ( Figure S3).

Limitations of short molecule sequencing with long-read sequencers
There are several disadvantages of sequencing short molecules on a nanopore sequencer: 1. Short molecules have a higher ratio of "technical" to "biological" nucleotides (sequencing adapters and barcodes). In our sequencing run without SMURF-seq, the pores spent 6.1% of their sequencing time on adapters, in comparison to 0.7% when sequencing the diploid genome using SMURFseq. Defining d(i) as the total duration and d t (i) as template duration for read i, this is calculated as 1 − ( i d t (i))/( i d(i)) based on information from the sequencing summary.txt file. This factor will become worse when the samples are barcoded (for example, the barcode kit EXP-NBD103 uses 24 bp barcodes, adding 48 bp to each read), and as the molecules get shorter.
2. The speed of sequencing was lower when sequencing short DNA molecules, likely due to the nanopore requiring a certain number of bases to get to its maximum speed. For example, the average sequencing speed was 315.54 bases per second for sequencing the diploid genome without SMURF-seq, and 400.29 bases per second when sequencing using SMURF-seq (calculated as the mean of |t(i)|/d(i), where t(i) is the template sequence for read i, from the sequencing summary.txt file; Figure S13). Thus as the molecules get shorter, the nanopores do not sequence at the maximum speed. 3. Our sequencing run produced reads with a mean length of 630.9 bp, which is longer than required for read-counting applications. The continually evolving library preparation protocols would likely require ad-hoc modifications for sequencing molecules of length that are optimal for read-counting applications. Further, certain library preparation kits require long DNA molecules for an optimal library construction, and these kits are not optimal for sequencing short DNA molecules. For example, the rapid sequencing kit uses transposase to fragment DNA molecules and these are not optimal for short DNA molecules as they would fragment the short molecules further. PacBio machines using the SMRTbell template [10,11], sequences only one molecule per zero-mode waveguide irrespective of the molecule length. Although, we have not demonstrated this, using SMURFseq for read-counting applications on a PacBio sequencer will directly lead to an increase in the number of fragments sequenced per read and therefore increase the read-counts obtained from a sequencing run. Additional