Performance difference of graph-based and alignment-based hybrid error correction methods for error-prone long reads

The error-prone third-generation sequencing (TGS) long reads can be corrected by the high-quality second-generation sequencing (SGS) short reads, which is referred to as hybrid error correction. We here investigate the influences of the principal algorithmic factors of two major types of hybrid error correction methods by mathematical modeling and analysis on both simulated and real data. Our study reveals the distribution of accuracy gain with respect to the original long read error rate. We also demonstrate that the original error rate of 19% is the limit for perfect correction, beyond which long reads are too error-prone to be corrected by these methods.

According to the definition of M and K, it is easy to know that {K ≥ k, M ≤ m} ⊂ {K ≥ k, M ≤ m + 1} and {K ≥ k, M ≤ m} ⊃ {K ≥ k + 1, M ≤ m} when l and p are fixed. Therefore the first two conclusions hold.
We prove that τ decreases with p below. Denote E n (0 ≤ n ≤ l) as a set of l-dimension 0-1 vectors. Each element in E n contains exactly n 1s, and the maximal length of continuous 0s is equal to or greater than k. According to the above steps, Taking the derivative of τ with respect to p gives Denote e as an l-dimension 0-1 vector and e i as its i-th entry. For any n(0 ≤ n ≤ l − 1), define two sets B and B as: f i is the operator that changes the i-th entry of e as 0. It can be verified that F is a single map, so that This implies (n + 1)#(B n+1 ) ≤ (l − n)#(B n ).
2 Dependence of short read alignment we study the distribution of N , which is the number of aligned short reads. Given the errors in the long reads, the alignments of all short reads are not completely independent. Once the shift distance of two short reads (i.e., the distance between their start positions) is small and one of them fails to be aligned, it is highly likely that the other cannot be aligned (Fig. S1A) as both short reads cover nearly the same set of errors on long reads. The correlation of mismatch numbers between two short reads decreases with their shift distance, and reaches low value when the shift distance is > 30 bp (Fig. S1B). As the short read length is 100 bp in this study, all short reads covering the base b of interest are divided into three clusters based on their start positions (Fig. S1C). Assume that (1) the start positions of the short reads are uniformly distributed; (2) whether the short reads in different clusters can be successfully aligned is independent; (3) the short reads in each cluster are all aligned or all not aligned. Denote D = (D 1 , D 2 , D 3 ) as the number of sequencing short reads in three clusters, and d = (d 1 , d 2 , d 3 ) as the value of D. Note that be the random variable that represents the number of aligned short reads in the i-th cluster. The observed aligned short reads N equals to N 1 + N 2 + N 3 . The probability Pr(N = n) can be expressed as According to assumption (1), Pr(D = d) can be calculated based on the trinomial distribution Trinom(C; 1 3 , 1 3 , 1 3 ). According to assumption (2) and (3), where Pr(N i = n i ) = τ if n i = 0, and 1 − τ if n i = d i .

Estimation of mismatch rate p
Denote the short read error rate and long read error rate as β and γ, respectively. Let S i be the real nucleotide on the molecule. Let X i and Y i be the corresponding nucleotides on the short read and long read. Since the short read and long read are sequenced independently, we can simply assume that The mismatch rate is calculated as It can be seen that Note that γ generally ranges between 5% and 30%, while β is only from 0.1% to 1%. The value of βγ is usually tiny, so we can simply set the mismatch rate p as β + γ, which is the summation of the short read error rate and the long read error rate. The ranges of γ and β also imply that the mismatch rate p is mainly determined by the long read error rate.

Relationship between τ and parameters p, k, m
We further study how τ changes with certain values of p, k, and m (Fig. 1B). l is set as 100, the common length of Illumina short reads. τ shows inverse sigmoidal tendency with respect to p: τ is near 1 when p is small, and τ decreases rapidly to 0 as p increases. This indicates that the mismatch rate, or approximately the long read error rate, is the most dominant factor on τ . The start and end of the dramatic drop depend on k and m. Under the regular settings of k (from 15 to 20) and m (from 10 to 20), the drop occurs as 5% < p < 30% (Fig. 1B), which is the same as the typical range of TGS long read error rate. Thus, the alignment-based method is applicable to correct TGS data. As m increases, both τ and the effect of k on τ increase. When m = 10, the increment of k from 15 to 20 has little effect on τ and the dramatic drop of τ starts at p = 5% and ends at p = 20% (Fig. 1B). As m increases to 15 and 20, the effect of k becomes more obvious, and the dramatic drop of τ shifts to higher mismatch rate, starting at p = 10% and ending at p = 25% and p = 30%, respectively. The probability that a short read is aligned to a random place, or an incorrect place, can also be calculated using Theorem 1. The parameter p should be accordingly replaced by the mismatch rate under random occasion, which can be taken as 75%. Under a most loose criterion: k = 12, m = 40, the probability of successful alignment in random occasion is τ (12, 40, 0.75, 100) = 7.63×10 −15 . This implies that even if we have 300G Illumina short reads (10,000× coverage for human genome), the expected number of randomly aligned short read at each base is only 2.29 × 10 −3 . We therefore neglect the randomly aligned short reads. It should be noticed that by doing so we also neglect the false alignment of the short reads generated from repetitive regions. In that occasion, the mismatch rate p is not fixed and the problem is hard to model.

Monotonicity of consensus inference accuracy
g(N, β) is the consensus inference accuracy and is defined as (Methods part in main text): . N is the random variable that represents the number of successfully aligned short read covering the target base, and let n be a specific value of n. β is the short read error rate. Next, we prove that g(n, β) increases with n and decreases with β. The conclusions can be obtained from the following two lemmas. Lemma 1. Suppose n is a positive even number, then g(n, β) = g(n − 1, β). If β < 1 2 , we also have g(n, β) < g(n + 1, β).
Proof. We prove the fist conclusion. When n is even, we have Note that thus the first conclusion holds.
When β < 1 2 , g(n + 1, β) − g(n, β) therefore the second conclusion holds. Due to the nature of Second Generation Sequencing, β < 1 2 almost holds in any time, so we can generally conclude that g(n, β) decreases with short read error rate β.
Proof. According to Lemma 1, we only need to prove the occasion when n is odd, when Taking the derivative of g(n, β) with respect to β gives so that the conclusion holds.

Probability that a k-mer is present in DBG
We calculate p DBG , the probability that a k-mer is present in the DBG, i.e., p DBG equals to the probability that the k-mer is present on at least one short read. Suppose the number of short reads that cover the middle point of the k-mer is C, which is the short read coverage. Assume that the start positions of these short reads are evenly distributed. Let H be the number of short reads that cover the entire k-mer, then H follows the binomial distribution Binom C, k l and we have p DBG increases with C and decreases with β and k. We take l as 100 and k as 17, 19 and 21, which are three commonly used k values in software. When the short read error rate is 1%, only 8-9× short read coverage is required to achieve p DBG > 99.99%. Even if the short read error rate is as high as 2%, the requirement of short read coverage increases to 10-12× for p DBG > 99.99% (Fig. S3). This shallow coverage is generally satisfied in most current sequencing project. The above analyses imply that the long read length and error rate are the dominant factors for solid k-mer detection.

Generation of simulated data
The genome of E. coli (strain K-12 MG1655) is taken as the reference to generate simulated data. The long reads are generated using the PacBio read simulator SimLoRD (version 1.0.2) [1]. The parameter --fixed-readlength is as 1k, 2k, 5k and 10k, so that reads with four kinds of lengths are generated. Under each of the read length, we set --probability-threshold as 1%, 2% and up to 30%, and generate 2,000 long reads for each value. The parameter --max-passes for all occasions are set as 0. The software fails to generate any read when the --probability-threshold is 1%, so at last 58,000 long reads are generated for each kind of read length. Although SimLoRD automatically outputs the error rate and CIGAR string for each long read, we find that alternate insertion and deletion frequently occur at the places where the long read and reference genome are actually perfectly matched. This implies that the output error rate is likely to be overestimated. We therefore align the long reads to the true fragments on the reference genome and recalculate the long read error rate. The alignment is performed using the Needleman-Wunsch [2] algorithm with the scores of match, mismatch, gap open and gap extension being set as 1, -4, -2 and -1. The long read error rate is calculated as (#mismatch + total indel size)/long read length. The notation #mismatch represents the number of mismatched base pairs. After error rate recalculation, the number of long reads at all error rate levels are not equal to the fixed number 2,000, but they are all greater than 100. This implies that the sample size at each error rate level is still enough to achieve reliable evaluation.
The short reads are generated using ART (version 2.5.8) [3]. The parameter -ss (Illumina sequencing system) is set as HS25, so that the short reads are simulated with the setting of HiSeq 2500. The short read length is set as 100 bp. We simulate the short reads with three kinds of coverage: 10×, 20× and 50×. We also notice that proovread [4] performs error correction for multiple iterations. In the first iteration, the short reads are split into 1,000 chunks and only 6 chunks in every 20 chunks are used to correct the long reads. It implies that the effective coverage is only 30% of that of the input short reads. In order to mimic the coverage of 5×, 10×, 20× and 50×, we simulate additional short reads with coverage 17×, 33×, 67× and 167× for proovread.

Application of proovread
We correct the simulated long reads using proovread (version 5.14.0) [4], a typical and widely-used alignment-based error correction software. Since the model of alignment-based method has no relationship with the length of long read, we arbitrarily select the 2 kb long read set. It should be mentioned that proovread performs iterative correction and different alignment criteria are applied in different rounds. To fairly evaluate the model, we only use the result of the first iteration. In the version 5.14.0, bwa-proovread, a modified bwa program is used to align the short reads. The minimum seed length -k is automatically set as 12 by proovread. Instead of inputting the whole long reads, we split them into multiple bins according to their error rates (each bin corresponds to the error rate of 0-3%, 3-6%, 6-9%, 9-12%, 12-15%, 15-18%, 19-21%, 22-24%, 25-27% and 28-30%), and correct them separately. The reason is that when the long reads with high or low quality are mixed together, the short reads tend to be aligned to the high quality ones, and the low quality long reads are probable to embrace no or less short reads. Binning the long reads will reduce this bias and lead to fair evaluation of the model. We also notice that the option -a is set in the alignment command, which means all the alignments are output if a short read can be aligned to multiple long reads.
We set k = 12 and C as 10, 20 and 50 in the model, which are respectively the seed length threshold and short read coverage. l is set as 100, the short read length. Since proovread does local score comparison rather than applying a single hard cut-off, it is difficult to accurately set a general m. We therefore set m through an approximate approach. Specifically, we calculate the average number of aligned short read at each error rate level for the case of 10× coverage. Then we divide the average number by 200, which is the theoretical maximal aligned short read number, and obtain an alignment rate. We compare the rates at all error rate levels with τ under different m. The value of m minimizes the total difference is selected, which is 20 in our case (Fig. S4). With all parameter being set, the theoretical accuracy gain can be calculated.
For E. coli real sequencing data, we also bin the long reads according to their error rates and apply proovread with the default parameters, the same as for the simulated data.

Application of LoRDEC
As an evaluation for the model of graph-based method, we use LoRDEC (version 0.5.3) [5], a typical graph-based hybrid error correction software to correct the simulated and real long reads. The common parameter setting is -a 50000 -s 1. For simulated data, the k-mer size -k is set as values 17, 19 and 21. For E. coli real sequencing data, k is set as 19.

Processing of real data
We download the PacBio raw data of E. coli from PacificBiosciences/DevNet [6]. The bax file is then processed by smrtanalysis (version 2.3.0) with the command ConsensusTools.sh CircularConsensus --minPredictedAccuracy 75 --minFullPasses 0. The obtained 55,137 long reads are then aligned to the reference genome of E. coli (strain K-12 MG1655) with blasr (version 1.3.1) [7]. The parameters are set as -bestn 1 -sam -clipping soft. The number of successful alignments is 53,642. For each of the alignment, we trim the soft-clipped ends of the long read, and take out the region corresponding to the alignment from the reference genome. Those are taken as the pre-corrected and accurate long reads in our tests.
The Illumina short reads are downloaded from NCBI with the accession number ERR022075. In order to mimic 10× short read coverage, we randomly sample 33× (see Note 7 for explanation) and 10× from the short read dataset for proovread [4] and LoRDEC [5], respectively.
The models in the main text rely on the assumption that all bases on a long read are independent and that they share a common error rate. When the models are applied to real data, additional considerations need to be taken. Suppose we have a long read with error rate γ, which implies that the expectation of the total error number (i.e., the sum of mismatch number and total indel size) is Lγ.
We consider two cases: (1) the errors are evenly distributed; (2) errors cluster together. In the latter case, a perfectly matched k-mer is more likely to occur, which implies a greater probability of the long read's being corrected by either the alignment-based or graph-based method. If the frequency of the latter case is substantially larger in real data than under the i.i.d. assumption, the accuracy gain could be improperly estimated by the model with the original value of γ, so error rate adjustment is necessary for the analysis of real data.
Given a long read error rate γ, we calculate the distance between every two adjacent errors in the real long reads, and then fit the distance values by a geometric distribution. We pool the values of γ and the rates of the fitted geometric distribution, and then fit a cubic curve h = h(γ) over the points (shown in Fig. S6 are examples for the real and simulated E. coli datasets). When analyzing the real data, we input h(γ) instead of γ to the models. The function h is attributed to the sequencing platform and can be learned using other independent real datasets. We also notice that the error patterns generated by simulation software could be different from real dataset, as the fitted cubic curves do not overlap (Fig.  S6).
For the alignment-based method, error correction can only be achieved on the parts of the long reads where short reads can be aligned. At each long read error rate, we calculate the proportion of the regions with short reads being aligned for every long read, and use a boxplot to show the distribution (Fig. S7). The overall distribution of the proportion shows a down-shift tendency when long read error rate increases. Specifically, the entire long reads are likely to be covered by short reads when long read error rate is < 10%. The variance of the distribution increases dramatically after this turning point of long read error rate ∼ 10%.

Model application to transcriptome sequencing data
In principal, error correction is to correct the error-prone long reads using highly accurate short reads generated from the same resource, in spite of the difference of resource (genome or transcriptome). In addition, long reads are corrected independently. Therefore the algorithms and models mentioned in our study are all applicable to transcriptome sequencing data.
However, two important differences between genome and transcriptome data should be considered: gene expression level and upper limit of transcript length. The unequal expression levels of genes result in uneven short read coverage of transcriptome data, in contrast to the generally uniform distribution of genome sequencing coverage. Therefore, our model based on fixed short read coverage may not work well for this case. Alternatively, one can bin the genes with similar expression levels, and apply our model for every bin, with the tuned short read coverage.
In addition, only a small fraction of transcripts are longer than 10 kb, so in general transcriptome data is relatively shorter than genome data. Using PacBio platform, a large proportion of transcriptome data can generage CCS reads, which is of low error rate (< 10%) and can likely be further corrected "perfectly" by short reads (Fig. 1D and 1F) with both methods. In case of using ONT data, the consensus reads (1D 2 or 2D) may still have relatively high error rate. When correcting ONT reads for short transcripts, the alignment-based method may work better (see Fig. 2G).

Suggestion on method selection
Long reads are widely used for genome assembly and structural variation detection, while different error correction strategies should be used based on data types and research goals.
When the library with long molecules is sequenced, PacBio likely outputs CLR (continuous long read) data with little self-consensus correction but long length. The long read length can help us to build contigs and scaffolds at the long-range for genome assembly and detect large structural variations. To fully make use of this advantage of CLR data, hybrid error correction can improve the quality of CLR data. Considering the long read length, we suggest the graph-based method as the first choice to correct CLR data as shown in the main text. In addition, the graph-based method is much more computing efficient than the alignment-based method, which is useful to handle the large size of data in genome assembly. Moreover, the long reads failed to be corrected by the graph-based method can be subject to the alignment-based method, and some of them are probably to be rescued.
When short molecules are sequenced, PacBio likely outputs CCS (circular consensus sequencing) data with relatively high accuracy by self-consensus correction but short length. The high accuracy of CCS reads is useful for polishing the sequence accuracy of genome assembly and determining breakpoint of structural variations at single-nucleotide resolution. In some cases, the accuracy of CCS data can reach 99% [8]. In such case, the accuracy is very close to short reads and error correction is just optional.
For ONT data, the consensus reads (1D 2 or 2D) may still have relatively high error rate and thus hybrid error correction could still be helpful. For longer ONT reads (> 2 kb), we suggest the graph-based method as the first choice, and to rescue the uncorrected long reads by the alignment-based method. For shorter ONT reads (< 2 kb), the alignment-based method can be taken as the first choice (Fig.  2G), since the solid k-mer in graph-based method is likely to be absent in shorter ONT reads as shown in our analyses.

Parameter design for organisms with different genome complexity
One of the most important parameters of error correction methods is the k-mer size, which is used in the alignment-based (threshold for perfectly matched seed length) and the graph-based method (solid k-mer size). Genomes of different organisms can have distinct levels of repetitive sequences, and the parameter k should be specifically designed for each organism to ensure specificity. To do that, we evaluate the specificity by the proportion of unique k-mer pattern count over the genome size (Fig.  S8). The specificity increases with k and a turning point is obvious on each curve. The left side of the turning point shows a swift slope, while the right side reaches a plateau of specificity that varies among different organisms. Although a smaller k can lead to higher probability of short read alignment or solid k-mer detection (τ and φ in Theorem 1 and its expanded application to graph-based method), it should be controlled by a threshold to reduce false positive rate. Based on the turning point, we suggest k = 14 for E. coli, k = 15 for S. cerevisiae, k = 17 for D. melanogaster, k = 19 for M. musculus and H. sapiens (Fig. S8).
A C B Fig. S1: Independence of short read alignment and the related model. (A) A case showing the dependency of short read alignments. The lower short read has 10 mismatches (represented by red stars) with the long read. The upper short read has only 1 bp shift with the lower one, so that the difference of mismatch numbers for the two short reads cannot exceed 1. Under a certain alignment criterion, if one of the short read can (cannot) be aligned, then it is high likely that the other can (cannot) be aligned. Therefore it is not proper to assume that whether all short reads can be successfully aligned is independent. (B) Relationship between the NMI (normalized mutual information) of mismatch numbers and short read shift distance revealed by simulated tests. The correlation of mismatch number decreases when the shift distance increases. (C) Proposed model to tackle short read alignment dependence. The real short reads covering a certain base are divided into three classes according to their positions. The numbers of short reads in the three classes are denoted as 1 , 2 and 3 . It is assumed that the alignments of the short reads in different classes are independent, while all short reads within a class can (cannot) be aligned if one of them is (not) aligned.

Fig. S2: Relationship between consensus accuracy and short read coverage.
The consensus accuracy refers to the probability that the consensus inferred at a base on the long read is equal to the true base. The short read coverage here is the number of aligned short reads that cover the base on the long read. Different curves represent different short read error rates.   Rate of the fitted geometric distribution