Supplementary material for “Lighter: fast and memory-eﬃcient error correction without counting”

For a standard Bloom filter, each of the h hash functions could map item o to any element of the bit array. The bit array will often be very large, much larger than the processor cache. Thus, each probe into the bit array is likely to cause a cache miss. Putze et al [5] propose a blocked Bloom filter. Given a block size b, the first hash function H 0 (o) is used to select a size-b block of consecutive positions in the bit array. Then, H 1 (o), ..., H h−1 (o) map o onto elements of that block. When b is less than or equal to the size of a cache line, the h accesses will tend to cause only one or two cache misses, rather than approximately h cache misses. The drawback is that h and m must be somewhat larger to achieve the same false positive rate (FPR) as a corresponding standard Bloom filter. To estimate the FPR of the blocked Bloom filter, we can consider each of the possible m − b + 1 blocks. For the i-th block, the FPR within the block is (b ′ i /b) h , where b ′ i is the number of bits set to 1 in block i. So the overall FPR is: i (b ′ i /b) h m − b + 1 Putze et al also propose a pattern-blocked Bloom filter [5], where the difference is that instead of updating the h positions in the block separately, we pre-compute a list of patterns where each pattern is a bit-mask describing how to update h positions in a block with a few bitwise operations. To perform such an update we first find the appropriate pattern using hash function, then update the corresponding positions simultaneously. In Lighter, 64-bit integers are used to form the mask. For example, if b = 256, the pattern is made up of 4 64-bit integers, and we can update in 4 64-bit operations, regardless of h. The FPR formula above still roughly estimates the FPR for the pattern-blocked bloom filter. If the error is located near the end of the read and some candidate substitutions are equally good, we will extend reads using the k-mer reported in Bloom filter A for each candidates. Lighter extends the read base by base. For the new base beyond the read, Lighter tries all the four …

For a standard Bloom filter, each of the h hash functions could map item o to any element of the bit array.The bit array will often be very large, much larger than the processor cache.Thus, each probe into the bit array is likely to cause a cache miss.Putze et al [5] propose a blocked Bloom filter.Given a block size b, the first hash function H 0 (o) is used to select a size-b block of consecutive positions in the bit array.Then, H 1 (o), ..., H h−1 (o) map o onto elements of that block.When b is less than or equal to the size of a cache line, the h accesses will tend to cause only one or two cache misses, rather than approximately h cache misses.
The drawback is that h and m must be somewhat larger to achieve the same false positive rate (FPR) as a corresponding standard Bloom filter.To estimate the FPR of the blocked Bloom filter, we can consider each of the possible m − b + 1 blocks.For the i-th block, the FPR within the block is (b ′ i /b) h , where b ′ i is the number of bits set to 1 in block i.So the overall FPR is: Putze et al also propose a pattern-blocked Bloom filter [5], where the difference is that instead of updating the h positions in the block separately, we pre-compute a list of patterns where each pattern is a bitmask describing how to update h positions in a block with a few bitwise operations.To perform such an update we first find the appropriate pattern using hash function, then update the corresponding positions simultaneously.In Lighter, 64-bit integers are used to form the mask.For example, if b = 256, the pattern is made up of 4 64-bit integers, and we can update in 4 64-bit operations, regardless of h.The FPR formula above still roughly estimates the FPR for the pattern-blocked bloom filter.
Lighter's accuracy is near-constant as the depth of sequencing K increases and its memory footprint is held constant.The basic idea is that as K increases, we adjust α in inverse proportion.That is, we hold αK constant.For concreteness, consider two scenarios: scenario I, where the total number of k-mers is K 1 and subsampling fraction is α 1 , and scenario II where the number is K 2 = zK 1 and subsampling fraction is Contents of Bloom filter A. The occupancy of Bloom filter A, as well as the fraction of correct k-mers in A, are approximately the same in both scenarios.This follows from the fact that κ ′ c ∼ P ois(αK(1 − ǫ)/G), κ ′ e ∼ P ois(αKǫ/H), and αK, ǫ, G, and H are constant across scenarios.This is also supported by our experiments, as seen in the main body of the manuscript.Because the occupancy does not change, we can hold the Bloom filter's size constant while achieving the same false positive rate.
Accuracy of trusted / untrusted classifications.Also, if a read position and its neighbors within k − 1 positions on either side are error-free, then the probability it will be called trusted does not change between scenarios.We mentioned that when α is small, P (α 1 ) ≈ P (α 1 /z) = P (α 2 ).We also showed that the false positive rate of the bloom filter is approximately constant between scenarios, so P * (α 1 ) ≈ P * (α 1 /z) = P * (α 2 ).Thus, the thresholds y x will also remain unchanged.
is constant across scenarios since αK, ǫ, and G are constant.Since p c is constant, the parameters of the B e,x distribution are constant and the probability a correct position will be called trusted is also constant.Now we consider an incorrect read position.We ignore false positives from Bloom filter A for now.p e = p(κ ′ e ≥ 1)/p(κ e ≥ 1) = (1 − e −αǫK/H )/(1 − e −ǫK/H ) is the probability an incorrect k-mer is in the subsample given that it was sequenced.Since ǫK/H is close to 0, e −ǫK/H ≈ 1 − ǫK/H and p e ≈ (αǫK/H)/(ǫK/H) = α.Say an incorrect read position is covered by x k-mers; if B e,x is a random variable for the number of k-mers overlapping the position that appear in Bloom filter A, then B e,x ∼ Binom(x, p e ) ≈ Binom(x, α).The probability of falsely trusting a position is therefore: the upper bound in scenario II is lower by a factor of at least 1/z relative to the upper bound in scenario I.So an upper bound on the probability of labeling an incorrect position as trusted decreases by a factor of at least z.When K increases, the number of distinct test cases for incorrect positions increases by a factor of at most z.Thus, we expect the total number incorrect positions labeled as trusted to remain approximately constant.
When α is small, the false positive rate β may dominate the probability p e .In practice, however, the false positive rate is usually small enough that the probability of a incorrect position being labeled as trusted due to false positives is extremely low.For example, when k-mer length k = 17, the false positive rate of Bloom A ≈ 0.004, the threshold y 2k−1 = 6, and α = 0.05.In this situation, p(B e,x ≥ y x ) ≈ 5 • 10 −11 .
The above is not an exhaustive analysis, since we have not examined the case where a read position is error-free but not all of its neighbors within k − 1 positions on either side are error-free.In this case, whether the threshold is passed depends chiefly on the whereabouts of the nearby errors.

Contents of Bloom filter B.
Given the analysis in the previous section, we expect that the collection of k-mers drawn from the stretches of trusted positions in the reads will not change much across scenarios and, therefore, the contents of Bloom filter B will not change much.This conclusion is also supported by our experiments, as seen in the main body of the manuscript.

Supplementary Table 1: Quality-free simulation results
Here we give accuracy results from an evaluation similar to that shown in Table 1, except with quality values omitted.Each of the tools was run on a FASTA file (with no qualities) instead of a FASTQ file.Quake and Bless are omitted as they require quality values.The simulated error rate is 1% in these experiments.Here we give accuracy results from an evaluation similar to that shown in Table 1, except that the simulation was conducted using art_illumina v2.

Table 2 :
Simulation results using the Art simulator.

Table 3 :
[1] from the Art[2]package.Simulation results with C. elegans genome.evaluationfor the simulated data set from C. elegans genome with 35× coverage and 1% error rate using Mason[1]v0.1.2.The row labeled k gives the selected k-mer sizes.

Table 4 :
Alignment statistics for E. coli dataset using -very-fast.statistics for the 75× E. coli data set using Bowtie 2 [3] v2.2.2 with --very-fast.The column labeled k gives the selected k-mer sizes.