Definitions
We use i to index the position in a string S and let S[i, k] denote a k-mer substring at position i in S covering the k positions \(i,\ldots ,i+k-1\) in s. We will consider 0-indexed strings. We let \(|\cdot |\) denote both the length operator applied to strings and the cardinality operator applied to sets. We refer to a subsequence of a string as a set of ordered letters that can be derived from a string by deleting some or no letters without changing the order of the remaining letters. A substring is a subsequence where all the letters are consecutive. Our fuzzy seeds produced from S are subsequences of S since they consist of two syncmers \(k_1\) and \(k_2\) that do not necessarily overlap on S. The syncmers are concatenated after their extraction to a string \(k_{12} \doteq k_1k_2\) that constitutes the strobemer seed.
We use 2-bit encoding to store nucleotides with 00, 01, 10, and 11 representing A, C, G, and T. With this bit encoding, each string is associated with a sequence of bits that can be interpreted as an integer. For example, \(TCA = 110100 = 52\). From now on, we will assume that all strings are manipulated by manipulating their sequence of bits. We use h to denote a hash function \(h:\{0,1\}^{*}\xrightarrow []{} \{0,1\}^{*}\) mapping a sequence of bits to another sequence of bits. Finally, we say that two seeds \(k_{12}\) and \(k'_{12}\) match if \(h(k_{12}) = h(k'_{12})\). We will denote a match as m and use \(m.q_s, m.q_e, m.r_s, m.r_e, m.o\) to denote the read (query) start and end positions, the reference start and end positions, and the orientation of the match, respectively.
Open syncmers
Open syncmers is a k-mer subsampling method described in [22]. They are sampled based on three parameters (k, s, t) where k, s, and t are positive integers and s≤k. The open-syncmer method compares the \(k - s+1\) consecutive s-mers within a k-mer and selects the k-mer as a syncmer if the smallest s-mer occurs at position \(t \in [0, k - s+1]\) within the k-mer. The smallest s-mer is inferred from the hash value that the s-mer produces. Similar to what is commonly performed in k-mer applications, we use a canonical representation of syncmers. A canonical representation means that the lexicographically smallest syncmer out of its forward and reverse-complement sequence is stored.
Strobemers
Strobemers are described in [23] and consists of several shorter k-mers, referred to as strobes. We use the randstrobe method [23] described below. Four parameters \((n, k, w_{min}, w_{max})\) are used to define a randstrobe where n is the number of strobes, k is the length of the strobes, and \(w_{min}\) and \(w_{max}\) is the lower and upper coordinate offset to the first strobe \(k_1\) for sampling the second strobe \(k_2\) on a string S. We will here consider randstrobes with \(n=2\), i.e., consists of two strobes. Let \(k_1\) have coordinate i on S and \(\mathcal {W}\) be the set of k-mers from the substring (window) \(S[i +w_{min}:i+w_{max}+k-1]\). Then, the randstrobe samples strobe \(k_2\) according to the following sampling function [23]:
$$\begin{aligned} k_2 = \underset{k' \in \mathcal {W}}{\text {argmin}} \; (h(k_1) + h(k')) \; \& \; p . \end{aligned}$$
(1)
where & is the bitwise AND operator and p is a 64-bit bitmask consisting of 1s at the p leftmost bits and remaining 0s.
Modifications to strobemers
Strobemers was first described as being produced over the set of all k-mers in a sequence [23]. The first modification we make to strobemers as originally described is that we will compute the strobemers over syncmers. Therefore, from now on \(k_1, k_2\), and \(k'\) are used to denote the subsampled set of k-mers that are open syncmers. We will let \([w_{min},w_{max}]\) refer to the lower and upper number of syncmers downstream from \(k_1\) where we sample \(k_2\) from. Assume that \(k_1\) has coordinate i, then we let \(\mathcal {W}_s\) denote the set of syncmers in the substring \(S[i +w_{min}:i+w_{max}+k-1]\).
A second modification to the strobemers [23] is that we use a skewed sampling function that selects nearby syncmers more frequently. The sampling skew for sampling the second syncmer \(k_2\) is produced from
$$\begin{aligned} k_2 = \underset{k' \in \mathcal {W}_s }{\text {argmin}} \; B( \; (h(k_1) ^\wedge h(k')) \; \& \; p ). \end{aligned}$$
(2)
Here, \(^\wedge\) is the bitwise XOR operator, and B counts the number of set bits. In other words, B returns the number of set bits among the \(p \in [1,64]\) leftmost bits in the 64-bit integer produced after the XOR operation of the hash values of the two strings. Function (2) maps the value space down to [0,p-1] and collisions are resolved by picking the first leftmost strobe. Therefore, a lower value on p results in more often picking nearby strobes. An example distribution is shown in Additional file 1: Fig. S1. We found that function (2) gave significantly improved performance on shorter reads (100–150nt) compared to function (1).
The third modification to strobemers is our stored hash value for the strobemer. We store for \(k_{12}\) the value \(h'(k_{12}) \doteq h(k_1)/2 +h(k_2)/2\). The hash function \(h'\) is symmetric (\(h'(k_{12}) = h'(k_{21})\)) and together with canonical syncmers, it produces the same hash value if the strobemer is created from forward and reverse complement direction. It is stated in [23] that symmetrical hash functions are undesirable for mapping due to unnecessary hash collisions. However, when masking highly repetitive seeds as commonly performed in aligners [18], it turns out that a symmetrical hash function helps to avoid sub-optimal alignments when using strobemers. The reason is the same as for using canonical k-mers in mapping and overlap detection algorithms. Namely, it allows for consistent masking and treatment of forward and reverse complement mapping locations. We will now describe why.
Assume we would use an asymmetric hash function, such as \(h''(k_{12}) = h(k_1)/2 + h(k_2)/3\) proposed in [23]. Also assume that strobemer seeds \(k_{12}\) and \(k_{21}\) are both found in forward orientation the reference due to, e.g., inversions. In this case, only \(k_{12}\) may be masked because of its distinct hash value to \(k_{21}\). Now, consider a read in which we extract \(k_{12}\) in forward direction and \(k_{21}\) in reverse complement direction. If the read has an optimal match to forward direction with seed \(k_{12}\) (masked on reference), we would still find the suboptimal match of \(k_{21}\) of the read in reverse complement orientation to the reference. By using a symmetric hash function, we guarantee to mask the same strobemers in both directions. We observed that such cases are common on, e.g., chromosome X in the human genome.
Another benefit of this symmetrical value which does not hold for exact seeds, is that we can use false symmetrical matches to our benefit. A false symmetrical match is when the forward seed starting with syncmer \(k_1\) and reverse complement starting with syncmer \(k_2\) become linked as seeds \(k_{12}\) and \(k_{21}\), respectively, hence \(h'(k_{12}) = h'(k_{21})\). This happens relatively frequently but is not guaranteed. That is, even if the minimizing syncmer for \(k_1\) is \(k_2\), \(k_1\) does not need to be minimizing syncmer \(k_2\). However, the beneficial scenario happens when we have a false symmetric match and, for example, the forward orientation seed is destroyed because of mutations. In this case, it is not guaranteed that the match in the other orientation is destroyed. Thus, we get a false symmetrical match that helps us locate the read location on the genome, which is useful for reads with very few matches. The event of false symmetrical matches was realized and implemented in 0.6.1 in strobealign, leading to slightly improved accuracy.
Indexing
We first construct open syncmers from the reference sequences and then link two open syncmers together using the randstrobe method with Eq. (2) as sampling function. A beneficial characteristic with open syncmers is that the the same syncmers will be created from the forward and reverse complement strand if \(k-s+1\) is odd and \(t= \lceil (k-s+1)/2\rceil\). Conveniently, it was shown that choosing \(t = \lceil (k-s+1)/2\rceil\) is the optimal parameter for sequence mapping [24]. We compute such canonical open syncmers (using \(k=20\), \(s=16\), \(t=3\) as default values) to produce a subsampling of roughly 20% of the k-mers sampled, which is similar to \(w=10\) in the minimizer sampling method. As for forming the strobemers, we compute the second strobe from a window of \([w_{min},w_{max}]\) downstream syncmers to the first strobe, where we set \(w_{min}\) and \(w_{max}\) dependant on the read length based on an experimental evaluation. See details in “Implementation details” section.
We store tuples \((h'(k_{12}), r_s, v)\) in a flat vector V where \(h'(k_{12})\) is the 64-bit integer hash value of the strobemer, \(r_s\) is the coordinate start of the first strobe (32-bit integer), and v is a 32-bit integer containing reference id (rightmost 24 bits) and the offset of the second strobe (leftmost 8 bits). We sort V by hash values and construct a hash table with hash values as keys pointing to offset and the number of occurrences of the hash value in V. By lookup of \(h'(k_{12})\), we know which segment in V to iterate over to find query matches. This type of index representation has been used previously [18, 21] and was suggested to us by Heng Li [43]. Finally, similarly to minimap2, we mask (ignore) a top fraction of repetitive strobemers in the reference. This value is a parameter to strobealign and is by default set to 0.0002, similarly to minimap2.
Finding candidate mapping sites
Strobealign computes canonical open syncmers from the read similarly to what is described above to index the reference. Since the created syncmers are canonical, we can compute forward and reverse complemented strobemers by iterating over the syncmers in forward and reverse order, respectively. Computing strobemers in forward and reverse orientation gives us a vector of tuples \((q_s,q_e,o)\) representing start coordinate \(q_s\), end coordinate \(q_e\), and orientation o (Boolean value with 0 representing forward and 1 representing reverse complement), of the strobemer on the read.
If a strobemer is found in the reference, it will have one or more coordinate-tuples \((r_{id},r_s, r_e)\) in the reference. We call \(m=(r_{id},r_s, r_e,q_s,q_e,o)\) a match between the query and the reference. Let d be the difference in length between the strobemer on the reference and query. If several matches are found, Strobealign iterates over the matches, stores the lowest observed d during iteration, and saves only the matches with the current lowest d. This approach is not the same as first computing the lowest d and iterating a second pass to store only matches with the lowest d. We chose the former as we tried both and observed close to no difference in accuracy while slightly increasing runtime in the latter case due to the extra iteration.
From the stored matches, strobealign constructs merged matches \(\mathcal {M}\) which are similar but slightly more stringent to Non-overlapping Approximate Matches (NAMs) that are defined in [23]. Merged matches are produced as follows. We iterate over all matches in ascending order in the read and join two matches m and \(m'\) into a merged match if it holds that
-
(i)
\((m.r_{id} == m'.r_{id}) \; \& \& \; (m.o == m'.o)\)
-
(ii)
\(m.q_s < m'.q_s \le m.q_e\)
-
(iii)
\(m.r_s < m'.r_s \le m.r_e\)
-
(iv)
and if one of the following holds; \((m.q_s< m'.q_s< m.q_e< m'.q_e) \wedge (m.r_s< m'.r_s< m.r_e < m'.r_e)\) or \((m.q_s< m'.q_s< m'.q_e< m.q_e) \wedge (m.r_s< m'.r_s< m'.r_e < m.r_e)\).
In other words, the matches need to (i) come from the same reference and have the same direction, (ii-iii) overlap on both query and reference, and (iv) two strobemer matches need to have a consistent ordering of the four strobes on the reference and the query. Specifically, a NAM requires only (i-iii). There are scenarios due to local repeats where, for example, \((m.q_s< m'.q_s< m.q_e < m'.q_e)\) is the order on the query but \((m.r_s< m'.r_s< m'.r_e < m.r_e )\) is the order on the reference invalidating (iv). We consider such cases as separate matches.
If m is the current considered match in the iteration over the matches in a query, we refer to all matches \(m''\) with \(m''.q_s < m.q_s \le m''.q_e\) as open matches and \(m''.q_e < m.q_s\) as closed matches. While iterating over the matches in a read, we keep a vector of currently open merged matches, and filter out the closed matches in this vector. In a merged match \(\mathcal {M}\) we keep information of how many matches were added, the position of the first and last strobe on query and reference, and the orientation on the reference genome. After the final match on the query, we close all merged matches. The closed matches are the final merged matches that constitute the candidate mapping locations.
Computing MAPQ score
After merging matches, each merged match \(\mathcal {M}\) consists of a number of matches \(|\mathcal {M}|\) and a total span-range of the merged match on both the query \(a = \mathcal {M}.q_e - \mathcal {M}.q_s\) and the reference \(b =\mathcal {M}.r_e - \mathcal {M}.r_s\). We define the score of \(\mathcal {M}\) as \(S_{\mathcal {M}}=( \min \{a,b\} - |a-b|) \cdot |\mathcal {M}|\), which acknowledges only the minimum span over the query and reference and penalizes if there is a difference in the span lengths. We compute the MAPQ score similarly to minimap2 but substitute minimap2’s chain score [18] to our merged match score \(S_{\mathcal {M}}\). That is, if \(S_1\) and \(S_2\) are the top two scoring merged matches for a read, the MAPQ is computed by
$$\begin{aligned} MAPQ =40(1-S_2/S_1) \cdot \min \{1, |\mathcal {M}|/10\} \cdot \log S_1. \end{aligned}$$
Single-end alignment
Merged matches are produced and scored as described above and constitute the candidate mapping regions. For each candidate region sorted order with respect to the score, we extract segments on the reference defined by coordinates \((\mathcal {M}.r_s - \mathcal {M}.q_e, \mathcal {M}.r_e + (|q| - \mathcal {M}.q_e))\), where |q| is the length of the read. If \(| \mathcal {M}.q_e - \mathcal {M}.q_s | = | \mathcal {M}.r_e - \mathcal {M}.r_s|\), we compute the Hamming distance between the read and the extracted reference segment. Otherwise, if the distance between merged match is different on the reference and query due to, e.g., indels, we send the sequences to alignment with ssw [44]. We use a match score of 1 and alignment penalties of 4, 6, and 1 for mismatch, gap open, and gap extension, respectively. Additionally, if the computed Hamming distance is larger than 0.05|r| where |r| denotes the read length, we perform an additional alignment with ssw as, theoretically, there may be more than one indel within the mapped location that would lead to the same match lengths on the read and the reference.
Rescue mode
A read could have few or zero matches if all the strobemers extracted from the read were masked due to being too abundant. The abundance cutoff, which we denote as A, is controlled with a parameter -f (default value 0.0002), as in minimap2. For example, A is between 30 and 50 for hg38 depending on the values we use for parameters k, \(w_{min}\), and \(w_{max}\), described in “Implementation details” section. If more than 30% of strobemers were masked when finding matches, strobealign enters a rescue mode where it considers a higher threshold. In the rescue mode, strobealign sorts the seeds according to the abundance on the reference. Then it selects all the seeds below an abundance of R (R is a positive integer parameter with default value 2) and, if this still produces fewer than 5 seeds, it uses 1000 as a hard abundance threshold. If there are still 0 matches, the read is treated as unmapped.
Paired-end alignment
Similar to the single-end alignment mode, strobealign computes merged matches for both mates within the read pair and employs an identical rescue mode if there are too many masked strobemer seeds, as described for the single-end alignment. There are, however, two additional components in the paired-end alignment mode. Firstly, strobealign employs a joint scoring of candidate location based on expected insert size (similar to BWA-MEM). Secondly, strobealign can enter a rescue mode even for a read with zero matches based on the mate’s location. We describe the two components below.
For the joint scoring, strobealign first sorts the candidate map locations based on the total seed count for the two mates in a read pair. Then, strobealign finds the best candidate locations from a combined MAPQ score described below. Let \(|\mathcal {M}_i^1|\) and \(|\mathcal {M}_j^2|\) be the number of matches in i-th and j-th merged match for the first and the second mate in the read-pair, respectively. If it holds for some i and j that the two merged matches are on the same chromosome in the correct relative orientation with a mapped distance \(< \mu + 10\sigma\), the joint map-location count \(C_{ij}\) is simply \(C_{ij} = |\mathcal {M}_i^1|\) + \(|\mathcal {M}_j^2|\). We also add the individual candidate map location counts obtained as \(C_{i-} = |\mathcal {M}_i^1|\) and \(C_{-j} = |\mathcal {M}_j^2|\) for the two mates individually.
For the scores in order of highest total seed count first, strobealign performs base-level alignment of each mate (as described in the “Single-end alignment” section). The alignment score \(S_{ij}\) of such aligned pair is then computed as
$$\begin{aligned} S_{ij} = SW_{i} + SW_{j} + \log N(d_{ij},\mu ,\sigma ) \end{aligned}$$
where SW denotes the Smith-Waterman alignment score, and \(d_{ij}\) denotes the distance between the mates on the genome. The individually mapped reads (e.g., if on different chromosomes) are given a score
$$\begin{aligned} S_{ij} = SW_{i} + SW_{j} -10. \end{aligned}$$
Since -10 corresponds to more than 4 standard deviations away, this is the score cutoff which prefers the reads to be mapped individually at their respective locations with the highest SW score.
Mate rescue mode
If one of the mates does not have any merged match, we perform base level alignment of the mate without merged match within a genomic segment of \([0,\mu +5\sigma ]\) nucleotides away in the expected direction from the location of the mate with a merged match.
Implementation details
Similar to the default parameters in minimap2, we consider the top 20 MAPQ (or joint MAPQ) scoring candidates for alignment, and we implement a drop-off score threshold of 0.5 (score to the highest score). In addition to these parameters, we employ two additional optimizations. First, if we encounter a perfect match (no mismatches), we stop and report the alignment even if there are remaining candidates above the drop-off parameter. Second, suppose we have encountered an alignment with an edit distance of 1. In that case, we do not call base-level alignment for remaining candidates since a call to base level alignment implies that the edit distance is at least 1, as we described in the “Single-end alignment” section. Strobealign also supports multithreading using openMP. If more than one thread is specified, strobealign will parallelize the alignment step by splitting the reads into batches of 1 million reads to be processed in parallel.
The median read length (\(\tilde{x}\)) is estimated from the first 500 reads in the read input. As for selecting values of k, s, p, \(w_{min}\) and \(w_{max}\), we set suitable parameters given \(\tilde{x}\) based on experimental evaluation of accuracy and runtime. We let \(w_{min} = k/(k-s+1)+ l\) and \(w_{min} = k/(k-s+1)+ u\) where l and u are integers that specify lower and upper offset. Then, we choose the following parameter tuples for (k, s, p, l, u) given \(\tilde{x}\).
$$\begin{aligned} (k,s,p,l,u)= \left\{ \begin{array}{ll} (20,4,8,-4,2),&{} \text {if } \tilde{x} \le 75\\ (20,4,8,-2,2), &{} \text {if } 75< \tilde{x} \le 125 \\ (20,4,8,1,7), &{} \text {if } 125< \tilde{x} \le 175 \\ (20,4,8,4,13), &{} \text {if } 175< \tilde{x} \le 275 \\ (22,4,8,2,12), &{} \text {if } 275< \tilde{x} \le 375 \\ (23,6,8,2,12), &{} \text {if } 375 < \tilde{x} \end{array}\right. \end{aligned}$$
With a read length of 200nt and the parameters in this study, roughly \((1/5)*200 = 40\) syncmers are produced for each read, and roughly 30 strobemers are created in each direction. Naturally, these values are reduced for shorter reads which impacts sensitivity. A read of 100 nt will have on average 20 syncmers and only 10 strobemers. We could consider lower \(w_{max}\) to produce more strobemers for shorter reads at the cost of memory.
Although it is exponentially less likely to have open syncmers sampled further away from the mean sampling density [22], they do not have a window guarantee and may be sparser sampled in some regions. Therefore, we have a hard limit on the maximum seed size as a parameter to strobealign (m), where it defaults to \(\tilde{x} - 50\). This means that, in some cases, the maximum seed size may be lower than the distance to the downstream syncmer corresponding to \(w_{max}\). With the parameters we use above, this happens in less than 0.1% of the seeds. Finally, on rare occasions there is no open syncmer within the downstream window, e.g., due to regions of N’s in centromeres on hg38. In these cases we use only the first syncmer as the strobemer seed. This happens on hg38 for 0.0007% of the \(\tilde{x} = 150\) seeds (3,963 out of 544M seeds).
The E-hits metric
Let N be the total number of seeds and \(M \le N\) the number of distinct seeds produced over a set of reference sequences by any seeding method. Let \(i \in [1,M]\) be an index variable over the set of distinct seeds, and \(x_i\) denote the number of times seed i is produced. Then, \(p_i = \frac{x_i}{N}\) corresponds to the probability that seed i is picked uniformly at random over the multiset of seeds produced by the seeding method. The expected number of occurrences of a randomly sampled seed i is
$$\begin{aligned} E[X] = \sum\limits_{i = 1}^{M} x_i p_i = \sum\limits_{i = 1}^{M} x_i \frac{x_i}{N} = \frac{1}{ N } \sum\limits_{i = 1}^{M} x_i^2 \end{aligned}$$
We refer to E[X] as E-hits. The connection to read alignment is immediate. A seed of length s is captured in a read of length r if the read starts at any of the \(r-s+1\) locations upstream to the start of the seed. Assuming uniformly sampled reads, the probability of the seed being captured is \((r-s+1)/|G|\) where |G| if the length of the genome. Since every seed is captured with the same probability \((r-s+1)/|G|\), this is a constant. Therefore, picking a read at a random location and extracting a seed preserves the relative frequency to directly picking a seed at random from the index. Note however that the probability \((r-s+1)/|G|\) does not hold near the end of the reference. Assuming the reference is large enough, this has negligible effect. Therefore, if reads are sampled uniformly at random over the reference sequences, E-hits measure the expected number of matches we obtain for error-free seeds extracted from the reads. The uniform distribution is a common assumption for, e.g., Illumina genome sequencing reads, albeit not fully accurate.
The E-hits metric is conceptually similar to the expected contig size covering a random position in the genome (E-size) as defined in [45], hence E-hits’ similar denotation. E-hits is also conceptually similar to Eq. 3 in [46] for calculating what the authors define as expected hit rate. The difference between E-hits and the expected hit rate is that the denominator is \(N^2\) in the latter metric. This denominator results in a difference both in the intuition and in the outcome when ranking seed constructs. First, E-hits has a direct interpretation, namely the expected number of hits for a randomly selected seed, while expected hit rate results in a typically very small fractional number (dividing E-hits with the total number of seeds). Second, there is a difference in ranking the outcome when comparing seeding methods of different densities. For example, for seeds of size 60, E-hits is about 30 for k-mers and 20 for minimizers with density 1/5 (Fig. 2A). To get the expected hit rate for k-mers we divide with N and for minimizers we divide with N/5. This results in a smaller expected hit rate for k-mers. Note also that E-hits is different from the popular E-value used in BLAST [47]. The E-value is a theoretical computation that measures the expected number hits that could be found by chance under random nucleotide distribution given a database size and an exact k-mer seed. E-hits is computed from the actual reference sequences and any seeding protocol.
Memory usage
Strobealign has a peak memory usage of about 25–33 Gb for hg38 (Additional file 1: Fig. S5). With the parameter settings we investigated in this evaluation for read lengths 50–300 nt on hg38 (syncmer subsampling rate of 1/5), strobealign stores roughly 544 million seeds in memory. Furthermore, the size of the index scales with the number of unique seeds. For example, strobealign uses only about 1.5 times more memory (50Gb) when aligning to rye (2.3 times larger than hg38) as rye is a repetitive genome with a lower fraction of unique seeds. Similarly, the maize index is only 15Gb (0.5 times the index of human). A detailed discussion on the memory and implementation is found in Suppl Note G.