 Method
 Open Access
 Published:
Matchtigs: minimum plain text representation of kmer sets
Genome Biology volume 24, Article number: 136 (2023)
Abstract
We propose a polynomial algorithm computing a minimum plaintext representation of kmer sets, as well as an efficient nearminimum greedy heuristic. When compressing read sets of large model organisms or bacterial pangenomes, with only a minor runtime increase, we shrink the representation by up to 59% over unitigs and 26% over previous work. Additionally, the number of strings is decreased by up to 97% over unitigs and 90% over previous work. Finally, a small representation has advantages in downstream applications, as it speeds up SSHashLite queries by up to 4.26× over unitigs and 2.10× over previous work.
Background
Motivation
The field of kmerbased methods has seen a surge of publications in the last years. Examples include alignmentfree sequence comparison [1,2,3], variant calling and genotyping [4,5,6,7,8], transcript abundance estimation [9], metagenomic classification [10,11,12,13], abundance profile inference [14], indexing of variation graphs [15, 16], estimating the similarity between metagenomic datasets [17], species identification [18, 19] and sequence alignment to de Bruijn graphs [20,21,22,23]. All these methods are based mainly on kmer sets, i.e. on the existence or nonexistence of kmers. They ignore further information like for example predecessor and successor relations between kmers which are represented by the topology of a de Bruijn graph [24,25,26].
On the other hand, many classical methods such as genome assemblers [26,27,28,29,30,31,32,33,34,35] and related algorithms [36,37,38], are based on de Bruijn graphs and their topology. To increase the efficiency of these methods, the graphs are usually compacted by contracting all paths where all inner nodes have in and outdegree one. These paths are commonly known as unitigs, and their first usage can be traced back to [39]. Since unitigs contain no branches in their inner nodes, they do not alter the topology of the graph, and in turn enable the exact same set of analyses. There are highly engineered solutions available to compute a compacted de Bruijn graph by computing unitigs from any set of strings in memory [23] or with external memory [33, 40, 41]. Incidentally, the set of unitigs computed from a set of strings is also a way to store a set of kmers without repetition, and thus in reasonably small space. However, the necessity to preserve the topology of the graph makes unitigs an inferior choice to represent kmer sets, as the sum of their length is still far from optimal, and they consist of many separate strings. The possibility to ignore the topology for kmerbased methods opens more leeway in their representation that can be exploited to reduce the resource consumption of existing and future bioinformatics tools.
The need for such representations becomes apparent when observing the amount of data available to bioinformaticians. For example, the number of complete bacterial genomes available in RefSeq [42] more than doubled between May 2020 and July 2021 from around 9000 [43] to around 21,000. And with the ready availability of modern sequencing technologies, the amount of genomic data will increase further in the next years. In turn, analysing this data requires an ever growing amount of computational resources. But this could be relieved through a smaller representation that reduces the RAM usage and speeds up the analysis tools, and thereby allows to run larger pipelines using less computational resources. To fulfil this goal, a plain text representation would be the most useful: if the representation has to be decompressed before usage, then this likely erases the savings in RAM and/or adds additional runtime overhead. Formally, a plain text representation is a set of strings that contains each kmer from the input strings (forward, reversecomplemented, or both) and no other kmer. We denote such a set as a spectrum preserving string set (SPSS). Note that this definition is different from the SPSS defined by Rahman and Medvedev [44], who consider an SPSS to include the additional restriction that each kmer must be present at most once. Such a plain text representation has the great advantage that some tools (like, e.g. Bifrost’s query [23]) can use it without modification. We expect that even those tools that require modifications would not be hard to modify (like, e.g. SSHash [45] which we modified here as an example).
Related work
The concept of storing a set of kmers in plain text without repeating kmers to achieve a smaller and possibly simpler representation has recently been simultaneously discovered and named spectrum preserving string sets [without kmer repetition] by Rahman and Medvedev [44] as well as simplitigs by Břinda, Baym and Kucherov [43]. To avoid confusion with our redefinition of the SPSS, we call this concept simplitigs in our work. Both Rahman and Medvedev and Břinda, Baym and Kucherov propose an implementation that greedily joins consecutive unitigs to compute such a representation. The UST algorithm by Rahman and Medvedev works on the nodecentric de Bruijn graph of the input strings and finds arbitrary paths in the graph starting from arbitrary nodes. Each node is visited exactly by one path, and whenever a path cannot be extended forwards (because a deadend was found, or all successor nodes have been visited already), then a new path is started from a new random node. Before a new path is started this way, if any successor node of the finished path marks the start of a different path, then the two paths are joined. During the traversal, the unitigs of the visited nodes are concatenated (without repeating the \(k1\) overlapping characters) and those strings are the final output. Břinda, Baym and Kucherov’s greedy algorithm to compute simplitigs (for which the authors provide an implementation under the name ProphAsm [43]) does not construct a de Bruijn graph, but instead collects all kmers into a hash table. Then it extends arbitrary kmers both forwards and backwards arbitrarily until they cannot be extended anymore, without repeating any kmers. The extended kmers are the final output.
Both heuristics greatly reduce the number of strings (string count, SC) as well as the total amount of characters in the strings (cumulative length, CL) required to store a kmer set. The reduction in CL directly relates to a lower memory consumption for storing a set of strings, but also the reduction in SC is very useful. For example, when storing a set of strings, not only the strings need to be stored, but also some index structure telling where they start and end. This structure can be smaller if less strings exist. There might even be cases where by increasing CL and decreasing SC, the overall size of the representation of the string set (strings + index structure) can be decreased. However, to stay independent of any specific data structure we only optimise CL. Břinda, Baym and Kucherov show that both SC and CL are greatly reduced for very tangled de Bruijn graphs, like graphs for single large genomes with small kmer length and pangenome graphs with many genomes. Additionally they show merits of using heuristic simplitigs in downstream applications like an improvement in run time of kmer queries using BWA [46], as well as a reduction in space required when storing heuristic simplitigs compressed with generalpurpose compressors over storing unitigs compressed in the same way. Rahman and Medvedev show a significant reduction in SC and CL on various data sets as well, and also show a reduction in space required to store heuristic simplitigs over unitigs when compressed with generalpurpose compressors. Khan et al. [40] also provide an overview of using heuristic simplitigs for various genomes, including also a human gut metagenome.
The authors of both papers also give a lower bound on the cumulative length of simplitigs, and show that their heuristics achieve representations with a cumulative length very close to the lower bound for typical values of k (31 for bacterial genomes and 51 for eukaryotic genomes). Břinda, Baym and Kucherov also experiment with lower values of k (\(<20\) for bacterial genomes and \(<30\) for eukaryotic genomes) which make the de Bruijn graph more dense to almost complete, and show that in these cases, their heuristic does not get as close to the lower bound as for larger values of k. Further, the authors of both papers consider whether computing minimum simplitigs without repeating kmers might be NPhard. This has recently been disproven by Schmidt and Alanko [47], and in fact simplitigs with minimum cumulative length can be computed in linear time by using a subset of the matchtig algorithm. Their algorithm constructs a bidirected arccentric de Bruijn graph in linear time using a suffix tree, and then Eulerises it by inserting breaking arcs. It then computes a bidirected Eulerian circuit in the Eulerised graph and breaks it at all breaking arcs. The strings spelled by the resulting walks are the optimal simplitigs, named Eulertigs. Specifically, they leave out all parts from the matchtigs algorithm that relate to concatenating unitigs by repeating kmers, and instead only concatenate consecutive unitigs in an optimal way. In line with the previous results about the lower bounds [43, 44], Eulertigs are only marginally smaller than the strings computed by previous heuristics. All these suggest that no further progress is possible when kmer repetitions are not allowed in a plain text representation.
There are already tools available that use simplitigs. The compacted de Bruijn graph builder cuttlefish2 [40] has an option to output simplitigs instead of maximal unitigs. A recent proposal for a standardised file format for kmer sets explicitly supports simplitigs [48]. Also the kmer dictionary SSHash [45] uses simplitigs to achieve a smaller representation and to answer queries more efficiently. Here, the higher efficiency is achieved both by reducing the space required to store the kmers themselves, but also due to the lower string count reducing the size of the index data structures on top. Further, a recent proposal to index genomic sequences as opposed to kmer sets works with simplitigs without modification [49], and with minor extra bookkeeping also for general SPSSs. In that work, the size of the SPSS is very minor compared to the size of the index, however, major components of the index may be smaller if the SPSS contains less strings, which can be achieved by using greedy matchtigs. Our algorithms were also integrated into the externalmemory de Bruijn graph compactor GGCAT [41], which was easy to do (source: personal communication with Andrea Cracco).
For compressing multiple kmer sets, in [50] the authors use an algorithm similar to ProphAsm and UST that separates the unique kmer content of each set from the kmer content shared with other sets. For compressing kmer sets with abundances, the plain text representation REINDEER [51] uses substrings of unitigs with kmers of similar abundance, called monotigs. In the wider field of finding small representations of kmer sets that are not necessarily in plain text, there exists for example ESSCompress [52], which uses an extended DNA alphabet to encode similar kmers in smaller space.
Our contribution
In this paper we propose the first algorithm to find an SPSS of minimum size (CL). Moreover, we show that a minimum SPSS with repeated kmers is polynomially solvable, based on a manytomany mincost path query and a mincost perfect matching approach. We further propose a faster and more memoryefficient greedy heuristic to compute a small SPSS that skips the optimal matching step, but still produces close to optimal results in CL, and even better results in terms of SC.
Our experiments over references and read datasets of large model organisms and bacterial pangenomes show that the CL decreases by up to 26% and the SC by up to 90% over UST or ProphAsm (on larger datasets, sometimes UST cannot be run because BCALM2 cannot be run, and sometimes ProphAsm cannot be run because it does not use external memory). Compared to unitigs, the CL decreases by up to 59% and SC by up to 97%. These improvements come often at just minor costs, as computing our small representation (which includes a run of BCALM2) takes less than twice as long than computing unitigs with BCALM2, and takes less than 37% longer in most cases. Even if the memory requirements for large read datasets increase, they stay within the limits of a modern server.
Finally we show that besides the smaller size of a minimum SPSS, it also has advantages in downstream applications. As an example of a kmerbased method, we query our compressed representation with the tools SSHash [45] and Bifrost [23]. These are stateoftheart tools supporting kmerbased queries in genomic sequences, using a representation of a kmer set as a set of unitigs. By simply replacing unitigs with the strings produced by our greedy heuristic, and without modifications to Bifrost and a minor modification to SSHash disabling features that require unique kmers, we get a speedup of up to 4.26\(\times\) over unitigs, and up to 2.10\(\times\) over strings computed by UST and ProphAsm. We call the modified version of SSHash “SSHashLite”.
Results
Basic graph notation
We give detailed definitions for our notation below in the “Preliminaries” section, but give an intuition about the required notation for the results section here already. Note that our notation deviates from standard mathematical bidirected graph notations, but it is useful in practice as it allows to implement bidirected graphs on top of standard graph libraries. We assume that the reader is familiar with the general concept of de Bruijn graphs.
Our bidirected graphs are arccentric bidirected de Bruijn graphs. Arccentric de Bruijn graph means that kmers are on the arcs, and nodes represent \(k1\) overlaps between kmers. We represent the bidirected graph as doubled graph, i.e. by having a separate forward and reverse arc for each kmer and a separate forward and reverse node for each \(k1\) overlap. In this graph, binodes are ordered pairs \((v, v^{1})\) of nodes that are reverse complements of each other, and biarcs are ordered pairs \((e, e^{1})\) of arcs that are reverse complements of each other. Two biarcs \((e, e^{1})\) and \((f, f^{1})\) are consecutive if the normal arcs e and f are consecutive, i.e. the f leaves the node entered by e. A biwalk is a sequence of consecutive biarcs. If a biwalk visits a biarc, then it is considered to be covering both directions of the biarc. See Fig. 1a for an example.
Matchtigs as a minimum plain text representation of kmer sets
We introduce the matchtig algorithm that computes a characterminimum SPSS for a set of genomic sequences. While former heuristics (ProphAsm, UST) did not allow to repeat kmers, our algorithm explicitly searches for all opportunities to reduce the character count in the SPSS by repeating kmers. Consider for example the arccentric de Bruijn graph in Fig. 2a. When representing its kmers without repetition as in Fig. 2b, we need 43 characters and 7 strings. But if we allow to repeat kmers as in Fig. 2d, we require only 39 characters and 5 strings. It turns out that structures similar to this example occur often enough in real genome graphs to yield significant improvements in both character and string count of an SPSS.
Similar to previous heuristics, our algorithm works on the compacted bidirected de Bruijn graph of the input sequences. However, we require an arccentric de Bruijn graph, but this can be easily constructed from the nodecentric variant (see the “Building a compacted bidirected arccentric de Bruijn graph from a set of strings” section). In this graph, we find a mincost circular biwalk that visits each biarc at least once, and that can jump between arbitrary nodes at a cost of \(k  1\). This formulation is very similar to the classic Chinese postman problem [53], formulated as follows: find a mincost circular walk in a directed graph that visits each arc at least once. This similarity allows us to adapt a classic algorithm from Edmonds and Johnson that solves the Chinese postman problem [54] (the same principle was applied in [55]). They first reduce the problem to finding a mincost Eulerisation via a mincost flow formulation, and then further reduce that to mincost perfect matching using a manytomany mincost path query between unbalanced nodes. In a similar work [56], the authors solve the Chinese postman problem in a bidirected de Bruijn graph by finding a mincost Eulerisation via a mincost flow formulation. As opposed to [54, 55] and us, in [56] the authors propose to solve the mincost flow problem directly with a mincost flow solver. We believe this to be infeasible for our problem, since the arbitrary jumps between nodes require the graph in the flow formulation to have a number of arcs quadratic in the number of nodes.
Our resulting algorithm is polynomial but while it runs fast for large bacterial pangenomes, it proved practically infeasible to build the matching instance for very large genomes (\(\ge 500\)Mbp). This is because each of the mincost paths found translates into roughly one edge in the matching graph, and the number of mincost paths raises quadratically if the graph gets denser. Thus, our algorithm ran out of memory when constructing it for larger genomes, and for those where we were able to construct the matching instance, the matcher itself suffered from integer overflows, since it uses 32bit integers to store the instance. Hence, for practical purposes, we introduce a greedy heuristic to compute approximate matchtigs. This heuristic does not build the complete instance of the matching problem, but just greedily chooses the shortest path from each unbalanced node to Eulerise the graph. This reduces the amount of paths per node to at most one, and as a result, the heuristic uses significantly less memory, runs much faster, and achieves near optimal speedups when run with multiple threads (see Additional file 1: Supplemental figure S1). While it can in theory produce suboptimal results as in Fig. 2c, in practice, the size of the greedily computed strings is very close to that of matchtigs, and the number of strings is always smaller.
Moreover, the minimality of matchtigs allows us to exactly compare, for the first time, how close heuristic algorithms to compute simplitigs are to optimal SPSS (on smaller genomes and on bacterial pangenomes, due to the resourceintensiveness of optimal matchtigs).
Our implementations are available on GitHub (https://github.com/algbio/matchtigs) as both a library and a command line tool, both written in Rust. They support both GFA and fasta file formats with special optimisations for fasta files produced by BCALM2 or GGCAT. Additionally, our implementations support gzipcompressed input and output, as well as outputting an ASCIIencoded bitvector of duplicate kmers.
Compression of model organisms
We evaluate the performance of our proposed algorithms on three model organisms: C. elegans, B. mori, and H. sapiens. We benchmark the algorithms on both sets of short reads (average length 300 for C. elegans and B. mori, and 296 for H. sapiens) and reference genomes of these organisms. On human reads, we filter the data during processing so that we keep only kmers that occur at least 10 times (min abundance = 10). This is because with a min abundance of 1, H. sapiens has 114 billion unique kmers. This extreme kmer count causes ProphAsm and our tool to run out of memory even with 2TiB of RAM.
We use the metrics cumulative length (CL) and string count (SC) as in [43]. The CL is the total number of characters in all strings in the SPSS, and the SC is the number of strings. We evaluate our algorithms against the same large genomes as in [43], using both the reference genome and a full set of short reads of the respective species (see Table 1 for the results). Since UST as well as matchtigs and greedy matchtigs require unitigs as input, and specifically UST needs some extra information in a format only output by BCALM2 [33], we run BCALM2 to compute unitigs from the input strings. We chose \(k = 31\), as it is commonly used in kmerbased methods. While for larger genomes, larger k are used as well, we use the value \(k = 31\) throughout the main matter to allow for easier comparison between results. Furthermore, for all data sets but the C. elegans reference, the matchtigs algorithm ran out of the given 256GiB memory, so we only compute greedy matchtigs for those.
On read data sets where we keep all kmers, our greedy heuristic achieves an improvement of up to 26% CL and 82% SC over the best competitor (USTtigs). The human read data set has smaller improvements, however it was processed with a min abundance of 10, yielding longer unitigs with less potential for compression. On reference genomes, the improvement in CL is smaller with up to 7%; however, the improvement in SC is much larger with up to 90%.
For C. elegans, where computing matchtigs is feasible as well, we observe that they yield no significant improvement in CL, but are even slightly worse in SC than the greedy heuristic. The greedy heuristic actually optimises SC more than the optimal matchtigs algorithm. That is because the matching instance in the optimal algorithm is built to optimise CL, and whenever joining two strings does not alter CL, the choice is made arbitrarily. On the other hand, the greedy heuristic makes as many joins as possible, as long as a join does not worsen the CL. This way, the greedy heuristic actually prioritises joining two strings even if it does not alter the CL. For more details, see the “Solving the mincost integer flow formulation with mincost matching” and “Efficient computation of the greedy heuristic” sections. See Additional file 1: Supplemental figure S2 for more quality measurements with different kmer size and min. abundance.
We assume that the improvements correlate inversely with the average length of maximal unitigs of the data set. Our approach achieves a smaller representation by joining unitigs with overlapping ends, avoiding the repetition of those characters. This has a natural limit of saving at most \(k  1\) characters per pair of unitigs joint together, so at most \(k  1\) characters per unitig. In turn, the maximum fraction of characters saved is bound by \(k  1\) divided by the average length of unitigs. In Additional file 1: Supplemental figure S2, we have varied the kmer size and min. abundance for our data sets to vary the average length of unitigs. This gives us visual evidence for a correlation between average unitig length and decrease in CL.
Our improvements come at often negligible costs in terms of time and memory. Even for read sets, the run time at most doubles compared to BCALM2 in the worst case. However, the memory consumption rises significantly for read sets. This is due to the high number of unitigs in those graphs and the distance array of Dijkstra’s algorithm, whose size is linear in the number of nodes and the number of threads. See Additional file 1: Supplemental figure S3 for more performance measurements with different kmer size and min. abundance.
Compression of pangenomes
In addition to model organisms with large genomes, we evaluate our algorithms on bacterial pangenomes of N. gonorrhoeae, S. pneumoniae, E. coli, and Salmonella, as well as a human pangenome. We use the same metrics as for model organisms. For the bacterial genomes, we choose \(k = 31\), but also for the human genome for the reasons argued above, and also for easier comparability of the results on the different genomes. We show the results in Table 2. See Additional file 1: Supplemental figure S4 for more quality measurements with different kmer size and min. abundance, and Additional file 1: Supplemental figure S5 for more performance measurements with different kmer size and min. abundance. In neither of them, we have included Salmonella or human, as they take too much time.
Our algorithms improve CL up to 19% (using greedy matchtigs) over the best competitor and SC up to 70% (using greedy matchtigs). Matchtigs always achieve a slightly lower CL and slightly higher SC than greedy matchtigs, but the CL of greedy matchtigs is always at most 2% worse than that of matchtigs. We again assume that the improvements are correlated inversely to the average size of unitigs, as suggested by the experiments in Additional file 1: Supplemental figure S4. These improvements come at negligible costs, using at most 15% more time and 11% more memory than BCALM2 when computing greedy matchtigs, except for the large salmonella pangenome, which took 50% more memory. The higher memory consumption is due to the graph being more tangled due to the high number of genomes in the pangenome. For matchtigs, the time increases by less than a factor of three and memory by at most 12% compared to BCALM2. See Additional file 1: Supplemental figure S5 for more performance measurements with different kmer size and min. abundance.
kmerbased short read queries
Matchtigs have further applications beyond merely reducing the size required to store a set of kmers. Due to their smaller size and lower string count, they can make downstream applications more efficient. To make a concrete example, in this section we focus on membership queries. As already explained, each SPSS (unitigs, USTtigs, matchtigs, etc.) can be considered as a (multi) set of kmers. Given a kmer, a membership query is to verify whether the kmer belongs to the set or not. We focus on exact queries, rather than approximate, i.e. if a kmer does not belong to the set then the answer to the query must be “false”. Assessing the membership to the set for a string Q longer than k symbols is based on the answers to its constituent kmers: only if at least \(\lfloor \theta \times (Qk+1)\rfloor\) kmers of Q belongs to the set, then Q is considered to be present in the set. The threshold \(\theta\) is therefore an “inclusion” rate, which we fix to 0.8 for the experiments in this section.
To support fast membership queries in compressed space, we build an SSHashLite dictionary over each SPSS. SSHashLite is a relaxation of SSHash [45, 57] in that it supports membership queries without requiring each kmer to appear once in the underlying SPSS. It is available at https://github.com/jermp/sshashlite. In short, SSHash is a compressed dictionary for kmers — based on minimal perfect hashing [58] and minimizers [59] — which, for an input SPSS without duplicates and having n (distinct) kmers, assigns to each kmer in the input a unique integer number from 0 to \(n1\) by means of a Lookup query. The result of Lookup for any kmer that is not contained in the input SPSS is \(1\). Therefore, SSHash serves the same purpose of a minimal perfect hash function over a SPSS but it is also able to reject alien kmers. Two variants of SSHash were proposed — a regular and a canonical one. The canonical variant uses some extra space compared to the regular one but queries are faster to answer. (For all further details, we point the reader to the original papers [45, 57].)
Now, to let SSHash be able to query SPSSs with possible duplicate kmers (e.g. matchtigs), it was only necessary to modify the return value of the Lookup query to just return “true” if a kmer if found in the dictionary rather than its unique integer identifier (respectively, “false” if a kmer is not found instead of \(1\)). Therefore, SSHashLite can be directly used to index and query the unitigs, USTtigs, and matchtigs as well.
We compare the performance of SSHashLite when indexing unitigs, USTtigs, and matchtigs in Table 3. We build the SPSSs from three datasets: a ~309k× Salmonella Enterica pangenome; a 300× coverage human short read dataset filtered to exclude kmers with an abundance lower than 10; and a 2505× human pangenome. The Salmonalla pangenome was queried with 3 million random Salmonella short reads with lengths between 70 and 502, and an N75 of 302. The human queries for both the human read dataset and the human pangenome are 3 million random short reads (296 bases each) from the human read dataset.
We see that matchtigs improve the performance of membership queries in both space and time compared to unitigs and USTtigs. While the difference is more evident when compared to unitigs, matchtigs also consistently outperform USTtigs – achieving the lowest space usage and faster query time across almost all combinations of dataset and index variant (regular/canonical).
Note again that the speed up in searching time is more evident on the human reads dataset since it is much larger than the Salmonella pangenome and it is generally less evident for the canonical index variant of SSHashLite because it is approximately \(2\times\) faster to query than the regular one. Remarkably, regular SSHashLite over matchtigs achieves \(2659\%\) reduction in space over unitigs while being also \(4.26\times\) faster to query on the human reads datasets. Compared to USTtigs instead, matchtigs still retain \(2.10\times\) faster query time while improving space by up to \(8\%\). These results were achieved on a typical bioinformatics compute node with many logical cores (256) and a large amount of RAM (2TB). In Additional file 1: Supplemental table S2, we performed the same experiment on a server with focus on singlethread performance, achieving slightly smaller improvements.
The reduction in index space when indexing matchtigs is to be attributed to the lower string count and fewer nucleotides in the collection. The speedups achieved by SSHashLite when indexing matchtigs instead of unitigs can be explained as follows. When querying, SSHashLite streams through the kmers of the query. At the beginning, the tig containing the first kmer of the query is determined using a minimal perfect hash function over the minimizers of the input SPSS, as well as the position of the kmer in the tig. For the subsequent kmers of the query, SSHashLite attempts to “extend” the matching of the kmer against the identified tig by just comparing the single nucleotide following the previous kmer in the tig. Extending a match in this way is extremely fast not only because just a single nucleotide needs to be compared but also because it is a very cachefriendly algorithm, dispensing random accesses to the index entirely. However, each time an extension is not possible (either because we have a mismatch or we have reached the end of the current tig) a “full” new search is made in the index. The search consists in evaluating the minimal perfect hash function and locating the kmer inside another tig. Clearly, a search is much more expensive due to cache misses compared to an extension. Now, using longer tigs with a lower tig count — the case for the matchtigs — increases the chance of extension, or equivalently, decreases the number of full searches in the index. Compared to USTtig, matchtigs can be faster to query exactly because allowing repeated kmers to appear in the tigs further helps in creating opportunities for extension. Therefore, by reducing the number of full searches, we can reduce the overall runtime of the query.
See Additional file 1: Section 3 for a similar query experiment with the slightly older tool Bifrost.
Discussion
kmerbased methods have found widespread use in many areas of bioinformatics over the past years. However, they usually rely on unitigs to represent the kmer sets, since they can be computed efficiently with standard tools [23, 33, 40, 41]. Unitigs have the additional property that the de Bruijn graph topology can easily be reconstructed from them, since they do not contain branching nodes other than on their first and last kmer. However, this property is not usually required by kmerbased methods, which has opened the question if a smaller set of strings other than unitigs can be used to represent the kmer sets. If such a representation was in plain text, it should be usable in most kmerbased tools, by simply feeding it to the tool instead of unitigs.
Previous work has relaxed the unitig requirement of the representation of the kmer sets to arbitrary strings without kmer repetitions. This resulted in a smaller representation, leading to improvements in downstream applications. Additionally, previous work considered whether that finding an optimal representation without repeated kmers is NPhard, which was then disproven and shown to be lineartime solvable. We have shown that by allowing repetitions, there is a polynomial optimal algorithm that achieves better compression and improvements in downstream applications.
Conclusions
Our optimum algorithm compresses the representation significantly more than previous work. For practical purposes, we also propose a greedy heuristic that achieves nearoptimum results, while being suitable for practical purposes in runtime and memory. Specifically, our algorithms achieve a decrease of 26% in size and 90% in string count over UST. Additionally, we have shown that our greedy representation speeds up downstream applications, giving an example with a factor of 2.10 compared to previous compressed representations.
Our implementation is available as a standalone commandline tool and as a library. We hope that our efficient algorithms result in a widespread adoption of nearminimum plaintext representations of kmer sets in kmerbased methods, resulting in more efficient bioinformatics tools.
Methods
We first give some preliminary definitions in the “Preliminaries” section and define our problem in the “Problem overview” section. Note that to stay closer to our implementation, our definitions of bidirected de Bruijn graphs differ from those in e.g. [56]. However, the concepts are fundamentally the same. Then, in the “Building a compacted bidirected arccentric de Bruijn graph from a set of strings”, “Reduction to the bidirected partialcoverage Chinese postman problem”, “Solving the bidirected partialcoverage Chinese postman problem with mincost integer flows”, “Solving the mincost integer flow formulation with mincost matching”, and “Efficient computation of manytomany mincost paths” sections, we describe how to compute matchtigs. The whole algorithm is summarised by an example in Fig. 3. For simplicity, we describe the algorithm using an uncompacted de Bruijn graph. However, in practice it is much more efficient to use a compacted de Bruijn graph, but our algorithm can be adapted easily: simply replace the costs of 1 for each original arc with the number of uncompacted arcs it represents. In the “Efficient computation of the greedy heuristic” section, we describe the greedy heuristic.
Preliminaries
We are given an alphabet \(\Gamma\) and all strings in this work have only characters in \(\Gamma\). Further, we are given a bijection \(\textrm{comp} : \Gamma \rightarrow \Gamma\). The reverse complement of a string s is \(s^{1} := \textrm{rev}(\textrm{comp}^*(S))\) where \(\textrm{rev}\) denotes the reversal of a string and \(\textrm{comp}^*\) the characterwise application of \(\textrm{comp}\). For an integer k, a string of length k is called a kmer. From here on, we only consider strings of lengths at least k, i.e. strings that have at least one kmer as substring. We denote the prefix of length \(k  1\) of a kmer s by \(\textrm{pre}(s)\) and its suffix of length \(k  1\) by \(\textrm{suf}(s)\). The spectrum of a set of strings S is defined as the set of all kmers and their reverse complements that occur in at least one string \(s \in S\), formally \(\textrm{spec}_k(S) := \{r \in \Gamma ^k \mid \exists s \in S: r \text { or } r^{1} \text { is substring of } s\}\).
An arccentric deBruijn graph (or short deBruijn graph) \(\textrm{DBG}_k(S) = (V, E)\) of order k of a set of strings S is defined as a standard directed graph with nodes \(V := \{s \mid s \in \textrm{spec}_{k1}(S)\}\) and arcs \(E := \{(\textrm{pre}(s), \textrm{suf}(s)) \mid s \in \textrm{spec}_k(S)\}\). On top of this, we use the following notions of bidirectedness. An ordered pair of reversecomplementary nodes \([v, v^{1}] \in V \times V\) is called a binode and an ordered pair of reversecomplementary arcs \([(a, b), (b^{1}, a^{1})] \in E \times E\) is called a biarc. Even though these pairs are ordered, reversing the order still represents the same binode/biarc, just in the other direction. A node v is called canonical if v is lexicographically smaller than \(v^{1}\), and an arc (a, b) is called canonical if the kmer corresponding to (a, b) is lexicographically smaller or equal to the kmer corresponding to \((b^{1}, a^{1})\). If an arc or a node is its own reversecomplement (called selfcomplemental), then it is written as biarc [(a, b)] or binode [v]. See Fig. 1 for examples of different bigraphs.
Since de Bruijn graphs are defined as standard directed graphs, we use the following standard definitions. The set of incoming (outgoing) arcs of a node is denoted by \(E^(v)\) (\(E^+(v)\)), and the indegree (outdegree) is \(d^(v) := E^(v)\) (\(d^+(v) := E^+(v)\)). A walk in a de Bruijn graph is a sequence of adjacent arcs (followed in the forward direction) and a unitig is a walk in which all inner nodes (nodes with at least two incident walkarcs) have exactly one incoming and one outgoing arc. The length w of a walk w is the length of the sequence of its arcs (counting repeated arcs as often as they are repeated). A compacted deBruijn graph is a de Bruijn graph in which all maximal unitigs have been replaced by a single arc. A circular walk is a walk that starts and ends in the same node, and a Eulerian circuit is a circular walk that contains each arc exactly once. A graph that admits a Eulerian circuit is Eulerian.
Assuming the complemental pairing of nodes and arcs defined above, we can define the following bidirected notions of walks and standard de Bruijn graph concepts. Biwalks and circular biwalks are defined equivalently to walks, except that they are sequences of biarcs. A biwalk w in a de Bruijn graph spells a string \(\textrm{spell}(w)\) of overlapping visited kmers. That is, \(\textrm{spell}(w)\) is constructed by concatenating the string a from w’s first biarc \([(a, b), (b^{1}, a^{1})]\) (or [(a, b)]) with the last character of b of the first and all following biarcs. See Fig. 1 for examples of bidirected de Bruijn graphs and notable special cases.
A bidirected graph is connected, if between each pair of distinct binodes \([u, u^{1}]\) and \([v, v^{1}]\) that are not reverse complements of each other, there is a biwalk from \([u, u^{1}]\) to \([v, v^{1}]\) or from \([u, u^{1}]\) to \([v^{1}, v]\). We assume that our graph is connected, as on multiple disconnected components, our algorithm can be executed on each component, yielding a minimum result.
Problem overview
We are given a set of input strings I where each string has length at least k, and we want to compute a minimum spectrum preserving string set, defined as follows.
Definition 1
A spectrum preserving string set (or SPSS) of I is a set S of strings of length at least k such that \(\textrm{spec}_k(I) = \textrm{spec}_k(S)\), i.e. both sets of strings contain the same kmers, either directly or as reverse complement.
Note that our definition allows kmers and their reverse complements to be repeated in the SPSS, both in the same string and in different strings. This is an important difference to the definition of an SPSS by Rahman and Medvedev [44]. Their definition is equivalent to simplitigs, defined by Břinda, Baym and Kucherov [43]; hence, we use the term simplitigs when we refer to an SPSS without kmer repetitions in this paper. Simplitigs are an SPSS such that for each present kmer the reverse complement is only present if the kmer is selfcomplemental, and the kmer is only present once in at most one string of the SPSS.
Definition 2
The size S of an SPSS S is defined as
where s denotes the length of string s. A minimum SPSS is an SPSS of minimum size.
On a high level, our algorithm works as follows (see also Fig. 3).

1
Create a bidirected de Bruijn graph from the input strings (see the “Building a compacted bidirected arccentric de Bruijn graph from a set of strings” section).

2
Compute the biimbalances of each node (see the “Solving the bidirected partialcoverage Chinese postman problem with mincost integer flows” section).

3
Compute the mincost bipaths of length at most \(k  1\) from all nodes with negative biimbalance to all nodes with positive biimbalance (see the “Efficient computation of manytomany mincost paths” section).

4
Solve a mincost matching instance with costs for unmatched nodes to choose a set of shortest bipaths with minimum cost by reduction to mincost perfect matching (see the “Solving the mincost integer flow formulation with mincost matching” section).

5
BiEulerise the graph with the set of bipaths as well as arbitrary arcs between unmatched nodes.

6
Compute a biEulerian circuit in the resulting graph (see the “Solving the bidirected partialcoverage Chinese postman problem with mincost integer flows” section).

7
Break the circuit into a set of biwalks and translate them into a set of strings, which is the output minimum SPSS (see the “Reduction to the bidirected partialcoverage Chinese postman problem” section).
Note that in our implementation, a substantial difference is that we do not build the de Bruijn graph ourselves, but we expect the input to be a de Bruijn graph already. For our experiments, we use a compacted de Bruijn graph computed with BCALM2. We motivate the reasons for optimality while explaining our algorithm, but also give a formal proof in Additional file 1: Section 4.
Building a compacted bidirected arccentric de Bruijn graph from a set of strings
When building the graph we first compute unitigs from the input strings using BCALM2. Then we initialise an empty graph and do the following for each unitig:

1
We insert the unitig’s first \(k  1\)mer and its reverse complement as binode by inserting the two nodes separately and marking them as a bidirected pair, if it does not already exist. The existence is tracked with a hashmap, storing the two nodes corresponding to a kmer and its reverse complement if it exists.

2
We do the same for the last \(k  1\)mer of the unitig.

3
We add a biarc between the two binodes by inserting one forward arc between the forward nodes of the binodes, and one reverse arc between the reverse complement nodes of the binodes. The forward arc is labelled with the unitig, and the reverse arc is labelled with its reverse complement.
To save memory, we store the unitigs in a single large array, where each character is encoded as twobit number. The keys of the hashmap and the labels of the arcs are pointers into the array, together with a flag for the reverse complement. Nodes do not need a label, as their label can be inferred from any of its incident arcs’ label. Recall that in the description of our algorithm, we use an uncompacted graph only for simplicity.
Reduction to the bidirected partialcoverage Chinese postman problem
We first compute the arccentric deBruijn graph \(\textrm{DBG}_k(I)\) of the given input string set I as described in the “Building a compacted bidirected arccentric de Bruijn graph from a set of strings” section. In \(\textrm{DBG}_k(I)\), an SPSS S is represented by a biarccovering set of biwalks W (the reverse direction of a biarc does not need to be covered separately). That is a set of biwalks such that each biarc is element of at least one biwalk (see Fig. 1a). According to the definition of \(\textrm{spell}\), the size of S is related to W as follows:
Each walk costs \(k  1\) characters because it contains the node a from its first biarc \([(a, b), (b^{1}, a^{1})]\) (or [(a, b)]), and it additionally costs one character per arc.
To minimise S, we transform the graph as follows:
Definition 3
(Graph transformation) Given an arccentric deBruijn graph \(DBG_k(I) = (V, E)\), the transformed graph is defined as \(DBG'_k(I) = (V, E')\) where \(E'\) is a multiset defined as \(E' := E \cup (V \times V)\). In \(E'\), arcs from E are marked as nonbreaking, and arcs from \(V \times V\) are marked as breaking arcs. The cost function \(c(e), e \in E'\) assigns all nonbreaking arcs the costs 1 and all breaking arcs the costs \(k1\).
The reverse complements of breaking arcs are breaking as well, and the same holds for nonbreaking arcs. This means that biarcs always are either a pair of reverse complemental breaking arcs, in which case we call them breaking biarcs, or a pair of reverse complemental nonbreaking arcs, in which case we call them nonbreaking biarcs. By the same pattern, selfcomplemental biarcs are defined to be breaking biarcs or nonbreaking biarcs depending on their underlying arc. Breaking arcs have the costs \(k1\) because each breaking arc corresponds to starting a new walk, which by Eq. 1 costs \(k1\).
In the transformed graph we find a circular biwalk \(w^*\) of minimum cost that covers at least all original biarcs (to cover a biarc it is enough to traverse it once in one of its directions), as well as at least one breaking biarc. The reason for having at least one breaking biarc is that later we break the circular originalbiarccovering biwalk at all breaking biarcs to get a set of strings. If there was no breaking biarc, then we would get a circular string. Simply breaking a circular string at an arbitrary binode or a repeated nonbreaking biarc does not produce a minimum solution in general, because there may be multiple circular originalbiarccovering biwalks with minimum costs, but with different repeated kmers. When breaking the walk by removing a longer sequence of repeated kmers, the resulting string gets shorter, the more repeated kmers get removed. Hence we make finding an optimal breaking point part of the optimisation problem. We define such a walk as:
Definition 4
Given a transformed graph \(DBG'_k(I) = (V, E)\), a circular originalbiarccovering biwalk is a circular biwalk w such that for each nonbreaking arc \((a, b) \in E\) there is a biarc \([(a, b), (b^{1}, a^{1})]\), \([(b^{1}, a^{1}), (a, b)]\) or [(a, b)] in w. Additionally, w needs to contain at least one breaking biarc.
Definition 5
Given a transformed graph \(DBG'_k(I)\) and a circular originalbiarccovering walk w possibly consisting of biarcs \([(a, b), (b^{1}, a^{1})]\) and selfcomplemental biarcs [(a, b)]. The costs c(w) of w are the sum of the costs of each biarc and selfcomplemental biarc, where the costs of a biarc \([(a, b), (b^{1}, a^{1})]\) are c((a, b)), and the costs of a selfcomplemental biarc [(a, b)] are c((a, b)).
This is similar to the directed Chinese postman problem (DCPP). In the DCPP, the task is to find a circular mincost arccovering walk in a directed graph. It is a classical problem, known to be solvable in \(O(n^3)\) time [60] with a flowbased algorithm, using e.g. [61] to compute mincost flows. The partial coverage variant of the DCPP (requiring to cover only a subset of the arcs) is also known as the rural postman problem [62]. Further, the bidirected variant of the DCPP was discussed before in [56], and the authors also solved it using mincost flow in \(O(n^2 \log ^2(n))\) time.
We break the resulting mincost circular originalbiarccovering biwalk \(w^*\) at all breaking arcs (hence we require it to contain at least one breaking biarc, otherwise we would get a circular string). The returned SPSS is minimum, because the metric optimised when finding \(w^*\) matches Eq. 1.
Solving the bidirected partialcoverage Chinese postman problem with mincost integer flows
Edmonds and Johnson [54] introduced a polynomialtime flowbased approach that is adaptable to solve our variant of the DCPP. They show that finding a minimum circular arccovering walk in a directed graph is equivalent to finding a minimum Eulerisation of the graph, and then any Eulerian circuit in the Eulerised graph. A Eulerisation is a multiset of arccopies from the graph that makes the graph Eulerian if added, either by connecting nodes with missing outgoing arcs directly to nodes with missing incoming arcs, or by connecting them via a path of multiple arcs. A minimum Eulerisation is one whose sum of arc costs is minimum among all such multisets. To find such a minimum cost set of arcs, they formulate a mincost flow problem as an integer linear program as follows:
The variable \(x_e\) is interpreted as the amount of flow through arc e, and the variable \(c_e\) denotes the costs for assigning flow to an arc e. The costs are equivalent to the arc costs in the weighted graph specified by the DCPP instance. Objective (2) minimises the costs of inserted arcs as required. To ensure that the resulting flow can be directly translated into added arcs, Condition (3) ensures that the resulting flow is nonnegative and integral. Lastly, Eq. (4) is the balance constraint, ensuring that the resulting flow is a valid Eulerisation of the graph. This constraint makes nodes with missing outgoing arcs into sources, and nodes with missing incoming arcs into sinks, with demands matching the number of missing arcs. Note that in contrast to classic flow problems, this formulation contains no capacity constraint. For a solution of this linear program, the corresponding Eulerisation contains \(x_e\) copies of each arc e.
To adapt this formulation to our variant of the DCPP, we need to make modifications, namely:

compute the costs while treating selfcomplemental biarcs the same as other biarcs,

allow for partial coverage,

force cover at least one breaking arc (one of the arcs that are not required to be covered),

adjust the balance constraint for biwalks and

ensure that the resulting flow is bidirected, i.e. the flow of each arc equals the flow of its reverse complement.
Bidirected costs
If we would simply count the costs of each arc separately, then selfcomplemental biarcs would cost 1 for each repetition, while other biarcs would cost 2 for each repetition, since other biarcs consist for two arcs. To circumvent this, we only count arcs corresponding to canonical kmers in the cost function:
Partial coverage
In the partial coverage Chinese postman problem, we are additionally given a set \(F \subseteq E\) of arcs to be covered. In contrast to the standard DCPP, a solution walk only needs to cover all the arcs in F. In our case, the set F is the set of original arcs of the graph before Eulerisation. To solve the partial coverage Chinese postman problem we define outgoing covered arcs \(F^+(v) := F \cap E^+(v)\), and incoming covered arcs \(F^(v) := F \cap E^(v)\) for a node v, as well as the covered outdegree \(d_F^+(v) := F^+(v)\) and the covered indegree \(d_F^(v) := F^(v)\). Then we reformulate the balance constraint as:
The resulting set of arcs is a minimum Eulerisation of the graph (V, F), and a Eulerian walk in this graph is equivalent to a minimum circular Fcovering walk in the original graph.
Cover one breaking arc
Next to the partial coverage condition, we additionally require to cover at least one of the arcs that is not required to be covered. Since we forbid negative flows, we can express this as:
Bidirected balance
In contrast to Edmonds and Johnson, we are interested in a minimum circular biwalk that covers each original biarc. Analogue to the formulation for directed graphs, we make the following definitions:
Definition 6
(BiEulerian circuits and graphs) A biEulerian circuit in a bidirected graph is a bidirected circuit that visits each biarc exactly once. A biEulerian graph is a graph that admits a biEulerian circuit. A biEulerisation is a multiset of biarccopies from the graph that makes a graph biEulerian if added. A minimum biEulerisation is one whose sum of arc costs is minimum among all biEulerisations of the same graph.
We can compute a biEulerisation in the same way as we compute a Eulerisation, the only change is in the balance constraint. Observe that for a Eulerian graph, the imbalance \(i_v := d^(v)  d^+(v)\) is zero for each node [63], because each node is entered exactly as often as it is exited. For binodes, the definition of the biimbalance \(bi_v\) of a binode \([v, v^{1}]\) or [v] follows the same idea. However, in contrast to directed graphs, there are two special cases. These are mutually exclusive, since the labels of an arc are of length k and those of a node are of length \(k1\), such that only one can have even length, which is required to be selfcomplemental in DNA alphabets.
Binodes \([v, v^{1}] \in V \times V\) with \(v \ne v^{1}\) may have incident selfcomplemental arcs \([(v, v^{1})]\) and/or \([(v^{1}, v)]\) (see Fig. 1c for an example). If e.g. only \([(v, v^{1})]\) exists, then to traverse it, a biwalk needs to enter v twice. First, it needs to reach \([v, v^{1}]\) via some biarc, and after traversing \([(v, v^{1})]\), it needs to leave \([v^{1}, v]\) via a different biarc, whose reverse complement enters \([v, v^{1}]\). Hence, a selfcomplemental biarc alters the biimbalance of a node by two. See Fig. 4 for an example of this situation. If only \([(v^{1}, v)]\) exists, then the situation is symmetric. Therefore, for balance of \([v, v^{1}]\), the selfcomplemental biarc \([(v, v^{1})]\) requires two biarcs entering \([v, v^{1}]\) and the selfcomplemental biarc \([(v^{1}, v)]\) requires two biarcs leaving \([v, v^{1}]\). If both selfcomplemental arcs exist (e.g. both [(ATA, TAT)] and [(TAT, ATA)] for a binode [ATA, TAT]), then a biwalk can traverse them consecutively from e.g. \([v, v^{1}]\) by traversing first \([(v, v^{1})]\) and then \([(v^{1}, v)]\), ending up in \([v, v^{1}]\) again, such that the selfcomplemental arcs have a neutral contribution to the biimbalance. Resulting, the biimbalance of \([v, v^{1}]\) is
where \(\mathbbm {1}_P\) is 1 if the predicate P is true and 0 otherwise.
For selfcomplemental binodes \([v] \in V\), there is no concept of incoming or outgoing biarcs, since any biarc can be used to either enter or leave [v] (see Fig. 1b for an example). Therefore, for balance, biarcs need to be paired arbitrarily, for which to be possible there needs to be an even amount of biarcs. If there is an odd amount, then there is one unpaired biarc, hence the biimbalance is 1. The following condition exerts this behaviour (using \(\bmod\) as the remainder operator):
Finally, we include partial coverage to above biimbalance formulations by limiting the incoming and outgoing arcs to F. Further, to distinguish between selfcomplemental nodes and others, we denote the set of selfcomplemental nodes as \(S \subseteq V\) and the set of binodes that are not selfcomplemental as \(T := V \setminus S\). If selfcomplemental biarcs are included in the flow, then these alter the biimbalance by two, in the same way as they do in the equation of the biimbalance. We encode this on the left side of the equation. Then we get the following modified coverage constraint:
Valid bidirected flow
To adapt Edmonds’ and Johnson’s formulation to biwalks, we additionally need to ensure that the resulting flow yields a set of biarcs, i.e. that each arc has the same flow as its reverse complement:
Adapted flow formulation
With the modifications above, we can adapt the formulation of Edmonds and Johnson to solve the bidirected partialcoverage Chinese postman problem. We define F to be the arcs in the original graph, and set \(E := V \times V\). We further set \(c_e = 1\) for \(e \in F\) and \(c_e = k  1\) otherwise. Lastly we define S and T as above. Then we get the following modified formulation. Note that it is conceptually similar to that proposed in [56], however different because the basic definitions differ, and we further allow for special arcs of which only one needs to be covered.
In this mincost integer flow formulation of the bidirected partialcoverage Chinese postman problem, analogue to the formulation of Edmonds and Johnson, sources and sinks are nodes with missing outgoing or incoming arcs, with demands matching the number of missing arcs in F. Our formulation would not be solvable for practical deBruijn graphs because inserting a quadratic amount of arcs into the graph is infeasible. However, most of the breaking arcs are not needed, since in a minimum solution they can only carry flow if they directly connect a source to a sink, by the following argument: Imagine a breaking arc that carries flow but is connected to a source or sink with at most one end. We can trace one unit of flow on the arc to a source and a sink, creating a path of flow one. By removing the flow from the path, and adding it to a breaking arc directly connecting the source to the sink, we get a valid flow. This flow has lower costs than the original, because it contains the same amount of breaking arcs, but a lower number of nonbreaking arcs. This can be repeated until only breaking arcs that directly connect sources to sinks are left.
But even reducing the number of breaking arcs like this might not be enough if the graph contains too many sources and sinks. We therefore reduce the linear program to a mincost matching instance, similar to Edmonds and Johnson.
Solving the mincost integer flow formulation with mincost matching
To solve the bidirected partialcoverage Chinese postman problem with mincost matching, we observe that flow traverses the graph from a source to a sink only via mincost paths, since all arcs have infinite capacity. Due to the existence of the breaking arcs with low costs (\(k  1\)), we can further restrict the flow to use only paths of length at most \(k  2\) without affecting minimality. However, since we are also interested in a low number of strings in our minimum SPSS, we also allow paths of length \(k  1\). We can precompute these mincost paths efficiently in parallel (see “Efficient computation of manytomany mincost paths” section below). Then it remains to decide which combination of mincost paths and breaking arcs yield a minimum solution.
To simplify this problem, observe that the pairing of sources and sinks that are connected via breaking arcs does not matter. Any pairing uses the same amount of breaking arcs, and therefore has the same costs. It only matters that these nodes are not connected by a lowercost path that does not use any breaking arcs, and that there is at least one breaking arc. As a result, we can ignore breaking arcs when searching a minimum solution, and instead introduce costs for unmatched nodes. However, we still need to enforce that there is at least one pair of unmatched nodes. We do this using a special construction described below. Note though, that there can only be unmatched nodes if there are unbalanced binodes, i.e. the graph was not biEulerian in the beginning. However, if the graph is biEulerian already, the whole matching step can be left out, and instead a biEulerian circuit with one arbitrarily inserted breaking biarc can be returned. So we can safely assume here that the graph contains at least one pair of unbalanced binodes (it cannot contain a single unbalanced binode, see e.g. [47]).
We formulate a mincost matching problem with penalty costs for unmatched nodes, which can be reduced to a mincost perfect matching problem. For the construction of our undirected matching graph M we define the set of sources \(A \subseteq T\) as all nodes with negative biimbalance, and the set of sinks \(B \subseteq T\) as all nodes with positive biimbalance. Then we add \(bi_v\) (absolute value of the biimbalance of v) copies of each node from A, B and S to M. Further, for each mincost path from a node \(a \in A \cup S\) to a node \(b \in B \cup S\) we add an edge (undirected arc) from each copy of a to each copy of b in M with costs equal to the costs of the path. We ignore self loops at nodes in S since they do not alter the imbalance, and nodes in A and B cannot have self loops.
Then, to fulfil Condition (9) (valid bidirected flow) and to reduce the size of the matching problem, we merge all nodes and arcs with their reverse complement (the unmerged graph is built here to simplify our explanations, in our implementation we directly build the merged graph). This additionally results in selfcomplemental biarcs forming selfloops in the merged graph, thus making them not choosable by by the matcher. But this is correct, as selfcomplemental biarcs alter the biimbalance of a binode by two, and therefore they can only be chosen by matching two different copies of the same binode in M.
Additionally, to fulfil Condition (6) (cover one bidirected arc), we add a pair of nodes u, w to M. We connect u and w to each node in M (but do not add an edge between u and w) and assign costs 0 to all those edges. This forces u and w to be matched to other nodes \(u', w'\), which means that when biEulerising, the biimbalances of \(u'\) and \(w'\) need to be fixed with at least one breaking arc.
Lastly, we assign each node other than u and w penalty costs of \((k  1) / 2\) for staying unmatched, as each pair of unmatched nodes produces costs \(k  1\) for using a breaking arc.
We reduce M to an instance of the mincost perfect matching problem using the reduction described in [64]. For that we duplicate the graph, and add an edge with costs \(k  1\) between each node and its copy, but not between v and w and their respective copies. The costs for edges between a node and its copy are double the costs of keeping a node unmatched, because by choosing such an edge causes two nodes to stay unmatched.
After this reduction, we use Blossom V [65] to compute a solution. Since all nodes were doubled in the reduction, we actually get two solutions that might even differ, however both of them are minimum. We arbitrarily choose one of the solutions. This gives us a multiset of arcs that we complete with the breaking arcs required to balance the unmatched nodes to create a biEulerisation of the input graph. Following the approach from Edmonds and Johnson, we find a biEulerian circuit in the resulting graph which is a solution to the bidirected partialcoverage Chinese postman problem as required.
Note that our matching formulation only optimises CL, but does not optimise SC. It indirectly optimises SC because it decreases CL by joining strings, which also decreases SC by one each time. However, when joining two strings would not alter CL, Blossom V may output both variants, with joining the strings and without, while staying optimal. It then chooses an arbitrary option.
Efficient computation of manytomany mincost paths
Apart from solving the matching problem, finding the mincost paths between sources and sinks is the most computationally heavy part of our algorithm.
We solve it using Dijkstra’s shortest path algorithm [66] in a onetomany variant and execute it in parallel for all sources. To be efficient, we create a queue with blocks of source nodes, and the threads process one block at a time. A good block size balances between threads competing for access to the queue, and threads receiving a very imbalanced workload. Since our mincost paths are short (at most \(k  1\) arcs), in most executions of Dijkstra’s algorithm only a tiny fraction of the nodes in the graph are visited. But the standard variant of Dijkstra’s algorithm wastefully allocates an array for all nodes to store their distance from the source node (the distance array). To save space, we instead use a hashmap, mapping from node_index to distance from source. This turned out to be faster than using a distance array, even if the distance array uses an epoch system to only do a full reset every \(2^{32}\) queries. An epoch system stores a second value for each entry in the distance array indicating in what execution of Dijkstra’s algorithm that value is valid. The execution counter gets incremented each execution, and only when it wraps around, the distance array is reset normally. As another optimisation, we abort the execution early when Dijkstra reaches costs greater than \(k  1\), since we are only interested in paths up to costs \(k  1\).
Finally, in our implementation, we do not compute the actual sequences of arcs of the paths. Instead of copying the path arcs when biEulerising the graph, we insert special dummy arcs with a length equal to the length of the path. When breaking the final biEulerian circuit, if there are no breaking arcs but dummy arcs, then we break at a longest dummy arc to produce a minimum solution. If there are neither breaking nor dummy arcs, we proceed as described above. Then, when reporting the final set of strings, we define \(\textrm{spell}(\cdot )\) to append the last \(\ell\) characters of b when encountering a dummy biarc \([(a, b), (b^{1}, a^{1})]\) (or [(a, b)]) of length \(\ell\).
Efficient computation of the greedy heuristic
The greedy heuristic biEulerises the graph by greedily adding mincost paths between unbalanced nodes, as opposed to choosing an optimal set via mincost matching like our main algorithm. It then continues like the main algorithm, finding a biEulerian circuit, breaking it into walks and spelling out the strings.
To be efficient, the mincost paths are again computed in parallel, and we apply all optimisations from “Efficient computation of manytomany mincost paths” section. The parallelism however poses a problem for the greedy computation: if a binode with one missing incoming biarc is reached by two mincost paths in parallel, then if both threads add their respective biarcs, we would overshoot the biimbalance of that binode. To prevent that, we introduce a lock for each node, and before inserting a biarc into the graph we lock all (up to) four incident nodes. By locking the nodes in order of their ids we ensure that no deadlock can occur. Since the number of threads is many orders of magnitude lower than the number of nodes, we assume that threads almost never need to wait for each other. In addition to the parallelisation, we abort Dijkstra’s algorithm early when we have enough paths to fix the imbalance for the binode. This sometimes requires to execute Dijkstra’s algorithm again if a potential sink node was used by a different thread in parallel. But again, since the number of threads is many orders of magnitude lower than the number of nodes, we assume that this case almost never occurs.
In practice, the greedy heuristic achieves better results in terms of SC than the optimal matchtigs algorithm (see the “Results” section). This is because the greedy heuristics always joins two strings if it does not alter CL, while the optimal algorithm does not, as explained in “Solving the mincost integer ow formulation with mincost matching” section.
Minimising string count
In the paper we studied SPSSes of minimum total length (minimum CL). In this section, we note that an SPSS with a minimum number of strings (minimum SC), and with no constraint on the total length, is also computable in polynomial time.
The highlevel idea, ignoring reverse complements, is as follows. Given the arccentric de Bruijn graph G, construct the directed acyclic graph \(G^*\) of strongly connected components (SCCs) of G. In \(G^*\), every SCC is a node, and we have as many arcs between two SCCs as there are pairs of nodes in the two SCCs with an arc between them. Clearly, all arcs in a single SCC are coverable by a single walk. Moreover, for two SCCs connected by an arc, their two covering walks can be connected via this arc into a single walk covering all arcs of both SCCs. Thus, the minimum number of walks needed to cover all arcs of \(G^*\) (i.e. minimum SC SPSS) equals the minimum number of paths needed to cover all arcs of the acyclic graph \(G^*\). This is a classic problem solvable in polynomial time with network flows (see e.g. [67] among many).
However, such an SPSS of minimum SC very likely has a large CL, because covering an SCC with a single walk might repeat quadratically many arcs, and connecting the covering walks of two adjacent SCCs might additionally require to repeat many arcs to reach the arc between them.
Experimental evaluation
We ran our experiments on a server running Linux with two 64core AMD EPYC 7H12 processors with 2 logical cores per physical core, 1.96TiB RAM and an SSD. We downloaded the genomes of the model organisms from RefSeq [42]: Caenorhabditis elegans with accession GCF_000002985.6 [68], Bombyx mori with accession GCF_000151625.1 [69] and Homo sapiens with accession GCF_000001405.39 [70]. These are the same genomes as in [43], except that we downloaded HG38 from RefSeq for citability. The short reads were downloaded from the sequence read archive [71]: Caenorhabditis elegans with accession SRR14447868.1 [72], Bombyx mori with accession DRR064025.1 [73] and Homo sapiens with accessions SRR2052337.1 to SRR2052425.1 [74].
We downloaded the 1102 Neisseria gonorrhoeae genomes from [75]. We downloaded the 616 Streptococcus pneumoniae genomes from the sequence read archive, using the accession codes provided in Table 1 in [76]. Up to here the pangenomes are retrieved in the same way as in [43]. We additionally used grep to select 3682 Escherichia coli genomes from GenBank [77] using the overview file ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/assembly_summary.txt. The genomes are listed in Additional file 2. The \(\sim\)309k salmonella genome sequences were downloaded from the EnteroBase database [78] in February 2022. The included filenames are in Additional file 3. The 2505x human pangenome is from the 1000 genomes project [79], created by downloading a variant of GRCh37 from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz and downloading variant files for chromosomes 122 from http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/. We then converted chromosomes 122 in the reference into 2505 sequences each using the tool vcf2multialign published in [80].
For querying the human read dataset, we used 3 million reads randomly drawn from the reads used to construct the dataset. For querying the Salmonella pangenome, we used 3 million randomly drawn short reads from 10 read data sets from the sequence read archive with the accessions listed in Additional file 4. For querying the human pangenome, we used 3 million randomly drawn short reads from one file of the sequence read archive with accession SRR2052337.1. This is one of the files from the human short read dataset described above. For querying the E.coli pangenome (in Additional file 1: Section 3) we used 30 short read data sets from the sequence read archive with the accessions listed in Additional file 5.
We used snakemake [81] and the bioconda software repository [82] to craft our experiment pipeline. The tigs were checked for correctness by checking the kmer sets against unitigs. The bifrost queries in Additional file 1: Section 3 were checked for correctness by checking that the query results are equivalent for those with unitigs. The SSHash queries were not checked for correctness, as SSHash was modified by the author himself. Whenever we measured runtime of queries and builds for Additional file 1: Supplemental figure 5 (Performance with different amounts of threads), we only let a single experiment run, even if the experiment used only one core. When running the other builds we ran multiple processes at the same time, but never using more threads than the processor has physical cores (thus avoiding any effects introduced by sharing logical cores). When running a tool we copied its input to the SSD, and copied the results back to our main network storage, to prevent the network storage’s varying workload to affect our runtime measurements. For experiments running on input reads or references as opposed to unitigs (BCALM2, ProphAsm), we copied the inputs to a RAID of HDDs instead, due to their large size. The copying was not part of the measurements. We made sure that the server is undisturbed, except that we monitored the experiment status and progress with htop and less. We limited each run to 256GiB of RAM per process, which prevented us from running matchtigs on larger inputs. Further, ProphAsm supports only \(k \le 32\), so it was not run for k larger than that.
For an overview of our experiment pipeline for computing tigs, see Fig. 5. We run ProphAsm on the input data, as it was introduced to do [43]. All other tools require unitigs to be computed first. UST specifically requires unitigs computed by BCALM2, as BCALM2 adds additional annotations to the fasta unitig file. Our tool matchtigs also can make use of these annotations to speed up the construction of the arccentric de Bruijn graph. On the human pangenome, BCALM2 crashed due to the input being too large. Hence we used Cuttlefish 2 [40] to compute unitigs, and since UST only runs on unitigs computed by BCALM2, we then ran ProphAsm to compute heuristic simplitigs.
For queries, we executed Bifrost or SSHashLite on the different tigs. The Bifrost query command handles both building the index and executing the query, while SSHashLite requires to run a separate command to build the index first.
We measure runtimes and memory using the command /usr/bin/time v. The runtimes are reported as wall clock time, and the memory is reported as the maximum resident set size.
See Section Availability of data and materials for availability of our implementation and experiment code, which includes all the concrete commands we have used to execute our experiments.
Availability of data and materials
matchtigs: https://github.com/algbio/matchtigs.
SSHashLite: https://github.com/jermp/sshashlite.
The implementation of the matchtigs and greedy matchtigs algorithms is available on GitHub [83]. The name of the project is matchtigs. It is platform independent, and can be compiled locally or installed from bioconda as described in the README of the project. It is licensed under the 2clause BSD license. The version used for our experiments is available at [84], and the implementation together with all code to reproduce the experiments is available at [85]. The experiment code is also licensed under the 2clause BSD license. SSHashLite is available on GitHub [86] and licensed under the MIT license. The version used for our experiments is available at [87] and licensed under the MIT license.
The genomes of the model organisms are: Caenorhabditis elegans [68], Bombyx mori [69] and Homo sapiens [70]. The short reads of the model organisms are: Caenorhabditis elegans [72], Bombyx mori [73] and Homo sapiens [74].
The 616x Streptococcus pneumoniae pangenome is available in the sequence read archive [71], using the accession codes provided in Table 1 in [76]. The 1102x Neisseria gonorrhoeae pangenome is from [75]. The 3682x Escherichia coli pangenome is available from GenBank [77] using the accessions in Additional file 2. The \(\sim\)309kx Salmonella pangenome is created by downloading all Salmonella genomes from the EnteroBase database [78] in February 2022. The included filenames are in Additional file 3. The 2505x human pangenome is from the 1000 genomes project [79].
The query for the Salmonella pangenome are 3 million randomly drawn short reads from 10 Salmonella read data sets from the sequence read archive with the accessions listed in Additional file 4. The query for the human pangenome are 3 million randomly drawn short reads from [74] restricted to file SRR2052337.1. The query for the E.coli pangenome (in Additional file 1: Section 3) are 30 short read data sets from the sequence read archive with the accessions listed in Additional file 5.
References
Zielezinski A, Vinga S, Almeida J, Karlowski WM. Alignmentfree sequence comparison: benefits, applications, and tools. Genome Biol. 2017;18(1):1–17.
Zielezinski A, Girgis HZ, Bernard G, Leimeister CA, Tang K, Dencker T, Lau AK, Röhling S, Choi JJ, Waterman MS, et al. Benchmarking of alignmentfree sequence comparison methods. Genome Biol. 2019;20(1):1–18.
Luhmann N, Holley G, Achtman M. Blastfrost: fast querying of 100,000 s of bacterial genomes in bifrost graphs. Genome Biol. 2021;22(1):1–15.
Iqbal Z, Caccamo M, Turner I, Flicek P, McVean G. De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat Genet. 2012;44(2):226–32.
Nordström KJ, Albani MC, James GV, Gutjahr C, Hartwig B, Turck F, Paszkowski U, Coupland G, Schneeberger K. Mutation identification by direct comparison of wholegenome sequencing data from mutant and wildtype individuals using kmers. Nat Biotechnol. 2013;31(4):325–30.
Bradley P, Gordon NC, Walker TM, Dunn L, Heys S, Huang B, Earle S, Pankhurst LJ, Anson L, De Cesare M, et al. Rapid antibioticresistance predictions from genome sequence data for staphylococcus aureus and mycobacterium tuberculosis. Nat Commun. 2015;6(1):1–15.
Shajii A, Yorukoglu D, William YuY, Berger B. Fast genotyping of known snps through approximate kmer matching. Bioinformatics. 2016;32(17):538–44.
Sun C, Medvedev P. Toward fast and accurate snp genotyping from whole genome sequencing data for bedside diagnostics. Bioinformatics. 2019;35(3):415–20.
Bray NL, Pimentel H, Melsted P, Pachter L. Nearoptimal probabilistic rnaseq quantification. Nat Biotechnol. 2016;34(5):525–7.
Ames SK, Hysom DA, Gardner SN, Lloyd GS, Gokhale MB, Allen JE. Scalable metagenomic taxonomy classification using a reference genome database. Bioinformatics. 2013;29(18):2253–60.
Wood DE, Salzberg SL. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 2014;15(3):1–12.
Břinda K, Salikhov K, Pignotti S, Kucherov G. Prophyle: a phylogenybased metagenomic classifier using the burrowswheeler transform. Poster at HiTSeq 2017. 2017.
Corvelo A, Clarke WE, Robine N, Zody MC. taxmaps: comprehensive and highly accurate taxonomic classification of shortread data in reasonable time. Genome Res. 2018;28(5):751–8.
Simon HY, Siddle KJ, Park DJ, Sabeti PC. Benchmarking metagenomics tools for taxonomic classification. Cell. 2019;178(4):779–94.
Sirén J, Indexing variation graphs. In: 2017 Proceedings of the Ninteenth Workshop on Algorithm Engineering and Experiments (ALENEX). SIAM; 2017. pp. 13–27.
Garrison E, Sirén J, Novak AM, Hickey G, Eizenga JM, Dawson ET, Jones W, Garg S, Markello C, Lin MF, et al. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat Biotechnol. 2018;36(9):875–9.
Benoit G. Simka: fast kmerbased method for estimating the similarity between numerous metagenomic datasets. In: RCAM. Le Chesnay Cedex: Inria Domaine de Voluceau Rocquencourt; 2015.
David S, Mentasti M, Tewolde R, Aslett M, Harris SR, Afshar B, Underwood A, Fry NK, Parkhill J, Harrison TG. Evaluation of an optimal epidemiological typing scheme for legionella pneumophila with wholegenome sequence data using validation guidelines. J Clin Microbiol. 2016;54(8):2135–48.
Chattaway MA, Schaefer U, Tewolde R, Dallman TJ, Jenkins C. Identification of escherichia coli and shigella species from wholegenome sequences. J Clin Microbiol. 2017;55(2):616–23.
Clausen PT, Aarestrup FM, Lund O. Rapid and precise alignment of raw reads against redundant databases with KMA. BMC Bioinformatics. 2018;19(1):1–8.
Pandey P, Almodaresi F, Bender MA, Ferdman M, Johnson R, Patro R. Mantis: A fast, small, and exact largescale sequencesearch index. Cell Syst. 2018;7(2):201–7.
Marchet C, Kerbiriou M, Limasset A. Indexing De Bruijn graphs with minimizers. In: RecombSeq 20199th RECOMB Satellite Workshop on Massively Parallel Sequencing. Le Chesnay Cedex: Inria Domaine de Voluceau Rocquencourt; 2019. pp. 1–16.
Holley G, Melsted P. Bifrost: highly parallel construction and indexing of colored and compacted de Bruijn graphs. Genome Biol. 2020;21(1):1–20.
Pevzner PA. lTuple DNA sequencing: computer analysis. J Biomol Struct Dyn. 1989;7(1):63–73.
Idury RM, Waterman MS. A new algorithm for DNA sequence assembly. J Comput Biol. 1995;2(2):291–306.
Pevzner PA, Tang H, Waterman MS. An Eulerian path approach to DNA fragment assembly. Proc Natl Acad Sci. 2001;98(17):9748–53. https://doi.org/10.1073/pnas.171285098.
Chaisson MJ, Pevzner PA. Short read fragment assembly of bacterial genomes. Genome Res. 2008;18(2):324–30.
Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJ, Birol I. Abyss: a parallel assembler for short read sequence data. Genome Res. 2009;19(6):1117–23.
Li R, Zhu H, Ruan J, Qian W, Fang X, Shi Z, Li Y, Li S, Shan G, Kristiansen K, et al. De novo assembly of human genomes with massively parallel short read sequencing. Genome Res. 2010;20(2):265–72.
Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, Lesin VM, Nikolenko SI, Pham S, Prjibelski AD, Pyshkin AV, Sirotkin AV, Vyahhi N, Tesler G, Alekseyev MA, Pevzner PA. SPAdes: A New Genome Assembly Algorithm and Its Applications to SingleCell Sequencing. J Comput Biol. 2012;19(5):455. https://doi.org/10.1089/cmb.2012.0021.
Luo R, Liu B, Xie Y, Li Z, Huang W, Yuan J, He G, Chen Y, Pan Q, Liu Y, et al. Soapdenovo2: an empirically improved memoryefficient shortread de novo assembler. Gigascience. 2012;1(1):2047–217.
Chikhi R, Rizk G. Spaceefficient and exact de Bruijn graph representation based on a Bloom filter. Algoritm Mol Biol. 2013;8(1):22.
Chikhi R, Limasset A, Medvedev P. Compacting de Bruijn graphs from sequencing data quickly and in low memory. Bioinformatics. 2016;32(12):201–8.
Jackman SD, Vandervalk BP, Mohamadi H, Chu J, Yeo S, Hammond SA, Jahesh G, Khan H, Coombe L, Warren RL, Birol I. ABySS 2.0: resourceefficient assembly of large genomes using a Bloom filter. Genome Res. 2017;27(5):768–777. https://doi.org/10.1101/gr.214346.116.
Ruan J, Li H. Fast and accurate longread assembly with wtdbg2. Nat Methods. 2020;17(2):155–8.
Tomescu AI, Medvedev P. Safe and Complete Contig Assembly Through Omnitigs. J Comput Biol. 2017;24(6):590–602. https://doi.org/10.1089/cmb.2016.0141.
Acosta NO, Mäkinen V, Tomescu AI. A safe and complete algorithm for metagenomic assembly. Algoritm Mol Biol. 2018;13(1):1–12.
Cairo M, Khan S, Rizzi R, Schmidt S, Tomescu AI, Zirondelli EC. The hydrostructure: a universal framework for safe and complete algorithms for genome assembly. 2020. arXiv preprint arXiv:2011.12635.
Kececioglu JD, Myers EW. Combinatorial algorithms for DNA sequence assembly. Algorithmica. 1995;13(1):7–51.
Khan J, Kokot M, Deorowicz S, Patro R. Scalable, ultrafast, and lowmemory construction of compacted de bruijn graphs with Cuttlefish 2. Genome Biol. 2022;23(1):1–32.
Cracco A, Tomescu AI. Extremelyfast construction and querying of compacted and colored de Bruijn graphs with GGCAT. bioRxiv. 2022. https://doi.org/10.1101/2022.10.24.513174. https://www.biorxiv.org/content/early/2022/10/25/2022.10.24.513174.full.pdf
O’Leary NA, Wright MW, Brister JR, Ciufo S, Haddad D, McVeigh R, Rajput B, Robbertse B, SmithWhite B, AkoAdjei D, et al. Reference sequence (refseq) database at ncbi: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2016;44(D1):733–45.
Břinda K, Baym M, Kucherov G. Simplitigs as an efficient and scalable representation of de Bruijn graphs. Genome Biol. 2021;22(1):1–24.
Rahman A, Medevedev P. Representation of kmer sets using spectrumpreserving string sets. J Comput Biol. 2021;28(4):381–94.
Pibiri GE. Sparse and skew hashing of kmers. Bioinformatics. 2022;38(Supplement_1):185–194. https://doi.org/10.1093/bioinformatics/btac245.
Li H. Aligning sequence reads, clone sequences and assembly contigs with bwamem. 2013. arXiv preprint arXiv:1303.3997.
Schmidt S, Alanko JN. Eulertigs: Minimum Plain Text Representation of kmer Sets Without Repetitions in Linear Time. In: Boucher C, Rahmann S, editors. 22nd International Workshop on Algorithms in Bioinformatics (WABI 2022). Leibniz International Proceedings in Informatics (LIPIcs), vol 242. pp. 1–21. Dagstuhl: Schloss Dagstuhl – LeibnizZentrum für Informatik; 2022. https://doi.org/10.4230/LIPIcs.WABI.2022.2. https://drops.dagstuhl.de/opus/volltexte/2022/17036.
Dufresne Y, Lemane T, Marijon P, Peterlongo P, Rahman A, Kokot M, Medvedev P, Deorowicz S, Chikhi R. The kmer file format: a standardized and compact disk representation of sets of kmers. Bioinformatics. 2022;38(18):4423–5.
Fan J, Khan J, Pibiri GE, Patro R. Spectrum preserving tilings enable sparse and modular reference indexing. bioRxiv. 2022. https://doi.org/10.1101/2022.10.27.513881. https://www.biorxiv.org/content/early/2022/10/28/2022.10.27.513881.full.pdf.
Kitaya K, Shibuya T. Compression of Multiple kMer Sets by Iterative SPSS Decomposition. In: Carbone A, ElKebir M, editors. 21st International Workshop on Algorithms in Bioinformatics (WABI 2021). Leibniz International Proceedings in Informatics (LIPIcs), vol 201. Dagstuhl: Schloss Dagstuhl – LeibnizZentrum für Informatik; 2021. pp. 12–11217. https://doi.org/10.4230/LIPIcs.WABI.2021.12. https://drops.dagstuhl.de/opus/volltexte/2021/14365.
Marchet C, Iqbal Z, Gautheret D, Salson M, Chikhi R. REINDEER: efficient indexing of kmer presence and abundance in sequencing datasets. Bioinformatics. 36(Supplement_1):177–185. 2020. https://doi.org/10.1093/bioinformatics/btaa487. https://academic.oup.com/bioinformatics/articlepdf/36/Supplement_1/i177/33860751/btaa487.pdf.
Rahman A, Chikhi R, Medvedev P. Disk compression of kmer sets. Algoritm Mol Biol. 2021;16(1):1–14.
Kwan Mk. Graphic programming using odd or even points. Chin Math. 1962;1:273–7.
Edmonds J, Johnson EL. Matching, euler tours and the chinese postman. Math Program. 1973;5(1):88–124.
Kundeti V, Rajasekaran S, Dinh H. An efficient algorithm for chinese postman walk on bidirected de bruijn graphs. In: Wu W, Daescu O, editors. Combinatorial Optimization and Applications. Berlin, Heidelberg: Springer; 2010. p. 184–96.
Medvedev P, Georgiou K, Myers G, Brudno M. Computability of models for sequence assembly. In: Giancarlo R, Hannenhalli S, editors. Algorithms in Bioinformatics, 7th International Workshop, WABI 2007, Philadelphia, PA, USA, September 89, 2007, Proceedings. Lecture Notes in Computer Science, vol 4645. Berlin, Heidelberg: Springer; 2007. pp. 289–301. https://doi.org/10.1007/9783540741268_27.
Pibiri GE. On Weighted kmer Dictionaries. In: Boucher C, Rahmann S, editors. 22nd International Workshop on Algorithms in Bioinformatics (WABI 2022). Leibniz International Proceedings in Informatics (LIPIcs), vol 242. Dagstuhl: Schloss Dagstuhl – LeibnizZentrum für Informatik; 2022. pp. 1–20. https://doi.org/10.4230/LIPIcs.WABI.2022.9. https://drops.dagstuhl.de/opus/volltexte/2022/17043
Pibiri GE, Trani R. PTHash: Revisiting FCH Minimal Perfect Hashing. In: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: Association for Computing Machinery; 2021. pp. 1339–1348.
Roberts M, Hayes W, Hunt BR, Mount SM, Yorke JA. Reducing storage requirements for biological sequence comparison. Bioinformatics. 2004;20(18):3363–9.
Lenstra JK, Kan AR. Complexity of vehicle routing and scheduling problems. Networks. 1981;11(2):221–7.
Edmonds J, Karp RM. Theoretical improvements in algorithmic efficiency for network flow problems. J ACM (JACM). 1972;19(2):248–64.
Christofides N, Campos V, Corberán A, Mota E. In: Gallo G, Sandi C, editors. An algorithm for the Rural Postman problem on a directed graph. Berlin, Heidelberg: Springer; 1986. pp. 155–166. https://doi.org/10.1007/BFb0121091.
Even S. Graph Algorithms. Rockville: Computer Science Press; 1979.
Schäfer G. Weighted matchings in general graphs. Master’s thesis, Saarland University; 2000.
Kolmogorov V. Blossom V: a new implementation of a minimum cost perfect matching algorithm. Math Program Comput. 2009;1(1):43–67.
Dijkstra EW. A note on two problems in connexion with graphs. Numer Math. 1959;1(1):269–71.
Cáceres M, Cairo M, Mumey B, Rizzi R, Tomescu AI. Sparsifying, shrinking and splicing for minimum path cover in parameterized linear time. 2021. arXiv preprint arXiv:2107.05717. To appear in the Proceedings of the 2022 ACMSIAM Symposium on Discrete Algorithms (SODA 2022).
C. elegans Sequencing Consortium. Caenorhabditis elegans Bristol N2. 2013. https://www.ncbi.nlm.nih.gov/assembly/GCF_000002985.6/. Accessed 18 Apr 2023.
The international silkworm genome sequencing consortium. Bombyx mori p50T (= Dazao). 2008. https://www.ncbi.nlm.nih.gov/assembly/GCF_000151625.1/. Accessed 18 Apr 2023.
Genome Reference Consortium. Genome Reference Consortium Human Build 38 patch release 13 (GRCh38.p13). 2019. https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.39/. Accessed 18 Apr 2023.
Leinonen R, Sugawara H, Shumway M. The sequence read archive. Nucleic Acids Res. 2010;39(suppl_1):19–21.
Institute of Genetics and Developmental Biology. Deep sequencing of Caenorhabditis elegans with transgenerational UPRmt. 2021. https://www.ncbi.nlm.nih.gov/sra/?term=SRR14447868. Accessed 18 Apr 2023.
University of Tokyo  Graduate School of Agricultural and Life Sciences (UTGALS). Illumina HiSeq 2500 paired end sequencing of SAMD00054089. 2016. https://www.ncbi.nlm.nih.gov/sra/?term=DRR064025. Accessed 18 Apr 2023.
NCBI. NIST Genome in a Bottle, 300X sequencing of HG001 (NA12878)131219_D00360_005_BH814YADXX. 2015. https://www.ncbi.nlm.nih.gov/sra/?term=SRR2052337 to https://www.ncbi.nlm.nih.gov/sra/?term=SRR2052425. Accessed 18 Apr 2023.
Grad Y. Data for “Genomic Epidemiology of Gonococcal Resistance to ExtendedSpectrum Cephalosporins, Macrolides, and Fluoroquinolones in the United States, 2000–2013’’. Zenodo. 2019. https://doi.org/10.5281/zenodo.2618836.
Croucher NJ, Finkelstein JA, Pelton SI, Parkhill J, Bentley SD, Lipsitch M, Hanage WP. Population genomic datasets describing the postvaccine evolutionary epidemiology of streptococcus pneumoniae. Sci Data. 2015;2(1):1–9.
Benson DA, Cavanaugh M, Clark K, KarschMizrachi I, Lipman DJ, Ostell J, Sayers EW. GenBank. Nucleic Acids Research. 2012;41(D1):36–42. https://doi.org/10.1093/nar/gks1195. https://academic.oup.com/nar/articlepdf/41/D1/D36/3680750/gks1195.pdf
Zhou Z, Alikhan NF, Mohamed K, Fan Y, Achtman M. The user’s guide to comparative genomics with EnteroBase, including case studies on transmissions of microclades of Salmonella, the phylogeny of ancient and modern Yersinia pestis genomes, and the core genomic diversity of all Escherichia. bioRxiv. 2019. https://doi.org/10.1101/613554. https://www.biorxiv.org/content/early/2019/11/25/613554.full.pdf.
Consortium GP, et al. A global reference for human genetic variation. Nature. 2015;526(7571):68.
Norri T, Cazaux B, Dönges S, Valenzuela D, Mäkinen V. Founder reconstruction enables scalable and seamless pangenomic analysis. Bioinformatics. 2021;37(24):4611–9.
Köster J, Rahmann S. Snakemake—a scalable bioinformatics workflow engine. Bioinformatics. 2012;28(19):2520–2.
Grüning B, Dale R, Sjödin A, Chapman BA, Rowe J, TomkinsTinch CH, Valieris R, Köster J. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods. 2018;15(7):475–6.
Schmidt S. Matchtigs. GitHub. 2022. https://github.com/algbio/matchtigs. Accessed 18 Apr 2023.
Matchtigs Schmidt S. Zenodo. 2022. https://doi.org/10.5281/zenodo.7371184.
Schmidt S. Matchtigs experiments. Zenodo. 2022. https://doi.org/10.5281/zenodo.7275990.
Pibiri GE. SSHashLite. GitHub. 2022. https://github.com/jermp/sshashlite. Accessed 18 Apr 2023.
Pibiri GE. SSHashLite Zenodo. 2022. https://doi.org/10.5281/zenodo.7277145.
Acknowledgements
We are very grateful to Paul Medvedev and Amatur Rahman for helpful initial discussions on this problem. We further wish to thank the Finnish Computing Competence Infrastructure (FCCI) for supporting this project with computational and data storage resources. We also wish to thank Andrea Cracco for providing us with the \(\sim\)309kx Salmonella pangenome. We further wish to thank the anonymous reviewers for their useful constructive feedback, which improved the presentation of the paper, the implementation and the experimental results. Finally we wish to thank the Rust community (https://users.rustlang.org) for explanations about languagespecific details of parallel implementations.
Peer review information
Andrew Cosgrove was the primary editor of this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.
Review history
The review history is available as Additional file 6.
Funding
Open Access funding provided by University of Helsinki including Helsinki University Central Hospital. This work was partially funded by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. 851093, SAFEBIO), and by the Academy of Finland (grants No. 322595, 328877). This work was also partially supported by the project MobiDataLab (EU H2020 RIA, grant agreement No. 101006879). Funding for this research has also been provided by the European Union's Horizon Europe research and innovation programme (EFRA project, Grant Agreement Number 101093026).
Author information
Authors and Affiliations
Contributions
SK and AIT formulated the problem. AIT and SK designed an optimal algorithm and SK implemented and evaluated its prototype. Thereafter, SS improved the algorithm’s design to its current form and provided the final implementation as well as optimisations required to make it practically relevant. SS developed the greedy heuristic and implemented it. JA, SS and AIT designed the experiments and interpreted the results. SS performed all experiments (except for those on SSHashLite), and developed all further code published in the context of this work. SS wrote the manuscript under the supervision of the other authors, except for “kmerbased short read queries” section. GEP created and described SSHashLite in “kmerbased short read queries” section and ran the query experiments on the machine with focus on singlecore performance. All authors reviewed and approved the final version of the manuscript.
Authors’ twitter handles
Sebastian Schmidt @Sebasti66652811, Jarno Alanko @jnalanko, Giulio E. Pibiri @giulio_pibiri, and Alexandru I. Tomescu @AlexTomescu0.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
This study only uses publicly available datasets, hence an ethics approval or consent to participate is not required.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1.
Additional experiments and proof of optimality. Additional CL and SC data from our experiments, with varying k and min abundance. Also, the average unitig length and total unitig count is plotted. Performance measurements with varying amount of threads. A query experiment with Bifrost as well as an SSHashLite experiment on a machine with focus on singlecore performance. Proof of optimality for our algorithm.
Additional file 2.
Accessions of the genomes used for the E. coli pangenome. A list of links to the genomes in fasta format on NCBI.
Additional file 3.
File names of the genomes used for the Salmonella pangenome. A list of names of files downloaded from EnteroBase [78].
Additional file 4.
Accessions of the reads used for the Salmonella query experiment. A list of sequence read archive accession numbers for the 10 sets of Salmonella short reads used for querying.
Additional file 5.
Accessions of the reads used for the E. coli query experiment. A list of sequence read archive accession numbers for the 30 sets of E. coli short reads used for querying.
Additional file 6.
Review history.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Schmidt, S., Khan, S., Alanko, J.N. et al. Matchtigs: minimum plain text representation of kmer sets. Genome Biol 24, 136 (2023). https://doi.org/10.1186/s1305902302968z
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1305902302968z
Keywords
 kmer sets
 Plain text compression
 Graph algorithm
 Sequence analysis
 Genomic sequences
 Minimumcost flow
 Chinese postman problem