 Method
 Open Access
 Published:
Scalable, ultrafast, and lowmemory construction of compacted de Bruijn graphs with Cuttlefish 2
Genome Biology volume 23, Article number: 190 (2022)
Abstract
The de Bruijn graph is a key data structure in modern computational genomics, and construction of its compacted variant resides upstream of many genomic analyses. As the quantity of genomic data grows rapidly, this often forms a computational bottleneck. We present Cuttlefish 2, significantly advancing the stateoftheart for this problem. On a commodity server, it reduces the graph construction time for 661K bacterial genomes, of size 2.58Tbp, from 4.5 days to 17–23 h; and it constructs the graph for 1.52Tbp white spruce reads in approximately 10 h, while the closest competitor requires 54–58 h, using considerably more memory.
Background
Rapid developments in the throughput and affordability of modern sequencing technologies have made the generation of billions of shortread sequences from a panoply of biological samples highly time and costefficient. The National Center for Biotechnology Information (NCBI) has now moved the Sequence Read Archive (SRA) to the cloud, and this repository stores more than 14 petabytes worth of sequencing data [1]. Yet, this is only a fraction of the total sequencing data that has been produced, which is expected to reach exabytescale within the current decade [2]. In addition to the continued sequencing of an everexpanding catalog of various types and states of tissues from reference organisms, metagenomic sequencing of environmental [3] and microbiome [4] samples is also expected to enjoy a similar immense growth.
Given the expansive repository of existing sequencing data and the rate of acquisition, [5] argue that the ability of computational approaches to keep pace with data acquisition has become one of the main bottlenecks in contemporary genomics. These needs have spurred methods developers to produce ever more efficient and scalable computational methods for a variety of genomics analysis tasks, from genome and transcriptome assembly to pangenome analysis. Against this backdrop, the de Bruijn graph, along with its variants, has become a compact and efficient data representation of increasing importance and utility across computational genomics.
The de Bruijn graph originated in combinatorics as a mathematical construct devised to prove a conjecture about binary strings posed by Ir. K. Posthumus [6, 7]. In bioinformatics, de Bruijn graphs were introduced in the context of genome assembly algorithms for shortreads [8, 9], although the graph introduced in this context adopts a slightly different definition than in combinatorics. Subsequently, the de Bruijn graph has gradually been used in an increasing variety of different contexts within computational biology, including but not limited to: read correction [10, 11], genomic data compression [12], genotyping [13], structural variant detection [14], read mapping [15, 16], sequencesimilarity search [17], metagenomic sequence analysis [18,19,20], transcriptome assembly [21, 22], transcript quantification [23], and longread assembly [24,25,26].
In the context of fragment assembly—whether in forming contigs for wholegenome assembly pipelines [27, 28], or in encapsulating the read set into a summary representative structure for a host of downstream analyses [29,30,31,32]—de Bruijn graphs continue to be used extensively. The nonbranching paths in de Bruijn graphs are uniquelyassemblable contiguous sequences (known as unitigs) from the sequencing reads. Thus, they are certain to be present in any faithful genomic reconstruction from these reads, have no ambiguities regarding repeats in the data, and are fully consistent with the input. As such, maximal unitigs are excellent candidates to summarize the raw reads, capturing their essential substance, and are usually the output of the initial phase of modern de novo shortread assembly tools. Collapsing a set of reads into this compact set of fragments that preserve their effective information can directly contribute to the efficiency of many downstream analyses over the read set.
When constructed from reference genome sequences, the unitigs in the de Bruijn graphs correspond to substrings in the references that are shared identically across subsets of the genomes. Decomposing the reference collection into these fragments retains much of its effective information, while typically requiring much less space and memory to store, index, and analyze, than processing the collection of linear genomes directly. The ability to compactly and efficiently represent shared sequences has led many modern sequence analysis tools to adopt the de Bruijn graph as a central representation, including sequence indexers [33], read aligners [15, 16], homology mappers [34, 35], and RNAseq analysis tools [23, 36, 37]. Likewise, pangenome analysis tools [38,39,40,41,42,43] frequently make use of the maximal unitigs of the input references as the primary units upon which their core data structures and algorithms are built.
The vast majority of the examples described above make use of the compacted de Bruijn graph. A de Bruijn graph is compacted by collapsing each of its maximal, nonbranching paths (unitigs) into a single vertex. Many computational genomics workflows employing the (compacted) de Bruijn graph are multiphased, and typically, their most resourceintensive step is the initial one: construction of the regular and/or the compacted de Bruijn graph. The computational requirements for constructing the graph are often considerably higher than the downstream steps—posing major bottlenecks in many applications [13, 30]. As such, there has been a concerted effort over the past several years to develop resourcefrugal methods capable of constructing the compacted graph [44,45,46,47,48,49,50,51]. Critically, solving this problem efficiently and in a context independent from any specific downstream application yields a modular tool [45, 47] that can be used to enable a wide variety of subsequent computational pipelines.
To address the scalability challenges of constructing the compacted de Bruijn graph, we recently proposed a novel algorithm, CUTTLEFISH [44], that exhibited faster performance than preexisting stateoftheart tools, using (often multiple times) less memory. However, the presented algorithm is only applicable when constructing the graph from existing reference sequences. It cannot be applied in a number of contexts, such as fragment assembly or contig extraction from raw sequencing data. In this paper, we present a fast and memoryfrugal algorithm for constructing compacted de Bruijn graphs, CUTTLEFISH 2, applicable both on raw sequencing shortreads and assembled references, that can scale to very large datasets. It builds upon the novel idea of modeling de Bruijn graph vertices as Deterministic Finite Automata (DFA) [52] from [44]. However, the DFA model itself has been modified, and the algorithm has been generalized, so as to accommodate all valid forms of input. At the same time, in the case of constructing the graph from reference sequences, it is considerably faster than the previous approach, while retaining its frugal memory profile. We evaluated CUTTLEFISH 2 on a collection of datasets with diverse characteristics, and assess its performance compared to other leading compacted de Bruijn graph construction methods. We observed that CUTTLEFISH 2 demonstrates superior performance in all the experiments we consider.
Additionally, we demonstrate the flexibility of our approach by presenting another application of the algorithm. The compacted de Bruijn graph forms a vertexdecomposition of the graph, while preserving the graph topology [47]. However, for some applications, only the vertexdecomposition is sufficient, and preservation of the topology is redundant. For example, for applications such as performing presenceabsence queries for kmer or associating information to the constituent kmer of the input [53, 54], any set of strings that preserves the exact set of kmer from the input sequences can be sufficient. Relaxing the defining requirement of unitigs, that the paths be nonbranching in the underlying graph, and seeking instead a set of maximal nonoverlapping paths covering the de Bruijn graph, results in a more compact representation of the input data. This idea has recently been explored in the literature, with the representation being referred to as a spectrumpreserving string set [55], and the paths themselves as simplitigs [56]. We demonstrate that CUTTLEFISH 2 can seamlessly extract such maximal path covers by simply constraining the algorithm to operate on some specific subgraph(s) of the original graph. We compared it to the existing tools available in the literature [57] for constructing this representation, and observed that it outperforms those in terms of resource requirements.
Results
CUTTLEFISH 2 overview
We present a highlevel overview of the CUTTLEFISH 2 algorithm here. A complete treatment is provided in the “Algorithm” section.
CUTTLEFISH 2 takes as input a set \(\mathscr {R}\) of strings that are either shortreads or wholegenome references, a kmer length k, and a frequency threshold f _{0}≥1. As output, it produces the maximal unitigs of the de Bruijn graph \(G(\mathscr {R}, k)\). Figure 1 highlights the major steps in the algorithm.
CUTTLEFISH 2 first enumerates the set \(\mathscr {E}\) of edges of \(G(\mathscr {R}, k)\), the (k+1)mers present at least f_{0} times in \(\mathscr {R}\). This way the potential sequencing errors, present in case in which read sets are given as input, are discarded. Then the set \(\mathscr {V}\) of vertices of \(G(\mathscr {R}, k)\), which are the kmer present in these (k+1)mers, are extracted from \(\mathscr {E}\). Next, a Minimal Perfect Hash Function (MPHF) \(\mathfrak {f}\) over these vertices is constructed, that maps them bijectively to \([1,  \, \mathscr {V} \, ]\). This provides a spaceefficient way to associate information to the vertices through hashing. Modeling each vertex \(v \in \mathscr {V}\) as a Deterministic Finite Automaton (DFA), a piecewise traversal on \(G(\mathscr {R}, k)\) is made using \(\mathscr {E}\), computing the state S_{v} of the automaton of each \(v \in \mathscr {V}\)—associated to v through \(\mathfrak {f}(v)\). The DFA modeling scheme ensures the retention of just enough information per vertex, such that the maximal unitigs are constructible afterwards from the automata states. Then, with another piecewise traversal on \(G(\mathscr {R}, k)\) using \(\mathscr {V}\) and the states collection S, CUTTLEFISH 2 retrieves all the nonbranching edges of \(G(\mathscr {R}, k)\)—retained by the earlier traversal—and stitches them together in chains, constructing the maximal unitigs.
Experiments
We performed a number of experiments to characterize the various facets of the CUTTLEFISH 2 algorithm, its implementation, and some potential applications. We evaluated its execution performance compared to other available implementations of leading algorithms on de Bruijn graphs solving—(1) the compacted graph construction and (2) the maximal path cover problems, applicable on sharedmemory multicore machines. Although potentially feasible, CUTTLEFISH 2 is not designed as a method to leverage the capability of being distributed on a cluster of computenodes. Therefore, we did not consider relevant tools operating in that paradigm. We assessed its ability to construct compacted graphs and path covers for both sequencing reads and large pangenome collections. By working on the (k+1)mer spectrum, the new method performs a substantial amount of data reduction on the input sequences, yielding considerable speedups over the CUTTLEFISH algorithm [44] that, instead, requires multiple passes over the input sequences.
Next, we assess some structural characteristics of the algorithm and its implementation. Given an input dataset and a fixed internal parameter γ, the time and the spacecomplexity of CUTTLEFISH 2 depend on k (see the “Asymptotics” section). We evaluated the impact of k on its execution performance, and also assessed some structural properties of the compacted graph that change with the parameter k. Moreover, we appraised the parallel scalability of the different steps of the algorithm, characterizing the ones that scale particularly well with increasing processorthread count, as well as those that saturate more quickly.
A diverse collection of datasets has been used to conduct the experiments. We delineate the pertinent datasets for the experiments in their corresponding sections. The commands used for executing the different tools are available in Additional file 1: Sec. 1.10.
We compared the outputs of CUTTLEFISH 2 to those of several other tools used throughout our experiments. A detailed discussion of this is present at Additional file 1: Sec. 1.5.
Computation system for evaluation
All experiments were performed on a single server with two Intel Xeon E52699 v4 2.20 GHz CPUs having 44 cores in total and enabling upto 88 threads, 512 GB of 2.40 GHz DDR4 RAM, and a number of 3.6 TB Toshiba MG03ACA4 ATA HDDs. The system is run with Ubuntu 16.10 GNU/Linux 4.8.059generic. The running times and the maximum memory usages were measured with the GNU time command, and the intermediate diskusages were measured using the Linux commands inotifywait and du.
Compacted graph construction for sequencing data
We evaluated the performance of CUTTLEFISH 2 in constructing compacted de Bruijn graphs from shortread sequencing data compared to available implementations of other leading compaction algorithms: (1) ABYSSBLOOMDBG, the maximal unitigs assembler of the ABYSS 2.0 assemblypipeline [27], (2) BIFROST [45], (3) DEGSM [46], and (4) BCALM 2 [47].
The performances were tested on a number of shortread datasets with varied characteristics: (1) Mammalian dataset: (i) a human read set (NIST HG004) from an Ashkenazi white female Homo Sapiens (pairedend 250 bp Illumina reads with 70× coverage, SRA3440461–95, 148 GB compressed FASTQ), from [58], and (ii) an RNA sequencing dataset (ENA PRJEB3365) of 465 human lymphoblastoid cell line samples from the 1000 Genomes project (singleend 36 bp smallRNAseq Illumina reads, ERP001941, 140 GB compressed FASTQ), from [59]; (2) Metagenomic datasets: (i) a gut microbiome read set (ENA PRJEB33098) from nine individuals (pairedend 150 bp Illumina reads with high coverage, ERP115863, 45 GB compressed FASTQ), from [60], and (ii) a soil metagenome read set (Iowa Corn) from 100yearscultivated Iowa agricultural corn soil (pairedend 76 bp and 114 bp Illumina reads with low coverage, SRX100357 and SRX099904–06, 152 GB compressed FASTQ), used by [61]; and (3) Large organism dataset: a white spruce read set (NCBI PRJNA83435) from a Canadian Picea glauca tree (pairedend 150 bp and 100 bp Illumina reads with high coverage, SRA056234, 1.14 TB compressed FASTQ), from [62]. Table 1 contains the summary results of the benchmarking.
The frequency threshold f_{0} of kmers ((k+1)mers in case of CUTTLEFISH 2 ^{Footnote 1}) for the algorithms was approximated using kmer frequency distributions so as to roughly minimize the misclassification rates of weak and solid kmers^{Footnote 2} in these experiments (See Additional file 1: Sec. 1.1). In many practical scenarios, it might be preferable to skip computing an (approximate) frequency distribution, setting f _{0} through some informed choice based on the properties of the input data (e.g., the sequencing depth and protocol). This can incorporate more weak kmer into the graph. We present the results for such a scenario in Additional file 1: Table S2 on the human read set, setting f _{0} to just 2.
Across the different datasets and algorithms evaluated, several trends emerge, notable from Table 1. First, we observe that for every dataset considered, CUTTLEFISH 2 is the fastest tool to process the data, while simultaneously using the least amount of memory. If we allow CUTTLEFISH 2 to match the memory used by the second most memoryfrugal method (which is always BCALM 2 here), then it often completes even more quickly. We note that CUTTLEFISH 2 retains its performance lead over the alternative approaches across a wide range of different data input characteristics.
Among all the methods tested, CUTTLEFISH 2 and BCALM 2 were the only tools able to process all the datasets to completion under the different configurations tested, within the memory and diskspace constraints of the testing system. The rest of the methods generally required substantially more memory, sometimes over an order of magnitude more, depending on the dataset.
Of particular interest is CUTTLEFISH 2’s performance compared against BCALM 2. Relative to BCALM 2, CUTTLEFISH 2 is 1.7–5.3× faster on the human read set, while using 2.1–2.8× less memory. On the RNAseq dataset, it is 8.3–5.9× faster, with 1.3× less memory. For the metagenomic datasets, it is 4.1–5.9× faster and uses 2.2–3.1× less memory on the gut microbiome data, and is 2.8–8.5× faster using 2.5–2.7× less memory on the soil data. On the largest sequencing dataset here, the white spruce read set, CUTTLEFISH 2 is 5.4–5.7× faster and is 1.3–2.6× memoryfrugal—taking about 10 h, compared to at least 54 h for BCALM 2.
The timingprofile of BCALM 2 and CUTTLEFISH 2 excluding their similar initial stage: kmer and (k+1)mer enumeration, respectively, are shown in Additional file 1: Table S4. We also note some statistics of the de Bruijn graphs and their compacted forms for these datasets in Additional file 1: Table S5.
Compacted graph construction for reference collections
We assessed the execution performance of CUTTLEFISH 2 in constructing compacted de Bruijn graphs from wholegenome sequence collections in comparison to the available implementations of the following leading algorithms: (1) BIFROST [45], (2) DEGSM [46], and (3) BCALM 2 [47]. TWOPACO [48] is another notable algorithm applicable in this scenario, but we did not include it in the benchmarking as its output step lacks a parallelized implementation, and we consider very large sequence collections in this experiment.
We evaluated the performances on a number of datasets with varying attributes: (1) Metagenomic collection: 30,691 representative sequences from the most prevalent human gut prokaryotic genomes, gathered by [66] (≈ 61B bp, 18 GB compressed FASTA); (2) Mammalian collection: 100 human genomes—7 real sequences from [49] and 93 sequences simulated by [48] (≈ 294B bp, 305 GB uncompressed FASTA); and (3) Bacterial archive: 661,405 bacterial genomes, collected by [67] from the European Nucleotide Archive (≈ 2.58T bp, 752 GB compressed FASTA). Table 2 conveys the summary results of the benchmarking.
Evaluating the performance of the different tools over these pangenomic datasets, we observe similar trends to what was observed in Table 1, but with even more extreme differences than before. For a majority of the experiment configurations here, only BCALM 2 and CUTTLEFISH 2 were able to finish processing within time and machineconstraints. Again, CUTTLEFISH 2 exhibits the fastest runtime on all datasets, and the lowest memory usage on all datasets except the human gut genomes (where it consumes 1–2 GB more memory than BCALM 2, though taking 6–7 fewer hours to complete).
CUTTLEFISH 2 is 2.4–8.9× faster on the 30K human gut genomes compared to the closest competitors, using similar memory. On the 100 human reference sequences, CUTTLEFISH 2 is 4.3–4.7× faster, using 5.4–8.4× less memory. The only other tools able to construct this compacted graph successfully are DEGSM for k=27 (taking 4.3× as long and requiring 8.4× as much memory as CUTTLEFISH 2) and BCALM 2 for k=55 (taking over 4.7× as long and 5.4× as much memory as CUTTLEFISH 2). Finally, when constructing the compacted graph on the 661,405 bacterial genomes, CUTTLEFISH 2 is the only tested tool able to construct the graph for k=27. For k=55, BCALM 2 also completed, taking about 4.5 days, while CUTTLEFISH 2 finished under a day, with similar memoryprofile. Overall, we observe that for large pangenome datasets, CUTTLEFISH 2 is not only considerably faster and more memoryfrugal than alternative approaches, but is the only tool able to reliably construct the compacted de Bruijn graph under all the different configurations tested, within the constraints of the experimental system.
Table S4 notes the timingprofiles for BCALM 2 and CUTTLEFISH 2 without their first step of kmer and (k+1)mer enumerations, and Table S5 shows some characteristics of the (compacted) de Bruijn graphs for these pangenome datasets.
Maximal path cover construction
The execution performance of CUTTLEFISH 2 in decomposing de Bruijn graphs into maximal vertexdisjoint paths was assessed compared to the only two available tool implementations in literature [57] for this task: (1) PROPHASM [56] and (2) UST [55].
For sequencing data, we used (1) a roundworm read set (ENA DRR008444) from a Caenorhabditis elegans nematode (pairedend 300 bp Illumina reads, 5.6 GB compressed FASTQ); (2) the gut microbiome read set (ENA PRJEB33098) noted earlier; and (3) the human read set (NIST HG004) noted earlier. For wholegenome data, we used sequences from (1) a roundworm reference (Caenorhabditis elegans, VC2010) [68]; (2) a human reference (Homo sapiens, GRCh38); and (3) 7 real humans, collected from [49]. Table 3 presents the summary results of the benchmarking.
We note that CUTTLEFISH 2 outperforms the alternative tools for constructing maximal path covers in terms of the time and memory required. In the context of this task, CUTTLEFISH 2 also offers several qualitative benefits over these tools. For example, PROPHASM exposes only a singlethreaded implementation. Further, it is restricted to values of k≤32 and only accepts genomic sequences as input (and thus is not applicable for read sets). UST first makes use of BCALM 2 for maximal unitigs extraction—which we observed to be outperformed by CUTTLEFISH 2 in the earlier experiments—and then employs a sequential graph traversal on the compacted graph to extract a maximal path cover. For this problem, CUTTLEFISH 2 bypasses the compacted graph construction, and directly extracts a maximal cover.
We observe that compared to the tools, CUTTLEFISH 2 is competitive on singlethreaded executions. While on moderatesized datasets using multiple threads, it was 2–3.8× faster than UST using 2.2–12.6× less memory on sequencing data, and for reference sequences it was 2.8–6.1× faster than UST using 2.9–6.3× less memory.
We also provide a comparison of the maximal unitigbased and the maximal path coverbased representations of de Bruijn graphs in Additional file 1: Table S6. We observe that, for the human read set, the path cover representation requires 19–24% less space than the unitigs. For the human genome reference and 7 humans pangenome references, these reductions are 14–22%, and 20–25%, respectively. From the statistics of both the representations on the gut microbiome read set, it is evident that the corresponding de Bruijn graphs are highly branching, as might be expected for metagenomic data. The space reductions with path cover in these graphs are 33–36%.
Structural characteristics
Given an input dataset \(\mathscr {R}\) and a fixed frequency threshold f _{0} for the edges (i.e., (k+1)mers), the structure of the de Bruijn graph \(G(\mathscr {R}, k)\) is completely determined by the kmersize—the edge and the vertexcounts depend on k, and the asymptotic characteristics of the algorithm are dictated only by the kmer size k and the hash function spacetime tradeoff factor γ (see the “Asymptotics” section). We evaluated how CUTTLEFISH 2 is affected practically by changes in the kvalue. The human read set (NIST HG004) noted earlier was used for these evaluations.
For a range of increasing kvalues (and a constant γ), we measured the execution performance of CUTTLEFISH 2, and the following metrics of the maximal unitigs it produced: the number of unitigs, the average and the maximum unitig lengths, along with the N50^{Footnote 3} and the N G A50^{Footnote 4} scores for contigcontiguity. Across the varying k’s, Table 4 reports the performance and the unitigmetrics.
The unitigmetrics on this data convey what one might expect—as k increases, so do the average and the maximum lengths of the maximal unitigs, and the N50 and N G A50 metrics, since the underlying de Bruijn graph typically gets less tangled as the kmer size exceeds repeat lengths [70]. It is worth noting that, since we consider here just the extraction of unitigs, and no graph cleaning or filtering steps (e.g., bubble popping and tip clipping), we expect the N50 to be fairly short.
Perhaps the more interesting observation to be gleaned from the results is the scaling behavior of CUTTLEFISH 2 in terms of k. While the smallest kvalue leads to the fastest overall graph construction, with increase in the machineword count to encode the kmer, the increase in runtime is rather moderate with respect to k, which follows the expected asymptotics (see the “Time complexity” section). On the other hand, we observe that this increase can be offset by allowing CUTTLEFISH 2 to execute with more memory (which helps in the bottleneck step, (k+1)mer enumeration). We also note that, while the timingprofile exhibits reasonably good scalability over the parameter k, the effect on the required memory is rather small—it is not directly determined by the kvalue, rather is completely dictated by the distinct kmer count (see the “Space complexity” section).
Parallel scaling
We assessed the scalability of CUTTLEFISH 2 across a varying number of processorthreads. For this experiment, we downsampled the human read set NIST HG004 from 70× to 20× coverage and used this as input. We set k to 27 and 55, and executed CUTTLEFISH 2 with threadcounts ranging in 1–32. For k=27, Fig. 2a shows the time incurred by each step of the algorithm, and Fig. 2b demonstrates their speedups (i.e., factor of improvement in the speed of execution with different number of processorthreads). Additional file 1: Fig. S2 shows these metrics for k=55.
On the computation system used, we observe that all steps of CUTTLEFISH 2 scale well until about 8 threads. Beyond 8 threads, most steps but the minimal perfect hash construction continue to scale. Figure 2a shows that the most timeintensive step in the algorithm is the initial edge set enumeration. This step, along with vertex enumeration and DFA states computation, continue to show reasonably good scaling behavior until about 20 threads, then gradually saturating. The final step of unitigs extraction seems to scale well up to the maximum threadcount we tested with (32 in this case).
It is worth reiterating that all experiments were performed on standard hard drives, and that the most resourceintensive step of edge enumeration can be quite inputoutput (IO) bound, while the rest of the steps also iterate through the indisk set of edges or vertices—bound by diskread speed. So one might expect different (and quite possibly better) scaling behavior for the IOheavy operations when executing on faster external storage, e.g., in the form of SSD or NVMe drives [71]. This is further evidenced by [72], who show that KMC 3, the method used for the edge and the vertex enumeration steps in CUTTLEFISH 2, could have considerable gains in speed on large datasets when executed on SATA SSDs.
Conclusion
In this paper, we present CUTTLEFISH 2, a new algorithm for constructing the compacted de Bruijn graph, which is very fast and memoryfrugal, and highlyscalable in terms of the extent of the input data it can handle. CUTTLEFISH 2 builds upon the work of [44], which already advanced the stateoftheart in referencebased compacted de Bruijn graph construction. CUTTLEFISH 2 simultaneously addresses the limitation and the bottleneck of CUTTLEFISH, by substantially generalizing the work to allow graph construction from both raw sequencing reads and reference genome sequences, while offering a more efficient performance profile. It achieves this, in large part, through bypassing the need to make multiple passes over the original input for very large datasets.
As a result, CUTTLEFISH 2 is able to construct compacted de Bruijn graphs much more quickly, while using less memory—both often multiple times—than the numerous other methods evaluated. Since the construction of the graph resides upstream of many computational genomics analysis pipelines, and as it is typically one of the most resourceintensive steps in these approaches, CUTTLEFISH 2 could help remove computational barriers to using the de Bruijn graph in analyzing the everlarger corpora of genomic data.
In addition to the advances it represents in the compacted graph construction, we also demonstrate the ability of the algorithm to compute another spectrumpreserving string set of the input sequences—maximal path covers that have recently been adopted in a growing variety of applications in the literature [57]. A simple restriction on the considered graph structure allows CUTTLEFISH 2 to build this construct much more efficiently than the existing methods.
Though a thorough exploration of the potential benefits of improved compacted de Bruijn graph construction to the manifold downstream analyses is outside the scope of the current work, we present a proof of concept application (Additional file 1: Sec. 1.9), demonstrating the benefits of our improved algorithm to the task of constructing an associative kmer index.
As the scale of the data on which the de Bruijn graph and its variants must be constructed increases, and as the de Bruijn graph itself continues to find evermore widespread uses in genomics, we anticipate that CUTTLEFISH 2 will enable its use in manifold downstream applications that may not have been possible earlier due to computational challenges, paving the way for an even more widespread role for the de Bruijn graph in highthroughput computational genomics.
CUTTLEFISH 2 is implemented in C++17, and is available opensource at https://github.com/COMBINElab/cuttlefish.
Methods
Related work
Here, we briefly discuss the other compacted de Bruijn graph construction algorithms included in the experiments against which we compare CUTTLEFISH 2.
The BCALM algorithm [50] partitions the kmer from the input that pass frequency filtering into a collection of diskbuckets according to their minimizers [73], and processes each bucket sequentially as per the minimizerordering—loading all the strings of the bucket into memory, joining (or, compacting) them maximally while keeping the resulting paths nonbranching in the underlying de Bruijn graph, and distributing each resultant string into some other yettobeprocessed bucket for potential further compaction, or to the final output. As is, BCALM is inherently sequential. BCALM 2 [47] builds upon this use of minimizers to partition the graph, but it modifies the kmer partitioning strategy so that multiple diskbuckets can be compacted correctly in parallel, and then glues the further compactable strings from the compacted buckets.
ABYSSBLOOMDBG is the maximal unitigs assembler of the ABYSS 2.0 assembly tool [27]. It first saves all the kmer from the input reads into a cascading Bloom filter [63] to discard the likelyerroneous kmer. Then it identifies the reads that consist entirely of retained kmer, and extends them in both directions within the de Bruijn graph through identifying neighbors using the Bloom filter, while discarding the potentially falsepositive paths based on their spans—producing the maximal unitigs.
DEGSM first enumerates all the (k+2)mers of the input that pass frequency filtering. Then using a parallel external sorting over partitions of this set, it groups together the (k+2)mers with the same middle kmer, enabling it to identify the branching vertices in the de Bruijn graph. Then it merges the kmer from the sorted buckets in a strategy so as to produce a BurrowsWheeler Transform [74] of the maximal unitigs.
BIFROST [45] constructs an approximate compacted de Bruijn graph first by saving the kmer from the input in a Bloom filter [63], and then for each potential nonerroneous kmer, it extracts the maximal unitig containing it by extending the kmer in both directions using the Bloom filter. Then using a kmer counting based strategy, it refines the graph by removing the false edges induced by the Bloom filter.
Definitions
A strings is an ordered sequence of symbols drawn from an alphabet \({\Sigma }\). For the purposes of this paper, we assume all strings to be over the alphabet Σ={A,C,G,T}, the DNA alphabet where each symbol has a reciprocal complement—the complementary pairs being {A,T} and {C,G}. For a symbol c∈Σ, \(\overline {c}\) denotes its complement. s denotes the length of s. A kmer is a string with length k. s_{i} denotes the ith symbol in s (with 1based indexing). A substring of s is a string entirely contained in s, and s_{i..j} denotes the substring of s located from its ith to the jth indices, inclusive. pre_{ℓ}(s) and s uf_{ℓ}(s) denote the prefix and the suffix of s with length ℓ respectively, i.e., pre_{ℓ}(s)=s_{1..ℓ} and suf_{ℓ}(s)=s_{s−ℓ+1..s}, for some 0<ℓ≤s. For two strings x and y with suf_{ℓ}(x)=pre_{ℓ}(y), the ℓlength glue operation ⊙^{ℓ} is defined as x⊙^{ℓ}y=x·y_{ℓ+1..y}, where · denotes the append operation.
For a string s, its reverse complement \(\overline {s}\) is the string obtained through reversing the order of the symbols in s, and replacing each symbol with its complement, i.e., \(\overline {s} = \overline {s_{s}} \cdot \ldots \cdot \overline {s_{2}} \cdot \overline {s_{1}}\). The canonical form\(\widehat {s}\) of s is defined as the string \(\widehat {s} = \min (s, \overline {s})\), according to some consistent ordering of the strings in Σ^{s}. In this paper, we adopt the lexicographic ordering of the strings.
Given a set \(\mathscr {S}\) of strings and an integer k>0, let \(\mathscr {K}\) and \(\mathscr {K}_{+1}\) be respectively the sets of kmer and (k+1)mers present as substrings in some \(s \in \mathscr {S}\). The (directed) nodecentric de Bruijn graph\(G_{1} (\mathscr {S}, k) = (\mathscr {V}_{1}, \mathscr {E}_{1})\) is a directed graph where the vertex set is \(\mathscr {V}_{1} = \mathscr {K}\), and the edge set \(\mathscr {E}_{1}\) is induced by \(\mathscr {V}_{1}\): a directed edge \(e = (u, v) \in \mathscr {E}_{1}\) iff suf_{k−1}(u)=pre_{k−1}(v). The (directed) edgecentric de Bruijn graph\(G_{2} (\mathscr {S}, k) = (\mathscr {V}_{2}, \mathscr {E}_{2})\) is a directed graph where the edge set is \(\mathscr {E}_{2} = \mathscr {K}_{+1}\): each \(e \in \mathscr {K}_{+1}\) constitutes a directed edge (v_{1},v_{2}) where v_{1}=pre_{k}(e) and v_{2}=suf_{k}(e), and the vertex set \(\mathscr {V}_{2}\) is thus induced by \(\mathscr {E}_{2}\)^{Footnote 5}.
In this work, we adopt the edgecentric definition of de Bruijn graphs. Hence, we use the terms kmer and vertex and the terms (k+1)mer and edge interchangeably. We introduce both variants of the graph here as we compare (in the “Results” section) our algorithm with some other methods that employ the nodecentric definition.
We use the bidirected variant of de Bruijn graphs in the proposed algorithm, and redefine \(\mathscr {K}_{+1}\) to be the set of canonical (k+1)mers \(\widehat {z}\) such that z or \(\overline {z}\) appears as substring in some \(s \in \mathscr {S}\)^{Footnote 6}. For a bidirected edgecentric de Bruijn graph \(G(\mathscr {S}, k) = (\mathscr {V}, \mathscr {E})\) — (i) the vertex set \(\mathscr {V}\) is the set of canonical forms of the kmer present as substrings in some \(e \in \mathscr {K}_{+1}\), and (ii) the edge set is \(\mathscr {E} = \mathscr {K}_{+1}\), where an \(e \in \mathscr {E}\) connects the vertices \(\widehat {{pre}_{k}({e})}\) and \(\widehat {{suf}_{k}({e})}\). A vertex v has exactly two sides, referred to as the front side and the back side.
For a (k+1)mer z such that \(\widehat {z} \in \mathscr {K}_{+1}\), let \(u = \widehat {{pre}_{k}({z})}\) and \(v = \widehat {{suf}_{k}({z})}\). z induces an edge from the vertex u to the vertex v, and it is said to exitu and enterv. z connects to u’s back iff pre_{k}(z) is in its canonical form, i.e., pre_{k}(z)=u, and otherwise it connects to u’s front. Conversely, z connects to v’s front iff suf_{k}(z)=v, and otherwise to v’s back. Concisely put, z exits through u’s back iff z’s prefix kmer is canonical, and enters through v’s front iff z’s suffix kmer is canonical. The edge corresponding to z can be expressed as ((u,ψ _{u}),(v,ψ_{v})): it connects from the side ψ_{u} of the vertex u to the side ψ_{v} of the vertex v.
We prove in Lemma 1 (see Additional file 1: Sec. 3) that the two (k+1)mers z and \(\overline {z}\) correspond to the same edge, but in reversed directions: \(\overline {z}\) induces the edge ((v,ψ_{v}),(u,ψ_{u}))—opposite from that of z. The two edges for z and \(\overline {z}\) constitute a bidirected edge e={(u,ψ_{u}),(v,ψ_{v})}, corresponding to \(\widehat {z} \in \mathscr {E}\). While crossing e during a graph traversal, we say that espells the (k+1)mer z when the traversal is from (u,ψ_{u}) to (v,ψ_{v}), and it spells \(\overline {z}\) in the opposite direction.
A walkw=(v_{0},e _{1},v_{1},…,v_{n−1},e_{n},v_{n}) is an alternating sequence of vertices and edges in \(G(\mathscr {S}, k)\), where the vertices v_{i} and v_{i+1} are connected with the edge e_{i+1}, and e_{i} and e_{i+1} are incident to different sides of v_{i}. w denotes its length in vertices, i.e., w=n+1. v_{0} and v_{n} are its endpoints, and the v_{i} for 0<i<n are its internal vertices. The walks (v_{0},e_{1},…,e_{n},v_{n}) and (v_{n},e_{n},…,e_{1},v_{0}) denote the same walk but in opposite orientations. If the edge e_{i} spells the (k+1)mer l_{i}, then w spells l_{1}⊙^{k}l_{2}⊙^{k}…⊙^{k}l_{n}. If w=1, then it spells v_{0}. A path is a walk without any repeated vertex.
A unitig is a path p=(v_{0},e_{1},v_{1},…,e_{n},v_{n}) such that either p=1, or in \(G(\mathscr {S}, k)\):

1.
each internal vertex v _{i} has exactly one incident edge at each of its sides, the edges being e_{i} and e_{i+1}

2.
and v_{0} has only e_{1} and v_{n} has only e_{n} incident to their sides to which e_{1} and e_{n} are incident to, respectively.
A maximal unitig is a unitig p=(v_{0},e_{1},v_{1},…,e_{n},v_{n}) such that it cannot be extended anymore while retaining itself a unitig: there exists no x,y,e_{0}, or e_{n+1} such that (x,e_{0},v_{0},…,e_{n},v_{n}) or (v_{0},e_{1},…,v_{n},e_{n+1},y) is a unitig.
Figure 3a illustrates an example of the de Bruijn graph and the relevant constructs defined.
A compacted de Bruijn graph \(G_{c} (\mathscr {S}, k)\) is obtained through collapsing each maximal unitig of the de Bruijn graph \(G(\mathscr {S}, k)\) into a single vertex, through contracting its constituent edges [75]. Figure 3b shows the compacted form of the graph in Fig. 3a. Given a set \(\mathscr {S}\) of strings and an integer k>0, the problem of constructing the compacted de Bruijn graph \(G_{c} (\mathscr {S}, k)\) is to compute the maximal unitigs of \(G(\mathscr {S}, k)\)^{Footnote 7}.
A vertexdisjoint path cover\(\mathscr {P}\) of \(G(\mathscr {S}, k) = (\mathscr {V}, \mathscr {E})\) is a set of paths such that each vertex \(v \in \mathscr {V}\) is present in exactly one path \(p \in \mathscr {P}\). Unless stated otherwise, we refer to this construct simply as path cover. A maximal path cover is a path cover \(\mathscr {P}\) such that no two paths in \(\mathscr {P}\) can be joined into one single path, i.e., there exists no \(p_{1}, p_{2} \in \mathscr {P}\) such that p_{1}=(v_{0},e_{1},…,e_{n},x), p_{2}=(y,e1′,…,en′,vn′), and there is some edge \(\big \{ (x, s_{x}), (y, s_{y}) \big \} \in \mathscr {E}\) connecting the sides of x and y that are not incident to e_{n} and \(e^{\prime }_{n}\), respectively. Figure 3a provides examples of such.
Algorithm
Given a set \(\mathscr {R}\), either of shortreads sequenced from some biological sample, or of reference sequences, the construction of the compacted de Bruijn graph \(G_{c}(\mathscr {R}, k)\) for some k>0 is a data reduction problem in computational genomics. A simple algorithm to construct the compacted graph G_{c} involves constructing the ordinary de Bruijn graph \(G(\mathscr {R}, k)\) at first, and then applying a graph traversal algorithm [76] to extract all the maximal nonbranching paths in G. However, this approach requires constructing the ordinary graph and retaining it in memory for traversal (or organizing it in a way that it can be paged into memory for efficient traversal). Both of these tasks can be infeasible for large enough input samples. This prompts the requirement of methods to construct G_{c} directly from \(\mathscr {R}\), bypassing G. CUTTLEFISH 2 is an asymptotically and practically efficient algorithm performing this task.
Another practical aspect of the problem is that the sequenced reads typically contain errors [77]. This is offset through increasing the sequencing depth—even if a read \(r \in \mathscr {R}\) contains some erroneous symbol at index i of the underlying sequence being sampled, a high enough sequencing depth should ensure that some other reads in \(\mathscr {R}\) contain the correct nucleotide present at index i. Thus, in practice, these erroneous symbols need to be identified—usually heuristically—and the vertices and the edges of the graph corresponding to them need to be taken into account. CUTTLEFISH 2 discards the edges, i.e., (k+1)mers, that each occur less than some threshold parameter f_{0}, and only operates with the edges having frequencies ≥f_{0}.
Implicit traversals over \(G(\mathscr {R}, k)\)
When the input is a set \(\mathscr {S}\) of references, the CUTTLEFISH algorithm [44] makes a complete graph traversal on the reference de Bruijn graph^{Footnote 8}\(G(\mathscr {S}, k)\) through a linear scan over each \(s \in \mathscr {S}\). Also, the concept of sentinels^{Footnote 9} in \(G(\mathscr {S}, k)\) ensures that a unitig can not span multiple input sequences. In one complete traversal, the branching vertices are characterized through obtaining a particular set of neighborhood relations; and then using another traversal, the maximal unitigs are extracted.
For a set \(\mathscr {R}\) of shortreads, however, a linear scan over a read \(r \in \mathscr {R}\) may not provide a walk in \(G(\mathscr {R}, k)\), since r may contain errors, which break a contiguous walk. Additionally, the concept of sentinels is not applicable for reads. Therefore, unitigs may span multiple reads. For a unitig u spanning the reads r _{1} and r_{2} consecutively, it is not readily inferrable that r_{2} is to be scanned after r_{1} (or the reverse, for an oppositelyoriented traversal) while attempting to extract u directly from \(\mathscr {R}\), as the reads are not linked in the input for this purpose. Hence, contrary to the referenceinput algorithm [44] where complete graph traversal is possible with just \(\mathscr {R}\) different walks when \(\mathscr {R}\) consists of reference sequences, CUTTLEFISH 2 resorts to making implicit piecewisetraversals over \(G(\mathscr {R}, k)\).
For the purpose of determining the branching vertices, the piecewisetraversal is completely coarse—each walk traverses just one edge. Making such walks, CUTTLEFISH 2 retains just enough adjacency information for the vertices (i.e., only the internal edges of the unitigs) so that the unitigs can then be piecewiseconstructed without the input \(\mathscr {R}\). Avoiding the erroneous vertices is done through traversing only the solid edges ((k+1)mers occurring ≥f_{0} times in \(\mathscr {R}\), where f_{0} is a heuristicallyset input parameter).
Stitching together the pieces of a unitig is accomplished by making another piecewisetraversal over \(G(\mathscr {R}, k)\), not by extracting those directly from the input sequences (opposed to CUTTLEFISH [44]). Each walk comprises the extent of a maximal unitig—the edges retained by the earlier traversal are used to make the walk and to stitch together the unitig.
In fact, the graph traversal strategy of CUTTLEFISH for reference inputs \(\mathscr {S}\) is a specific case of this generalized traversal, where a complete graph traversal is possible through a linear scan over the input, as each \(s \in \mathscr {S}\) constitutes a complete walk over G({s},k). Besides, the maximal unitigs are tiled linearly in the sequences \(s \in \mathscr {S}\), and determining their terminal vertices is the core problem in that case; as extraction of the unitigs can then be performed through walking between the terminal vertices by scanning the \(s \in \mathscr {S}\).
A deterministic finite automaton model for vertices
While traversing the de Bruijn graph \(G(\mathscr {R}, k) = (\mathscr {V}, \mathscr {E})\) for the purpose of determining the maximal unitigs, it is sufficient to only keep track of information for each side s_{v} of each vertex \(v \in \mathscr {V}\) that can categorize it into one of the following classes:

1.
no edge has been observed to be incident to s_{v} yet

2.
s_{v} has multiple distinct incident edges

3.
s_{v} has exactly one distinct incident edge, for which there are Σ=4 possibilities (see Lemma 2, Additional file 1: Sec. 3).
Thus there are six classes to which each s_{v} may belong, and since v has two sides, it can be in one of 6×6=36 distinct configurations. Each such configuration is referred to as a state.
CUTTLEFISH 2 treats each vertex \(v \! \in \!\! \mathscr {V}\) as a Deterministic Finite Automaton (DFA) \(M_{v} = (\mathscr {Q}, \Sigma ', \delta, q_{0}, \mathscr {Q}')\):
States
\(\mathscr {Q}\) is the set of the possible 36 states for the automaton. Letting the number of distinct edges at the front with f and at the back with b for a vertex v with a state q, and based on the incidence characteristics of v, the states \(q \in \mathscr {Q}\) can be partitioned into four disjoint stateclasses: (1) fuzzyfront fuzzyback (f≠1,b≠1), (2) fuzzyfront uniqueback (f≠1,b=1), (3) uniquefront fuzzyback, (f=1,b≠1), and (4) uniquefront uniqueback (f=1,b=1).
Input symbols
Σ^{′}={(s,c):s∈{front,back}, c∈Σ} is the set of the input symbols for the automaton. Each incident edge e of a vertex u is provided as input to u’s automaton. For u, an incident edge e={(u,s_{u}),(v,s_{v})} can be succinctly encoded by its incidence side s_{u} and a symbol c∈Σ—the (k+1)mer \(\widehat {z}\) for e is one of the following, depending on s_{u} and whether \(\widehat {z}\) is exiting or entering u: u·c, \(\overline {u} \cdot c\), c·u, or \(c \cdot \overline {u}\).
Transition function
\(\delta : \mathscr {Q} \times \Sigma ' \rightarrow \mathscr {Q} \setminus \{q_{0}\}\) is the function controlling the statetransitions of the automaton. Figure 4 illustrates the statetransition diagram for an automaton as per δ.
Initial state
\(q_{0} \in \mathscr {Q}\) is the initial state of the automaton. This state corresponds to the configuration of the associated vertex at which no edge has been observed to be incident to either of its sides.
Accept states
\(\mathscr {Q}' = \mathscr {Q} \setminus \{q_{0}\}\) is the set of the states corresponding to vertexconfigurations having at least one edge^{Footnote 10}.
Algorithm overview
We provide here a highlevel overview of the CUTTLEFISH 2\((\mathscr {R}, k, f_{0})\) algorithm. The input to the algorithm is a set \(\mathscr {R}\) of strings, an odd integer k>0, and an abundance threshold f_{0}>0; the output is the set of strings spelled by the maximal unitigs of the de Bruijn graph \(G(\mathscr {R}, k)\).
CUTTLEFISH 2\((\mathscr {R}, k, f_{0})\) executes in five highlevel stages, and Fig. 1 illustrates these steps. Firstly, it enumerates the set \(\mathscr {E}\) of edges, i.e., (k+1)mers that appear at least f_{0} times in \(\mathscr {R}\). Then the set \(\mathscr {V}\) of vertices, i.e., kmer are extracted from \(\mathscr {E}\). Having the distinct kmer, it constructs a minimal perfect hash function h over \(\mathscr {V}\). At this point, a hash table structure is set up for the automata—the hash function being h, and each hash bucket having enough bits to store a state encoding. Then, making a piecewise traversal over \(G(\mathscr {R}, k)\) using \(\mathscr {E}\), CUTTLEFISH 2 observes all the adjacency relations in the graph, making appropriate state transitions along the way for the automata of the vertices u and v for each edge { (u,s_{u}),(v,s_{v}) }. After all the edges in \(\mathscr {E}\) are processed, each automaton M_{v} resides in its correct state. Due to the design characteristic of the statespace \(\mathscr {Q}\) of M_{v}, the internal vertices of the unitigs in \(G(\mathscr {R}, k)\), as well as the nonbranching sides of the branching vertices have their incident edges encoded in their states. CUTTLEFISH 2 retrieves these unitiginternal edges and stitches them together in chains until branching vertices are encountered, thus extracting the maximal unitigs implicitly through another piecewise traversal, with each walk spanning a maximal unitig.
These major steps in the algorithm are detailed in the following subsections. Then we analyze the asymptotic characteristics of the algorithm in the “Asymptotics” section. Finally, we provide a proof of its correctness in Theorem 1 (see Additional file 1: Sec. 3).
Edge set construction
The initial enumeration of the edges, i.e., (k+1)mers from the input set \(\mathscr {R}\) is performed with the KMC 3 algorithm [72]. KMC 3 enumerates the ℓmers of its input in two major steps. Firstly, it partitions the ℓmers based on signatures—a restrictive variant of minimizers ^{Footnote 11}. Maximal substrings from the input strings with all their ℓmers having the same signature (referred to as super ℓmers) are partitioned into bins corresponding to the signatures. Typically the number of bins is much smaller than the number of possible signatures, and hence each bin may contain strings from multiple signatures (set heuristically to balance the bins). In the second phase, for each partition, its strings are split into substrings called (ℓ,x)mers—an extension of ℓmers. These substrings are then sorted using a mostsignificantdigit radix sort [78] to unify the repeated ℓmers in the partition. For ℓ=k+1, the collection of these deduplicated partitions constitute the edge set \(\mathscr {E}\).
Vertex set extraction
CUTTLEFISH 2 extracts the distinct canonical kmer—vertices of \(G(\mathscr {R}, k)\)—from \(\mathscr {E}\) through an extension of KMC 3 [72] (See Additional file 1: Sec. 2.1). For such, taking \(\mathscr {E}\) as input, KMC 3 treats each (k+1)mer \(e \in \mathscr {E}\) as an input string, and enumerates their constituent kmer applying its original algorithm. Using \(\mathscr {E}\) instead of \(\mathscr {R}\) as input reduces the amount of work performed in this phase through utilizing the work done in the earlier phase—skipping another pass over the entire input set \(\mathscr {R}\), which can be computationally substantial.
Hash table structure setup
An associative data structure is required to store the transitioning states of the automata of the vertices of \(G(\mathscr {R}, k)\). That is, association of some encoding of the states to each canonical kmer is required. Some offtheshelf hash table can be employed for this purpose. Due to hash collisions, generalpurpose hash tables typically need to store the keys along with their associated data—the key set \(\mathscr {V}\) may end up taking k log2Σ=2k bits/kmer in the hash table^{Footnote 12}. In designing a more efficient hash table structure, CUTTLEFISH 2 makes use of the facts that (i) the set \(\mathscr {V}\) of keys is static—no alien keys will be encountered while traversing the edges in \(\mathscr {E}\), since \(\mathscr {V}\) is constructed from \(\mathscr {E}\), and (ii) \(\mathscr {V}\) has been enumerated at this point.
A Minimal Perfect Hash Function (MPHF) is applicable in this setting. Given a set \(\mathscr {K}\) of keys, a perfect hash function over \(\mathscr {K}\) is an injective function \(h: \mathscr {K} \rightarrow \mathbb {Z}\), i.e., \(\forall \, k_{1}, k_{2} \in \mathscr {K}, \; k_{1} \neq k_{2} \Leftrightarrow h(k_{1}) \neq h(k_{2})\). h is minimal when its image is \([0, \mathscr {K})\), i.e., an MPHF is an injective function \(h: \mathscr {K} \rightarrow [0, \mathscr {K})\). By definition, an MPHF does not incur collisions. Therefore when used as the hash function for a hash table, it obviates the requirement to store the keys with the table structure. Instead, some encoding of the MPHF needs to be stored in the structure.
To associate the automata to their states, CUTTLEFISH 2 uses a hash table with an MPHF as the hash function. An MPHF h over the vertex set \(\mathscr {V}\) is constructed with the BBHASH algorithm [79]. For the key set \(\mathscr {V}_{0} = \mathscr {V}\), BBHASH constructs h through a number of iterations. It maps each key \(v \in \mathscr {V}_{0}\) to \([\, 1, \, \gamma \mathscr {V}_{0} \,]\) with a classical hash function h_{0}, for a provided parameter γ>0. The collisionfree hashes in the hash codomain \([\, 1, \, \gamma \mathscr {V}_{0} \,]\) are marked in a bitarray A_{0} of length \(\gamma \mathscr {V}_{0}\). The colliding keys are collected into a set \(\mathscr {V}_{1}\), and the algorithm is iteratively applied on \(\mathscr {V}_{1}\). The iterations are repeated until either some \(\mathscr {V}_{n}\) is found empty, or a maximum level is reached. The bitarrays A_{i} for the iterations are concatenated into an array A, which along with some other metadata, encode h. A has an expected size of \(\gamma e^{{1}/{\gamma }} \mathscr {V}\) bits [79]. γ trades off the encoding size of h with its computation time. γ=2 provides a reasonable tradeoff, with the size of h being ≈3.7 bits/vertex^{Footnote 13}. Note that, the size is independent of k, i.e., the size of the keys.
For the collection of hash buckets, CUTTLEFISH 2 uses a linear array [81] of size \(\mathscr {V}\). Since each bucket is to contain some state \(q \in \mathscr {Q}\), \(\lceil \log _{2} \mathscr {Q} \rceil = \lceil \log _{2} 36 \rceil = 6\) bits are necessary (and also sufficient) to encode q. Therefore CUTTLEFISH 2 uses 6 bits for each bucket. The hash table structure is thus composed of an MPHF h and a linear array S: for a vertex v, its (transitioning) state q_{v} is encoded at the index h(v) of S, and in total the structure uses ≈9.7 bits/vertex.
Automaton states computation
Given the set \(\mathscr {E}\) of edges of a de Bruijn graph \(G(\mathscr {R}, k)\) and an MPHF h over its vertex set \(\mathscr {V}\), the COMPUTEAUTOMATONSTATES\((\mathscr {E}, h)\) algorithm computes the state of the automaton M_{v} of each \(v \in \mathscr {V}\).
It initializes each automaton M_{w} with q_{0}—the initial state corresponding to no incident edges. Then for each edge \(e = \{ \, (\widehat {u}, s_{u}), (\widehat {v}, s_{v}) \, \} \in \mathscr {E}\), connecting the vertex \(\widehat {u}\) via its side s_{u} to the vertex \(\widehat {v}\) via its side s_{v}, it makes appropriate state transitions for M_{u} and M_{v}, the automata of \(\widehat {u}\) and \(\widehat {v}\) respectively. For each endpoint \(\widehat {w}\) of e, (s_{w},c_{w}) is fed to M_{w}, where c_{w}∈Σ. Together with \(\widehat {w}\), s_{w} and c_{w} encode e. The setting policy for c_{w} is described in the following. Technicalities relating to loops are accounted for in the CUTTLEFISH 2 implementation, but are omitted from discussion for simplicity.
e has two associated (k+1)mers: z and \(\overline {z}\). Say that z=u⊙^{k−1}v. Based on whether \(u = \widehat {u}\) holds or not, e is incident to either \(\widehat {u}\)’s back or front. As defined (see the “Definitions” section), if it is incident to the back, then \(z = \widehat {u} \cdot c\); otherwise, \(z = \overline {\widehat {u}} \cdot c\), where c=e_{k+1}. In these cases respectively, \(\overline {z} = \overline {c} \cdot \overline {\widehat {u}}\), and \(\overline {z} = \overline {c} \cdot \widehat {u}\). For consistency, CUTTLEFISH 2 always uses a fixed form of e for \(\widehat {u}\)—either z or \(\overline {z}\)—to provide it as input to M_{u}: the one containing the kmer u in its canonical form. So if e is at \(\widehat {u}\)’s back, the \(\widehat {u} \cdot c\) form is used for e, and (back,c) is fed to M_{u}; otherwise, e is expressed as \(\overline {c} \cdot \widehat {u}\) and \(({front}, \overline {c})\) is the input for M_{u}. The encoding (s_{v},c^{′}) of e for \(\widehat {v}\) is set similarly.
Maximal unitigs extraction
Given the set \(\mathscr {V}\) of vertices of a de Bruijn graph \(G(\mathscr {R}, k)\), an MPHF h over \(\mathscr {V}\), and the statestable S for the automata of \(v \in \mathscr {V}\), the EXTRACTMAXIMALUNITIGS\((\mathscr {V}, h, S)\) algorithm assembles all the maximal unitigs of \(G(\mathscr {R}, k)\).
The algorithm iterates over the vertices in \(\mathscr {V}\). For some vertex \(\widehat {v} \in \mathscr {V}\), let p be the maximal unitig containing \(\widehat {v}\). p can be broken into two subpaths: p_{b} and p_{f}, overlapping only at \(\widehat {v}\). The EXTRACTMAXIMALUNITIGS\((\mathscr {V}, h, S)\) algorithm extracts these subpaths separately, and joins them at \(\widehat {v}\) to construct p. Then p’s constituent vertices are marked by transitioning their automata to some special states (not shown in Fig. 4), so that p is extracted only once.
The subpaths p_{b} and p_{f} are extracted by initiating two walks: one from each of \(\widehat {v}\)’s sides back and front, using the WALKMAXIMALUNITIG\((\widehat {v}, s_{v})\) algorithm. Each walk continues on until a flanking vertex\(\widehat {x}\) is encountered. For a vertex \(\widehat {x}\), let q_{x} denote the state of \(\widehat {x}\)’s automaton and \(\mathscr {C}_{x}\) denote q_{x}’s stateclass. Then \(\widehat {x}\) is noted to be a flanking vertex iff:

1)
either \(\mathscr {C}_{x}\) is not uniquefront uniqueback;

2)
or \(\widehat {x}\) connects to the side s_{y} of a vertex \(\widehat {y}\) such that:

(a)
\(\mathscr {C}_{y}\) is fuzzyfront fuzzyback; or

(b)
s_{y}=front and \(\mathscr {C}_{y}\) is fuzzyfront uniqueback; or

(c)
s_{y}=back and \(\mathscr {C}_{y}\) is uniquefront fuzzyback.

(a)
Lemma 3 (see Additional file 1: Sec. 3) shows that the flanking vertices in \(G(\mathscr {R}, k)\) are precisely the endpoints of its maximal unitigs.
The WALKMAXIMALUNITIG\((\widehat {v}, s_{v})\) algorithm initiates a walk w from \(\widehat {v}\), exiting through its side s_{v}. It fetches \(\widehat {v}\)’s state q_{v} from the hash table. If q_{v} is found to be not belonging to the stateclass uniquefront uniqueback due to s_{v} having ≠1 incident edges, then \(\widehat {v}\) is a flanking vertex of its containing maximal unitig p, and p has no edges at s_{v}. Hence w terminates at \(\widehat {v}\). Otherwise, s_{v} has exactly one incident edge. The walk algorithm makes use of the fact that, the vertexsides s_{u} that are internal to the maximal unitigs in \(G(\mathscr {R}, k)\) contain their adjacency information encoded in the states q_{u} of their vertices \(\widehat {u}\)’s automata, once the COMPUTEAUTOMATONSTATES\((\mathscr {E}, h)\) algorithm is executed. Thus, it decodes q_{v} to get the unique edge \(e = \{ (\widehat {u}, s_{u}), (\widehat {v}, s_{v}) \}\) incident to s_{v}. Through e, w reaches the neighboring vertex \(\widehat {u}\), at its side s_{u}. \(\widehat {u}\)’s state q_{u} is fetched, and if q_{u} is found not to be in the class uniquefront uniqueback due to s_{u} having >1 incident edges, then both \(\widehat {u}\) and \(\widehat {v}\) are flanking vertices (for different maximal unitigs), and w retracts to and stops at \(\widehat {v}\). Otherwise, e is internal to p, and w switches to the other side of \(\widehat {u}\), proceeding on similarly. It continues through vertices \(\widehat {v}_{i}\) in this manner until a flanking vertex of p is reached, stitching together the edges along the way to construct a subpath of p.
A few constanttime supplementary procedures are used throughout the algorithm. ISFUZZYSIDE(q,s) determines whether a vertex with the state q has 0 or >1 edges at its side s. EDGEEXTENSION(q,s) returns an encoding of the edge incident to the side s of a vertex with state q. ENTRANCESIDE\((\widehat {v}, v)\) (and EXITSIDE\((\widehat {v}, v)\)) returns the side used to enter (and exit) the vertex \(\widehat {v}\) when its kmer form v is observed.
Maximal pathcover extraction
We discuss here how CUTTLEFISH 2 might be modified so that it can extract a maximal path cover of a de Bruijn graph \(G(\mathscr {R}, k)\). For such, only the COMPUTEAUTOMATONSTATES step needs to be modified, and the rest of the algorithm remains the same. Given the edge set \(\mathscr {E}\) of the graph \(G(\mathscr {R}, k)\) and an MPHF h over its vertex set \(\mathscr {V}\), COMPUTEAUTOMATONSTATESPATHCOVER presents the modified DFA states computation algorithm.
The maximal path cover extraction variant of CUTTLEFISH 2 works as follows. It starts with a trivial path cover \(\mathscr {P}_{0}\) of \(G(\mathscr {R}, k)\): each \(v \in \mathscr {V}\) constitutes a single path, spanning the subgraph \(G'(\mathscr {V}, \emptyset)\). Then it iterates over the edges \(e \in \mathscr {E}\) (with \(\left \mathscr {E}\right  = m\)) in arbitrary order. We will use \(\mathscr {P}_{i}\) to refer to the path cover after having visited i edges. At any given point of the execution, the algorithm maintains the invariant that \(\mathscr {P}_{i}\) is a maximal path cover of the graph \(G'(\mathscr {V}, \mathscr {E}')\), where \(\mathscr {E}' \subseteq \mathscr {E}\) (with \(\left \mathscr {E}'\right  = i\)) is the set of the edges examined until that point. When examining the (i+1)’th edge e={(u,s_{u}),(v,s_{v})}, it checks whether e connects two different paths in \(\mathscr {P}_{i}\) into one single path: this is possible iff the sides s_{u} and s_{v} do not have any incident edges already in \(\mathscr {E}'\), i.e., the sides are empty in \(G'(\mathscr {V}, \mathscr {E}')\). If this is the case, the paths are joined in \(\mathscr {P}_{i+1}\) into a single path containing the new edge e. Otherwise, the path cover remains unchanged so that \(\mathscr {P}_{i+1} = \mathscr {P}_{i}\). By definition, \(\mathscr {P}_{i+1}\) is a path cover of \(G'(\mathscr {V}, \mathscr {E}' \cup \{e\})\), as e could only affect the paths (at most two) in \(\mathscr {P}_{i}\) containing u and v, while the rest are unaffected and retain maximality—thus the invariant is maintained. By induction, \(\mathscr {P}_{m}\) is a path cover of \(G(\mathscr {V}, \mathscr {E})\) once all the edges have been examined, i.e., when \(\mathscr {E}' = \mathscr {E}\).
By making state transitions for the automata only for the edges present internal to the paths \(p \in \mathscr {P}_{m}\), the COMPUTEAUTOMATONSTATESPATHCOVER\((\mathscr {E}, h)\) algorithm seamlessly captures the subgraph \(G_{\mathscr {P}_{m}}\) of \(G(\mathscr {R}, k)\) that is induced by the path cover \(\mathscr {P}_{m}\). \(G_{\mathscr {P}_{m}}\) consists of a collection of disconnected maximal paths, and thus there exists no branching in \(G_{\mathscr {P}_{m}}\). Consequently, each of these maximal paths is a maximal unitig of \(G_{\mathscr {P}_{m}}\). The subsequent EXTRACTMAXIMALUNITIGS algorithm operates using the DFA states collection S computed at this step, and therefore it extracts precisely these maximal paths.
Parallelization
CUTTLEFISH 2 is highly parallelizable on a sharedmemory multicore machine. The ENUMERATEEDGES and the EXTRACTVERTICES steps, using KMC 3 [72], are parallelized in their constituent phases via parallel distribution of the input (k+1)mers (and kmer) into partitions, and sorting multiple partitions in parallel.
The COMPUTEMINIMALPERFECTHASH step using BBHASH [79] parallelizes the construction through distributing disjoint subsets \(\mathscr {V}_{i}\) of the vertices to the processorthreads, and the threads process the \(\mathscr {V}_{i}\)’s in parallel.
The next two steps, COMPUTEAUTOMATONSTATES and EXTRACTMAXIMALUNITIGS, both (piecewise) traverse the graph through iterating over \(\mathscr {E}\) and \(\mathscr {V}\) respectively. The processorthreads are provided disjoint subsets of \(\mathscr {E}\) and \(\mathscr {V}\) to process in parallel. Although the threads process different edges in COMPUTEAUTOMATONSTATES, multiple threads may access the same automaton into the hash table simultaneously, due to edges sharing endpoints. Similarly in EXTRACTMAXIMALUNITIGS, though the threads examine disjoint vertex sets, multiple threads simultaneously constructing the same maximal unitig from its different constituent vertices can access the same automaton concurrently, at the walks’ meeting vertex. CUTTLEFISH 2 maintains exclusive access to a vertex to one thread at a time through a sparse set \(\mathscr {L}\) of locks. Each lock \(l \in \mathscr {L}\) guards a disjoint set \(\mathscr {V}_{i}\) of vertices, roughly of equal size. With p processorthreads and assuming all p threads accessing the hash table at the same time, the probability of two threads concurrently probing the same lock at the same turn is \(\big (1  (1  {^{1\!}}/_{\!\mathscr {L}})^{p \choose 2} \big)\)—this is minuscule with an adequate \(\mathscr {L}\)^{Footnote 14}.
Asymptotics
In this section, we analyze the computational complexity of the CUTTLEFISH 2\((\mathscr {R}, k, f_{0})\) algorithm when executed on a set \(\mathscr {R}\) of strings, given a k value, and a threshold factor f_{0} for the edges in \(G(\mathscr {R}, k)\). \(\mathscr {E}\) is the set of the (k+1)mers occurring ≥f_{0} times in \(\mathscr {R}\), and \(\mathscr {V}\) is the set of the kmer in \(\mathscr {E}\). Let ℓ be the total length of the strings \(r \in \mathscr {R}\), n be the vertexcount \(\mathscr {V}\), and m be the edgecount \(\mathscr {E}\).
Time complexity
CUTTLEFISH 2 represents jmers with 64bit machinewords—packing 32 symbols into a single word. Let w_{j} denote the number of words in a jmer, i.e., \(w_{j} = \lceil \frac {j}{32} \rceil\).
Note that the number of (k+1)mers in \(\mathscr {R}\) is upperbounded by ℓ. The ENUMERATEEDGES step uses the KMC 3 [72] algorithm. At first, it partitions the (k+1)mers into buckets based on their signatures. With a rolling computation, determining the signature of a (k+1)mer takes an amortized constant time. Assigning a (k+1)mer to its bucket then takes time \(\mathcal {O} (w_{k + 1})\), and the complete distribution takes \(\mathcal {O}(w_{k + 1} \ell)\)^{Footnote 15}. As each (k+1)mer consists of w_{k+1} words, radixsorting a bucket of size B_{i} takes time \(\mathcal {O}(B_{i} w_{k + 1})\). So in the second step, for a total of b buckets for \(\mathscr {R}\), the cumulative sorting time is \(\sum _{i = 1}^{b} \mathcal {O}(B_{i} w_{k + 1}) = \mathcal {O} (w_{k + 1} \sum _{i = 1}^{b} B_{i}) = \mathcal {O}(w_{k + 1} \ell)\). Thus ENUMERATEEDGES takes time \(\mathcal {O}(\ell w_{k + 1})\).
The EXTRACTVERTICES step applies KMC 3 [72] with \(\mathscr {E}\) as input, and hence we perform a similar analysis as earlier. Each \(e \in \mathscr {E}\) comprises two kmer. So partitioning the kmer takes time \(\mathcal {O}(2m w_{k})\), and radixsorting the buckets takes \(\mathcal {O}(w_{k} \sum B_{i}) = \mathcal {O}(2m w_{k})\). Therefore EXTRACTVERTICES takes time \(\mathcal {O} (m w_{k})\).
The CONSTRUCTMINIMALPERFECTHASH step applies the BBHASH [79] algorithm to construct an MPHF h over \(\mathscr {V}\). It is a multipass algorithm—each pass i tries to assign final hash values to a subset \(\mathscr {K}_{i}\) of keys. Making a bounded number of passes over sets \(\mathscr {K}_{i}\) of keys—shrinking in size—it applies some classical hash h_{i} on the \(x \in \mathscr {K}_{i}\) in each pass. For some \(x \in \mathscr {K}_{i}\), iff h_{i}(x) is free of hash collisions, then x is not propagated to \(\mathscr {K}_{i + 1}\). Provided that the h_{i}’s are uniform and random, each key \({v} \in \mathscr {V}\) is hashed with the h_{i}’s an expected \(\mathcal {O}(e^{{1}/{\gamma }})\) times [79], an exponentially decaying function on the γ parameter. Given that h_{i}’s are constant time on machinewords, computing h_{i}(v) takes time \(\mathcal {O}(w_{k})\). Then the expected time to assign its final hash value to a \({v} \in \mathscr {V}\) is \(H(k) = \mathcal {O} \big (w_{k} e^{{1}/{\gamma }} \big)\). Therefore CONSTRUCTMINIMALPERFECTHASH takes an expected time \(\mathcal {O} \big (n H(k) \big)\). Note that, querying h, i.e., computing h(v) also takes time H(k), as the query algorithm is multipass and similar to the construction.
The COMPUTEAUTOMATONSTATES step initializes the n automata with the state q_{0}, taking time \(\mathcal {O} (n)\). Then for each edge \(e \in \mathscr {E}\), it fetches its two endpoints’ states from the hash table in time 2H(k), updating them if required. In total there are 2m hash accesses, and thus COMPUTEAUTOMATONSTATES takes time \(\mathcal {O} (n + m H(k))\).
The EXTRACTMAXIMALUNITIGS step scans through each vertex \({v} \in \mathscr {V}\), and walks the entire maximal unitig p containing v. The state of each vertex in p is decoded to complete the walk—requiring p hash table accesses, taking time pH(k). If the flanking vertices of p are nonbranching, then the walk also visits their neighboring vertices that are absent in p, at most once per each endpoint. Once extracted, all the vertices in p are marked so that p is not extracted again later on—this takes time pH(k), and can actually be done in time \(\mathcal {O} (p)\) by saving the hash values of the path vertices while constructing p. Thus traversing all the u_{i}’s in the maximal unitigs set \(\mathscr {U}\) takes time \(\big (H(k) \sum \limits _{u_{i} \in \mathscr {U}} (u_{i} + 2) + \sum \limits _{u_{i} \in \mathscr {U}} u_{i} \big) = n H(k)\). \(\sum _{u_{i} \in \mathscr {U}} u_{i}\) equates to n because the set of the maximal unitigs \(\mathscr {U}\) forms a vertex decomposition of \(G(\mathscr {R}, k)\) [47]. Thus EXTRACTMAXIMALUNITIGS takes time \(\mathcal {O} \big (n H(k) \big)\).
In the brief analysis for the last three steps, we do not include the time to read the edges \(\big (\mathcal {O} (m w_{k + 1}) \big)\) and the vertices \(\big (\mathcal {O} (n w_{k}) \big)\) into memory, as they are subsumed by other terms.
Thus, CUTTLEFISH 2\((\mathscr {R}, k, f_{0})\) has an expected running time \(\mathcal {O} \big (\ell w_{k + 1} + m w_{k} + (n + m)H(k) \big)\), where \(w_{j} = \lceil \frac {j}{32} \rceil\), \(H(k) = \mathcal {O} (w_{k} e^{{1}/{\gamma }})\), and γ>0 is a constant. It is evident that the bottleneck is the initial ENUMERATEEDGES step, and it asymptotically subsumes the running time.
Space complexity
Here, we analyze the working memory (i.e., RAM) required by the CUTTLEFISH 2 algorithm. The ENUMERATEEDGES step with KMC 3 [72] can work within a bounded memory space. Its partitioning phase distributes input kmer into disk bins, and the kmer are kept in working memory within a total space limit S, before flushes to disk. Radixsorting the bins are done through loading bins into memory with sizes within S, and larger bins are broken into subbins to facilitate boundedmemory sort. As we discuss below, the graph traversal steps require a fixed amount of memory, determined linearly by n. As n is not computed until the completion of EXTRACTVERTICES, we approximate it within the KMC 3 algorithm (see Additional file 1: Sec. 2.1), and then bound the memory for the KMC 3 execution appropriately. The next step of EXTRACTVERTICES is also performed similarly within the same memorybound.
The CONSTRUCTMINIMALPERFECTHASH step with BBHASH [79] processes the key set \(\mathscr {V}\) in fixedsized chunks. Each pass i with key set \(\mathscr {V}_{i}\) has a bitarray A_{i} to mark h_{i}(v) for all the \({v} \in \mathscr {V}_{i}\), along with an additional bitarray C_{i} to detect the hash collisions. Both A_{i} and C_{i} have the size \(\gamma \mathscr {V}_{i}\). The finally concatenated A_{i}’s is the output data structure A for the algorithm, and some C_{i} is present only during the pass i. A has an expected size of γe^{1/γ}n bits [79]. \(C_{0} = \gamma \mathscr {V}_{0} = \gamma n\), and this is the largest collision array in the algorithm’s lifetime. Thus, an expected loose upperbound of the memory usage in this step is \(\mathcal {O} \big (A + C_{0} \big) = \mathcal {O} \big ((e^{{1}/{\gamma }} + 1) \gamma n \big)\) bits.
At this point in the algorithm, a hash table structure is set up for the automata. Together, the hash function h and the hash buckets collection S take an expected space of \(\big (\gamma e^{{1}/{\gamma }} n + n \lceil \log _{2} \mathscr {Q} \rceil \big) = (\gamma e^{{1}/{\gamma }} + 6) n\) bits.
The COMPUTEAUTOMATONSTATES step scans the edges in \(\mathscr {E}\) in fixedsized chunks. For each \(e \in \mathscr {E}\), it queries and updates the hash table for the endpoints of e as required. Similarly, the EXTRACTMAXIMALUNITIGS step scans the vertices in \(\mathscr {V}\) in fixedsized chunks, and spells the containing maximal unitig of some \(v \in \mathscr {V}\) through successively querying the hash table for the path vertices. The spelled paths are dumped to disk at a certain cumulative threshold size. Thus the only nontrivial memory usage by these steps is from the hash table. Therefore these graph traversal steps use \(\big ((\gamma e^{{1}/{\gamma }} + 6) n + \mathcal {O} (1) \big)\) bits.
When γ≤6, the hash table (i.e., the hash function and the bucket collection) is the dominant factor in the algorithm’s memory usage, and CUTTLEFISH 2\((\mathscr {R}, k, f_{0})\) consumes expected space \(\mathcal {O} \big ((\gamma e^{{1}/{\gamma }} + 6) n \big)\). If γ>6 is set, then it could be possible for the hash function construction memory to dominate. In practice, we adopt γ=2, and the observed memory usage is ≈9.7n bits, translating to ≈1.2 bytes per distinct kmer.
Availability of data and materials
All data generated or analyzed during this study are included in this published article (and its supplementary information files). The data supporting the findings of this study are publicly available, and the data sources are noted in the appropriate sections in the manuscript (see the “Results” section).
Cuttlefish 2 is implemented in C++17, and is released under the BSD 3Clause “New” or “Revised” License. The latest source code is available at the GitHub repository: https://github.com/COMBINElab/cuttlefish [82]. The source code version as used in preparing the manuscript is available at https://doi.org/10.5281/zenodo.6897066 [83].
Notes
From our observations, the distributions of kmer frequencies and of (k+1)mer frequencies on real data tend to agree closely, resulting in the same f _{0} for these experiments for both Cuttlefish 2 and the rest of the algorithms, as per the settingpolicy used.
kmer occurring frequently enough in input NGS reads are said to be solid kmer, and the other ones are said to be weak [65].
Length ℓ of the longest contig such that all the contigs having lengths ≥ℓ sum in size to at least 50% of the sum size of the contigs.
Analogous to N50, except for: (1) breaking the contigs into their constituent blocks that can be aligned to an associated reference sequence, and (2) replacing the sum size of contigs with the reference length.
As per this definition, \(\mathscr {V}_{2} = \mathscr {K}\). We describe in the Algorithm section a practical consideration that implies that \(\mathscr {V}_{2}\) need not necessarily be the same as \(\mathscr {K}\) when some filtering is applied on the input \(\mathscr {S}\) to generate \(\mathscr {K}_{+1}\).
This is to account for the DNA being doublestranded, and a genomic string may come from either of these oppositelyoriented complementary strands.
Discarding orientations: the two unitigs (v_{0},…,v_{n}) and (v_{n},…,v_{0}) are topologically the same.
Introduced by [44], based on the input to the de Bruijn graph constructions being either reference sequences or sequencing reads, the graphs are distinguished as either reference or read de Bruijn graphs.
A vertex v is a sentinel if the first or the last kmer x of an input string corresponds to v. Let v’s empty side in x be s_{v}. The graph \(G(\mathscr {S}, k)\) is modified such that s_{v} connects to a special branching vertex Υ—preventing unitigs containing v to span farther through s_{v}.
Formally, \(\mathscr {Q}'\) is the set of states reachable from q_{0} through transitions as per some definite patterns of input symbols. For our purposes, recognizing specific input patterns is not a concern—rendering this parameter redundant—we define it as the set of the final states an automaton can be in having fed all its inputs.
For a given j<ℓ, a jminimizer of an ℓmer x is the smallest jmer substring of x according to some specified function.
This can be improved by having 4^{p} different hash tables for \(\mathscr {V}\), for a fixed prefix length p≤k. Each hash table then accounts for keys of length (k−p).
It can be as low as ≈3 bits/vertex with γ=1, at the expense of slower hashing. The theoretical lower limit for MPHFs is ≈1.44 bits/key [80].
The optimal (in regard to probability) value \(\mathscr {L} = \mathscr {V}\) is not used due to the locks’ memory usage.
This bound is not tight, as KMC 3 actually distributes sequences longer than (k+1)mers—reducing computation (see the Edge set construction section).
References
US National Library of Medicine. NCBI insights: the entire corpus of the sequence read archive (SRA) now live on two cloud platforms! Natl Cent Biotechnol Inf. 2020. https://ncbiinsights.ncbi.nlm.nih.gov/2020/02/24/sracloud/. Accessed 8 Nov 2021.
Stephens ZD, Lee SY, Faghri F, Campbell RH, Zhai C, Efron MJ, Iyer R, Schatz MC, Sinha S, Robinson GE. Big data: Astronomical or genomical?PLoS Biol. 2015; 13(7):1–11. https://doi.org/10.1371/journal.pbio.1002195.
Nayfach S, Roux S, Seshadri R, Udwary D, Varghese N, Schulz F, Wu D, PaezEspino D, Chen IM, Huntemann M, Palaniappan K, Ladau J, et al.A genomic catalog of earth’s microbiomes. Nat Biotechnol. 2021; 39(4):499–509. https://doi.org/10.1038/s4158702007186.
Gevers D, Knight R, Petrosino JF, Huang K, McGuire AL, Birren BW, Nelson KE, White O, Methé BA, Huttenhower C. The human microbiome project: A community resource for the healthy human microbiome. PLOS Biol. 2012; 10(8):1–5. https://doi.org/10.1371/journal.pbio.1001377.
Muir P, Li S, Lou S, Wang D, Spakowicz DJ, Salichos L, Zhang J, Weinstock GM, Isaacs F, Rozowsky J, et al. The real cost of sequencing: scaling computation to keep pace with data generation. Genome Biol. 2016; 17(1):1–9.
de Bruijn NG. A combinatorial problem. Nederl Akad Wetensch Proc. 1946; 49:758–64.
Good IJ. Normal recurring decimals. J Lond Math Soc. 1946; s121(3):167–9. https://doi.org/10.1112/jlms/s121.3.167.
Simpson JT, Pop M. The theory and practice of genome sequence assembly. Annu Rev Genomics Hum Genet. 2015; 16(1):153–72. https://doi.org/10.1146/annurevgenom090314050032.
Pevzner PA, Tang H, Waterman MS. An eulerian path approach to DNA fragment assembly. Proc Natl Acad Sci. 2001; 98(17):9748–53. https://doi.org/10.1073/pnas.171285098.
Limasset A, Flot JF, Peterlongo P. Toward perfect reads: selfcorrection of short reads via mapping on de bruijn graphs. Bioinformatics. 2019; 36(5):1374–81. https://doi.org/10.1093/bioinformatics/btz102.
Salmela L, Rivals E. LoRDEC: accurate and efficient long read error correction. Bioinformatics. 2014; 30(24):3506–14. https://doi.org/10.1093/bioinformatics/btu538.
Benoit G, Lemaitre C, Lavenier D, Drezen E, Dayris T, Uricaru R, Rizk G. Referencefree compression of high throughput sequencing data with a probabilistic de Bruijn graph. BMC Bioinformatics. 2015; 16(1):288. https://doi.org/10.1186/s1285901507097.
Iqbal Z, Caccamo M, Turner I, Flicek P, McVean G. De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat Genet. 2012; 44(2):226–32. https://doi.org/10.1038/ng.1028.
Cameron DL, Schröder J, Penington JS, Do H, Molania R, Dobrovic A, Speed TP, Papenfuss AT. GRIDSS: sensitive and specific genomic rearrangement detection using positional de Bruijn graph assembly. Genome Res. 2017; 27(12):2050–60. https://doi.org/10.1101/gr.222109.117.
Almodaresi F, Zakeri M, Patro R. PuffAligner: a fast, efficient and accurate aligner based on the pufferfish index. Bioinformatics. 2021. https://doi.org/10.1093/bioinformatics/btab408.
Liu B, Guo H, Brudno M, Wang Y. deBGA: read alignment with de bruijn graphbased seed and extension. Bioinformatics. 2016; 32(21):3224–32. https://doi.org/10.1093/bioinformatics/btw371.
Almodaresi F, Khan J, Madaminov S, Pandey P, Ferdman M, Johnson R, Patro R. An incrementally updatable and scalable system for largescale sequence search using LSM trees. BioRxiv. 2021. https://doi.org/10.1101/2021.02.05.429839.
Ye Y, Tang H. Utilizing de bruijn graph of metagenome assembly for metatranscriptome analysis. Bioinformatics. 2015; 32(7):1001–8. https://doi.org/10.1093/bioinformatics/btv510.
Bradley P, Gordon NC, Walker TM, Dunn L, Heys S, Huang B, Earle S, Pankhurst LJ, Anson L, de Cesare M, Piazza P, Votintseva AA, Golubchik T, Wilson DJ, Wyllie DH, Diel R, Niemann S, Feuerriegel S, Kohl TA, Ismail N, Omar SV, Smith EG, Buck D, McVean G, Walker AS, Peto TEA, Crook DW, Iqbal Z. Rapid antibioticresistance predictions from genome sequence data for staphylococcus aureus and mycobacterium tuberculosis. Nat Commun. 2015; 6(1):10063. https://doi.org/10.1038/ncomms10063.
Wang M, Ye Y, Tang H. A de Bruijn graph approach to the quantification of closelyrelated genomes in a microbial community. J Comput Biol. 2012; 19(6):814–25. https://doi.org/10.1089/cmb.2012.0058.
Peng Y, Leung HCM, Yiu SM, Lv MJ, Zhu XG, Chin FYL. IDBAtran: a more robust de novo de bruijn graph assembler for transcriptomes with uneven expression levels. Bioinformatics. 2013; 29(13):326–34. https://doi.org/10.1093/bioinformatics/btt219.
Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, Adiconis X, Fan L, Raychowdhury R, Zeng Q, Chen Z, Mauceli E, Hacohen N, Gnirke A, Rhind N, di Palma F, Birren BW, Nusbaum C, LindbladToh K, Friedman N, Regev A. Fulllength transcriptome assembly from RNASeq data without a reference genome. Nat Biotechnol. 2011; 29(7):644–52. https://doi.org/10.1038/nbt.1883.
Bray NL, Pimentel H, Melsted P, Pachter L. Nearoptimal probabilistic RNAseq quantification. Nat Biotechnol. 2016; 34(5):525–7. https://doi.org/10.1038/nbt.3519.
Ekim B, Berger B, Chikhi R. Minimizerspace de Bruijn graphs: Wholegenome assembly of long reads in minutes on a personal computer. Cell Syst. 2021; 12(10):958–9686. https://doi.org/10.1016/j.cels.2021.08.009.
Ruan J, Li H. Fast and accurate longread assembly with wtdbg2. Nat Methods. 2020; 17(2):155–8. https://doi.org/10.1038/s4159201906693.
Lin Y, Yuan J, Kolmogorov M, Shen MW, Chaisson M, Pevzner PA. Assembly of long errorprone reads using de Bruijn graphs. Proc Natl Acad Sci. 2016; 113(52):8396–405. https://doi.org/10.1073/pnas.1604560113.
Jackman SD, Vandervalk BP, Mohamadi H, Chu J, Yeo S, Hammond SA, Jahesh G, Khan H, Coombe L, Warren RL, Birol I. ABySS 2.0: resourceefficient assembly of large genomes using a bloom filter. Genome Res. 2017; 27(5):768–77. https://doi.org/10.1101/gr.214346.116.
Li D, Liu CM, Luo R, Sadakane K, Lam TW. MEGAHIT: an ultrafast singlenode solution for large and complex metagenomics assembly via succinct de bruijn graph. Bioinformatics. 2015; 31(10):1674–6. https://doi.org/10.1093/bioinformatics/btv033.
Li X, Shi Q, Shao M. On bridging pairedend RNAseq data. BioRxiv. 2021. https://doi.org/10.1101/2021.02.26.433113.
Brown CT, Moritz D, O’Brien MP, Reidl F, Reiter T, Sullivan BD. Exploring neighborhoods in large metagenome assembly graphs using spacegraphcats reveals hidden sequence diversity. Genome Biol. 2020; 21(1):164. https://doi.org/10.1186/s13059020020664.
David L, Vicedomini R, Richard H, Carbone A. Targeted domain assembly for fast functional profiling of metagenomic datasets with S3A. Bioinformatics. 2020; 36(13):3975–81. https://doi.org/10.1093/bioinformatics/btaa272.
Schrinner SD, Mari RS, Ebler J, Rautiainen M, Seillier L, Reimer JJ, Usadel B, Marschall T, Klau GW. Haplotype threading: accurate polyploid phasing from long reads. Genome Biol. 2020; 21(1):252. https://doi.org/10.1186/s13059020021581.
Liu B, Liu Y, Li J, Guo H, Zang T, Wang Y. deSALT: fast and accurate long transcriptomic read alignment with de Bruijn graphbased index. Genome Biol. 2019; 20(1):274. https://doi.org/10.1186/s1305901918959.
Minkin I, Medvedev P. Scalable multiple wholegenome alignment and locally collinear block construction with SibeliaZ. Nat Commun. 2020; 11(1):6327. https://doi.org/10.1038/s41467020197778.
Minkin I, Medvedev P. Scalable pairwise wholegenome homology mapping of long genomes with BubbZ. IScience. 2020; 23(6):101224. https://doi.org/10.1016/j.isci.2020.101224.
LopezMaestre H, Brinza L, Marchet C, Kielbassa J, Bastien S, Boutigny M, Monnin D, Filali AE, Carareto CM, Vieira C, Picard F, Kremer N, Vavre F, Sagot MF, Lacroix V. SNP calling from RNAseq data without a reference genome: identification, quantification, differential analysis and impact on the protein sequence. Nucleic Acids Res. 2016; 44(19):148. https://doi.org/10.1093/nar/gkw655.
Sacomoto GA, Kielbassa J, Chikhi R, Uricaru R, Antoniou P, Sagot MF, Peterlongo P, Lacroix V. KIS SPLICE: denovo calling alternative splicing events from RNAseq data. BMC Bioinformatics. 2012; 13(6):5. https://doi.org/10.1186/1471210513S6S5.
Dede K, Ohlebusch E. Dynamic construction of pangenome subgraphs. Open Comput Sci. 2020; 10(1):82–96. https://doi.org/10.1515/comp20200018.
Lees JA, Mai TT, Galardini M, Wheeler NE, Horsfield ST, Parkhill J, Corander J, Ravel J. Improved prediction of bacterial genotypephenotype associations using interpretable pangenomespanning regressions. MBio. 2020; 11(4):01344–20. https://doi.org/10.1128/mBio.0134420.
Wittler R. Alignment and referencefree phylogenomics with colored de Bruijn graphs. Algoritm Mol Biol. 2020; 15(1):4. https://doi.org/10.1186/s13015020001643.
Cleary A, Ramaraj T, Kahanda I, Mudge J, Mumey B. Exploring frequented regions in pangenomic graphs. IEEE/ACM Trans Comput Biol Bioinforma. 2019; 16(5):1424–35. https://doi.org/10.1109/TCBB.2018.2864564.
Manuweera B, Mudge J, Kahanda I, Mumey B, Ramaraj T, Cleary A. Pangenomewide association studies with frequented regions. In: Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics (BCB ’19). New York: Association for Computing Machinery: 2019. p. 627–32. https://doi.org/10.1145/3307339.3343478.
Sheikhizadeh S, Schranz ME, Akdel M, de Ridder D, Smit S. PanTools: representation, storage and exploration of pangenomic data. Bioinformatics. 2016; 32(17):487–93. https://doi.org/10.1093/bioinformatics/btw455.
Khan J, Patro R. Cuttlefish: fast, parallel and lowmemory compaction of de bruijn graphs from largescale genome collections. Bioinformatics. 2021; 37(Supplement_1):177–86. https://doi.org/10.1093/bioinformatics/btab309.
Holley G, Melsted P. Bifrost: highly parallel construction and indexing of colored and compacted de bruijn graphs. Genome Biol. 2020; 21(1):249. https://doi.org/10.1186/s13059020021358.
Guo H, Fu Y, Gao Y, Li J, Wang Y, Liu B. deGSM: memory scalable construction of large scale de bruijn graph. IEEE/ACM Trans Comput Biol Bioinforma. 2019; Early Access:1–1.
Chikhi R, Limasset A, Medvedev P. Compacting de bruijn graphs from sequencing data quickly and in low memory. Bioinformatics. 2016; 32(12):201–8. https://doi.org/10.1093/bioinformatics/btw279.
Minkin I, Pham S, Medvedev P. TwoPaCo: an efficient algorithm to build the compacted de bruijn graph from many complete genomes. Bioinformatics. 2016; 33(24):4024–32. https://doi.org/10.1093/bioinformatics/btw609.
Baier U, Beller T, Ohlebusch E. Graphical pangenome analysis with compressed suffix trees and the Burrows–Wheeler transform. Bioinformatics. 2015; 32(4):497–504. https://doi.org/10.1093/bioinformatics/btv603.
Chikhi R, Limasset A, Jackman S, Simpson JT, Medvedev P. On the representation of de bruijn graphs In: Sharan R, editor. Research in Computational Molecular Biology. Cham: Springer: 2014. p. 35–55.
Marcus S, Lee H, Schatz MC. SplitMEM: a graphical algorithm for pangenome analysis with suffix skips. Bioinformatics. 2014; 30(24):3476–83. https://doi.org/10.1093/bioinformatics/btu756.
Hopcroft JE, Motwani R, Ullman JD. Introduction to Automata Theory, Languages, and Computation (3rd Edition). USA: AddisonWesley Longman Publishing Co., Inc.; 2006.
Marchet C, Kerbiriou M, Limasset A. BLight: efficient exact associative structure for kmers. Bioinformatics. 2021; 37(18):2858–65. https://doi.org/10.1093/bioinformatics/btab217.
Pibiri GE. Sparse and skew hashing of kmers. bioRxiv. 2022. https://doi.org/10.1101/2022.01.15.476199.
Rahman A, Medvedev P. Representation of kmer sets using spectrumpreserving string sets In: Schwartz R, editor. Research in Computational Molecular Biology. Cham: Springer: 2020. p. 152–168.
Břinda K, Baym M, Kucherov G. Simplitigs as an efficient and scalable representation of de Bruijn graphs. Genome Biol. 2021; 22(1):96. https://doi.org/10.1186/s1305902102297z.
Chikhi R, Holub J, Medvedev P. Data structures to represent a set of klong DNA sequences. ACM Comput Surv. 2021;54(1). https://doi.org/10.1145/3445967.
Zook JM, Catoe D, McDaniel J, Vang L, Spies N, Sidow A, Weng Z, Liu Y, Mason CE, Alexander N, Henaff E, McIntyre ABR, Chandramohan D, Chen F, Jaeger E, Moshrefi A, Pham K, Stedman W, Liang T, Saghbini M, Dzakula Z, Hastie A, Cao H, Deikus G, Schadt E, Sebra R, Bashir A, Truty RM, Chang CC, Gulbahce N, Zhao K, Ghosh S, Hyland F, Fu Y, Chaisson M, Xiao C, Trow J, Sherry ST, Zaranek AW, Ball M, Bobe J, Estep P, Church GM, Marks P, KyriazopoulouPanagiotopoulou S, Zheng GXY, SchnallLevin M, Ordonez HS, Mudivarti PA, Giorda K, Sheng Y, Rypdal KB, Salit M. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Sci Data. 2016; 3(1):160025. https://doi.org/10.1038/sdata.2016.25.
Lappalainen T, Sammeth M, Friedländer MR, ‘t Hoen PAC, Monlong J, Rivas MA, GonzàlezPorta M, Kurbatova N, Griebel T, Ferreira PG, Barann M, Wieland T, Greger L, van Iterson M, Almlöf J, Ribeca P, Pulyakhina I, Esser D, Giger T, Tikhonov A, Sultan M, Bertier G, MacArthur DG, Lek M, Lizano E, Buermans HPJ, Padioleau I, Schwarzmayr T, Karlberg O, Ongen H, Kilpinen H, Beltran S, Gut M, Kahlem K, Amstislavskiy V, Stegle O, Pirinen M, Montgomery SB, Donnelly P, McCarthy MI, Flicek P, Strom TM, Lehrach H, Schreiber S, Sudbrak R, Carracedo Á., Antonarakis SE, Häsler R, Syvänen AC, van Ommen GJ, Brazma A, Meitinger T, Rosenstiel P, Guigó R, Gut IG, Estivill X, Dermitzakis ET, Consortium TG. Transcriptome and genome sequencing uncovers functional variation in humans. Nature. 2013; 501(7468):506–11. https://doi.org/10.1038/nature12531.
MasLloret J, ObónSantacana M, IbáñezSanz G, Guinó E, Pato ML, RodriguezMoranta F, Mata A, GarcíaRodríguez A, Moreno V, Pimenoff VN. Gut microbiome diversity detected by highcoverage 16S and shotgun sequencing of paired stool and colon sample. Scientific Data. 2020; 7(1):92. https://doi.org/10.1038/s4159702004275.
Howe AC, Jansson JK, Malfatti SA, Tringe SG, Tiedje JM, Brown CT. Tackling soil diversity with the assembly of large, complex metagenomes. Proc Natl Acad Sci. 2014; 111(13):4904–9. https://doi.org/10.1073/pnas.1402564111.
Birol I, Raymond A, Jackman SD, Pleasance S, Coope R, Taylor GA, Yuen MMS, Keeling CI, Brand D, Vandervalk BP, Kirk H, Pandoh P, Moore RA, Zhao Y, Mungall AJ, Jaquish B, Yanchuk A, Ritland C, Boyle B, Bousquet J, Ritland K, MacKay J, Bohlmann J, Jones SJM. Assembling the 20 gb white spruce (picea glauca) genome from wholegenome shotgun sequencing data. Bioinformatics. 2013; 29(12):1492–7. https://doi.org/10.1093/bioinformatics/btt178.
Bloom BH. Space/Time tradeoffs in hash coding with allowable errors. Commun ACM. 1970; 13(7):422–6. https://doi.org/10.1145/362686.362692.
Marçais G, Kingsford C. A fast, lockfree approach for efficient parallel counting of occurrences of kmers. Bioinformatics. 2011; 27(6):764–70. https://doi.org/10.1093/bioinformatics/btr011.
Zhao L, Xie J, Bai L, Chen W, Wang M, Zhang Z, Wang Y, Zhao Z, Li J. Mining statisticallysolid kmers for accurate NGS error correction. BMC Genomics. 2018; 19(10):912. https://doi.org/10.1186/s128640185272y.
Hiseni P, Rudi K, Wilson RC, Hegge FT, Snipen L. HumGut: a comprehensive human gut prokaryotic genomes collection filtered by metagenome data. Microbiome. 2021; 9(1):165. https://doi.org/10.1186/s4016802101114w.
Blackwell GA, Hunt M, Malone KM, Lima L, Horesh G, Alako BTF, Thomson NR, Iqbal Z. Exploring bacterial diversity via a curated and searchable snapshot of archived DNA sequences. PLOS Biology. 2021; 19(11):1–16. https://doi.org/10.1371/journal.pbio.3001421.
Yoshimura J, Ichikawa K, Shoura MJ, Artiles KL, Gabdank I, Wahba L, Smith CL, Edgley ML, Rougvie AE, Fire AZ, Morishita S, Schwarz EM. Recompleting the caenorhabditis elegans genome. Genome Res. 2019; 29(6):1009–22. https://doi.org/10.1101/gr.244830.118.
Li H. Aligning sequence reads, clone sequences and assembly contigs with bwamem. arXiv. 2013. https://doi.org/10.48550/arXiv.1303.3997.
Chikhi R, Medvedev P. Informed and automated kmer size selection for genome assembly. Bioinformatics. 2013; 30(1):31–7. https://doi.org/10.1093/bioinformatics/btt310.
Lee S, Min H, Yoon S. Will solidstate drives accelerate your bioinformatics? indepth profiling, performance analysis and beyond. Brief Bioinforma. 2015; 17(4):713–27. https://doi.org/10.1093/bib/bbv073.
Kokot M, Długosz M, Deorowicz S. KMC 3: counting and manipulating kmer statistics. Bioinformatics. 2017; 33(17):2759–61. https://doi.org/10.1093/bioinformatics/btx304.
Roberts M, Hayes W, Hunt BR, Mount SM, Yorke JA. Reducing storage requirements for biological sequence comparison. Bioinformatics. 2004; 20(18):3363–9. https://doi.org/10.1093/bioinformatics/bth408.
Burrows M, Wheeler DJ. A blocksorting lossless data compression algorithm. Technical report, Systems Research Center, Digital Equipment Corp. 1994.
Gross J, Yellen J. Graph Theory and Its Applications. USA: CRC Press, Inc.; 1999, p. 264.
Kleinberg J, Tardos E. Graphs. In: Algorithm Design. USA: AddisonWesley Longman Publishing Co., Inc.: 2005.
Ma X, Shao Y, Tian L, Flasch DA, Mulder HL, Edmonson MN, Liu Y, Chen X, Newman S, Nakitandwe J, Li Y, Li B, Shen S, Wang Z, Shurtleff S, Robison LL, Levy S, Easton J, Zhang J. Analysis of error profiles in deep nextgeneration sequencing data. Genome Biol. 2019; 20(1):50. https://doi.org/10.1186/s1305901916596.
Kokot M, Deorowicz S, DebudajGrabysz A. Sorting data on ultralarge scale with RADULS. In: Beyond Databases, Architectures and Structures. Towards Efficient Solutions for Data Analysis and Knowledge Representation. Cham: Springer: 2017. p. 235–45.
Limasset A, Rizk G, Chikhi R, Peterlongo P. Fast and scalable minimal perfect hashing for massive key sets. In: 16th International Symposium on Experimental Algorithms (SEA 2017) (Leibniz International Proceedings in Informatics (LIPIcs)). Dagstuhl: Schloss Dagstuhl–LeibnizZentrum fuer Informatik: 2017. p. 25–12516. https://doi.org/10.4230/LIPIcs.SEA.2017.25.
Fredman ML, Komlós J. On the size of separating systems and families of perfect hash functions. SIAM J Algebraic Discret Methods. 1984; 5(1):61–68. https://doi.org/10.1137/0605009.
Marçais G. Compact vector: Bit packed vector of integral values. GitHub. 2020. https://github.com/gmarcais/compact_vector. Accessed 18 June 2020.
Khan J, Patro R. Cuttlefish: Building the compacted de Bruijn graph efficiently from references or reads. GitHub. 2022. https://github.com/COMBINElab/cuttlefish. Accessed 24 July 2022.
Khan J, Kokot M, Deorowicz S, Patro R. Software version used in the paper: Scalable, ultrafast, and lowmemory construction of compacted de Bruijn graphs with Cuttlefish 2. Zenodo. 2022. https://doi.org/10.5281/zenodo.6897066. Accessed 24 July 2022.
Mohamadi H, Khan H, Birol I. ntCard: a streaming algorithm for cardinality estimation in genomics data. Bioinformatics. 2017; 33(9):1324–30. https://doi.org/10.1093/bioinformatics/btw832.
Rizk G, Lavenier D, Chikhi R. DSK: kmer counting with very low memory usage. Bioinformatics. 2013; 29(5):652–53. https://doi.org/10.1093/bioinformatics/btt020.
Patro R, Mount SM, Kingsford C. Sailfish enables alignmentfree isoform quantification from rnaseq reads using lightweight algorithms. Nat Biotechnol. 2014; 32(5):462–4. https://doi.org/10.1038/nbt.2862.
Pandey P, Almodaresi F, Bender MA, Ferdman M, Johnson R, Patro R. Mantis: A fast, small, and exact largescale sequencesearch index. Cell Syst. 2018; 7(2):201–2074. https://doi.org/10.1016/j.cels.2018.05.021.
Marchet C, Iqbal Z, Gautheret D, Salson M, Chikhi R. REINDEER: efficient indexing of kmer presence and abundance in sequencing datasets. Bioinformatics. 2020; 36(Supplement_1):177–85. https://doi.org/10.1093/bioinformatics/btaa487.
Acknowledgements
Not applicable.
Peer review information
Andrew Cosgrove was the primary editor of this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.
Review history
The review history is available as Additional file 2.
Funding
This work has been supported by the US National Institutes of Health (R01 HG009937) (RP), US National Science Foundation (CCF1750472, and CNS1763680) (RP), Poland National Science Centre (project DEC2019/33/B/ST6/02040) (SD), and Faculty of Automatic Control, Electronics and Computer Science at Silesian University of Technology (statutory research project 02/080/BKM_21/0020) (MK). The funders had no role in the design of the method, data analysis, decision to publish, or preparation of the manuscript.
Author information
Authors and Affiliations
Contributions
Authors’ contributions
JK, MK, SD, and RP designed the method. JK, MK, and RP implemented the method. JK designed and conducted the experiments. All authors contributed to and approved the final manuscript.
Authors’ Twitter handles
Twitter handles: @scarecrow00007 (Jamshed Khan); @marekkoki (Marek Kokot); @sdeorowicz (Sebastian Deorowicz); @nomad421 (Rob Patro)
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
RP is a cofounder of Ocean Genomics, inc. The other authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Khan, J., Kokot, M., Deorowicz, S. et al. Scalable, ultrafast, and lowmemory construction of compacted de Bruijn graphs with Cuttlefish 2. Genome Biol 23, 190 (2022). https://doi.org/10.1186/s13059022027436
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13059022027436
Keywords
 de Bruijn graph
 Compacted de Bruijn graph
 Data structures
 Highthroughput sequencing
 Unitig
 Path cover