Skip to main content

Scalable, ultra-fast, and low-memory construction of compacted de Bruijn graphs with Cuttlefish 2

Abstract

The de Bruijn graph is a key data structure in modern computational genomics, and construction of its compacted variant resides upstream of many genomic analyses. As the quantity of genomic data grows rapidly, this often forms a computational bottleneck. We present Cuttlefish 2, significantly advancing the state-of-the-art for this problem. On a commodity server, it reduces the graph construction time for 661K bacterial genomes, of size 2.58Tbp, from 4.5 days to 17–23 h; and it constructs the graph for 1.52Tbp white spruce reads in approximately 10 h, while the closest competitor requires 54–58 h, using considerably more memory.

Background

Rapid developments in the throughput and affordability of modern sequencing technologies have made the generation of billions of short-read sequences from a panoply of biological samples highly time- and cost-efficient. The National Center for Biotechnology Information (NCBI) has now moved the Sequence Read Archive (SRA) to the cloud, and this repository stores more than 14 petabytes worth of sequencing data [1]. Yet, this is only a fraction of the total sequencing data that has been produced, which is expected to reach exabyte-scale within the current decade [2]. In addition to the continued sequencing of an ever-expanding catalog of various types and states of tissues from reference organisms, metagenomic sequencing of environmental [3] and microbiome [4] samples is also expected to enjoy a similar immense growth.

Given the expansive repository of existing sequencing data and the rate of acquisition, [5] argue that the ability of computational approaches to keep pace with data acquisition has become one of the main bottlenecks in contemporary genomics. These needs have spurred methods developers to produce ever more efficient and scalable computational methods for a variety of genomics analysis tasks, from genome and transcriptome assembly to pan-genome analysis. Against this backdrop, the de Bruijn graph, along with its variants, has become a compact and efficient data representation of increasing importance and utility across computational genomics.

The de Bruijn graph originated in combinatorics as a mathematical construct devised to prove a conjecture about binary strings posed by Ir. K. Posthumus [6, 7]. In bioinformatics, de Bruijn graphs were introduced in the context of genome assembly algorithms for short-reads [8, 9], although the graph introduced in this context adopts a slightly different definition than in combinatorics. Subsequently, the de Bruijn graph has gradually been used in an increasing variety of different contexts within computational biology, including but not limited to: read correction [10, 11], genomic data compression [12], genotyping [13], structural variant detection [14], read mapping [15, 16], sequence-similarity search [17], metagenomic sequence analysis [18,19,20], transcriptome assembly [21, 22], transcript quantification [23], and long-read assembly [24,25,26].

In the context of fragment assembly—whether in forming contigs for whole-genome assembly pipelines [27, 28], or in encapsulating the read set into a summary representative structure for a host of downstream analyses [29,30,31,32]—de Bruijn graphs continue to be used extensively. The non-branching paths in de Bruijn graphs are uniquely-assemblable contiguous sequences (known as unitigs) from the sequencing reads. Thus, they are certain to be present in any faithful genomic reconstruction from these reads, have no ambiguities regarding repeats in the data, and are fully consistent with the input. As such, maximal unitigs are excellent candidates to summarize the raw reads, capturing their essential substance, and are usually the output of the initial phase of modern de novo short-read assembly tools. Collapsing a set of reads into this compact set of fragments that preserve their effective information can directly contribute to the efficiency of many downstream analyses over the read set.

When constructed from reference genome sequences, the unitigs in the de Bruijn graphs correspond to substrings in the references that are shared identically across subsets of the genomes. Decomposing the reference collection into these fragments retains much of its effective information, while typically requiring much less space and memory to store, index, and analyze, than processing the collection of linear genomes directly. The ability to compactly and efficiently represent shared sequences has led many modern sequence analysis tools to adopt the de Bruijn graph as a central representation, including sequence indexers [33], read aligners [15, 16], homology mappers [34, 35], and RNA-seq analysis tools [23, 36, 37]. Likewise, pan-genome analysis tools [38,39,40,41,42,43] frequently make use of the maximal unitigs of the input references as the primary units upon which their core data structures and algorithms are built.

The vast majority of the examples described above make use of the compacted de Bruijn graph. A de Bruijn graph is compacted by collapsing each of its maximal, non-branching paths (unitigs) into a single vertex. Many computational genomics workflows employing the (compacted) de Bruijn graph are multi-phased, and typically, their most resource-intensive step is the initial one: construction of the regular and/or the compacted de Bruijn graph. The computational requirements for constructing the graph are often considerably higher than the downstream steps—posing major bottlenecks in many applications [13, 30]. As such, there has been a concerted effort over the past several years to develop resource-frugal methods capable of constructing the compacted graph [44,45,46,47,48,49,50,51]. Critically, solving this problem efficiently and in a context independent from any specific downstream application yields a modular tool [45, 47] that can be used to enable a wide variety of subsequent computational pipelines.

To address the scalability challenges of constructing the compacted de Bruijn graph, we recently proposed a novel algorithm, CUTTLEFISH [44], that exhibited faster performance than pre-existing state-of-the-art tools, using (often multiple times) less memory. However, the presented algorithm is only applicable when constructing the graph from existing reference sequences. It cannot be applied in a number of contexts, such as fragment assembly or contig extraction from raw sequencing data. In this paper, we present a fast and memory-frugal algorithm for constructing compacted de Bruijn graphs, CUTTLEFISH 2, applicable both on raw sequencing short-reads and assembled references, that can scale to very large datasets. It builds upon the novel idea of modeling de Bruijn graph vertices as Deterministic Finite Automata (DFA) [52] from [44]. However, the DFA model itself has been modified, and the algorithm has been generalized, so as to accommodate all valid forms of input. At the same time, in the case of constructing the graph from reference sequences, it is considerably faster than the previous approach, while retaining its frugal memory profile. We evaluated CUTTLEFISH 2 on a collection of datasets with diverse characteristics, and assess its performance compared to other leading compacted de Bruijn graph construction methods. We observed that CUTTLEFISH 2 demonstrates superior performance in all the experiments we consider.

Additionally, we demonstrate the flexibility of our approach by presenting another application of the algorithm. The compacted de Bruijn graph forms a vertex-decomposition of the graph, while preserving the graph topology [47]. However, for some applications, only the vertex-decomposition is sufficient, and preservation of the topology is redundant. For example, for applications such as performing presence-absence queries for k-mer or associating information to the constituent k-mer of the input [53, 54], any set of strings that preserves the exact set of k-mer from the input sequences can be sufficient. Relaxing the defining requirement of unitigs, that the paths be non-branching in the underlying graph, and seeking instead a set of maximal non-overlapping paths covering the de Bruijn graph, results in a more compact representation of the input data. This idea has recently been explored in the literature, with the representation being referred to as a spectrum-preserving string set [55], and the paths themselves as simplitigs [56]. We demonstrate that CUTTLEFISH 2 can seamlessly extract such maximal path covers by simply constraining the algorithm to operate on some specific subgraph(s) of the original graph. We compared it to the existing tools available in the literature [57] for constructing this representation, and observed that it outperforms those in terms of resource requirements.

Results

CUTTLEFISH 2 overview

We present a high-level overview of the CUTTLEFISH 2 algorithm here. A complete treatment is provided in the “Algorithm” section.

CUTTLEFISH 2 takes as input a set \(\mathscr {R}\) of strings that are either short-reads or whole-genome references, a k-mer length k, and a frequency threshold f 0≥1. As output, it produces the maximal unitigs of the de Bruijn graph \(G(\mathscr {R}, k)\). Figure 1 highlights the major steps in the algorithm.

Fig. 1
figure 1

An overview of the CUTTLEFISH 2 algorithm. It is capable of constructing the compacted de Bruijn graph from a collection of either reference sequences or raw sequencing reads. The edges ((k+1)-mers) of the underlying de Bruijn graph are enumerated from the input, and optionally filtered based on the user-defined threshold. The edges are then used to enumerate the vertices (k-mer) they contain. An MPHF is constructed over the set of vertices, to associate the DFA-state of each vertex to it. Then the edge set is iterated over to determine the state of the DFA of each vertex in the graph, by transitioning the DFA through appropriate states, based on the edges in which the vertex is observed. Then an iteration over the original vertices to stitch together appropriate edges allows the extraction of the maximal unitigs

CUTTLEFISH 2 first enumerates the set \(\mathscr {E}\) of edges of \(G(\mathscr {R}, k)\), the (k+1)-mers present at least f0 times in \(\mathscr {R}\). This way the potential sequencing errors, present in case in which read sets are given as input, are discarded. Then the set \(\mathscr {V}\) of vertices of \(G(\mathscr {R}, k)\), which are the k-mer present in these (k+1)-mers, are extracted from \(\mathscr {E}\). Next, a Minimal Perfect Hash Function (MPHF) \(\mathfrak {f}\) over these vertices is constructed, that maps them bijectively to \([1, | \, \mathscr {V} \, |]\). This provides a space-efficient way to associate information to the vertices through hashing. Modeling each vertex \(v \in \mathscr {V}\) as a Deterministic Finite Automaton (DFA), a piecewise traversal on \(G(\mathscr {R}, k)\) is made using \(\mathscr {E}\), computing the state Sv of the automaton of each \(v \in \mathscr {V}\)—associated to v through \(\mathfrak {f}(v)\). The DFA modeling scheme ensures the retention of just enough information per vertex, such that the maximal unitigs are constructible afterwards from the automata states. Then, with another piecewise traversal on \(G(\mathscr {R}, k)\) using \(\mathscr {V}\) and the states collection S, CUTTLEFISH 2 retrieves all the non-branching edges of \(G(\mathscr {R}, k)\)—retained by the earlier traversal—and stitches them together in chains, constructing the maximal unitigs.

Experiments

We performed a number of experiments to characterize the various facets of the CUTTLEFISH 2 algorithm, its implementation, and some potential applications. We evaluated its execution performance compared to other available implementations of leading algorithms on de Bruijn graphs solving—(1) the compacted graph construction and (2) the maximal path cover problems, applicable on shared-memory multi-core machines. Although potentially feasible, CUTTLEFISH 2 is not designed as a method to leverage the capability of being distributed on a cluster of compute-nodes. Therefore, we did not consider relevant tools operating in that paradigm. We assessed its ability to construct compacted graphs and path covers for both sequencing reads and large pan-genome collections. By working on the (k+1)-mer spectrum, the new method performs a substantial amount of data reduction on the input sequences, yielding considerable speedups over the CUTTLEFISH algorithm [44] that, instead, requires multiple passes over the input sequences.

Next, we assess some structural characteristics of the algorithm and its implementation. Given an input dataset and a fixed internal parameter γ, the time- and the space-complexity of CUTTLEFISH 2 depend on k (see the “Asymptotics” section). We evaluated the impact of k on its execution performance, and also assessed some structural properties of the compacted graph that change with the parameter k. Moreover, we appraised the parallel scalability of the different steps of the algorithm, characterizing the ones that scale particularly well with increasing processor-thread count, as well as those that saturate more quickly.

A diverse collection of datasets has been used to conduct the experiments. We delineate the pertinent datasets for the experiments in their corresponding sections. The commands used for executing the different tools are available in Additional file 1: Sec. 1.10.

We compared the outputs of CUTTLEFISH 2 to those of several other tools used throughout our experiments. A detailed discussion of this is present at Additional file 1: Sec. 1.5.

Computation system for evaluation

All experiments were performed on a single server with two Intel Xeon E5-2699 v4 2.20 GHz CPUs having 44 cores in total and enabling up-to 88 threads, 512 GB of 2.40 GHz DDR4 RAM, and a number of 3.6 TB Toshiba MG03ACA4 ATA HDDs. The system is run with Ubuntu 16.10 GNU/Linux 4.8.0-59-generic. The running times and the maximum memory usages were measured with the GNU time command, and the intermediate disk-usages were measured using the Linux commands inotifywait and du.

Compacted graph construction for sequencing data

We evaluated the performance of CUTTLEFISH 2 in constructing compacted de Bruijn graphs from short-read sequencing data compared to available implementations of other leading compaction algorithms: (1) ABYSS-BLOOM-DBG, the maximal unitigs assembler of the ABYSS 2.0 assembly-pipeline [27], (2) BIFROST [45], (3) DEGSM [46], and (4) BCALM 2 [47].

The performances were tested on a number of short-read datasets with varied characteristics: (1) Mammalian dataset: (i) a human read set (NIST HG004) from an Ashkenazi white female Homo Sapiens (paired-end 250 bp Illumina reads with 70× coverage, SRA3440461–95, 148 GB compressed FASTQ), from [58], and (ii) an RNA sequencing dataset (ENA PRJEB3365) of 465 human lymphoblastoid cell line samples from the 1000 Genomes project (single-end 36 bp small-RNA-seq Illumina reads, ERP001941, 140 GB compressed FASTQ), from [59]; (2) Metagenomic datasets: (i) a gut microbiome read set (ENA PRJEB33098) from nine individuals (paired-end 150 bp Illumina reads with high coverage, ERP115863, 45 GB compressed FASTQ), from [60], and (ii) a soil metagenome read set (Iowa Corn) from 100-years-cultivated Iowa agricultural corn soil (paired-end 76 bp and 114 bp Illumina reads with low coverage, SRX100357 and SRX099904–06, 152 GB compressed FASTQ), used by [61]; and (3) Large organism dataset: a white spruce read set (NCBI PRJNA83435) from a Canadian Picea glauca tree (paired-end 150 bp and 100 bp Illumina reads with high coverage, SRA056234, 1.14 TB compressed FASTQ), from [62]. Table 1 contains the summary results of the benchmarking.

Table 1 Time- and memory-performance results for constructing compacted de Bruijn graphs from short-read sets

The frequency threshold f0 of k-mers ((k+1)-mers in case of CUTTLEFISH 2 Footnote 1) for the algorithms was approximated using k-mer frequency distributions so as to roughly minimize the misclassification rates of weak and solid k-mersFootnote 2 in these experiments (See Additional file 1: Sec. 1.1). In many practical scenarios, it might be preferable to skip computing an (approximate) frequency distribution, setting f 0 through some informed choice based on the properties of the input data (e.g., the sequencing depth and protocol). This can incorporate more weak k-mer into the graph. We present the results for such a scenario in Additional file 1: Table S2 on the human read set, setting f 0 to just 2.

Across the different datasets and algorithms evaluated, several trends emerge, notable from Table 1. First, we observe that for every dataset considered, CUTTLEFISH 2 is the fastest tool to process the data, while simultaneously using the least amount of memory. If we allow CUTTLEFISH 2 to match the memory used by the second most memory-frugal method (which is always BCALM 2 here), then it often completes even more quickly. We note that CUTTLEFISH 2 retains its performance lead over the alternative approaches across a wide range of different data input characteristics.

Among all the methods tested, CUTTLEFISH 2 and BCALM 2 were the only tools able to process all the datasets to completion under the different configurations tested, within the memory and disk-space constraints of the testing system. The rest of the methods generally required substantially more memory, sometimes over an order of magnitude more, depending on the dataset.

Of particular interest is CUTTLEFISH 2’s performance compared against BCALM 2. Relative to BCALM 2, CUTTLEFISH 2 is 1.7–5.3× faster on the human read set, while using 2.1–2.8× less memory. On the RNA-seq dataset, it is 8.3–5.9× faster, with 1.3× less memory. For the metagenomic datasets, it is 4.1–5.9× faster and uses 2.2–3.1× less memory on the gut microbiome data, and is 2.8–8.5× faster using 2.5–2.7× less memory on the soil data. On the largest sequencing dataset here, the white spruce read set, CUTTLEFISH 2 is 5.4–5.7× faster and is 1.3–2.6× memory-frugal—taking about 10 h, compared to at least 54 h for BCALM 2.

The timing-profile of BCALM 2 and CUTTLEFISH 2 excluding their similar initial stage: k-mer and (k+1)-mer enumeration, respectively, are shown in Additional file 1: Table S4. We also note some statistics of the de Bruijn graphs and their compacted forms for these datasets in Additional file 1: Table S5.

Compacted graph construction for reference collections

We assessed the execution performance of CUTTLEFISH 2 in constructing compacted de Bruijn graphs from whole-genome sequence collections in comparison to the available implementations of the following leading algorithms: (1) BIFROST [45], (2) DEGSM [46], and (3) BCALM 2 [47]. TWOPACO [48] is another notable algorithm applicable in this scenario, but we did not include it in the benchmarking as its output step lacks a parallelized implementation, and we consider very large sequence collections in this experiment.

We evaluated the performances on a number of datasets with varying attributes: (1) Metagenomic collection: 30,691 representative sequences from the most prevalent human gut prokaryotic genomes, gathered by [66] (≈ 61B bp, 18 GB compressed FASTA); (2) Mammalian collection: 100 human genomes—7 real sequences from [49] and 93 sequences simulated by [48] (≈ 294B bp, 305 GB uncompressed FASTA); and (3) Bacterial archive: 661,405 bacterial genomes, collected by [67] from the European Nucleotide Archive (≈ 2.58T bp, 752 GB compressed FASTA). Table 2 conveys the summary results of the benchmarking.

Table 2 Time- and memory-performance results for constructing compacted de Bruijn graphs from whole-genome reference collections

Evaluating the performance of the different tools over these pan-genomic datasets, we observe similar trends to what was observed in Table 1, but with even more extreme differences than before. For a majority of the experiment configurations here, only BCALM 2 and CUTTLEFISH 2 were able to finish processing within time- and machine-constraints. Again, CUTTLEFISH 2 exhibits the fastest runtime on all datasets, and the lowest memory usage on all datasets except the human gut genomes (where it consumes 1–2 GB more memory than BCALM 2, though taking 6–7 fewer hours to complete).

CUTTLEFISH 2 is 2.4–8.9× faster on the 30K human gut genomes compared to the closest competitors, using similar memory. On the 100 human reference sequences, CUTTLEFISH 2 is 4.3–4.7× faster, using 5.4–8.4× less memory. The only other tools able to construct this compacted graph successfully are DEGSM for k=27 (taking 4.3× as long and requiring 8.4× as much memory as CUTTLEFISH 2) and BCALM 2 for k=55 (taking over 4.7× as long and 5.4× as much memory as CUTTLEFISH 2). Finally, when constructing the compacted graph on the 661,405 bacterial genomes, CUTTLEFISH 2 is the only tested tool able to construct the graph for k=27. For k=55, BCALM 2 also completed, taking about 4.5 days, while CUTTLEFISH 2 finished under a day, with similar memory-profile. Overall, we observe that for large pan-genome datasets, CUTTLEFISH 2 is not only considerably faster and more memory-frugal than alternative approaches, but is the only tool able to reliably construct the compacted de Bruijn graph under all the different configurations tested, within the constraints of the experimental system.

Table S4 notes the timing-profiles for BCALM 2 and CUTTLEFISH 2 without their first step of k-mer and (k+1)-mer enumerations, and Table S5 shows some characteristics of the (compacted) de Bruijn graphs for these pan-genome datasets.

Maximal path cover construction

The execution performance of CUTTLEFISH 2 in decomposing de Bruijn graphs into maximal vertex-disjoint paths was assessed compared to the only two available tool implementations in literature [57] for this task: (1) PROPHASM [56] and (2) UST [55].

For sequencing data, we used (1) a roundworm read set (ENA DRR008444) from a Caenorhabditis elegans nematode (paired-end 300 bp Illumina reads, 5.6 GB compressed FASTQ); (2) the gut microbiome read set (ENA PRJEB33098) noted earlier; and (3) the human read set (NIST HG004) noted earlier. For whole-genome data, we used sequences from (1) a roundworm reference (Caenorhabditis elegans, VC2010) [68]; (2) a human reference (Homo sapiens, GRCh38); and (3) 7 real humans, collected from [49]. Table 3 presents the summary results of the benchmarking.

Table 3 Time- and memory-performance results for decomposing de Bruijn graphs into maximal vertex-disjoint paths

We note that CUTTLEFISH 2 outperforms the alternative tools for constructing maximal path covers in terms of the time and memory required. In the context of this task, CUTTLEFISH 2 also offers several qualitative benefits over these tools. For example, PROPHASM exposes only a single-threaded implementation. Further, it is restricted to values of k≤32 and only accepts genomic sequences as input (and thus is not applicable for read sets). UST first makes use of BCALM 2 for maximal unitigs extraction—which we observed to be outperformed by CUTTLEFISH 2 in the earlier experiments—and then employs a sequential graph traversal on the compacted graph to extract a maximal path cover. For this problem, CUTTLEFISH 2 bypasses the compacted graph construction, and directly extracts a maximal cover.

We observe that compared to the tools, CUTTLEFISH 2 is competitive on single-threaded executions. While on moderate-sized datasets using multiple threads, it was 2–3.8× faster than UST using 2.2–12.6× less memory on sequencing data, and for reference sequences it was 2.8–6.1× faster than UST using 2.9–6.3× less memory.

We also provide a comparison of the maximal unitig-based and the maximal path cover-based representations of de Bruijn graphs in Additional file 1: Table S6. We observe that, for the human read set, the path cover representation requires 19–24% less space than the unitigs. For the human genome reference and 7 humans pan-genome references, these reductions are 14–22%, and 20–25%, respectively. From the statistics of both the representations on the gut microbiome read set, it is evident that the corresponding de Bruijn graphs are highly branching, as might be expected for metagenomic data. The space reductions with path cover in these graphs are 33–36%.

Structural characteristics

Given an input dataset \(\mathscr {R}\) and a fixed frequency threshold f 0 for the edges (i.e., (k+1)-mers), the structure of the de Bruijn graph \(G(\mathscr {R}, k)\) is completely determined by the k-mer-size—the edge- and the vertex-counts depend on k, and the asymptotic characteristics of the algorithm are dictated only by the k-mer size k and the hash function space-time tradeoff factor γ (see the “Asymptotics” section). We evaluated how CUTTLEFISH 2 is affected practically by changes in the k-value. The human read set (NIST HG004) noted earlier was used for these evaluations.

For a range of increasing k-values (and a constant γ), we measured the execution performance of CUTTLEFISH 2, and the following metrics of the maximal unitigs it produced: the number of unitigs, the average and the maximum unitig lengths, along with the N50Footnote 3 and the N G A50Footnote 4 scores for contig-contiguity. Across the varying k’s, Table 4 reports the performance- and the unitig-metrics.

Table 4 Time- and memory-performance of CUTTLEFISH 2 for constructing the compacted de Bruijn graph from the human read set NIST HG004, and some corresponding metrics of the output maximal unitigs, over a range of k-mer sizes

The unitig-metrics on this data convey what one might expect—as k increases, so do the average and the maximum lengths of the maximal unitigs, and the N50 and N G A50 metrics, since the underlying de Bruijn graph typically gets less tangled as the k-mer size exceeds repeat lengths [70]. It is worth noting that, since we consider here just the extraction of unitigs, and no graph cleaning or filtering steps (e.g., bubble popping and tip clipping), we expect the N50 to be fairly short.

Perhaps the more interesting observation to be gleaned from the results is the scaling behavior of CUTTLEFISH 2 in terms of k. While the smallest k-value leads to the fastest overall graph construction, with increase in the machine-word count to encode the k-mer, the increase in runtime is rather moderate with respect to k, which follows the expected asymptotics (see the “Time complexity” section). On the other hand, we observe that this increase can be offset by allowing CUTTLEFISH 2 to execute with more memory (which helps in the bottleneck step, (k+1)-mer enumeration). We also note that, while the timing-profile exhibits reasonably good scalability over the parameter k, the effect on the required memory is rather small—it is not directly determined by the k-value, rather is completely dictated by the distinct k-mer count (see the “Space complexity” section).

Parallel scaling

We assessed the scalability of CUTTLEFISH 2 across a varying number of processor-threads. For this experiment, we downsampled the human read set NIST HG004 from 70× to 20× coverage and used this as input. We set k to 27 and 55, and executed CUTTLEFISH 2 with thread-counts ranging in 1–32. For k=27, Fig. 2a shows the time incurred by each step of the algorithm, and Fig. 2b demonstrates their speedups (i.e., factor of improvement in the speed of execution with different number of processor-threads). Additional file 1: Fig. S2 shows these metrics for k=55.

Fig. 2
figure 2

Parallel-scaling metrics for CUTTLEFISH 2 across 1–32 processor threads, using k=27 on the (downsampled) human read set NIST HG004, with the frequency threshold f 0=4

On the computation system used, we observe that all steps of CUTTLEFISH 2 scale well until about 8 threads. Beyond 8 threads, most steps but the minimal perfect hash construction continue to scale. Figure 2a shows that the most time-intensive step in the algorithm is the initial edge set enumeration. This step, along with vertex enumeration and DFA states computation, continue to show reasonably good scaling behavior until about 20 threads, then gradually saturating. The final step of unitigs extraction seems to scale well up to the maximum thread-count we tested with (32 in this case).

It is worth reiterating that all experiments were performed on standard hard drives, and that the most resource-intensive step of edge enumeration can be quite input-output (IO) bound, while the rest of the steps also iterate through the in-disk set of edges or vertices—bound by disk-read speed. So one might expect different (and quite possibly better) scaling behavior for the IO-heavy operations when executing on faster external storage, e.g., in the form of SSD or NVMe drives [71]. This is further evidenced by [72], who show that KMC 3, the method used for the edge and the vertex enumeration steps in CUTTLEFISH 2, could have considerable gains in speed on large datasets when executed on SATA SSDs.

Conclusion

In this paper, we present CUTTLEFISH 2, a new algorithm for constructing the compacted de Bruijn graph, which is very fast and memory-frugal, and highly-scalable in terms of the extent of the input data it can handle. CUTTLEFISH 2 builds upon the work of [44], which already advanced the state-of-the-art in reference-based compacted de Bruijn graph construction. CUTTLEFISH 2 simultaneously addresses the limitation and the bottleneck of CUTTLEFISH, by substantially generalizing the work to allow graph construction from both raw sequencing reads and reference genome sequences, while offering a more efficient performance profile. It achieves this, in large part, through bypassing the need to make multiple passes over the original input for very large datasets.

As a result, CUTTLEFISH 2 is able to construct compacted de Bruijn graphs much more quickly, while using less memory—both often multiple times—than the numerous other methods evaluated. Since the construction of the graph resides upstream of many computational genomics analysis pipelines, and as it is typically one of the most resource-intensive steps in these approaches, CUTTLEFISH 2 could help remove computational barriers to using the de Bruijn graph in analyzing the ever-larger corpora of genomic data.

In addition to the advances it represents in the compacted graph construction, we also demonstrate the ability of the algorithm to compute another spectrum-preserving string set of the input sequences—maximal path covers that have recently been adopted in a growing variety of applications in the literature [57]. A simple restriction on the considered graph structure allows CUTTLEFISH 2 to build this construct much more efficiently than the existing methods.

Though a thorough exploration of the potential benefits of improved compacted de Bruijn graph construction to the manifold downstream analyses is outside the scope of the current work, we present a proof of concept application (Additional file 1: Sec. 1.9), demonstrating the benefits of our improved algorithm to the task of constructing an associative k-mer index.

As the scale of the data on which the de Bruijn graph and its variants must be constructed increases, and as the de Bruijn graph itself continues to find ever-more widespread uses in genomics, we anticipate that CUTTLEFISH 2 will enable its use in manifold downstream applications that may not have been possible earlier due to computational challenges, paving the way for an even more widespread role for the de Bruijn graph in high-throughput computational genomics.

CUTTLEFISH 2 is implemented in C++17, and is available open-source at https://github.com/COMBINE-lab/cuttlefish.

Methods

Related work

Here, we briefly discuss the other compacted de Bruijn graph construction algorithms included in the experiments against which we compare CUTTLEFISH 2.

The BCALM algorithm [50] partitions the k-mer from the input that pass frequency filtering into a collection of disk-buckets according to their minimizers [73], and processes each bucket sequentially as per the minimizer-ordering—loading all the strings of the bucket into memory, joining (or, compacting) them maximally while keeping the resulting paths non-branching in the underlying de Bruijn graph, and distributing each resultant string into some other yet-to-be-processed bucket for potential further compaction, or to the final output. As is, BCALM is inherently sequential. BCALM 2 [47] builds upon this use of minimizers to partition the graph, but it modifies the k-mer partitioning strategy so that multiple disk-buckets can be compacted correctly in parallel, and then glues the further compactable strings from the compacted buckets.

ABYSS-BLOOM-DBG is the maximal unitigs assembler of the ABYSS 2.0 assembly tool [27]. It first saves all the k-mer from the input reads into a cascading Bloom filter [63] to discard the likely-erroneous k-mer. Then it identifies the reads that consist entirely of retained k-mer, and extends them in both directions within the de Bruijn graph through identifying neighbors using the Bloom filter, while discarding the potentially false-positive paths based on their spans—producing the maximal unitigs.

DEGSM first enumerates all the (k+2)-mers of the input that pass frequency filtering. Then using a parallel external sorting over partitions of this set, it groups together the (k+2)-mers with the same middle k-mer, enabling it to identify the branching vertices in the de Bruijn graph. Then it merges the k-mer from the sorted buckets in a strategy so as to produce a Burrows-Wheeler Transform [74] of the maximal unitigs.

BIFROST [45] constructs an approximate compacted de Bruijn graph first by saving the k-mer from the input in a Bloom filter [63], and then for each potential non-erroneous k-mer, it extracts the maximal unitig containing it by extending the k-mer in both directions using the Bloom filter. Then using a k-mer counting based strategy, it refines the graph by removing the false edges induced by the Bloom filter.

Definitions

A strings is an ordered sequence of symbols drawn from an alphabet \({\Sigma }\). For the purposes of this paper, we assume all strings to be over the alphabet Σ={A,C,G,T}, the DNA alphabet where each symbol has a reciprocal complement—the complementary pairs being {A,T} and {C,G}. For a symbol cΣ, \(\overline {c}\) denotes its complement. |s| denotes the length of s. A k-mer is a string with length k. si denotes the ith symbol in s (with 1-based indexing). A substring of s is a string entirely contained in s, and si..j denotes the substring of s located from its ith to the jth indices, inclusive. pre(s) and s uf(s) denote the prefix and the suffix of s with length respectively, i.e., pre(s)=s1.. and suf(s)=s|s|−+1..|s|, for some 0<≤|s|. For two strings x and y with suf(x)=pre(y), the ℓ-length glue operation is defined as xy=x·y+1..|y|, where · denotes the append operation.

For a string s, its reverse complement \(\overline {s}\) is the string obtained through reversing the order of the symbols in s, and replacing each symbol with its complement, i.e., \(\overline {s} = \overline {s_{|s|}} \cdot \ldots \cdot \overline {s_{2}} \cdot \overline {s_{1}}\). The canonical form\(\widehat {s}\) of s is defined as the string \(\widehat {s} = \min (s, \overline {s})\), according to some consistent ordering of the strings in Σ|s|. In this paper, we adopt the lexicographic ordering of the strings.

Given a set \(\mathscr {S}\) of strings and an integer k>0, let \(\mathscr {K}\) and \(\mathscr {K}_{+1}\) be respectively the sets of k-mer and (k+1)-mers present as substrings in some \(s \in \mathscr {S}\). The (directed) node-centric de Bruijn graph\(G_{1} (\mathscr {S}, k) = (\mathscr {V}_{1}, \mathscr {E}_{1})\) is a directed graph where the vertex set is \(\mathscr {V}_{1} = \mathscr {K}\), and the edge set \(\mathscr {E}_{1}\) is induced by \(\mathscr {V}_{1}\): a directed edge \(e = (u, v) \in \mathscr {E}_{1}\) iff sufk−1(u)=prek−1(v). The (directed) edge-centric de Bruijn graph\(G_{2} (\mathscr {S}, k) = (\mathscr {V}_{2}, \mathscr {E}_{2})\) is a directed graph where the edge set is \(\mathscr {E}_{2} = \mathscr {K}_{+1}\): each \(e \in \mathscr {K}_{+1}\) constitutes a directed edge (v1,v2) where v1=prek(e) and v2=sufk(e), and the vertex set \(\mathscr {V}_{2}\) is thus induced by \(\mathscr {E}_{2}\)Footnote 5.

In this work, we adopt the edge-centric definition of de Bruijn graphs. Hence, we use the terms k-mer and vertex and the terms (k+1)-mer and edge interchangeably. We introduce both variants of the graph here as we compare (in the “Results” section) our algorithm with some other methods that employ the node-centric definition.

We use the bidirected variant of de Bruijn graphs in the proposed algorithm, and redefine \(\mathscr {K}_{+1}\) to be the set of canonical (k+1)-mers \(\widehat {z}\) such that z or \(\overline {z}\) appears as substring in some \(s \in \mathscr {S}\)Footnote 6. For a bidirected edge-centric de Bruijn graph \(G(\mathscr {S}, k) = (\mathscr {V}, \mathscr {E})\) — (i) the vertex set \(\mathscr {V}\) is the set of canonical forms of the k-mer present as substrings in some \(e \in \mathscr {K}_{+1}\), and (ii) the edge set is \(\mathscr {E} = \mathscr {K}_{+1}\), where an \(e \in \mathscr {E}\) connects the vertices \(\widehat {{pre}_{k}({e})}\) and \(\widehat {{suf}_{k}({e})}\). A vertex v has exactly two sides, referred to as the front side and the back side.

For a (k+1)-mer z such that \(\widehat {z} \in \mathscr {K}_{+1}\), let \(u = \widehat {{pre}_{k}({z})}\) and \(v = \widehat {{suf}_{k}({z})}\). z induces an edge from the vertex u to the vertex v, and it is said to exitu and enterv. z connects to u’s back iff prek(z) is in its canonical form, i.e., prek(z)=u, and otherwise it connects to u’s front. Conversely, z connects to v’s front iff sufk(z)=v, and otherwise to v’s back. Concisely put, z exits through u’s back iff z’s prefix k-mer is canonical, and enters through v’s front iff z’s suffix k-mer is canonical. The edge corresponding to z can be expressed as ((u,ψ u),(v,ψv)): it connects from the side ψu of the vertex u to the side ψv of the vertex v.

We prove in Lemma 1 (see Additional file 1: Sec. 3) that the two (k+1)-mers z and \(\overline {z}\) correspond to the same edge, but in reversed directions: \(\overline {z}\) induces the edge ((v,ψv),(u,ψu))—opposite from that of z. The two edges for z and \(\overline {z}\) constitute a bidirected edge e={(u,ψu),(v,ψv)}, corresponding to \(\widehat {z} \in \mathscr {E}\). While crossing e during a graph traversal, we say that espells the (k+1)-mer z when the traversal is from (u,ψu) to (v,ψv), and it spells \(\overline {z}\) in the opposite direction.

A walkw=(v0,e 1,v1,…,vn−1,en,vn) is an alternating sequence of vertices and edges in \(G(\mathscr {S}, k)\), where the vertices vi and vi+1 are connected with the edge ei+1, and ei and ei+1 are incident to different sides of vi. |w| denotes its length in vertices, i.e., |w|=n+1. v0 and vn are its endpoints, and the vi for 0<i<n are its internal vertices. The walks (v0,e1,…,en,vn) and (vn,en,…,e1,v0) denote the same walk but in opposite orientations. If the edge ei spells the (k+1)-mer li, then w spells l1kl2kkln. If |w|=1, then it spells v0. A path is a walk without any repeated vertex.

A unitig is a path p=(v0,e1,v1,…,en,vn) such that either |p|=1, or in \(G(\mathscr {S}, k)\):

  1. 1.

    each internal vertex v i has exactly one incident edge at each of its sides, the edges being ei and ei+1

  2. 2.

    and v0 has only e1 and vn has only en incident to their sides to which e1 and en are incident to, respectively.

A maximal unitig is a unitig p=(v0,e1,v1,…,en,vn) such that it cannot be extended anymore while retaining itself a unitig: there exists no x,y,e0, or en+1 such that (x,e0,v0,…,en,vn) or (v0,e1,…,vn,en+1,y) is a unitig.

Figure 3a illustrates an example of the de Bruijn graph and the relevant constructs defined.

Fig. 3
figure 3

A (bidirected) edge-centric de Bruijn graph \(G(\mathscr {S}, k)\) for a set \(\mathscr {S} = \{ CTAAGAT, CGATGCA, TAAGAGG \}\) of strings and k-mer size k=3 in a, and its compacted form \(G_{c} (\mathscr {S}, k)\) in b. In the graphs, the vertices are denoted with pentagons—the flat and the cusped ends depict the front and the back sides respectively, and each edge corresponds to some 4-mer(s) in \(s \in \mathscr {S}\). In a, the vertices are the canonical forms of the k-mer in \(s \in \mathscr {S}\). The canonical string \(\widehat {t}\) associated to each vertex v is labeled inside v, to be spelled in the direction from v’s front to its back. Using \(\widehat {t}\), we also refer to v. The label beneath v is \(\overline {\widehat {t}}\), and is to be spelled in the opposite direction (i.e., back to front). For example, consider the 4-mer CGAT, an edge e in \(G(\mathscr {S}, 3)\). e connects the 3-mers x=pre3(e)=CGA and y=suf3(e)=GAT, the vertices being \(u = \widehat {x} = CGA\) and \(v = \widehat {y} = ATC\) respectively. x is canonical and thus e exits through u’s back; whereas y is non-canonical and hence e enters through v’s back. (CTA,TAA,AAG) is a walk, a path, and also a unitig (edges not listed). (CGA,ATC,ATG) is a walk and a path, but not a unitig—the internal vertex ATC has multiple incident edges at its back. The unitig (CTA,TAA,AAG) is not maximal, as it can be extended farther through AAG’s back. Then it becomes maximal and spells CTAAGA. There are four such maximal unitigs in \(G(\mathscr {S}, 3)\), and contracting each into a single vertex produces \(G_{c} (\mathscr {S}, 3)\), in b. There are two different maximal path covers of \(G(\mathscr {S}, 3)\): spelling {CTAAGATGC,CGA,CCTC} and {CCTCTTAG,CGATGC}

A compacted de Bruijn graph \(G_{c} (\mathscr {S}, k)\) is obtained through collapsing each maximal unitig of the de Bruijn graph \(G(\mathscr {S}, k)\) into a single vertex, through contracting its constituent edges [75]. Figure 3b shows the compacted form of the graph in Fig. 3a. Given a set \(\mathscr {S}\) of strings and an integer k>0, the problem of constructing the compacted de Bruijn graph \(G_{c} (\mathscr {S}, k)\) is to compute the maximal unitigs of \(G(\mathscr {S}, k)\)Footnote 7.

A vertex-disjoint path cover\(\mathscr {P}\) of \(G(\mathscr {S}, k) = (\mathscr {V}, \mathscr {E})\) is a set of paths such that each vertex \(v \in \mathscr {V}\) is present in exactly one path \(p \in \mathscr {P}\). Unless stated otherwise, we refer to this construct simply as path cover. A maximal path cover is a path cover \(\mathscr {P}\) such that no two paths in \(\mathscr {P}\) can be joined into one single path, i.e., there exists no \(p_{1}, p_{2} \in \mathscr {P}\) such that p1=(v0,e1,…,en,x), p2=(y,e1′,…,en′,vn′), and there is some edge \(\big \{ (x, s_{x}), (y, s_{y}) \big \} \in \mathscr {E}\) connecting the sides of x and y that are not incident to en and \(e^{\prime }_{n}\), respectively. Figure 3a provides examples of such.

Algorithm

Given a set \(\mathscr {R}\), either of short-reads sequenced from some biological sample, or of reference sequences, the construction of the compacted de Bruijn graph \(G_{c}(\mathscr {R}, k)\) for some k>0 is a data reduction problem in computational genomics. A simple algorithm to construct the compacted graph Gc involves constructing the ordinary de Bruijn graph \(G(\mathscr {R}, k)\) at first, and then applying a graph traversal algorithm [76] to extract all the maximal non-branching paths in G. However, this approach requires constructing the ordinary graph and retaining it in memory for traversal (or organizing it in a way that it can be paged into memory for efficient traversal). Both of these tasks can be infeasible for large enough input samples. This prompts the requirement of methods to construct Gc directly from \(\mathscr {R}\), bypassing G. CUTTLEFISH 2 is an asymptotically and practically efficient algorithm performing this task.

Another practical aspect of the problem is that the sequenced reads typically contain errors [77]. This is offset through increasing the sequencing depth—even if a read \(r \in \mathscr {R}\) contains some erroneous symbol at index i of the underlying sequence being sampled, a high enough sequencing depth should ensure that some other reads in \(\mathscr {R}\) contain the correct nucleotide present at index i. Thus, in practice, these erroneous symbols need to be identified—usually heuristically—and the vertices and the edges of the graph corresponding to them need to be taken into account. CUTTLEFISH 2 discards the edges, i.e., (k+1)-mers, that each occur less than some threshold parameter f0, and only operates with the edges having frequencies ≥f0.

Implicit traversals over \(G(\mathscr {R}, k)\)

When the input is a set \(\mathscr {S}\) of references, the CUTTLEFISH algorithm [44] makes a complete graph traversal on the reference de Bruijn graphFootnote 8\(G(\mathscr {S}, k)\) through a linear scan over each \(s \in \mathscr {S}\). Also, the concept of sentinelsFootnote 9 in \(G(\mathscr {S}, k)\) ensures that a unitig can not span multiple input sequences. In one complete traversal, the branching vertices are characterized through obtaining a particular set of neighborhood relations; and then using another traversal, the maximal unitigs are extracted.

For a set \(\mathscr {R}\) of short-reads, however, a linear scan over a read \(r \in \mathscr {R}\) may not provide a walk in \(G(\mathscr {R}, k)\), since r may contain errors, which break a contiguous walk. Additionally, the concept of sentinels is not applicable for reads. Therefore, unitigs may span multiple reads. For a unitig u spanning the reads r 1 and r2 consecutively, it is not readily inferrable that r2 is to be scanned after r1 (or the reverse, for an oppositely-oriented traversal) while attempting to extract u directly from \(\mathscr {R}\), as the reads are not linked in the input for this purpose. Hence, contrary to the reference-input algorithm [44] where complete graph traversal is possible with just \(|\mathscr {R}|\) different walks when \(\mathscr {R}\) consists of reference sequences, CUTTLEFISH 2 resorts to making implicit piecewise-traversals over \(G(\mathscr {R}, k)\).

For the purpose of determining the branching vertices, the piecewise-traversal is completely coarse—each walk traverses just one edge. Making such walks, CUTTLEFISH 2 retains just enough adjacency information for the vertices (i.e., only the internal edges of the unitigs) so that the unitigs can then be piecewise-constructed without the input \(\mathscr {R}\). Avoiding the erroneous vertices is done through traversing only the solid edges ((k+1)-mers occurring ≥f0 times in \(\mathscr {R}\), where f0 is a heuristically-set input parameter).

Stitching together the pieces of a unitig is accomplished by making another piecewise-traversal over \(G(\mathscr {R}, k)\), not by extracting those directly from the input sequences (opposed to CUTTLEFISH [44]). Each walk comprises the extent of a maximal unitig—the edges retained by the earlier traversal are used to make the walk and to stitch together the unitig.

In fact, the graph traversal strategy of CUTTLEFISH for reference inputs \(\mathscr {S}\) is a specific case of this generalized traversal, where a complete graph traversal is possible through a linear scan over the input, as each \(s \in \mathscr {S}\) constitutes a complete walk over G({s},k). Besides, the maximal unitigs are tiled linearly in the sequences \(s \in \mathscr {S}\), and determining their terminal vertices is the core problem in that case; as extraction of the unitigs can then be performed through walking between the terminal vertices by scanning the \(s \in \mathscr {S}\).

A deterministic finite automaton model for vertices

While traversing the de Bruijn graph \(G(\mathscr {R}, k) = (\mathscr {V}, \mathscr {E})\) for the purpose of determining the maximal unitigs, it is sufficient to only keep track of information for each side sv of each vertex \(v \in \mathscr {V}\) that can categorize it into one of the following classes:

  1. 1.

    no edge has been observed to be incident to sv yet

  2. 2.

    sv has multiple distinct incident edges

  3. 3.

    sv has exactly one distinct incident edge, for which there are |Σ|=4 possibilities (see Lemma 2, Additional file 1: Sec. 3).

Thus there are six classes to which each sv may belong, and since v has two sides, it can be in one of 6×6=36 distinct configurations. Each such configuration is referred to as a state.

CUTTLEFISH 2 treats each vertex \(v \! \in \!\! \mathscr {V}\) as a Deterministic Finite Automaton (DFA) \(M_{v} = (\mathscr {Q}, \Sigma ', \delta, q_{0}, \mathscr {Q}')\):

States

\(\mathscr {Q}\) is the set of the possible 36 states for the automaton. Letting the number of distinct edges at the front with f and at the back with b for a vertex v with a state q, and based on the incidence characteristics of v, the states \(q \in \mathscr {Q}\) can be partitioned into four disjoint state-classes: (1) fuzzy-front fuzzy-back (f≠1,b≠1), (2) fuzzy-front unique-back (f≠1,b=1), (3) unique-front fuzzy-back, (f=1,b≠1), and (4) unique-front unique-back (f=1,b=1).

Input symbols

Σ={(s,c):s{front,back}, cΣ} is the set of the input symbols for the automaton. Each incident edge e of a vertex u is provided as input to u’s automaton. For u, an incident edge e={(u,su),(v,sv)} can be succinctly encoded by its incidence side su and a symbol cΣ—the (k+1)-mer \(\widehat {z}\) for e is one of the following, depending on su and whether \(\widehat {z}\) is exiting or entering u: u·c, \(\overline {u} \cdot c\), c·u, or \(c \cdot \overline {u}\).

Transition function

\(\delta : \mathscr {Q} \times \Sigma ' \rightarrow \mathscr {Q} \setminus \{q_{0}\}\) is the function controlling the state-transitions of the automaton. Figure 4 illustrates the state-transition diagram for an automaton as per δ.

Fig. 4
figure 4

The state-transition diagram for an automaton \(M_{v} = (\mathscr {Q}, \Sigma ', \delta, q_{0}, \mathscr {Q}')\). Each node in the diagram represents a collection of states \(q \in \mathscr {Q}\), and q0 is the initial state of Mv. A node may represent multiple states collectively. For example, the node at the center of the figure with the symbols x and y at its flat and cusped ends respectively represents 16 states (all the ones from the state-class unique-front unique-back). Thus each node \(\mathscr {Q}_{k}\) represents a disjoint subset of \(\mathscr {Q}\). The pictorial shape of \(\mathscr {Q}_{k}\) is similar to that of a de Bruijn graph vertex (see Fig. 3), and denotes the incidence characteristics of the vertices having their automata in states in \(\mathscr {Q}_{k}\): for a vertex v with its automaton in state \(q_{k} \in \mathscr {Q}_{k}\), a unique symbol at side s of \(\mathscr {Q}_{k}\) means that v has a distinct edge at side s, ellipsis means multiple unique edges, and absence of any symbol means no edge has been observed incident to that side. A directed edge \((\mathscr {Q}_{i}, \mathscr {Q}_{j})\) labeled with (s,c) denotes transitions from a state \(q_{i} \in \mathscr {Q}_{i}\) to a state \(q_{j} \in \mathscr {Q}_{j}\), and (s,c) symbolizes the corresponding input fed to an automaton at state qi for that transition to happen. That is, δ(qi,(s,c))=qj. Thus these edges pictorially encode the transition function δ. For the automaton Mv of a vertex v, an input (s,c) means that an edge e is being added to its side s{f,b}; along with s and v, the character cΣ succinctly encodes e. f and b are shorthands for front and back, respectively. Self-transition is possible for each state \(q \in \mathscr {Q}'\), and are not shown here for brevity

Initial state

\(q_{0} \in \mathscr {Q}\) is the initial state of the automaton. This state corresponds to the configuration of the associated vertex at which no edge has been observed to be incident to either of its sides.

Accept states

\(\mathscr {Q}' = \mathscr {Q} \setminus \{q_{0}\}\) is the set of the states corresponding to vertex-configurations having at least one edgeFootnote 10.

Algorithm overview

We provide here a high-level overview of the CUTTLEFISH 2\((\mathscr {R}, k, f_{0})\) algorithm. The input to the algorithm is a set \(\mathscr {R}\) of strings, an odd integer k>0, and an abundance threshold f0>0; the output is the set of strings spelled by the maximal unitigs of the de Bruijn graph \(G(\mathscr {R}, k)\).

CUTTLEFISH 2\((\mathscr {R}, k, f_{0})\) executes in five high-level stages, and Fig. 1 illustrates these steps. Firstly, it enumerates the set \(\mathscr {E}\) of edges, i.e., (k+1)-mers that appear at least f0 times in \(\mathscr {R}\). Then the set \(\mathscr {V}\) of vertices, i.e., k-mer are extracted from \(\mathscr {E}\). Having the distinct k-mer, it constructs a minimal perfect hash function h over \(\mathscr {V}\). At this point, a hash table structure is set up for the automata—the hash function being h, and each hash bucket having enough bits to store a state encoding. Then, making a piecewise traversal over \(G(\mathscr {R}, k)\) using \(\mathscr {E}\), CUTTLEFISH 2 observes all the adjacency relations in the graph, making appropriate state transitions along the way for the automata of the vertices u and v for each edge { (u,su),(v,sv) }. After all the edges in \(\mathscr {E}\) are processed, each automaton Mv resides in its correct state. Due to the design characteristic of the state-space \(\mathscr {Q}\) of Mv, the internal vertices of the unitigs in \(G(\mathscr {R}, k)\), as well as the non-branching sides of the branching vertices have their incident edges encoded in their states. CUTTLEFISH 2 retrieves these unitig-internal edges and stitches them together in chains until branching vertices are encountered, thus extracting the maximal unitigs implicitly through another piecewise traversal, with each walk spanning a maximal unitig.

These major steps in the algorithm are detailed in the following subsections. Then we analyze the asymptotic characteristics of the algorithm in the “Asymptotics” section. Finally, we provide a proof of its correctness in Theorem 1 (see Additional file 1: Sec. 3).

Edge set construction

The initial enumeration of the edges, i.e., (k+1)-mers from the input set \(\mathscr {R}\) is performed with the KMC 3 algorithm [72]. KMC 3 enumerates the -mers of its input in two major steps. Firstly, it partitions the -mers based on signatures—a restrictive variant of minimizers Footnote 11. Maximal substrings from the input strings with all their -mers having the same signature (referred to as super ℓ-mers) are partitioned into bins corresponding to the signatures. Typically the number of bins is much smaller than the number of possible signatures, and hence each bin may contain strings from multiple signatures (set heuristically to balance the bins). In the second phase, for each partition, its strings are split into substrings called (ℓ,x)-mers—an extension of -mers. These substrings are then sorted using a most-significant-digit radix sort [78] to unify the repeated -mers in the partition. For =k+1, the collection of these deduplicated partitions constitute the edge set \(\mathscr {E}\).

Vertex set extraction

CUTTLEFISH 2 extracts the distinct canonical k-mer—vertices of \(G(\mathscr {R}, k)\)—from \(\mathscr {E}\) through an extension of KMC 3 [72] (See Additional file 1: Sec. 2.1). For such, taking \(\mathscr {E}\) as input, KMC 3 treats each (k+1)-mer \(e \in \mathscr {E}\) as an input string, and enumerates their constituent k-mer applying its original algorithm. Using \(\mathscr {E}\) instead of \(\mathscr {R}\) as input reduces the amount of work performed in this phase through utilizing the work done in the earlier phase—skipping another pass over the entire input set \(\mathscr {R}\), which can be computationally substantial.

Hash table structure setup

An associative data structure is required to store the transitioning states of the automata of the vertices of \(G(\mathscr {R}, k)\). That is, association of some encoding of the states to each canonical k-mer is required. Some off-the-shelf hash table can be employed for this purpose. Due to hash collisions, general-purpose hash tables typically need to store the keys along with their associated data—the key set \(\mathscr {V}\) may end up taking k log2|Σ|=2k bits/k-mer in the hash tableFootnote 12. In designing a more efficient hash table structure, CUTTLEFISH 2 makes use of the facts that (i) the set \(\mathscr {V}\) of keys is static—no alien keys will be encountered while traversing the edges in \(\mathscr {E}\), since \(\mathscr {V}\) is constructed from \(\mathscr {E}\), and (ii) \(\mathscr {V}\) has been enumerated at this point.

A Minimal Perfect Hash Function (MPHF) is applicable in this setting. Given a set \(\mathscr {K}\) of keys, a perfect hash function over \(\mathscr {K}\) is an injective function \(h: \mathscr {K} \rightarrow \mathbb {Z}\), i.e., \(\forall \, k_{1}, k_{2} \in \mathscr {K}, \; k_{1} \neq k_{2} \Leftrightarrow h(k_{1}) \neq h(k_{2})\). h is minimal when its image is \([0, |\mathscr {K}|)\), i.e., an MPHF is an injective function \(h: \mathscr {K} \rightarrow [0, |\mathscr {K}|)\). By definition, an MPHF does not incur collisions. Therefore when used as the hash function for a hash table, it obviates the requirement to store the keys with the table structure. Instead, some encoding of the MPHF needs to be stored in the structure.

To associate the automata to their states, CUTTLEFISH 2 uses a hash table with an MPHF as the hash function. An MPHF h over the vertex set \(\mathscr {V}\) is constructed with the BBHASH algorithm [79]. For the key set \(\mathscr {V}_{0} = \mathscr {V}\), BBHASH constructs h through a number of iterations. It maps each key \(v \in \mathscr {V}_{0}\) to \([\, 1, \, \gamma |\mathscr {V}_{0}| \,]\) with a classical hash function h0, for a provided parameter γ>0. The collision-free hashes in the hash codomain \([\, 1, \, \gamma |\mathscr {V}_{0}| \,]\) are marked in a bit-array A0 of length \(\gamma |\mathscr {V}_{0}|\). The colliding keys are collected into a set \(\mathscr {V}_{1}\), and the algorithm is iteratively applied on \(\mathscr {V}_{1}\). The iterations are repeated until either some \(\mathscr {V}_{n}\) is found empty, or a maximum level is reached. The bit-arrays Ai for the iterations are concatenated into an array A, which along with some other metadata, encode h. A has an expected size of \(\gamma e^{{1}/{\gamma }} |\mathscr {V}|\) bits [79]. γ trades off the encoding size of h with its computation time. γ=2 provides a reasonable trade-off, with the size of h being ≈3.7 bits/vertexFootnote 13. Note that, the size is independent of k, i.e., the size of the keys.

For the collection of hash buckets, CUTTLEFISH 2 uses a linear array [81] of size \(|\mathscr {V}|\). Since each bucket is to contain some state \(q \in \mathscr {Q}\), \(\lceil \log _{2} |\mathscr {Q}| \rceil = \lceil \log _{2} 36 \rceil = 6\) bits are necessary (and also sufficient) to encode q. Therefore CUTTLEFISH 2 uses 6 bits for each bucket. The hash table structure is thus composed of an MPHF h and a linear array S: for a vertex v, its (transitioning) state qv is encoded at the index h(v) of S, and in total the structure uses ≈9.7 bits/vertex.

Automaton states computation

Given the set \(\mathscr {E}\) of edges of a de Bruijn graph \(G(\mathscr {R}, k)\) and an MPHF h over its vertex set \(\mathscr {V}\), the COMPUTE-AUTOMATON-STATES\((\mathscr {E}, h)\) algorithm computes the state of the automaton Mv of each \(v \in \mathscr {V}\).

It initializes each automaton Mw with q0—the initial state corresponding to no incident edges. Then for each edge \(e = \{ \, (\widehat {u}, s_{u}), (\widehat {v}, s_{v}) \, \} \in \mathscr {E}\), connecting the vertex \(\widehat {u}\) via its side su to the vertex \(\widehat {v}\) via its side sv, it makes appropriate state transitions for Mu and Mv, the automata of \(\widehat {u}\) and \(\widehat {v}\) respectively. For each endpoint \(\widehat {w}\) of e, (sw,cw) is fed to Mw, where cwΣ. Together with \(\widehat {w}\), sw and cw encode e. The setting policy for cw is described in the following. Technicalities relating to loops are accounted for in the CUTTLEFISH 2 implementation, but are omitted from discussion for simplicity.

e has two associated (k+1)-mers: z and \(\overline {z}\). Say that z=uk−1v. Based on whether \(u = \widehat {u}\) holds or not, e is incident to either \(\widehat {u}\)’s back or front. As defined (see the “Definitions” section), if it is incident to the back, then \(z = \widehat {u} \cdot c\); otherwise, \(z = \overline {\widehat {u}} \cdot c\), where c=ek+1. In these cases respectively, \(\overline {z} = \overline {c} \cdot \overline {\widehat {u}}\), and \(\overline {z} = \overline {c} \cdot \widehat {u}\). For consistency, CUTTLEFISH 2 always uses a fixed form of e for \(\widehat {u}\)—either z or \(\overline {z}\)—to provide it as input to Mu: the one containing the k-mer u in its canonical form. So if e is at \(\widehat {u}\)’s back, the \(\widehat {u} \cdot c\) form is used for e, and (back,c) is fed to Mu; otherwise, e is expressed as \(\overline {c} \cdot \widehat {u}\) and \(({front}, \overline {c})\) is the input for Mu. The encoding (sv,c) of e for \(\widehat {v}\) is set similarly.

Maximal unitigs extraction

Given the set \(\mathscr {V}\) of vertices of a de Bruijn graph \(G(\mathscr {R}, k)\), an MPHF h over \(\mathscr {V}\), and the states-table S for the automata of \(v \in \mathscr {V}\), the EXTRACT-MAXIMAL-UNITIGS\((\mathscr {V}, h, S)\) algorithm assembles all the maximal unitigs of \(G(\mathscr {R}, k)\).

The algorithm iterates over the vertices in \(\mathscr {V}\). For some vertex \(\widehat {v} \in \mathscr {V}\), let p be the maximal unitig containing \(\widehat {v}\). p can be broken into two subpaths: pb and pf, overlapping only at \(\widehat {v}\). The EXTRACT-MAXIMAL-UNITIGS\((\mathscr {V}, h, S)\) algorithm extracts these subpaths separately, and joins them at \(\widehat {v}\) to construct p. Then p’s constituent vertices are marked by transitioning their automata to some special states (not shown in Fig. 4), so that p is extracted only once.

The subpaths pb and pf are extracted by initiating two walks: one from each of \(\widehat {v}\)’s sides back and front, using the WALK-MAXIMAL-UNITIG\((\widehat {v}, s_{v})\) algorithm. Each walk continues on until a flanking vertex\(\widehat {x}\) is encountered. For a vertex \(\widehat {x}\), let qx denote the state of \(\widehat {x}\)’s automaton and \(\mathscr {C}_{x}\) denote qx’s state-class. Then \(\widehat {x}\) is noted to be a flanking vertex iff:

figure d
figure e
  1. 1)

    either \(\mathscr {C}_{x}\) is not unique-front unique-back;

  2. 2)

    or \(\widehat {x}\) connects to the side sy of a vertex \(\widehat {y}\) such that:

    1. (a)

      \(\mathscr {C}_{y}\) is fuzzy-front fuzzy-back; or

    2. (b)

      sy=front and \(\mathscr {C}_{y}\) is fuzzy-front unique-back; or

    3. (c)

      sy=back and \(\mathscr {C}_{y}\) is unique-front fuzzy-back.

Lemma 3 (see Additional file 1: Sec. 3) shows that the flanking vertices in \(G(\mathscr {R}, k)\) are precisely the endpoints of its maximal unitigs.

The WALK-MAXIMAL-UNITIG\((\widehat {v}, s_{v})\) algorithm initiates a walk w from \(\widehat {v}\), exiting through its side sv. It fetches \(\widehat {v}\)’s state qv from the hash table. If qv is found to be not belonging to the state-class unique-front unique-back due to sv having ≠1 incident edges, then \(\widehat {v}\) is a flanking vertex of its containing maximal unitig p, and p has no edges at sv. Hence w terminates at \(\widehat {v}\). Otherwise, sv has exactly one incident edge. The walk algorithm makes use of the fact that, the vertex-sides su that are internal to the maximal unitigs in \(G(\mathscr {R}, k)\) contain their adjacency information encoded in the states qu of their vertices \(\widehat {u}\)’s automata, once the COMPUTE-AUTOMATON-STATES\((\mathscr {E}, h)\) algorithm is executed. Thus, it decodes qv to get the unique edge \(e = \{ (\widehat {u}, s_{u}), (\widehat {v}, s_{v}) \}\) incident to sv. Through e, w reaches the neighboring vertex \(\widehat {u}\), at its side su. \(\widehat {u}\)’s state qu is fetched, and if qu is found not to be in the class unique-front unique-back due to su having >1 incident edges, then both \(\widehat {u}\) and \(\widehat {v}\) are flanking vertices (for different maximal unitigs), and w retracts to and stops at \(\widehat {v}\). Otherwise, e is internal to p, and w switches to the other side of \(\widehat {u}\), proceeding on similarly. It continues through vertices \(\widehat {v}_{i}\) in this manner until a flanking vertex of p is reached, stitching together the edges along the way to construct a subpath of p.

A few constant-time supplementary procedures are used throughout the algorithm. IS-FUZZY-SIDE(q,s) determines whether a vertex with the state q has 0 or >1 edges at its side s. EDGE-EXTENSION(q,s) returns an encoding of the edge incident to the side s of a vertex with state q. ENTRANCE-SIDE\((\widehat {v}, v)\) (and EXIT-SIDE\((\widehat {v}, v)\)) returns the side used to enter (and exit) the vertex \(\widehat {v}\) when its k-mer form v is observed.

Maximal path-cover extraction

We discuss here how CUTTLEFISH 2 might be modified so that it can extract a maximal path cover of a de Bruijn graph \(G(\mathscr {R}, k)\). For such, only the COMPUTE-AUTOMATON-STATES step needs to be modified, and the rest of the algorithm remains the same. Given the edge set \(\mathscr {E}\) of the graph \(G(\mathscr {R}, k)\) and an MPHF h over its vertex set \(\mathscr {V}\), COMPUTE-AUTOMATON-STATES-PATH-COVER presents the modified DFA states computation algorithm.

The maximal path cover extraction variant of CUTTLEFISH 2 works as follows. It starts with a trivial path cover \(\mathscr {P}_{0}\) of \(G(\mathscr {R}, k)\): each \(v \in \mathscr {V}\) constitutes a single path, spanning the subgraph \(G'(\mathscr {V}, \emptyset)\). Then it iterates over the edges \(e \in \mathscr {E}\) (with \(\left |\mathscr {E}\right | = m\)) in arbitrary order. We will use \(\mathscr {P}_{i}\) to refer to the path cover after having visited i edges. At any given point of the execution, the algorithm maintains the invariant that \(\mathscr {P}_{i}\) is a maximal path cover of the graph \(G'(\mathscr {V}, \mathscr {E}')\), where \(\mathscr {E}' \subseteq \mathscr {E}\) (with \(\left |\mathscr {E}'\right | = i\)) is the set of the edges examined until that point. When examining the (i+1)’th edge e={(u,su),(v,sv)}, it checks whether e connects two different paths in \(\mathscr {P}_{i}\) into one single path: this is possible iff the sides su and sv do not have any incident edges already in \(\mathscr {E}'\), i.e., the sides are empty in \(G'(\mathscr {V}, \mathscr {E}')\). If this is the case, the paths are joined in \(\mathscr {P}_{i+1}\) into a single path containing the new edge e. Otherwise, the path cover remains unchanged so that \(\mathscr {P}_{i+1} = \mathscr {P}_{i}\). By definition, \(\mathscr {P}_{i+1}\) is a path cover of \(G'(\mathscr {V}, \mathscr {E}' \cup \{e\})\), as e could only affect the paths (at most two) in \(\mathscr {P}_{i}\) containing u and v, while the rest are unaffected and retain maximality—thus the invariant is maintained. By induction, \(\mathscr {P}_{m}\) is a path cover of \(G(\mathscr {V}, \mathscr {E})\) once all the edges have been examined, i.e., when \(\mathscr {E}' = \mathscr {E}\).

By making state transitions for the automata only for the edges present internal to the paths \(p \in \mathscr {P}_{m}\), the COMPUTE-AUTOMATON-STATES-PATH-COVER\((\mathscr {E}, h)\) algorithm seamlessly captures the subgraph \(G_{\mathscr {P}_{m}}\) of \(G(\mathscr {R}, k)\) that is induced by the path cover \(\mathscr {P}_{m}\). \(G_{\mathscr {P}_{m}}\) consists of a collection of disconnected maximal paths, and thus there exists no branching in \(G_{\mathscr {P}_{m}}\). Consequently, each of these maximal paths is a maximal unitig of \(G_{\mathscr {P}_{m}}\). The subsequent EXTRACT-MAXIMAL-UNITIGS algorithm operates using the DFA states collection S computed at this step, and therefore it extracts precisely these maximal paths.

Parallelization

CUTTLEFISH 2 is highly parallelizable on a shared-memory multi-core machine. The ENUMERATE-EDGES and the EXTRACT-VERTICES steps, using KMC 3 [72], are parallelized in their constituent phases via parallel distribution of the input (k+1)-mers (and k-mer) into partitions, and sorting multiple partitions in parallel.

The COMPUTE-MINIMAL-PERFECT-HASH step using BBHASH [79] parallelizes the construction through distributing disjoint subsets \(\mathscr {V}_{i}\) of the vertices to the processor-threads, and the threads process the \(\mathscr {V}_{i}\)’s in parallel.

The next two steps, COMPUTE-AUTOMATON-STATES and EXTRACT-MAXIMAL-UNITIGS, both (piecewise) traverse the graph through iterating over \(\mathscr {E}\) and \(\mathscr {V}\) respectively. The processor-threads are provided disjoint subsets of \(\mathscr {E}\) and \(\mathscr {V}\) to process in parallel. Although the threads process different edges in COMPUTE-AUTOMATON-STATES, multiple threads may access the same automaton into the hash table simultaneously, due to edges sharing endpoints. Similarly in EXTRACT-MAXIMAL-UNITIGS, though the threads examine disjoint vertex sets, multiple threads simultaneously constructing the same maximal unitig from its different constituent vertices can access the same automaton concurrently, at the walks’ meeting vertex. CUTTLEFISH 2 maintains exclusive access to a vertex to one thread at a time through a sparse set \(\mathscr {L}\) of locks. Each lock \(l \in \mathscr {L}\) guards a disjoint set \(\mathscr {V}_{i}\) of vertices, roughly of equal size. With p processor-threads and assuming all p threads accessing the hash table at the same time, the probability of two threads concurrently probing the same lock at the same turn is \(\big (1 - (1 - {^{1\!}}/_{\!|\mathscr {L}|})^{p \choose 2} \big)\)—this is minuscule with an adequate \(|\mathscr {L}|\)Footnote 14.

Asymptotics

In this section, we analyze the computational complexity of the CUTTLEFISH 2\((\mathscr {R}, k, f_{0})\) algorithm when executed on a set \(\mathscr {R}\) of strings, given a k value, and a threshold factor f0 for the edges in \(G(\mathscr {R}, k)\). \(\mathscr {E}\) is the set of the (k+1)-mers occurring ≥f0 times in \(\mathscr {R}\), and \(\mathscr {V}\) is the set of the k-mer in \(\mathscr {E}\). Let be the total length of the strings \(r \in \mathscr {R}\), n be the vertex-count \(|\mathscr {V}|\), and m be the edge-count \(|\mathscr {E}|\).

Time complexity

CUTTLEFISH 2 represents j-mers with 64-bit machine-words—packing 32 symbols into a single word. Let wj denote the number of words in a j-mer, i.e., \(w_{j} = \lceil \frac {j}{32} \rceil\).

Note that the number of (k+1)-mers in \(\mathscr {R}\) is upper-bounded by . The ENUMERATE-EDGES step uses the KMC 3 [72] algorithm. At first, it partitions the (k+1)-mers into buckets based on their signatures. With a rolling computation, determining the signature of a (k+1)-mer takes an amortized constant time. Assigning a (k+1)-mer to its bucket then takes time \(\mathcal {O} (w_{k + 1})\), and the complete distribution takes \(\mathcal {O}(w_{k + 1} \ell)\)Footnote 15. As each (k+1)-mer consists of wk+1 words, radix-sorting a bucket of size Bi takes time \(\mathcal {O}(B_{i} w_{k + 1})\). So in the second step, for a total of b buckets for \(\mathscr {R}\), the cumulative sorting time is \(\sum _{i = 1}^{b} \mathcal {O}(B_{i} w_{k + 1}) = \mathcal {O} (w_{k + 1} \sum _{i = 1}^{b} B_{i}) = \mathcal {O}(w_{k + 1} \ell)\). Thus ENUMERATE-EDGES takes time \(\mathcal {O}(\ell w_{k + 1})\).

The EXTRACT-VERTICES step applies KMC 3 [72] with \(\mathscr {E}\) as input, and hence we perform a similar analysis as earlier. Each \(e \in \mathscr {E}\) comprises two k-mer. So partitioning the k-mer takes time \(\mathcal {O}(2m w_{k})\), and radix-sorting the buckets takes \(\mathcal {O}(w_{k} \sum B_{i}) = \mathcal {O}(2m w_{k})\). Therefore EXTRACT-VERTICES takes time \(\mathcal {O} (m w_{k})\).

The CONSTRUCT-MINIMAL-PERFECT-HASH step applies the BBHASH [79] algorithm to construct an MPHF h over \(\mathscr {V}\). It is a multi-pass algorithm—each pass i tries to assign final hash values to a subset \(\mathscr {K}_{i}\) of keys. Making a bounded number of passes over sets \(\mathscr {K}_{i}\) of keys—shrinking in size—it applies some classical hash hi on the \(x \in \mathscr {K}_{i}\) in each pass. For some \(x \in \mathscr {K}_{i}\), iff hi(x) is free of hash collisions, then x is not propagated to \(\mathscr {K}_{i + 1}\). Provided that the hi’s are uniform and random, each key \({v} \in \mathscr {V}\) is hashed with the hi’s an expected \(\mathcal {O}(e^{{1}/{\gamma }})\) times [79], an exponentially decaying function on the γ parameter. Given that hi’s are constant time on machine-words, computing hi(v) takes time \(\mathcal {O}(w_{k})\). Then the expected time to assign its final hash value to a \({v} \in \mathscr {V}\) is \(H(k) = \mathcal {O} \big (w_{k} e^{{1}/{\gamma }} \big)\). Therefore CONSTRUCT-MINIMAL-PERFECT-HASH takes an expected time \(\mathcal {O} \big (n H(k) \big)\). Note that, querying h, i.e., computing h(v) also takes time H(k), as the query algorithm is multi-pass and similar to the construction.

The COMPUTE-AUTOMATON-STATES step initializes the n automata with the state q0, taking time \(\mathcal {O} (n)\). Then for each edge \(e \in \mathscr {E}\), it fetches its two endpoints’ states from the hash table in time 2H(k), updating them if required. In total there are 2m hash accesses, and thus COMPUTE-AUTOMATON-STATES takes time \(\mathcal {O} (n + m H(k))\).

The EXTRACT-MAXIMAL-UNITIGS step scans through each vertex \({v} \in \mathscr {V}\), and walks the entire maximal unitig p containing v. The state of each vertex in p is decoded to complete the walk—requiring |p| hash table accesses, taking time |p|H(k). If the flanking vertices of p are non-branching, then the walk also visits their neighboring vertices that are absent in p, at most once per each endpoint. Once extracted, all the vertices in p are marked so that p is not extracted again later on—this takes time |p|H(k), and can actually be done in time \(\mathcal {O} (|p|)\) by saving the hash values of the path vertices while constructing p. Thus traversing all the ui’s in the maximal unitigs set \(\mathscr {U}\) takes time \(\big (H(k) \sum \limits _{u_{i} \in \mathscr {U}} (|u_{i}| + 2) + \sum \limits _{u_{i} \in \mathscr {U}} |u_{i}| \big) = n H(k)\). \(\sum _{u_{i} \in \mathscr {U}} |u_{i}|\) equates to n because the set of the maximal unitigs \(\mathscr {U}\) forms a vertex decomposition of \(G(\mathscr {R}, k)\) [47]. Thus EXTRACT-MAXIMAL-UNITIGS takes time \(\mathcal {O} \big (n H(k) \big)\).

In the brief analysis for the last three steps, we do not include the time to read the edges \(\big (\mathcal {O} (m w_{k + 1}) \big)\) and the vertices \(\big (\mathcal {O} (n w_{k}) \big)\) into memory, as they are subsumed by other terms.

Thus, CUTTLEFISH 2\((\mathscr {R}, k, f_{0})\) has an expected running time \(\mathcal {O} \big (\ell w_{k + 1} + m w_{k} + (n + m)H(k) \big)\), where \(w_{j} = \lceil \frac {j}{32} \rceil\), \(H(k) = \mathcal {O} (w_{k} e^{{1}/{\gamma }})\), and γ>0 is a constant. It is evident that the bottleneck is the initial ENUMERATE-EDGES step, and it asymptotically subsumes the running time.

Space complexity

Here, we analyze the working memory (i.e., RAM) required by the CUTTLEFISH 2 algorithm. The ENUMERATE-EDGES step with KMC 3 [72] can work within a bounded memory space. Its partitioning phase distributes input k-mer into disk bins, and the k-mer are kept in working memory within a total space limit S, before flushes to disk. Radix-sorting the bins are done through loading bins into memory with sizes within S, and larger bins are broken into sub-bins to facilitate bounded-memory sort. As we discuss below, the graph traversal steps require a fixed amount of memory, determined linearly by n. As n is not computed until the completion of EXTRACT-VERTICES, we approximate it within the KMC 3 algorithm (see Additional file 1: Sec. 2.1), and then bound the memory for the KMC 3 execution appropriately. The next step of EXTRACT-VERTICES is also performed similarly within the same memory-bound.

The CONSTRUCT-MINIMAL-PERFECT-HASH step with BBHASH [79] processes the key set \(\mathscr {V}\) in fixed-sized chunks. Each pass i with key set \(\mathscr {V}_{i}\) has a bit-array Ai to mark hi(v) for all the \({v} \in \mathscr {V}_{i}\), along with an additional bit-array Ci to detect the hash collisions. Both Ai and Ci have the size \(\gamma |\mathscr {V}_{i}|\). The finally concatenated Ai’s is the output data structure A for the algorithm, and some Ci is present only during the pass i. A has an expected size of γe1/γn bits [79]. \(|C_{0}| = \gamma |\mathscr {V}_{0}| = \gamma n\), and this is the largest collision array in the algorithm’s lifetime. Thus, an expected loose upper-bound of the memory usage in this step is \(\mathcal {O} \big (|A| + |C_{0}| \big) = \mathcal {O} \big ((e^{{1}/{\gamma }} + 1) \gamma n \big)\) bits.

At this point in the algorithm, a hash table structure is set up for the automata. Together, the hash function h and the hash buckets collection S take an expected space of \(\big (\gamma e^{{1}/{\gamma }} n + n \lceil \log _{2} |\mathscr {Q}| \rceil \big) = (\gamma e^{{1}/{\gamma }} + 6) n\) bits.

The COMPUTE-AUTOMATON-STATES step scans the edges in \(\mathscr {E}\) in fixed-sized chunks. For each \(e \in \mathscr {E}\), it queries and updates the hash table for the endpoints of e as required. Similarly, the EXTRACT-MAXIMAL-UNITIGS step scans the vertices in \(\mathscr {V}\) in fixed-sized chunks, and spells the containing maximal unitig of some \(v \in \mathscr {V}\) through successively querying the hash table for the path vertices. The spelled paths are dumped to disk at a certain cumulative threshold size. Thus the only non-trivial memory usage by these steps is from the hash table. Therefore these graph traversal steps use \(\big ((\gamma e^{{1}/{\gamma }} + 6) n + \mathcal {O} (1) \big)\) bits.

When γ≤6, the hash table (i.e., the hash function and the bucket collection) is the dominant factor in the algorithm’s memory usage, and CUTTLEFISH 2\((\mathscr {R}, k, f_{0})\) consumes expected space \(\mathcal {O} \big ((\gamma e^{{1}/{\gamma }} + 6) n \big)\). If γ>6 is set, then it could be possible for the hash function construction memory to dominate. In practice, we adopt γ=2, and the observed memory usage is ≈9.7n bits, translating to ≈1.2 bytes per distinct k-mer.

Availability of data and materials

All data generated or analyzed during this study are included in this published article (and its supplementary information files). The data supporting the findings of this study are publicly available, and the data sources are noted in the appropriate sections in the manuscript (see the “Results” section).

Cuttlefish 2 is implemented in C++17, and is released under the BSD 3-Clause “New” or “Revised” License. The latest source code is available at the GitHub repository: https://github.com/COMBINE-lab/cuttlefish [82]. The source code version as used in preparing the manuscript is available at https://doi.org/10.5281/zenodo.6897066 [83].

Notes

  1. From our observations, the distributions of k-mer frequencies and of (k+1)-mer frequencies on real data tend to agree closely, resulting in the same f 0 for these experiments for both Cuttlefish 2 and the rest of the algorithms, as per the setting-policy used.

  2. k-mer occurring frequently enough in input NGS reads are said to be solid k-mer, and the other ones are said to be weak [65].

  3. Length of the longest contig such that all the contigs having lengths ≥ sum in size to at least 50% of the sum size of the contigs.

  4. Analogous to N50, except for: (1) breaking the contigs into their constituent blocks that can be aligned to an associated reference sequence, and (2) replacing the sum size of contigs with the reference length.

  5. As per this definition, \(\mathscr {V}_{2} = \mathscr {K}\). We describe in the Algorithm section a practical consideration that implies that \(\mathscr {V}_{2}\) need not necessarily be the same as \(\mathscr {K}\) when some filtering is applied on the input \(\mathscr {S}\) to generate \(\mathscr {K}_{+1}\).

  6. This is to account for the DNA being double-stranded, and a genomic string may come from either of these oppositely-oriented complementary strands.

  7. Discarding orientations: the two unitigs (v0,…,vn) and (vn,…,v0) are topologically the same.

  8. Introduced by [44], based on the input to the de Bruijn graph constructions being either reference sequences or sequencing reads, the graphs are distinguished as either reference or read de Bruijn graphs.

  9. A vertex v is a sentinel if the first or the last k-mer x of an input string corresponds to v. Let v’s empty side in x be sv. The graph \(G(\mathscr {S}, k)\) is modified such that sv connects to a special branching vertex Υ—preventing unitigs containing v to span farther through sv.

  10. Formally, \(\mathscr {Q}'\) is the set of states reachable from q0 through transitions as per some definite patterns of input symbols. For our purposes, recognizing specific input patterns is not a concern—rendering this parameter redundant—we define it as the set of the final states an automaton can be in having fed all its inputs.

  11. For a given j<, a j-minimizer of an -mer x is the smallest j-mer substring of x according to some specified function.

  12. This can be improved by having 4p different hash tables for \(\mathscr {V}\), for a fixed prefix length pk. Each hash table then accounts for keys of length (kp).

  13. It can be as low as ≈3 bits/vertex with γ=1, at the expense of slower hashing. The theoretical lower limit for MPHFs is ≈1.44 bits/key [80].

  14. The optimal (in regard to probability) value \(|\mathscr {L}| = |\mathscr {V}|\) is not used due to the locks’ memory usage.

  15. This bound is not tight, as KMC 3 actually distributes sequences longer than (k+1)-mers—reducing computation (see the Edge set construction section).

References

  1. US National Library of Medicine. NCBI insights: the entire corpus of the sequence read archive (SRA) now live on two cloud platforms! Natl Cent Biotechnol Inf. 2020. https://ncbiinsights.ncbi.nlm.nih.gov/2020/02/24/sra-cloud/. Accessed 8 Nov 2021.

  2. Stephens ZD, Lee SY, Faghri F, Campbell RH, Zhai C, Efron MJ, Iyer R, Schatz MC, Sinha S, Robinson GE. Big data: Astronomical or genomical?PLoS Biol. 2015; 13(7):1–11. https://doi.org/10.1371/journal.pbio.1002195.

    Article  CAS  Google Scholar 

  3. Nayfach S, Roux S, Seshadri R, Udwary D, Varghese N, Schulz F, Wu D, Paez-Espino D, Chen I-M, Huntemann M, Palaniappan K, Ladau J, et al.A genomic catalog of earth’s microbiomes. Nat Biotechnol. 2021; 39(4):499–509. https://doi.org/10.1038/s41587-020-0718-6.

    CAS  PubMed  Article  Google Scholar 

  4. Gevers D, Knight R, Petrosino JF, Huang K, McGuire AL, Birren BW, Nelson KE, White O, Methé BA, Huttenhower C. The human microbiome project: A community resource for the healthy human microbiome. PLOS Biol. 2012; 10(8):1–5. https://doi.org/10.1371/journal.pbio.1001377.

    Article  CAS  Google Scholar 

  5. Muir P, Li S, Lou S, Wang D, Spakowicz DJ, Salichos L, Zhang J, Weinstock GM, Isaacs F, Rozowsky J, et al. The real cost of sequencing: scaling computation to keep pace with data generation. Genome Biol. 2016; 17(1):1–9.

    Article  CAS  Google Scholar 

  6. de Bruijn NG. A combinatorial problem. Nederl Akad Wetensch Proc. 1946; 49:758–64.

    Google Scholar 

  7. Good IJ. Normal recurring decimals. J Lond Math Soc. 1946; s1-21(3):167–9. https://doi.org/10.1112/jlms/s1-21.3.167.

    Article  Google Scholar 

  8. Simpson JT, Pop M. The theory and practice of genome sequence assembly. Annu Rev Genomics Hum Genet. 2015; 16(1):153–72. https://doi.org/10.1146/annurev-genom-090314-050032.

    CAS  PubMed  Article  Google Scholar 

  9. Pevzner PA, Tang H, Waterman MS. An eulerian path approach to DNA fragment assembly. Proc Natl Acad Sci. 2001; 98(17):9748–53. https://doi.org/10.1073/pnas.171285098.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  10. Limasset A, Flot J-F, Peterlongo P. Toward perfect reads: self-correction of short reads via mapping on de bruijn graphs. Bioinformatics. 2019; 36(5):1374–81. https://doi.org/10.1093/bioinformatics/btz102.

    Article  CAS  Google Scholar 

  11. Salmela L, Rivals E. LoRDEC: accurate and efficient long read error correction. Bioinformatics. 2014; 30(24):3506–14. https://doi.org/10.1093/bioinformatics/btu538.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  12. Benoit G, Lemaitre C, Lavenier D, Drezen E, Dayris T, Uricaru R, Rizk G. Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph. BMC Bioinformatics. 2015; 16(1):288. https://doi.org/10.1186/s12859-015-0709-7.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  13. Iqbal Z, Caccamo M, Turner I, Flicek P, McVean G. De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat Genet. 2012; 44(2):226–32. https://doi.org/10.1038/ng.1028.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  14. Cameron DL, Schröder J, Penington JS, Do H, Molania R, Dobrovic A, Speed TP, Papenfuss AT. GRIDSS: sensitive and specific genomic rearrangement detection using positional de Bruijn graph assembly. Genome Res. 2017; 27(12):2050–60. https://doi.org/10.1101/gr.222109.117.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  15. Almodaresi F, Zakeri M, Patro R. PuffAligner: a fast, efficient and accurate aligner based on the pufferfish index. Bioinformatics. 2021. https://doi.org/10.1093/bioinformatics/btab408.

  16. Liu B, Guo H, Brudno M, Wang Y. deBGA: read alignment with de bruijn graph-based seed and extension. Bioinformatics. 2016; 32(21):3224–32. https://doi.org/10.1093/bioinformatics/btw371.

    CAS  PubMed  Article  Google Scholar 

  17. Almodaresi F, Khan J, Madaminov S, Pandey P, Ferdman M, Johnson R, Patro R. An incrementally updatable and scalable system for large-scale sequence search using LSM trees. BioRxiv. 2021. https://doi.org/10.1101/2021.02.05.429839.

  18. Ye Y, Tang H. Utilizing de bruijn graph of metagenome assembly for metatranscriptome analysis. Bioinformatics. 2015; 32(7):1001–8. https://doi.org/10.1093/bioinformatics/btv510.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  19. Bradley P, Gordon NC, Walker TM, Dunn L, Heys S, Huang B, Earle S, Pankhurst LJ, Anson L, de Cesare M, Piazza P, Votintseva AA, Golubchik T, Wilson DJ, Wyllie DH, Diel R, Niemann S, Feuerriegel S, Kohl TA, Ismail N, Omar SV, Smith EG, Buck D, McVean G, Walker AS, Peto TEA, Crook DW, Iqbal Z. Rapid antibiotic-resistance predictions from genome sequence data for staphylococcus aureus and mycobacterium tuberculosis. Nat Commun. 2015; 6(1):10063. https://doi.org/10.1038/ncomms10063.

    CAS  PubMed  Article  Google Scholar 

  20. Wang M, Ye Y, Tang H. A de Bruijn graph approach to the quantification of closely-related genomes in a microbial community. J Comput Biol. 2012; 19(6):814–25. https://doi.org/10.1089/cmb.2012.0058.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  21. Peng Y, Leung HCM, Yiu S-M, Lv M-J, Zhu X-G, Chin FYL. IDBA-tran: a more robust de novo de bruijn graph assembler for transcriptomes with uneven expression levels. Bioinformatics. 2013; 29(13):326–34. https://doi.org/10.1093/bioinformatics/btt219.

    Article  CAS  Google Scholar 

  22. Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, Adiconis X, Fan L, Raychowdhury R, Zeng Q, Chen Z, Mauceli E, Hacohen N, Gnirke A, Rhind N, di Palma F, Birren BW, Nusbaum C, Lindblad-Toh K, Friedman N, Regev A. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol. 2011; 29(7):644–52. https://doi.org/10.1038/nbt.1883.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  23. Bray NL, Pimentel H, Melsted P, Pachter L. Near-optimal probabilistic RNA-seq quantification. Nat Biotechnol. 2016; 34(5):525–7. https://doi.org/10.1038/nbt.3519.

    CAS  PubMed  Article  Google Scholar 

  24. Ekim B, Berger B, Chikhi R. Minimizer-space de Bruijn graphs: Whole-genome assembly of long reads in minutes on a personal computer. Cell Syst. 2021; 12(10):958–9686. https://doi.org/10.1016/j.cels.2021.08.009.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  25. Ruan J, Li H. Fast and accurate long-read assembly with wtdbg2. Nat Methods. 2020; 17(2):155–8. https://doi.org/10.1038/s41592-019-0669-3.

    CAS  PubMed  Article  Google Scholar 

  26. Lin Y, Yuan J, Kolmogorov M, Shen MW, Chaisson M, Pevzner PA. Assembly of long error-prone reads using de Bruijn graphs. Proc Natl Acad Sci. 2016; 113(52):8396–405. https://doi.org/10.1073/pnas.1604560113.

    Google Scholar 

  27. Jackman SD, Vandervalk BP, Mohamadi H, Chu J, Yeo S, Hammond SA, Jahesh G, Khan H, Coombe L, Warren RL, Birol I. ABySS 2.0: resource-efficient assembly of large genomes using a bloom filter. Genome Res. 2017; 27(5):768–77. https://doi.org/10.1101/gr.214346.116.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  28. Li D, Liu C-M, Luo R, Sadakane K, Lam T-W. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de bruijn graph. Bioinformatics. 2015; 31(10):1674–6. https://doi.org/10.1093/bioinformatics/btv033.

    CAS  PubMed  Article  Google Scholar 

  29. Li X, Shi Q, Shao M. On bridging paired-end RNA-seq data. BioRxiv. 2021. https://doi.org/10.1101/2021.02.26.433113.

  30. Brown CT, Moritz D, O’Brien MP, Reidl F, Reiter T, Sullivan BD. Exploring neighborhoods in large metagenome assembly graphs using spacegraphcats reveals hidden sequence diversity. Genome Biol. 2020; 21(1):164. https://doi.org/10.1186/s13059-020-02066-4.

    PubMed  PubMed Central  Article  Google Scholar 

  31. David L, Vicedomini R, Richard H, Carbone A. Targeted domain assembly for fast functional profiling of metagenomic datasets with S3A. Bioinformatics. 2020; 36(13):3975–81. https://doi.org/10.1093/bioinformatics/btaa272.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  32. Schrinner SD, Mari RS, Ebler J, Rautiainen M, Seillier L, Reimer JJ, Usadel B, Marschall T, Klau GW. Haplotype threading: accurate polyploid phasing from long reads. Genome Biol. 2020; 21(1):252. https://doi.org/10.1186/s13059-020-02158-1.

    PubMed  PubMed Central  Article  Google Scholar 

  33. Liu B, Liu Y, Li J, Guo H, Zang T, Wang Y. deSALT: fast and accurate long transcriptomic read alignment with de Bruijn graph-based index. Genome Biol. 2019; 20(1):274. https://doi.org/10.1186/s13059-019-1895-9.

    PubMed  PubMed Central  Article  Google Scholar 

  34. Minkin I, Medvedev P. Scalable multiple whole-genome alignment and locally collinear block construction with SibeliaZ. Nat Commun. 2020; 11(1):6327. https://doi.org/10.1038/s41467-020-19777-8.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  35. Minkin I, Medvedev P. Scalable pairwise whole-genome homology mapping of long genomes with BubbZ. IScience. 2020; 23(6):101224. https://doi.org/10.1016/j.isci.2020.101224.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  36. Lopez-Maestre H, Brinza L, Marchet C, Kielbassa J, Bastien S, Boutigny M, Monnin D, Filali AE, Carareto CM, Vieira C, Picard F, Kremer N, Vavre F, Sagot M-F, Lacroix V. SNP calling from RNA-seq data without a reference genome: identification, quantification, differential analysis and impact on the protein sequence. Nucleic Acids Res. 2016; 44(19):148. https://doi.org/10.1093/nar/gkw655.

    Google Scholar 

  37. Sacomoto GA, Kielbassa J, Chikhi R, Uricaru R, Antoniou P, Sagot M-F, Peterlongo P, Lacroix V. KIS SPLICE: de-novo calling alternative splicing events from RNA-seq data. BMC Bioinformatics. 2012; 13(6):5. https://doi.org/10.1186/1471-2105-13-S6-S5.

    Article  Google Scholar 

  38. Dede K, Ohlebusch E. Dynamic construction of pan-genome subgraphs. Open Comput Sci. 2020; 10(1):82–96. https://doi.org/10.1515/comp-2020-0018.

    Article  Google Scholar 

  39. Lees JA, Mai TT, Galardini M, Wheeler NE, Horsfield ST, Parkhill J, Corander J, Ravel J. Improved prediction of bacterial genotype-phenotype associations using interpretable pangenome-spanning regressions. MBio. 2020; 11(4):01344–20. https://doi.org/10.1128/mBio.01344-20.

    Article  Google Scholar 

  40. Wittler R. Alignment- and reference-free phylogenomics with colored de Bruijn graphs. Algoritm Mol Biol. 2020; 15(1):4. https://doi.org/10.1186/s13015-020-00164-3.

    CAS  Article  Google Scholar 

  41. Cleary A, Ramaraj T, Kahanda I, Mudge J, Mumey B. Exploring frequented regions in pan-genomic graphs. IEEE/ACM Trans Comput Biol Bioinforma. 2019; 16(5):1424–35. https://doi.org/10.1109/TCBB.2018.2864564.

    Article  Google Scholar 

  42. Manuweera B, Mudge J, Kahanda I, Mumey B, Ramaraj T, Cleary A. Pangenome-wide association studies with frequented regions. In: Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics (BCB ’19). New York: Association for Computing Machinery: 2019. p. 627–32. https://doi.org/10.1145/3307339.3343478.

    Google Scholar 

  43. Sheikhizadeh S, Schranz ME, Akdel M, de Ridder D, Smit S. PanTools: representation, storage and exploration of pan-genomic data. Bioinformatics. 2016; 32(17):487–93. https://doi.org/10.1093/bioinformatics/btw455.

    Article  CAS  Google Scholar 

  44. Khan J, Patro R. Cuttlefish: fast, parallel and low-memory compaction of de bruijn graphs from large-scale genome collections. Bioinformatics. 2021; 37(Supplement_1):177–86. https://doi.org/10.1093/bioinformatics/btab309.

    Article  CAS  Google Scholar 

  45. Holley G, Melsted P. Bifrost: highly parallel construction and indexing of colored and compacted de bruijn graphs. Genome Biol. 2020; 21(1):249. https://doi.org/10.1186/s13059-020-02135-8.

    PubMed  PubMed Central  Article  Google Scholar 

  46. Guo H, Fu Y, Gao Y, Li J, Wang Y, Liu B. deGSM: memory scalable construction of large scale de bruijn graph. IEEE/ACM Trans Comput Biol Bioinforma. 2019; Early Access:1–1.

    Article  Google Scholar 

  47. Chikhi R, Limasset A, Medvedev P. Compacting de bruijn graphs from sequencing data quickly and in low memory. Bioinformatics. 2016; 32(12):201–8. https://doi.org/10.1093/bioinformatics/btw279.

    Article  CAS  Google Scholar 

  48. Minkin I, Pham S, Medvedev P. TwoPaCo: an efficient algorithm to build the compacted de bruijn graph from many complete genomes. Bioinformatics. 2016; 33(24):4024–32. https://doi.org/10.1093/bioinformatics/btw609.

    Google Scholar 

  49. Baier U, Beller T, Ohlebusch E. Graphical pan-genome analysis with compressed suffix trees and the Burrows–Wheeler transform. Bioinformatics. 2015; 32(4):497–504. https://doi.org/10.1093/bioinformatics/btv603.

    PubMed  Article  CAS  Google Scholar 

  50. Chikhi R, Limasset A, Jackman S, Simpson JT, Medvedev P. On the representation of de bruijn graphs In: Sharan R, editor. Research in Computational Molecular Biology. Cham: Springer: 2014. p. 35–55.

    Google Scholar 

  51. Marcus S, Lee H, Schatz MC. SplitMEM: a graphical algorithm for pan-genome analysis with suffix skips. Bioinformatics. 2014; 30(24):3476–83. https://doi.org/10.1093/bioinformatics/btu756.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  52. Hopcroft JE, Motwani R, Ullman JD. Introduction to Automata Theory, Languages, and Computation (3rd Edition). USA: Addison-Wesley Longman Publishing Co., Inc.; 2006.

    Google Scholar 

  53. Marchet C, Kerbiriou M, Limasset A. BLight: efficient exact associative structure for k-mers. Bioinformatics. 2021; 37(18):2858–65. https://doi.org/10.1093/bioinformatics/btab217.

    CAS  Article  Google Scholar 

  54. Pibiri GE. Sparse and skew hashing of k-mers. bioRxiv. 2022. https://doi.org/10.1101/2022.01.15.476199.

  55. Rahman A, Medvedev P. Representation of k-mer sets using spectrum-preserving string sets In: Schwartz R, editor. Research in Computational Molecular Biology. Cham: Springer: 2020. p. 152–168.

    Google Scholar 

  56. Břinda K, Baym M, Kucherov G. Simplitigs as an efficient and scalable representation of de Bruijn graphs. Genome Biol. 2021; 22(1):96. https://doi.org/10.1186/s13059-021-02297-z.

    PubMed  PubMed Central  Article  Google Scholar 

  57. Chikhi R, Holub J, Medvedev P. Data structures to represent a set of k-long DNA sequences. ACM Comput Surv. 2021;54(1). https://doi.org/10.1145/3445967.

  58. Zook JM, Catoe D, McDaniel J, Vang L, Spies N, Sidow A, Weng Z, Liu Y, Mason CE, Alexander N, Henaff E, McIntyre ABR, Chandramohan D, Chen F, Jaeger E, Moshrefi A, Pham K, Stedman W, Liang T, Saghbini M, Dzakula Z, Hastie A, Cao H, Deikus G, Schadt E, Sebra R, Bashir A, Truty RM, Chang CC, Gulbahce N, Zhao K, Ghosh S, Hyland F, Fu Y, Chaisson M, Xiao C, Trow J, Sherry ST, Zaranek AW, Ball M, Bobe J, Estep P, Church GM, Marks P, Kyriazopoulou-Panagiotopoulou S, Zheng GXY, Schnall-Levin M, Ordonez HS, Mudivarti PA, Giorda K, Sheng Y, Rypdal KB, Salit M. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Sci Data. 2016; 3(1):160025. https://doi.org/10.1038/sdata.2016.25.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  59. Lappalainen T, Sammeth M, Friedländer MR, ‘t Hoen PAC, Monlong J, Rivas MA, Gonzàlez-Porta M, Kurbatova N, Griebel T, Ferreira PG, Barann M, Wieland T, Greger L, van Iterson M, Almlöf J, Ribeca P, Pulyakhina I, Esser D, Giger T, Tikhonov A, Sultan M, Bertier G, MacArthur DG, Lek M, Lizano E, Buermans HPJ, Padioleau I, Schwarzmayr T, Karlberg O, Ongen H, Kilpinen H, Beltran S, Gut M, Kahlem K, Amstislavskiy V, Stegle O, Pirinen M, Montgomery SB, Donnelly P, McCarthy MI, Flicek P, Strom TM, Lehrach H, Schreiber S, Sudbrak R, Carracedo Á., Antonarakis SE, Häsler R, Syvänen A-C, van Ommen G-J, Brazma A, Meitinger T, Rosenstiel P, Guigó R, Gut IG, Estivill X, Dermitzakis ET, Consortium TG. Transcriptome and genome sequencing uncovers functional variation in humans. Nature. 2013; 501(7468):506–11. https://doi.org/10.1038/nature12531.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  60. Mas-Lloret J, Obón-Santacana M, Ibáñez-Sanz G, Guinó E, Pato ML, Rodriguez-Moranta F, Mata A, García-Rodríguez A, Moreno V, Pimenoff VN. Gut microbiome diversity detected by high-coverage 16S and shotgun sequencing of paired stool and colon sample. Scientific Data. 2020; 7(1):92. https://doi.org/10.1038/s41597-020-0427-5.

    PubMed  PubMed Central  Article  Google Scholar 

  61. Howe AC, Jansson JK, Malfatti SA, Tringe SG, Tiedje JM, Brown CT. Tackling soil diversity with the assembly of large, complex metagenomes. Proc Natl Acad Sci. 2014; 111(13):4904–9. https://doi.org/10.1073/pnas.1402564111.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  62. Birol I, Raymond A, Jackman SD, Pleasance S, Coope R, Taylor GA, Yuen MMS, Keeling CI, Brand D, Vandervalk BP, Kirk H, Pandoh P, Moore RA, Zhao Y, Mungall AJ, Jaquish B, Yanchuk A, Ritland C, Boyle B, Bousquet J, Ritland K, MacKay J, Bohlmann J, Jones SJM. Assembling the 20 gb white spruce (picea glauca) genome from whole-genome shotgun sequencing data. Bioinformatics. 2013; 29(12):1492–7. https://doi.org/10.1093/bioinformatics/btt178.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  63. Bloom BH. Space/Time trade-offs in hash coding with allowable errors. Commun ACM. 1970; 13(7):422–6. https://doi.org/10.1145/362686.362692.

    Article  Google Scholar 

  64. Marçais G, Kingsford C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics. 2011; 27(6):764–70. https://doi.org/10.1093/bioinformatics/btr011.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  65. Zhao L, Xie J, Bai L, Chen W, Wang M, Zhang Z, Wang Y, Zhao Z, Li J. Mining statistically-solid k-mers for accurate NGS error correction. BMC Genomics. 2018; 19(10):912. https://doi.org/10.1186/s12864-018-5272-y.

    PubMed  PubMed Central  Article  Google Scholar 

  66. Hiseni P, Rudi K, Wilson RC, Hegge FT, Snipen L. HumGut: a comprehensive human gut prokaryotic genomes collection filtered by metagenome data. Microbiome. 2021; 9(1):165. https://doi.org/10.1186/s40168-021-01114-w.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  67. Blackwell GA, Hunt M, Malone KM, Lima L, Horesh G, Alako BTF, Thomson NR, Iqbal Z. Exploring bacterial diversity via a curated and searchable snapshot of archived DNA sequences. PLOS Biology. 2021; 19(11):1–16. https://doi.org/10.1371/journal.pbio.3001421.

    Article  CAS  Google Scholar 

  68. Yoshimura J, Ichikawa K, Shoura MJ, Artiles KL, Gabdank I, Wahba L, Smith CL, Edgley ML, Rougvie AE, Fire AZ, Morishita S, Schwarz EM. Recompleting the caenorhabditis elegans genome. Genome Res. 2019; 29(6):1009–22. https://doi.org/10.1101/gr.244830.118.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  69. Li H. Aligning sequence reads, clone sequences and assembly contigs with bwa-mem. arXiv. 2013. https://doi.org/10.48550/arXiv.1303.3997.

  70. Chikhi R, Medvedev P. Informed and automated k-mer size selection for genome assembly. Bioinformatics. 2013; 30(1):31–7. https://doi.org/10.1093/bioinformatics/btt310.

    PubMed  Article  CAS  Google Scholar 

  71. Lee S, Min H, Yoon S. Will solid-state drives accelerate your bioinformatics? in-depth profiling, performance analysis and beyond. Brief Bioinforma. 2015; 17(4):713–27. https://doi.org/10.1093/bib/bbv073.

    Article  Google Scholar 

  72. Kokot M, Długosz M, Deorowicz S. KMC 3: counting and manipulating k-mer statistics. Bioinformatics. 2017; 33(17):2759–61. https://doi.org/10.1093/bioinformatics/btx304.

    CAS  PubMed  Article  Google Scholar 

  73. Roberts M, Hayes W, Hunt BR, Mount SM, Yorke JA. Reducing storage requirements for biological sequence comparison. Bioinformatics. 2004; 20(18):3363–9. https://doi.org/10.1093/bioinformatics/bth408.

    CAS  PubMed  Article  Google Scholar 

  74. Burrows M, Wheeler DJ. A block-sorting lossless data compression algorithm. Technical report, Systems Research Center, Digital Equipment Corp. 1994.

  75. Gross J, Yellen J. Graph Theory and Its Applications. USA: CRC Press, Inc.; 1999, p. 264.

    Google Scholar 

  76. Kleinberg J, Tardos E. Graphs. In: Algorithm Design. USA: Addison-Wesley Longman Publishing Co., Inc.: 2005.

    Google Scholar 

  77. Ma X, Shao Y, Tian L, Flasch DA, Mulder HL, Edmonson MN, Liu Y, Chen X, Newman S, Nakitandwe J, Li Y, Li B, Shen S, Wang Z, Shurtleff S, Robison LL, Levy S, Easton J, Zhang J. Analysis of error profiles in deep next-generation sequencing data. Genome Biol. 2019; 20(1):50. https://doi.org/10.1186/s13059-019-1659-6.

    PubMed  PubMed Central  Article  Google Scholar 

  78. Kokot M, Deorowicz S, Debudaj-Grabysz A. Sorting data on ultra-large scale with RADULS. In: Beyond Databases, Architectures and Structures. Towards Efficient Solutions for Data Analysis and Knowledge Representation. Cham: Springer: 2017. p. 235–45.

    Google Scholar 

  79. Limasset A, Rizk G, Chikhi R, Peterlongo P. Fast and scalable minimal perfect hashing for massive key sets. In: 16th International Symposium on Experimental Algorithms (SEA 2017) (Leibniz International Proceedings in Informatics (LIPIcs)). Dagstuhl: Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik: 2017. p. 25–12516. https://doi.org/10.4230/LIPIcs.SEA.2017.25.

    Google Scholar 

  80. Fredman ML, Komlós J. On the size of separating systems and families of perfect hash functions. SIAM J Algebraic Discret Methods. 1984; 5(1):61–68. https://doi.org/10.1137/0605009.

    Article  Google Scholar 

  81. Marçais G. Compact vector: Bit packed vector of integral values. GitHub. 2020. https://github.com/gmarcais/compact_vector. Accessed 18 June 2020.

  82. Khan J, Patro R. Cuttlefish: Building the compacted de Bruijn graph efficiently from references or reads. GitHub. 2022. https://github.com/COMBINE-lab/cuttlefish. Accessed 24 July 2022.

  83. Khan J, Kokot M, Deorowicz S, Patro R. Software version used in the paper: Scalable, ultra-fast, and low-memory construction of compacted de Bruijn graphs with Cuttlefish 2. Zenodo. 2022. https://doi.org/10.5281/zenodo.6897066. Accessed 24 July 2022.

  84. Mohamadi H, Khan H, Birol I. ntCard: a streaming algorithm for cardinality estimation in genomics data. Bioinformatics. 2017; 33(9):1324–30. https://doi.org/10.1093/bioinformatics/btw832.

    CAS  PubMed  PubMed Central  Google Scholar 

  85. Rizk G, Lavenier D, Chikhi R. DSK: k-mer counting with very low memory usage. Bioinformatics. 2013; 29(5):652–53. https://doi.org/10.1093/bioinformatics/btt020.

    CAS  PubMed  Article  Google Scholar 

  86. Patro R, Mount SM, Kingsford C. Sailfish enables alignment-free isoform quantification from rna-seq reads using lightweight algorithms. Nat Biotechnol. 2014; 32(5):462–4. https://doi.org/10.1038/nbt.2862.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  87. Pandey P, Almodaresi F, Bender MA, Ferdman M, Johnson R, Patro R. Mantis: A fast, small, and exact large-scale sequence-search index. Cell Syst. 2018; 7(2):201–2074. https://doi.org/10.1016/j.cels.2018.05.021.

    CAS  PubMed  Article  Google Scholar 

  88. Marchet C, Iqbal Z, Gautheret D, Salson M, Chikhi R. REINDEER: efficient indexing of k-mer presence and abundance in sequencing datasets. Bioinformatics. 2020; 36(Supplement_1):177–85. https://doi.org/10.1093/bioinformatics/btaa487.

    Article  CAS  Google Scholar 

Download references

Acknowledgements

Not applicable.

Peer review information

Andrew Cosgrove was the primary editor of this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Review history

The review history is available as Additional file 2.

Funding

This work has been supported by the US National Institutes of Health (R01 HG009937) (RP), US National Science Foundation (CCF-1750472, and CNS-1763680) (RP), Poland National Science Centre (project DEC-2019/33/B/ST6/02040) (SD), and Faculty of Automatic Control, Electronics and Computer Science at Silesian University of Technology (statutory research project 02/080/BKM_21/0020) (MK). The funders had no role in the design of the method, data analysis, decision to publish, or preparation of the manuscript.

Author information

Authors and Affiliations

Authors

Contributions

Authors’ contributions

JK, MK, SD, and RP designed the method. JK, MK, and RP implemented the method. JK designed and conducted the experiments. All authors contributed to and approved the final manuscript.

Authors’ Twitter handles

Twitter handles: @scarecrow00007 (Jamshed Khan); @marekkoki (Marek Kokot); @sdeorowicz (Sebastian Deorowicz); @nomad421 (Rob Patro)

Corresponding authors

Correspondence to Marek Kokot or Rob Patro.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

RP is a co-founder of Ocean Genomics, inc. The other authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

13059_2022_2743_MOESM1_ESM.pdf

Additional file 1. Supplementary material for “Scalable, ultra-fast, and low-memory construction of compacted de Bruijn graphs with Cuttlefish 2”. References [84,85,86,87,88] are cited in order in the supplementary material.

Additional file 2. Review history.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Khan, J., Kokot, M., Deorowicz, S. et al. Scalable, ultra-fast, and low-memory construction of compacted de Bruijn graphs with Cuttlefish 2. Genome Biol 23, 190 (2022). https://doi.org/10.1186/s13059-022-02743-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s13059-022-02743-6

Keywords

  • de Bruijn graph
  • Compacted de Bruijn graph
  • Data structures
  • High-throughput sequencing
  • Unitig
  • Path cover