 Software
 Open Access
 Published:
Bifrost: highly parallel construction and indexing of colored and compacted de Bruijn graphs
Genome Biology volume 21, Article number: 249 (2020)
Abstract
Memory consumption of de Bruijn graphs is often prohibitive. Most de Bruijn graphbased assemblers reduce the complexity by compacting paths into single vertices, but this is challenging as it requires the uncompacted de Bruijn graph to be available in memory. We present a parallel and memoryefficient algorithm enabling the direct construction of the compacted de Bruijn graph without producing the intermediate uncompacted graph. Bifrost features a broad range of functions, such as indexing, editing, and querying the graph, and includes a graph coloring method that maps each kmer of the graph to the genomes it occurs in.
Availability
Introduction
The de Bruijn graph is an abstract data structure with a rich history in computational biology as a tool for genome assembly [1, 2]. With the advent of high throughput sequencing, the Overlap Layout Consensus (OLC) framework frequently used to assemble Sanger sequencing data [3] was progressively replaced in favor of de Bruijn graphbased methods. Since 2008, a wide range of genome assemblers based on the de Bruijn graph have been released [4–10]. Although single molecule sequencing technologies [11, 12] have reintroduced the OLC framework as the method of choice to assemble long and erroneous reads [13–16], de Bruijn graphbased methods are nonetheless used to assemble and correct long reads [17, 18]. Overall, de Bruijn graphs have found widespread use for a variety of problems such as de novo transcriptome assembly [19], variant calling [20], short read compression [21], short read correction [22], long read correction [17], and short read mapping [23] to name a few. The colored de Bruijn graph is a variant of the de Bruijn graph which keeps track of the source of each vertex in the graph [24]. The initial application was for assembly and genotyping, but it has also found use in pangenomics [25], variant calling [26], and transcript quantification methods [27].
Despite serving as a building block for many methods in computational biology, the de Bruijn graph adoption is hindered by two factors. First, the memory usage and computational requirements for building de Bruijn graphs from raw sequencing reads are considerable compared to alignment to a reference genome, while only a handful of tools have focused on de Bruijn graph compaction [28–33]. Second, de Bruijn graph construction usually requires tight integration with the code. In the best case, software libraries for building and manipulating de Bruijn graphs are used [34, 35], but in most cases, data structures to index the de Bruijn graph are reimplemented. Those downsides are intensified in the colored de Bruijn graph for which the memory consumption of colors rapidly overtakes the vertices and edges memory usage [36]. For this reason, a lot of attention has been given to succinct data structures for building the colored de Bruijn graph [30, 31, 36–41] and data structures for multiset kmer indexing [42–47]. In the following, we focus on tools for constructing compacted de Bruijn graphs (cdBGs) with or without colors. We refer the reader to the survey of [48] for more details about kmerbased data structures as well as the reviews of [25] and [49] for data structures to index collections of kmer sets.
TwoPaCo [28] is a highly parallel construction tool for the cdBG. It builds progressively the cdBG from assembled genomes by identifying junctionkmers which are either branching or located at the extremities of unitigs. A Bloom filter is first used to approximate the graph and a hash table is subsequently employed to remove false positives. The approach taken by BCALM2 [29] is orthogonal to the one of TwoPaCo: rather than identifying junction kmers, BCALM2 incrementally assembles kmers into unitigs until junction kmers are reached. Kmers are partitioned according to their minimizers, and partitions are compacted independently in parallel. A final step glues the compaction of different partitions together. Note that BCALM2 can process assembled genomes as well as short read data. deGSM [50] performs an external sorting of the kmers from the input sequences and then constructs a BurrowsWheeler transform (BWT) [51] of the unitigs from which the final graph is extracted. SplitMEM [30] uses the suffix tree [52] to construct a ccdBG. Unitigs of the graph are derived from the set of Maximum Exact Matches in the input genomes, while colors are implicitly encoded in the suffix tree. SplitMEM is not adapted to short read data input and splits the unitigs to ensure all kmers of each unitig share the same set of colors. Baier et al. [31] provided two algorithms improving SplitMEM with a lower time complexity using a Compressed Suffix Tree and the BWT. PanTools [33] creates first an uncompacted kmer index from which are derived unitigs. By iterating over the input assembled genomes, kmers that have not been visited yet are extended to form unitigs, possibly leading to the merging and splitting of previously created unitigs. The graph index is maintained in a database providing edit operations such as updating the graph with additional data. PanTools was specifically designed for pangenomic applications with assembled genomes in input and allows gene annotations in the graph.
In this paper, we present Bifrost, a software for efficiently constructing, indexing, and querying the colored and compacted de Bruijn graph (ccdBGs), both in terms of runtime and memory usage. The data structures and algorithms implemented in Bifrost are specifically tailored for fast and lightweight construction, querying, and dynamic manipulation of compacted de Bruijn graphs, both regular and colored. The software is designed to take advantage of multiple cores and modern processors instruction sets (SIMD operations). Bifrost is also available as a C++11 software library with minimal external dependencies and allows developers to build on top of an efficient de Bruijn graph engine by using the Bifrost API. Bifrost has been successfully employed for alignment and referencefree phylogenomics [53] as well as bacterial genomes querying of genes linked to pathogenicity islands and fluoroquinolone resistance [54].
Results
We benchmarked Bifrost against stateoftheart software on publicly available dataset. We focus on three representative use cases: cdBG construction, cdBG querying, and cdBG coloring. All experiments were run of a server with an 16core Intel Xeon E52650 processor and 256G of RAM. Running time was measured as wall clock time using the time command, and peak memory was measured by ps.
cdBG construction
We constructed the cdBG of the NA12878 human genome short read dataset from the Genome In A Bottle consortium [55]. The dataset is downsampled from 300fold to 30fold coverage to reflect normal sequencing depth, resulting in about 696 million 150bp pairedend sequences.
We compared Bifrost to BCALM2 because of its low computational requirements and versatility as it can build a cdBG from short read data or assembled genomes. BCALM2 can be configured for different memory usage where a lower memory usage results in a longer running time. In our experiments, it was configured with the maximum memory usage of Bifrost for each kmer size tested. Additionally, BCALM2 uses by default up to 5 GB of disk space while Bifrost does not use any disk except for the final output. Results are shown in Table 1, and summaries of the unitig N50, kmer cardinality, and unitig cardinality in each graph built are reported in Table 2.
Bifrost was consistently faster than BCALM2, up to a factor 15.32, on all kmer sizes and number of threads tested. For increasing kmer sizes, Bifrost construction time kept decreasing while BCALM2 construction time increased. However, BCALM2 used up to 24.3% less memory than Bifrost. Memory usage for a fixed kmer size was fairly constant for both tools across different number of threads, except for BCALM2 using k=127.
cdBG querying
We compared Bifrost to two tools for querying dBGs based on the kmer composition of the queries, namely Blight [56] and Mantis [45]. The dataset used for the graph index was the NA12878 dataset from the Genome In A Bottle consortium described in the “cdBG construction” section. For querying, Bifrost takes as input the graph it constructed and builds an index for querying kmers. Mantis requires processing the unitigs of the graph with Squeakr [57] to produce a compressed table of all kmers present. Mantis then builds an index directly from the output of Squeakr for querying. Blight takes as input a graph created by BCALM2. All indexes were created using k=31 and 16 threads.
To query the graph, we used 30 million singleend reads from the NA12878 short read dataset that was used to construct the reference graph.
Note that both Bifrost and Mantis return query hits for every query while Blight only returns the total number of kmers found in the graph from all input queries. Furthermore, Mantis and Blight cannot be configured to return the presence or absence of a query based on different kmer inclusion rates. Hence, Bifrost was queried initially with parameter e=1.0 to indicate that an input query is returned present in the graph only if all of its composing kmers are present. This is done to ensure that all methods query the graph for all kmers in the read. Results are shown in Table 3. Finally, Bifrost enables graph querying based on kmers with up to one substitution or indel. Table 4 shows the performance of Bifrost with different kmer inclusion rates, where e=θ requires at least the presence of θ fraction of the kmers in the graph, both using exact or inexact kmers. Querying for inexact kmers, where an edit distance of 1 is allowed, increases the number of hits but requires more running time. However, even in the case where all kmers are queried, the inexact version is still competitive with Blight and Mantis which only perform exact kmer queries. Overall, the results show that Bifrost is the fastest at querying, while using 26.8 GB of memory, whereas Blight uses less memory at the expense of speed. The low memory usage of Blight is partially explained by the fact that Blight maintains its index in main memory but stores subsequences of the graph on disk.
cdBG coloring
We constructed ccdBGs with k=31 for a maximum of 117,913 assembled genomes of Salmonella. The input represents all publicly available Salmonella assemblies from the database Enterobase [58] as of August 2018. This is a 7.3 × increase in the number of colors compared to the work of [41] who reported the ccdBG construction for 16,000 Salmonella strains. We compared Bifrost to VARImerge [41] as both tools can construct the colored de Bruijn graph and update it without reconstructing the graph entirely. The main differences between the two tools is that VARImerge is mainly a diskbased method that produces a noncompacted colored de Bruijn graph. We only benchmarked VARImerge as it is currently the stateoftheart for colored de Bruijn graph construction. A comparison of VARImerge to other colored de Bruijn graph construction tools is given in [41]. Results are given in Table 5 for a variable number of strains. Note that the reported VARImerge time includes the time spent by KMC2 [59] to compute the kmers required in input of VARImerge.
In [41], the authors process 16,000 strains in batches of 4000, merging the batches to produce a colored de Bruijn graph of all strains. This required 254 GB of memory and 2.34 TB of external disk, with a total running time of 69 h. In comparison, Bifrost processed 117,913 strains using about 103 GB of memory, no external disk usage and a total running time of 93.35 h. While the running time is not directly comparable across different machines due to different processors, this is in line with Bifrost being about eight times faster than VARImerge. The graph built from the 117,913 strains contains 413,658,482 kmers: 39.19% of the kmers have only one color (singleton), less than 0.01% of the kmers have all the colors (core), and 60.80% of the kmers have more than one but not all colors (dispensable). Among the 26,324,369 unitigs, 98.72 % have a single set of colors shared by all their kmers.
Discussion
The de Bruijn graph has been widely used as a fundamental data structure in assemblers, but the memory requirements and focus on speed mean that the implementation has been tightly integrated into the project. Bifrost allows for the integration of the de Bruijn graph as a data structure into projects that work with short read sequencing datasets or assemblies of several genomes. Reusing assemblers can often lead to suboptimal results, e.g., genome assemblers often have coverage assumptions that are not valid for transcriptome assembly. By making minimal assumptions about the input, Bifrost enables researchers to extend our work rather than having to reimplement it.
Conclusion
We present Bifrost, a method for constructing, indexing, and querying compacted de Bruijn graphs, both regular and colored, with minimal computational requirements. Bifrost is competitive with the stateoftheart de Bruijn graph construction method BCALM2 and the unitig indexing tool Blight with the advantage that Bifrost is dynamic. For colored de Bruijn graphs, Bifrost is about eight times faster than VARImerge and uses about 20 times less memory with no external disk. The query capabilities of Bifrost are for both identifying colors for a given kmer and navigating the de Bruijn graph. The software was developed with the intention of being usable as a tool or a library wherever large de Bruijn graphs are needed with minimal external dependencies.
Methods
“Definitions” section details the concepts and data structures that will be used throughout this paper. “Approximating the de Bruijn graph” section describes how an approximation of the uncompacted de Bruijn graph is built from a set of sequencing reads. “Constructing the compacted de Bruijn graph” section shows how the approximate compacted de Bruijn graph is built from its uncompacted counterpart and subsequently converted to an exact compacted de Bruijn graph. “Coloring” section presents how the graph coloring is built efficiently on top of the compacted de Bruijn graph.
Definitions
A string s is a sequence of symbols drawn from an alphabet \(\mathcal {A}\). The length of s is denoted by s. A substring of s is a string occurring in s: it has a starting position i and a length l and is denoted by s(i,l). A substring of length l is also denoted an lmer. In the following, we assume \(\mathcal {A}\) is the DNA alphabet \(\mathcal {A} = \{A, C, G, T\}\) for which symbols have complements: (A,T) and (C,G) are the complementing pairs. The reversecomplemented string \(\overline {s}\) is the reverse sequence of complemented symbols in s. The canonical string \(\hat {s}\) is the lexicographically smallest of s and its reversecomplement \(\overline {s}\). The minimizer [60, 61] of an lmer x is a gmer y occurring in x such that g<l and y is the lexicographically smallest of all the gmers in x. The lexicographical order can be cumbersome to use since polyA gmers naturally occur in sequencing data and is often replaced by a random order. The simplest way to obtain a random order is to compute a hashvalue for each gmer in x and select the gmer with the smallest hashvalue as the minimizer. In this work, we will only consider minimizers generated by random orderings.
A de Bruijn graph (dBG) is a directed graph G=(V,E) in which each vertex v∈V represents a kmer. A directed edge e∈E from vertex v to vertex v^{′} representing kmers x and x^{′}, respectively, exists if and only if x(2,k−1)=x^{′}(1,k−1). Each kmer x has \(\mathcal {A}\) possible successors x(2,k−1)⊙a and \(\mathcal {A}\) possible predecessors a⊙x(1,k−1) in G with \(a\in \mathcal {A}\) and ⊙ as the concatenation operator. Note that in the original combinatorial definition of the dBG, all possible kmers for an alphabet \(\mathcal {A}\) are present in the graph, whereas in computational biology, the definition is restricted to a subset of the de Bruijn graph representing the kmers in the input. A path in the graph is a sequence of distinct and connected vertices p=(v_{1},...,v_{m}). We say that the path p is nonbranching if all its vertices have an in and outdegree of one with exception of the head vertex v_{1} which can have more than one incoming edge and the tail vertex v_{m} which can have more than one outgoing edge. A nonbranching path is maximal if it cannot be extended in the graph without being branching. A compacted de Bruijn graph (cdBG) merges all maximal nonbranching paths of η vertices from the dBG into single vertices, called unitigs, representing words of length k+η−1. Minimal examples of dBG and cdBG are provided in Fig. 1a and b respectively. A colored de Bruijn graph is a graph G=(V,E,C) in which (V,E) is a dBG and C is a set of colors such that each vertex v∈V maps to a subset of C; we extend the definition of a cdBG to a colored compacted de Bruijn Graph (ccdBG) to be a graph G=(V,E,C), where (V,E) is a cdBG, so the vertices represent unitigs, and each kmer of a unitig maps to a subset of C.
Introduced by [62], the Bloom filter (BF) is a space and timeefficient data structure that records the approximate membership of elements in a set. The BF is represented as a bitmap B of m bits initialized with 0s, coupled with a set of f hash functions h_{1},...,h_{f}. Inserting and querying an element e into B is performed with the functions
and
respectively, in which \(\bigwedge \) is the logical conjunction operator. Those functions require \(\mathcal {O}(1)\) time. The function MayContain may report false positives when querying for elements which were never inserted but are present in B as a result of independent insertions. Given n elements to insert, the optimal number of hash functions to use [63] is \(f = \frac {m}{n}\ln (2)\), for an approximate false positive rate of
Hence, the BF trades off memory usage and time complexity with a decreased false positive rate.
In order to accelerate BFs, [63] demonstrated that two hash functions combined in a double hashing technique can be applied in order to simulate more than two hash functions and obtain similar hashing performance. One main drawback of BFs is their poor data locality as bits corresponding to one element are scattered over B, resulting in several CPU cache misses when inserting and querying. This issue was addressed in [64], which presented the Blocked Bloom Filter (BBF), an array of smaller BFs individually fitting into one or multiple cache lines. To insert or lookup an element, a supplementary hash function is used to determine which BF to load. While BBFs are fast, their false positive ratios are usually higher than regular BFs due to the unbalanced load of each BF in the array.
As minimizers are used extensively throughout Bifrost, we use an efficient rolling hash function based on the work of [65] to select a gmer as the minimizer within a single kmer. Since overlapping kmers are likely to share minimizers, we use an ascending minima approach [66] to recompute minimizers with amortized O(1) costs, so that iterating over minimizers of adjacent kmers in a sequence is linear in the length of the sequence. Another optimization is to restrict the computation of minimizers to a subset of gmers of a kmer, namely, we exclude the first and last gmer as a candidate for being a minimizer. This ensures that for a given kmer, all of its forward, respectively backward, adjacent kmers necessarily share the same minimizer. While it is likely that a kmer x and its neighbor x^{′} share a minimizer, this neighbor hashing trick [38] guarantees that when searching all forward, respectively backward, neighbors of x, they will all have the same minimizer and will be stored within the same block of a BBF, thus minimizing cache misses.
Approximating the de Bruijn graph
The kmers extracted from the reads will be inserted into two BBFs: BBF_{1} will contain all kmers occurring at least once in the input read sets while BBF_{2} will contain all kmers occurring twice or more often. This separation allows us to filter out unique kmers which are likely to be sequencing errors [67]. Algorithm 1 starts by iterating over the reads and extracts all the canonical kmers. BBF_{1} is queried for the presence of each such kmer, and kmers already present in BBF_{1} are inserted into BBF_{2}. Finally, BBF_{1} is discarded as the cdBG will be built from the kmers of BBF_{2}.
In order to accelerate the insertions into the BBFs, the minimizer hashvalue of each kmer is used to determine the BBF block in which the kmer is inserted. This guarantees that overlapping kmers sharing the same minimizer position within a read are inserted into the same BBF block, thus improving the cache efficiency of BBFs. Furthermore, the neighbor hashing of the minimizers guarantees that all predecessors and successors of a kmer are hashing to the same block, thus improving graph traversal for the exact cdBG construction step. Finally, the BBFs in Bifrost use 2choice hashing [68] to balance the number of insertions per block and reduce the number of false positives. Instead of selecting a single BBF block when inserting a kmer, two blocks are selected. If none of the two blocks already contains the kmer, it is inserted into the block which has the fewest number of bits set. To enable parallel insertions, each BBF block is equipped with a spinlock to avoid multiple threads inserting at the same time within the same block. Algorithm 2 refines the insertion function introduced in the “Definitions” section to enable 2choice hashing and spinlocks usage with BBFs. Bifrost can make use of modern processors instruction sets to query simultaneously up to 16 bits within a block using AVX instructions.
Constructing the compacted de Bruijn graph
The following section describes the data structure indexing the unitigs. The “Unitig extraction” section details the unitig extraction procedure from the BBF and the insertion of unitigs into the cdBG data structure.
Data structure
The cdBG data structure D=(U,M) is composed of a unitig array U and a hash table of minimizers M. A unitig u is first inserted into U and gets a unique identifier id_{u}. Unitig u is then decomposed into its set of constituent kmers from which minimizers are extracted. Each minimizer is identified by a position p_{m} in u. While there can be as many minimizer positions as there are kmers in the unitig, it is likely that multiple overlapping kmers share the same minimizer position. The canonical gmers corresponding to the minimizers are inserted into M and associated with their position p_{m} in u and the identifier id_{u}. Note that a minimizer might have multiple occurrences, either within a unitig or in different unitigs of the graph. The cdBG data structure D is illustrated in Fig. 2. Algorithm 3 details the insertion of a unitig u in the cdBG data structure. Note that removing a unitig from the graph can be done in a reversedfashion to Algorithm 3: The tuples associated with unitig u are removed from M and unitig u is removed from U.
Lookingup a kmer x in the cdBG data structure is similar to inserting a unitig. The canonical gmer corresponding to the minimizer of x is extracted and used to query M. If the gmer is not in M, x does not occur in a unitig of the cdBG. However, if the gmer is present, the identifiers of the unitigs containing the gmer and the gmer positions within those unitigs are returned. Kmer x and its reversecomplement \(\overline {x}\) are then anchored in those unitigs at the given minimizer positions and compared. If the comparison is positive, a tuple with the unitig identifier and the kmer position in the unitig is returned. Algorithm 4 shows how to lookup D for a kmer.
Unitig extraction
The BBF returned by Algorithm 1 represents an approximation of the dBG: It contains the true positive kmers, namely all the kmers present in the unitigs of the cdBG, but also false positive kmers, which do not belong to the cdBG. The false positive kmers are either artifacts of BBF_{2} or single occurrence kmers that should have been filtered out by Algorithm 1 but were inserted into BBF_{2} as a result of their false occurrences in BBF_{1}. Although BBFs are efficient data structures, they do not allow to iterate over the contents. To get around this limitation, we iterate over the original set of reads and query BBF_{2} to identify kmers that are present.
Given a kmer x, Algorithm 5 extracts from the BBF the unitig from which x is a substring, conditioned upon the presence of x in the BBF. Kmer x is extended forward, respectively backward, by reconstructing iteratively the prefix, respectively suffix, of the unitig using function Extend. Note that a backward extension is performed by extending forward from the reversecomplement of x and the extracted suffix is reversecomplemented to obtain the unitig prefix. Forward extensions are made with function ExtendForward which iteratively concatenate the last character from the next kmer in the extension until no more kmer is found or the extracted kmer creates a cycle. Finally, kmer x is extended with x^{′} using function ExtendKmer if the two kmers belong to the same maximal nonbranching path, i.e, if x^{′} is the only successor of x in the BBF and x is the only predecessor of x^{′} in the BBF,
Given the read set, the BBF containing the filtered kmers, and an empty cdBG data structure, Algorithm 6 extracts the unitigs from the BBF and inserts them into the cdBG data structure. The algorithm iterates over the kmers of the reads and queries the BBF for their presence. A missing kmer in the BBF indicates the kmer was filtered out by Algorithm 1 and will not be part of a unitig, in which case the next kmer in the read is queried. However, in case of the kmer presence in the BBF, the cdBG is searched for the unitig containing this kmer using Algorithm 4. If the kmer is missing from the unitigs present in the cdBG data structure, it means its unitig has not been extracted yet from the BBF. The extraction using Algorithm 5 takes place, and the extracted unitig is inserted into the cdBG data structure with Algorithm 3.
Eliminating the false positive kmers
The cdBG constructed by Algorithm 6 is not exact as it contains false positive kmers of BBF_{2}. Those false positive kmers create two types of errors in the graph:

False connection: A false positive kmer connects a unitig with no successors to a unitig with no predecessors. Hence, one unitig is extracted from the BBF instead of two.

False branching: A false positive kmer connects as a successor, respectively predecessor, to a true positive kmer which already has a successor, respectively predecessor. Hence, three unitigs are extracted from the BBF instead of one.
An example of a cdBG containing the two types of errors is illustrated in Fig. 3: Kmer “CCG” creates a false branching and “ACT” creates a false connection.
In order to distinguish false positive from true positive kmers, a counter is maintained on each kmer of the unitigs and Algorithm 6 is modified to increment the counters of the kmers occurring in the reads. Hence, false positive kmers with no or one single occurrence are deleted from the graph. In the case of a false connection kmer, deleting the kmer splits a unitig. In case of a false branching, deleting the kmer joins one or multiple unitigs.
Ghost kmers
The false positive rate of the BBF will affect the length of the unitigs extracted by Algorithm 5. Consider a unitig of length k+η−1 in the true cdBG, consisting of ηkmers. For each internal kmer, the algorithm makes 8 queries to the BBF, two of which will return true and 6 of which should return false. If the BBF has a false positive rate of p, the algorithm will advance to the next kmer with probability (1−p)^{6}≈1−6p and stop prematurely with probability ≈6p. The number of kmers in the extracted unitig will then be limited by η on one hand and a geometric distribution with probability 6p, whose expected value is \(\frac {1}{6p}\). When p=10^{−3}, this would lead to an average unitig length of 167. While these errors are fixed with Algorithm 7, this leads to an increased memory usage. One way to increase the length would be to use more memory in the BBF which would reduce the false positive rate. However, we observe that the most likely configuration is that a single false positive kmer x^{′}, adjacent to a real kmer x in the unitig, causes a premature halt to the extraction of the true unitig. When x^{′} has no other neighbor in the BBF except for x, we call it a ghost kmer, insert it into a hash table to keep track of it in case we observe it later but do not stop the extraction of the unitig. In the rare case that x^{′} turns out to belong to the true cdBG, we identify the unitig containing x^{′} and fix the mistake. The probability that we halt can now be approximated as 42p^{2}, since this would require two adjacent false positive kmers to occur in the BBF. The use of ghost kmers greatly reduces fragmentation which improves memory usage and running time.
Recurrent minimizers
Even in the case of a minimizer random ordering as described in the “Definitions” section, some minimizers are expected to occur more often in unitigs than others, due to indels occurring in homopolymer and tandem repeat sequences. Those minimizers are likely to increase the running time as their lists of tuples in the minimizer hash table M will be much longer than for the other minimizers. We define a minimizer as recurrent if it occurs t times or more in the unitigs of the cdBG. In order to limit the impact of recurrent minimizers on the graph construction, lists of tuples in M have a maximum length t. When a kmer x and its corresponding minimizer y must be inserted into the cdBG data structure, the length of the list associated with y in M is verified first. If the length is greater or equals to t, y is a recurrent minimizer. In such case, a nonrecurrent minimizer y^{′}>y is extracted from x and inserted into M. If x does not contain a nonrecurrent minimizer y^{′}, the recurrent minimizer y is inserted into M instead. Whenever kmer x is searched, the list of tuples associated with its minimizer y is traversed and x is anchored on the instances of y in the unitigs of the graph until a match is found, as described in Algorithm 4. However, if no match is found for x and the list of tuples associated with y contains t or more tuples, the nonrecurrent minimizer y^{′} is extracted from x and the search continues using minimizer y^{′}.
Coloring
We denote as D^{′} the data structure of a ccdBG: It is composed of a unitig array U, a minimizer hash table M, an array O of color containers, an array H of hash functions, and a hash table K of kmers.
Container representation
In Bifrost, a color is represented by an integer from 1 to C. A unitig u composed of η=u−k+1kmers is associated with a binary matrix of size η×C: rows represent the different kmer positions in u and columns represent the colors from C. A bit set at row 1≤i≤η and column 1≤j≤C indicates that kmer u(i,k) occurs in dataset j. In order to limit the memory usage of colors, multiple compressed index is used to represent these binary matrices depending on their sparsity:

A 64bit word that can be either a tuple 〈positioni,colorj〉 or a binary matrix of size η×C≤62 (2 bits are reserved for the metadata)

A compressed bitmap adapted from a Roaring bitmap container [69]. This compressed bitmap stores up to 65488 tuples 〈positioni,colorj〉 and uses a maximum of 8 KB of memory. This container has 3 representations of the tuples it indexes: bit vector, sorted list of tuples, and runlength encoded list of sorted tuples. Compared to a Roaring bitmap, this compressed bitmap uses less memory for its metadata and incurs fewer cache misses to access the tuples.

A Roaring bitmap [69] to store more than 65488 tuples. Roaring bitmaps are SIMD accelerated and propose numerous functions to manipulate bitmaps such as set intersection and union.
Those representations have a logarithmic worstcase time lookup and insertion.
Container indexing
Color containers can become substantially large, and in order to avoid costly data transfer operations when the ccdBG data structure D^{′} is modified, color containers are not associated directly to unitigs in D^{′}. Instead, a solution derived from the MPHF (Minimal Perfect Hash Function) library BBHash [70] is used to link unitigs of array U to color containers of array O. The benefit of such a method is that operations which affect only the structure of the graph do not move the color containers in memory. Algorithm 8 describes how color containers are associated to their respective unitigs.
Availability of data and materials
We have made the source code of Bifrost available as open source software at https://github.com/pmelsted/bifrost[71]. The source code is released under a BSD2 license. The website contains details on installation, setup, and usage. The exact version used in this paper is archived at Zenodo under https://zenodo.org/record/3973373[72].
References
Pevzner PA, Tang H, Waterman MS. An Eulerian path approach to DNA fragment assembly. Proc Natl Acad Sci USA. 2001; 98(17):9748–53.
Idury RM, Waterman MS. A new algorithm for DNA sequence assembly. J Comput Biol. 1995; 2(2):291–306.
Yang B, Liu B, Mu D, Zhang H, Yuan J, Gan J, Li N, Fan W, Hu X, Chen Y, Shi Y, Li Z. Comparison of the two major classes of assembly algorithms: overlap–layout–consensus and debruijngraph. Brief Funct Genomics. 2011; 11(1):25–37.
Chaisson MJ, Pevzner PA. Short read fragment assembly of bacterial genomes. Genome Res. 2008; 18(2):324–30.
Zerbino DR, Birney E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 2008; 18(5):821–9.
Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJ, Birol İ. ABySS: a parallel assembler for short read sequence data. Genome Res. 2009; 19(6):1117–23.
Luo R, Liu B, Xie Y, Li Z, Huang W, Yuan J, He G, Chen Y, Pan Q, Liu Y, Tang J, Wu G, Zhang H, Shi Y, Liu Y, Yu C, Wang B, Lu Y, Han C, Cheung DW, Yiu SM, Peng S, Xiaoqian Z, Liu G, Liao X, Li Y, Yang H, Wang J, Lam TW, Wang J. SOAPdenovo2: an empirically improved memoryefficient shortread de novo assembler,. GigaScience. 2012; 1(1):1–18.
Chikhi R, Rizk G. Spaceefficient and exact de Bruijn graph representation based on a Bloom filter. Algorithms Mol Biol. 2013;8(22).
Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, Lesin VM, Nikolenko SI, Pham S, Prjibelski AD, et al.SPAdes: a new genome assembly algorithm and its applications to singlecell sequencing. J Comput Biol. 2012; 19(5):455–77.
MacCallum I, Przybylski D, Gnerre S, Burton J, Shlyakhter I, Gnirke A, Malek J, McKernan K, Ranade S, Shea TP, et al.ALLPATHS 2: small genomes assembled accurately and with high continuity from short paired reads. Genome Biol. 2009; 10:103.
Rang FJ, Kloosterman WP, de Ridder J. From squiggle to basepair: computational approaches for improving nanopore sequencing read accuracy. Genome Biol. 2018; 19(1):90.
Rhoads A, Au KF. PacBio sequencing and its applications. Genomics Proteome Bioinforma. 2015; 13(5):278–89.
Koren S, Walenz BP, Berlin K, Miller JR, Bergman NH, Phillippy AM. Canu: scalable and accurate longread assembly via adaptive kmer weighting and repeat separation. Genome Res. 2017; 27(5):722–36.
Li H. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics. 2016; 32(14):2103–10.
Chin CS, Peluso P, Sedlazeck FJ, Nattestad M, Concepcion GT, Clum A, Dunn C, O’Malley R, FigueroaBalderas R, MoralesCruz A, et al.Phased diploid genome assembly with singlemolecule realtime sequencing. Nat Methods. 2016; 13(12):1050.
Kamath GM, Shomorony I, Xia F, Courtade TA, David NT. Hinge: longread assembly achieves optimal repeat resolution. Genome Res. 2017; 27(5):747–56.
Salmela L, Rivals E. LoRDEC: accurate and efficient long read error correction. Bioinformatics. 2014; 30(24):3506–14.
Ruan J, Li H. Fast and accurate longread assembly with wtdbg2. Nat Methods. 2020; 17:155–8.
Robertson G, Schein J, Chiu R, Corbett R, Field M, Jackman SD, Mungall K, Lee S, Okada HM, Qian JQ, et al.De novo assembly and analysis of RNAseq data. Nat Methods. 2010; 7:909–12.
Uricaru R, Rizk G, Lacroix V, Quillery E, Plantard O, Chikhi R, Lemaitre C, Peterlongo P. Referencefree detection of isolated SNPs. Nucleic Acids Res. 2015; 43(2):11.
Benoit G, Lemaitre C, Lavenier D, Drezen E, Dayris T, Uricaru R, Rizk G. Referencefree compression of high throughput sequencing data with a probabilistic de Bruijn graph. BMC Bioinformatics. 2015; 16(1):288.
Limasset A, Flot JF, Peterlongo P. Toward perfect reads: selfcorrection of short reads via mapping on de Bruijn graphs. Bioinformatics. 2020; 36(5):1374–81.
Liu B, Guo H, Brudno M, Wang Y. deBGA: read alignment with de Bruijn graphbased seed and extension. Bioinformatics. 2016; 32(21):3224–32.
Iqbal Z, Caccamo M, Turner I, Flicek P, McVean G. De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat Genet. 2012; 44:226–32.
Zekic T, Holley G, Stoye J. Pangenome storage and analysis techniques. In: Comparative Genomics. Springer: 2018. p. 29–53.
Fang H, Bergmann EA, Arora K, Vacic V, Zody MC, Iossifov I, O’Rawe JA, Wu Y, Barron LTJ, Rosenbaum J, et al.Indel variant analysis of shortread sequencing data with Scalpel. Nat Protoc. 2016; 11:2529–48.
Bray NL, Pimentel H, Melsted P, Pachter L. Nearoptimal probabilistic RNAseq quantification. Nat Biotechnol. 2016; 34:525–7.
Minkin I, Pham S, Medvedev P. TwoPaCo: an efficient algorithm to build the compacted de Bruijn graph from many complete genomes. Bioinformatics. 2017; 33(24):4024–32.
Chikhi R, Limasset A, Medvedev P. Compacting de Bruijn graphs from sequencing data quickly and in low memory. Bioinformatics. 2016; 32(12):201–8.
Marcus S, Lee H, Schatz MC. SplitMEM: a graphical algorithm for pangenome analysis with suffix skips. Bioinformatics. 2014; 30(24):3476–83.
Baier U, Beller T, Ohlebusch E. Graphical pangenome analysis with compressed suffix trees and the BurrowsWheeler transform. Bioinformatics. 2016; 32(4):497–504.
Minkin I, Patel A, Kolmogorov M, Vyahhi N, Pham S. Sibelia: a scalable and comprehensive synteny block generation tool for closely related microbial genomes. In: Proc. of the 13th Workshop on Algorithms in Bioinformatics (WABI’13): 2013. p. 215–29.
Sheikhizadeh S, Schranz ME, Akdel M, de Ridder D, Smit S. PanTools: representation, storage and exploration of pangenomic data. Bioinformatics. 2016; 32(17):487–93.
Drezen E, Rizk G, Chikhi R, Deltel C, Lemaitre C, Peterlongo P, Lavenier D. GATB: genome assembly & analysis tool box. Bioinformatics. 2014; 30(20):2959–61.
Crusoe MR, Alameldin HF, Awad S, Boucher E, Caldwell A, Cartwright R, Charbonneau A, Constantinides B, Edvenson G, Fay S, et al.The khmer software package: enabling efficient nucleotide sequence analysis. F1000Research. 2015; 4:900.
Almodaresi F, Pandey P, Patro R. Rainbowfish: a succinct colored de Bruijn graph representation. In: Proc. of the 17th Workshop on Algorithms in Bioinformatics (WABI’17). Schloss DagstuhlLeibnizZentrum für Informatik: 2017.
Holt J, McMillan L. Merging of multistring BWTs with applications. Bioinformatics. 2014; 30(24):3524–31.
Holley G, Wittler R, Stoye J. Bloom filter trie–a data structure for pangenome storage. In: Proc. of the 15th Workshop on Algorithms in Bioinformatics (WABI’15), vol. 9289. Springer: 2015. p. 217–30.
Muggli MD, Bowe A, Noyes NR, Morley PS, Belk KE, Raymond R, Gagie T, Puglisi SJ, Boucher C. Succinct colored de Bruijn graphs. Bioinformatics. 2017; 33(20):3181–7.
Almodaresi F, Pandey P, Ferdman M, Johnson R, Patro R. An efficient, scalable and exact representation of highdimensional color information enabled via de Bruijn graph search. In: Proc. of the 23rd International Conference on Research in Computational Molecular Biology (RECOMB’19). Springer: 2019.
Muggli MD, Alipanahi B, Boucher C. Building large updatable colored de Bruijn graphs via merging. bioRxiv. 2019. https://doi.org/10.1101/229641.
Solomon B, Kingsford C. Fast search of thousands of shortread sequencing experiments. Nat Biotechnol. 2016; 34:300–2.
Sun C, Harris RS, Chikhi R, Medvedev P. Allsome sequence bloom trees. J Comput Biol. 2018;25(5):467–79.
Solomon B, Kingsford C. Improved search of large transcriptomic sequencing databases using split sequence Bloom trees. J Comput Biol. 2018;25(7):755–65.
Pandey P, Almodaresi F, Bender MA, Ferdman M, Johnson R, Patro R. Mantis: a fast, small, and exact largescale sequencesearch index. Cell Syst. 2018; 7(2):201–7.
Yu Y, Liu J, Liu X, Zhang Y, Magner E, Lehnert E, Qian C, Liu J. Seqothello: querying RNAseq experiments at scale. Genome Biol. 2018; 19:167.
Bradley P, den Bakker HC, Rocha EP, McVean G, Iqbal Z. Ultrafast search of all deposited bacterial and viral genomic data. Nat Biotechnol. 2019; 37:152–9.
Chikhi R, Holub J, Medvedev P. Data structures to represent sets of klong DNA sequences. arXiv: 1903.12312. 2019.
Marchet C, Boucher C, Puglisi SJ, Medvedev P, Salson M, Chikhi R. Data structures based on kmers for querying large collections of sequencing datasets. bioRxiv. 2019. https://doi.org/10.1101/866756.
Guo H, Fu Y, Gao Y, Li J, Wang Y, Liu B. deGSM: memory scalable construction of large scale de Bruijn Graph. IEEE/ACM Trans Comput Biol Bioinform. 2019. https://doi.org/10.1109/TCBB.2019.2913932.
Burrows M, Wheeler DJ. A blocksorting lossless data compression algorithm. Tech. Rep. 124, Digital SRC Research Report. 1994.
Weiner P. Linear pattern matching algorithms. In: Proc. of the 14th Annual Symposium on Switching and Automata Theory (SWAT’73).). IEEE: 1973.
Wittler R. Alignment and referencefree phylogenomics with colored deBruijn graphs. In: Proc. of the 19th Workshop on Algorithms in Bioinformatics (WABI’19). Springer: 2019.
Luhmann N, Holley G, Achtman M. BlastFrost: fast querying of 100,000s of bacterial genomes in Bifrost graphs. bioRxiv. 2020. https://doi.org/10.1101/2020.01.21.914168.
Zook JM, Catoe D, McDaniel J, Vang L, Spies N, Sidow A, Weng Z, Liu Y, Mason CE, Alexander N, et al.Extensive sequencing of seven human genomes to characterize benchmark reference materials. Sci Data. 2016; 3:160025.
Marchet C, Kerbiriou M, Limasset A. Indexing de Bruijn graphs with minimizers. In: Proc. of the 23rd International Conference on Research in Computational Molecular Biology (RECOMB’19): 2019. https://doi.org/10.1101/546309.
Pandey P, Bender MA, Johnson R, Patro R. Squeakr: an exact and approximate kmer counting system. Bioinformatics. 2017; 34(4):568–75. https://doi.org/10.1093/bioinformatics/btx636.
Zhou Z, Alikhan NF, Mohamed K, Achtman M. The user’s guide to comparative genomics with EnteroBase. Three case studies: microclades within Salmonella enterica serovar Agama, ancient and modern populations of Yersinia pestis, and core genomic diversity of all Escherichia. bioRxiv. 2019. https://doi.org/10.1101/613554.
Deorowicz S, Kokot M, Grabowski S, DebudajGrabysz A. KMC 2: fast and resourcefrugal kmer counting. Bioinformatics. 2015; 31(10):1569–76.
Roberts M, Hayes W, Hunt BR, Mount SM, Yorke JA. Reducing storage requirements for biological sequence comparison. Bioinformatics. 2004; 20(18):3363–9.
Grabowski S, Deorowicz S, Roguski L. Diskbased compression of data from genome sequencing. Bioinformatics. 2015; 31(9):1389–95.
Bloom BH. Space/time tradeoffs in hash coding with allowable errors. Comm ACM. 1970; 13(7):422–6.
Kirsch A, Mitzenmacher M. Less hashing, same performance: building a better Bloom filter. In: Proc. of the European Symposium on Algorithms (ESA’06), vol. 4168: 2006. p. 456–67.
Putze F, Sanders P, Singler J. Cache, hash and spaceefficient bloom filters. ACM J Exp Algorithmic. 2009; 14:9.
Lemire D, Kaser O. Recursive ngram hashing is pairwise independent, at best. Comput Speech Lang. 2010; 24(4):698–710.
Harter R. The minimum on a sliding window algorithm. 2009. http://richardhartersworld.com/cri/2001/slidingmin.html. Accessed 25 Mar 2019.
Melsted P, Pritchard JK. Efficient counting of kmers in DNA sequences using a bloom filter. BMC Bioinformatics. 2011; 12(1):333.
Azar Y, Broder AZ, Karlin AR, Upfal E. Balanced allocations. SIAM J Comput. 1999; 29(1):180–200.
Chambi S, Lemire D, Kaser O, Godin R. Better bitmap performance with Roaring bitmaps. Softw Pract Exp. 2016; 46(5):709–19.
Limasset A, Rizk G, Chikhi R, Peterlongo P. Fast and scalable minimal perfect hashing for massive key sets. arXiv. 2017.
Holley G, Melsted P. Bifrost Github repository. 2020. https://github.com/pmelsted/bifrost.
Holley G, Melsted P. Zenodo repository for Bifrost. https://doi.org/10.5281/zenodo.3973373.
Acknowledgements
The authors would like to thank Nina Luhmann, Birte Kehr, and Thomas Krannich for their helpful feedback during the development of the software and Trausti Sæmundsson for his work on an early draft of the software.
Review history
The review history is available as Additional file 2.
Funding
This work was supported by the Icelandic Research Fund Project grant number 152399053.
Author information
Authors and Affiliations
Contributions
All authors implemented the Bifrost software and designed the algorithm and the experiments. All authors wrote the manuscript. All authors reviewed and approved the final version of the manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Additional file 1
Supplementary file.
Additional file 2
Review history.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Holley, G., Melsted, P. Bifrost: highly parallel construction and indexing of colored and compacted de Bruijn graphs. Genome Biol 21, 249 (2020). https://doi.org/10.1186/s13059020021358
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13059020021358