Skip to main content
Fig. 3 | Genome Biology

Fig. 3

From: Scalable, ultra-fast, and low-memory construction of compacted de Bruijn graphs with Cuttlefish 2

Fig. 3

A (bidirected) edge-centric de Bruijn graph \(G(\mathscr {S}, k)\) for a set \(\mathscr {S} = \{ CTAAGAT, CGATGCA, TAAGAGG \}\) of strings and k-mer size k=3 in a, and its compacted form \(G_{c} (\mathscr {S}, k)\) in b. In the graphs, the vertices are denoted with pentagons—the flat and the cusped ends depict the front and the back sides respectively, and each edge corresponds to some 4-mer(s) in \(s \in \mathscr {S}\). In a, the vertices are the canonical forms of the k-mer in \(s \in \mathscr {S}\). The canonical string \(\widehat {t}\) associated to each vertex v is labeled inside v, to be spelled in the direction from v’s front to its back. Using \(\widehat {t}\), we also refer to v. The label beneath v is \(\overline {\widehat {t}}\), and is to be spelled in the opposite direction (i.e., back to front). For example, consider the 4-mer CGAT, an edge e in \(G(\mathscr {S}, 3)\). e connects the 3-mers x=pre3(e)=CGA and y=suf3(e)=GAT, the vertices being \(u = \widehat {x} = CGA\) and \(v = \widehat {y} = ATC\) respectively. x is canonical and thus e exits through u’s back; whereas y is non-canonical and hence e enters through v’s back. (CTA,TAA,AAG) is a walk, a path, and also a unitig (edges not listed). (CGA,ATC,ATG) is a walk and a path, but not a unitig—the internal vertex ATC has multiple incident edges at its back. The unitig (CTA,TAA,AAG) is not maximal, as it can be extended farther through AAG’s back. Then it becomes maximal and spells CTAAGA. There are four such maximal unitigs in \(G(\mathscr {S}, 3)\), and contracting each into a single vertex produces \(G_{c} (\mathscr {S}, 3)\), in b. There are two different maximal path covers of \(G(\mathscr {S}, 3)\): spelling {CTAAGATGC,CGA,CCTC} and {CCTCTTAG,CGATGC}

Back to article page