Skip to main content

Table 1 Time- and memory-performance results for constructing compacted de Bruijn graphs from short-read sets

From: Scalable, ultra-fast, and low-memory construction of compacted de Bruijn graphs with Cuttlefish 2

   

ABySS-Bloom-dBG

Bifrost

deGSM

BCALM 2

CUTTLEFISH 2

Dataset

k

Thread-count

Small-memory

Large-memory

   

Default memory

Match second-best memory

Unrestricted memory

Human

27

8

22 h 18 min (39.3)

20 h 23 min (71.3)

11 h 43 min (48.5)

10 h 36 min (235.8)

04 h 23 min (6.7)

01 h 13 min (3.2)

01 h 10 min (6.2)

01 h (11.3)

  

16

11 h 38 min (39.3)

11 h 02 min (71.3)

09 h 39 min (48.6)

07 h 08 min (235.8)

04 h 58 min (8.9)

56 min (3.3)

56 min (7.6)

51 min (11.3)

 

55

8

16 h 32 min (34.0)

15 h 58 min (66.0)

05 h 43 min (43.8)

16 h 50 min (293.2)

04 h 01 min (7.4)

02 h 20 min (3.5)

01 h 08 min (7.1)

01 h 03 min (11.3)

  

16

09 h 28 min (34.1)

08 h 37 min (66.1)

04 h 16 min (43.9)

15 h 54 min (293.3)

04 h 26 min (10.5)

02 h 02 min (3.7)

01 h 11 min (9.5)

51 min (11.3)

Human RNA-seq

27

8

11 h 47 min (33.7)

11 h 22 min (65.7)

06 h 04 min (7.2)

01 h 35 min (87.1)

02 h 58 min (3.8)

30 min (2.9)

–

18 min (80.1)

  

16

11 h 38 min (39.3)

07 h 38 min (65.7)

07 h 24 min (7.2)

01 h 37 min (87.2)

02 h 46 min (3.9)

20 min (3.0)

–

12 min (80.1)

Gut microbiome

27

16

18 h 47 min (42.0)

20 h 12 min (74.0)

03 h 54 min (38.1)

02 h 28 min (157.2)

02 h 34 min (7.7)

26 min (3.5)

23 min (6.7)

20 min (26.8)

 

55

 

1 day 17 h 43 min (35.9)

1 day 08 h 09 min (67.8)

02 h 44 min (46.7)

06 h 53 min (293.3)

03 h 02 min (12.5)

44 min (4.0)

25 min (11.3)

20 min (69.9)

Soil

27

16

1 d 18 h 35 min (150.4)

14 h 24 min (275.0)

15 h 28 min (274.1)

1 day 14 h 29 min (235.8)

19 h 39 min (52.0)

02 h 01 min (19.2))

02 h 18 min (40.9)

01 h 35 min (40.9)

 

55

 

07 h 57 min (128.9)

06 h 36 min (256.8)

05 h 49 min (157.0)

1 day 11 h 05 min (293.3)

08 h 30 min (27.5)

03 h 02 min (11.1)

02 h 43 min (23.3)

01 h 38 min (23.3)

White spruce

27

16

∗

X

X

†

2 days 06 h 12 min (36.8)

10 h 05 min (14.0)

07 h 47 min (35.2)

07 h 13 min (204.2)

 

55

 

∗

X

X

†

2 days 09 h 59 min (31.6)

10 h 12 min (23.8)

10 h 08 min (31.1)

07 h 24 min (279.3)

  1. Each cell contains the running time in wall clock format, and the maximum memory usage in gigabytes, in parentheses. The frequency thresholds f 0 used are as follows: (i) human: 14 (k = 27) and 9 (k = 55), (ii) human RNA-seq, gut microbiome and soil: 2, and (iii) white spruce: 11 (k = 27) and 7 (k = 55). Some details on executing the different tool implementations are as follows: (1) ABySS-Bloom-dBG has two tunable parameters significantly affecting its performance: a Bloom filter [63] memory budget and the number of hash functions for the filters. We executed it with two configurations: small-memory (with 4 hashes) and large-memory (with 3 hashes). The memory budgets used in these configurations are as follows: (i) human, human RNA-seq, and gut microbiome: 32 GB and 64 GB; (ii) soil: 64 GB and 128 GB; and (iii) white spruce: 400 GB, and no large-memory execution due to hardware limitations. (2) Bifrost does not support the usage of arbitrary f 0, and uses a default f 0=2. For a uniform comparison across the tools with f 0=2 on the human dataset, see Additional file 1: Table S2. We did not execute Bifrost on the white spruce dataset due to this limitation—while on the human dataset the increases in the vertex-count for Bifrost are approximately 26% (k=27) and 19% (k=55), these are 91% and 45% respectively on the white spruce dataset. (3) deGSM has a maximum-memory parameter, with an upper-limit of 128 GB. We observed that its internal k-mer enumeration steps using Jellyfish [64] use more memory than this limit in all the experiments, and therefore we used 128 GB for deGSM in all its executions. (4) BCALM 2 also has a maximum-memory option, which we set to the best memory usage obtained from the rest of the algorithms. It also has a maximum disk usage option, which we set to the entire usable space (3.4 TB) of the disk used for its working directory, for maximum efficiency. (5) The Cuttlefish 2 implementation also supports tunable memory up-to a certain extent, and we executed it with three settings: (i) default memory: using the default minimum memory of ≈9.7 bits/vertex (see the Space complexity section), (ii) match second-best memory: using up-to the memory amount found best in executions other than Cuttlefish 2 strict-memory mode, and (iii) unrestricted memory: using no strict upper-limit for memory.
  2. The best performance with respect to each metric in each row is highlighted, where only the default-memory mode is considered for Cuttlefish 2. The ∗’s and the †’s denote that the corresponding executions could not complete due to hardware shortage of memory and disk-space, respectively. The X’s denote that the corresponding executions were not run for reasons noted earlier. Additional file 1: Table S1 also includes the intermediate disk-usages incurred by the tools, besides time and memory