Skip to main content

Table 2 Quality and performance of compressing pangenomes

From: Matchtigs: minimum plain text representation of k-mer sets

Pangenome

Algorithm

CL ratio

SC ratio

Time [s]

Memory [GiB]

1102× N. gonorrhoeae

B2

1.00

1.00

29.1

 

4.25

 
 

B2+UST

0.63

0.35

31.1

(1.07)

4.25

(1.00)

 

ProphAsm

0.62

0.33

735

(25.3)

0.202

(0.05)

 

B2+GREEDY

0.57 (0.93)

0.18 (0.54)

30.2

(1.04)

4.25

(1.00)

 

B2+MATCH

0.57 (0.92)

0.18 (0.56)

31.1

(1.07)

4.25

(1.00)

616× S. pneumoniae

B2

1.00

1.00

26.1

 

3.07

 
 

B2+UST

0.61

0.35

31.1

(1.19)

3.07

(1.00)

 

ProphAsm

0.60

0.33

445

(17.0)

0.424

(0.14)

 

B2+GREEDY

0.53 (0.89)

0.13 (0.41)

29.0

(1.11)

3.07

(1.00)

 

B2+MATCH

0.52 (0.88)

0.14 (0.44)

41.8

(1.60)

3.07

(1.00)

3682× E. coli

B2

1.00

1.00

334

 

6.95

 
 

B2+UST

0.60

0.35

417

(1.25)

6.95

(1.00)

 

ProphAsm

0.59

0.32

13,339

(39.9)

7.05

(1.01)

 

B2+GREEDY

0.51 (0.87)

0.11 (0.33)

384

(1.15)

6.95

(1.00)

 

B2+MATCH

0.50 (0.85)

0.12 (0.37)

861

(2.58)

7.78

(1.12)

\(\sim\)309k× Salmonella

B2

1.00

1.00

82,417

 

12.7

 
 

B2+UST

0.57

0.36

82,841

(1.01)

12.7

(1.00)

 

B2+GREEDY

0.46 (0.81)

0.11 (0.30)

82,726

(1.00)

19.1

(1.50)

2505× H. sapiens

CF

1.00

1.00

77,582

 

402

 
 

CF+ProphAsm

0.68

0.31

82,797

(1.07)

402

(1.00)

 

CF+GREEDY

0.63 (0.93)

0.16 (0.50)

83,507

(1.08)

402

(1.00)

  1. We chose \(k = 31\) and a min abundance of 1. The CL and SC ratios are between compressed strings and unitigs, and in parentheses are the ratios between our algorithm and the best competitor. B2 means BCALM2. For time and memory, we report the total time and maximum memory required to compute the tigs from the respective data set. BCALM2 directly computes unitigs and ProphAsm directly computes heuristic simplitigs. UST, GREEDY and MATCH compute heuristic simplitigs greedy matchtigs and matchtigs from unitigs. The number in parentheses behind time and memory indicates the slowdown/increase over computing just unitigs with BCALM2. All algorithms were run with 28 threads, except for UST which supports only one thread (the preceding run of BCALM2 was still executed with 28 threads), and ProphAsm, which supports only one thread as well. The N. gonorrhoeae pangenome contains 8.36 million unique k-mers, the S. pneumoniae pangenome contains 19.3 million unique k-mers, the E. coli pangenome contains 341 million unique k-mers, the Salmonella pangenome contains 657 million unique k-mers and the human pangenome contains 2.8 billion unique k-mers. Due to its size, ProphAsm and MATCH could not be run on the Salmonella pangenome. Also due to size, BCALM2 did not run on the human pangenome, hence we used Cuttlefish 2. To still be able to compare against competitors, we ran ProphAsm on the unitigs produced by Cuttlefish 2 (UST requires extra information from BCALM2). To let Cuttlefish 2 run faster, we have used the flag –unrestricted-memory. Hence, its memory consumption is a lot higher than that of BCALM2