Skip to main content

Table 1 Quality and performance of compressing model organisms

From: Matchtigs: minimum plain text representation of k-mer sets

Genome

Algorithm

CL ratio

SC ratio

Time [s]

Memory [GiB]

C. elegans (reads)

B2

1.00

1.00

2402

 

5.54

 
 

B2+UST

0.58

0.37

3424

(1.43)

17.6

(3.18)

 

ProphAsm

0.55

0.34

5433

(2.26)

56.5

(10.2)

 

B2+GREEDY

0.45 (0.79)

0.11 (0.28)

3057

(1.27)

41.0

(7.41)

B. mori (reads)

B2

1.00

1.00

6406

 

9.95

 
 

B2+UST

0.55

0.35

9896

(1.54)

56.2

(5.64)

 

ProphAsm

0.52

0.31

27,912

(4.36)

157

(15.8)

 

B2+GREEDY

0.41 (0.74)

0.06 (0.18)

11,793

(1.84)

123

(12.4)

H. sapiens (reads)

B2

1.00

1.00

168,938

 

12.4

 
 

B2+UST

0.67

0.46

170,427

(1.01)

29.0

(2.34)

 

B2+GREEDY

0.57 (0.84)

0.22 (0.48)

209,646

(1.24)

68.5

(5.52)

C. elegans

B2

1.00

1.00

52.7

 

0.96

 
 

B2+UST

0.92

0.34

58.6

(1.11)

0.96

(1.00)

 

ProphAsm

0.92

0.30

133

(2.52)

3.78

(3.94)

 

B2+GREEDY

0.90 (0.98)

0.06 (0.18)

59.9

(1.14)

0.96

(1.00)

 

B2+MATCH

0.90 (0.98)

0.07 (0.23)

380

(7.21)

1.34

(1.40)

B. mori

B2

1.00

1.00

244

 

1.92

 
 

B2+UST

0.78

0.34

303

(1.24)

1.92

(1.00)

 

ProphAsm

0.76

0.28

716

(2.93)

13.8

(7.19)

 

B2+GREEDY

0.72 (0.92)

0.06 (0.19)

334

(1.37)

2.42

(1.26)

H. sapiens

B2

1.00

1.00

1787

 

6.29

 
 

B2+UST

0.79

0.33

2249

(1.26)

8.80

(1.40)

 

ProphAsm

0.76

0.26

6677

(3.74)

130

(20.7)

 

B2+GREEDY

0.71 (0.91)

0.03 (0.10)

4999

(2.80)

17.3

(2.75)

  1. We chose \(k = 31\) and a min abundance of 10 for H. sapiens reads and 1 for all others. The CL and SC ratios are between compressed strings and unitigs, and in parentheses are the ratios between our algorithm and the best competitor. B2 means BCALM2. For time and memory, we report the total time and maximum memory required to compute the tigs from the respective data set. BCALM2 directly computes unitigs and ProphAsm directly computes heuristic simplitigs. UST, GREEDY and MATCH compute heuristic simplitigs, greedy matchtigs and matchtigs from unitigs. The number in parentheses behind time and memory indicates the slowdown/increase over computing just unitigs with BCALM2. All algorithms were run with 28 threads, except for UST which supports only one thread (the preceding run of BCALM2 was still executed with 28 threads), and ProphAsm, which supports only one thread as well. Matchtigs are too expensive to run on all genomes except for the C. elegans reference, and ProphAsm takes too much time on H. sapiens reads, especially since it does not support minimum abundance. The lengths of the genomes are 100Mbp for C. elegans, 482Mbp for B. mori and 3.21Gbp for H. sapiens and the read data sets have a coverage of 64× for C. elegans, 58x for B. mori and 300× for H. sapiens. The unique k-mer counts of the read datasets are 1.35 billion for C. elegans, 3.66 billion for B. mori and 2.78 billion for H. sapiens