Skip to main content

Table 3 Performance characteristics of querying different tigs with SSHash-Lite

From: Matchtigs: minimum plain text representation of k-mer sets

Genome

Algorithm

Index time

Search time

Search speedup

Index size

Size imprv.

  

[min]

[sec]

  

[GiB]

  

(a) regular SSHash-Lite

\(\sim\)309k× Salmonella (0.75)

unitigs

2.77

3027

1.00

 

1.04

1.00

 
 

UST

2.42

1491

2.03

 

0.71

1.48

 
 

gMatchtigs

2.60

710

4.26

(2.10)

0.65

1.60

(1.08)

Human reads (0.75)

unitigs

18.9

558

1.00

 

4.60

1.00

 
 

UST

17.1

499

1.12

 

3.63

1.27

 
 

gMatchtigs

19.2

384

1.45

(1.30)

3.47

1.33

(1.05)

2505× Human (0.65)

unitigs

15.1

515

1.00

 

3.63

1.00

 
 

ProphAsm

14.0

421

1.22

 

2.86

1.27

 
 

gMatchtigs

14.8

363

1.42

(1.16)

2.86

1.27

(1.00)

(b) canonical SSHash-Lite

\(\sim\)309k× Salmonella (0.75)

unitigs

3.94

1576

1.00

 

1.13

1.00

 
 

UST

3.30

961

1.64

 

0.78

1.44

 
 

gMatchtigs

3.71

572

2.75

(1.68)

0.74

1.52

(1.06)

Human reads (0.75)

unitigs

25.0

373

1.00

 

5.02

1.00

 
 

UST

23.4

324

1.15

 

4.05

1.24

 
 

gMatchtigs

26.3

266

1.40

(1.22)

3.94

1.28

(1.03)

2505× Human (0.65)

unitigs

21.3

340

1.00

 

4.26

1.00

 
 

ProphAsm

20.1

258

1.32

 

3.48

1.22

 
 

gMatchtigs

21.1

232

1.46

(1.11)

3.52

1.21

(0.99)

  1. SSHash-Lite is run with \(k=31\) and a k-mer-inclusion rate of 0.8. On the Salmonella pan-genome, we used a minimizer length of 17 for the regular index and a minimizer length of 16 for the canonical index. On the human reads, we used a minimizer length of 20 for the regular index and a minimizer length of 19 for the canonical index. On the human pangenome, we used a minimizer length of 19 for the regular index and a minimizer length of 20 for the canonical index. The search speedup is with respect to unitigs, and the search speedup in parentheses is with respect to the strings computed by UST. Index time is the end-to-end time required to build the SSHash-Lite index: it includes reading the collections from disk and building the data structure using external memory. Searching time is the time required to check which reads have at least 80% of their k-mers in the input SPSS. The number in parentheses under the genome is the k-mer hitrate, i.e. the fraction of k-mers from the query that are part of the queried dataset