Matchtigs: minimum plain text representation of k-mer sets

Table 3 Performance characteristics of querying different tigs with SSHash-Lite

Genome	Algorithm	Index time	Search time	Search speedup		Index size	Size imprv.
		[min]	[sec]			[GiB]
(a) regular SSHash-Lite
\(\sim\)309k× Salmonella (0.75)	unitigs	2.77	3027	1.00		1.04	1.00
	UST	2.42	1491	2.03		0.71	1.48
	gMatchtigs	2.60	710	4.26	(2.10)	0.65	1.60	(1.08)
Human reads (0.75)	unitigs	18.9	558	1.00		4.60	1.00
	UST	17.1	499	1.12		3.63	1.27
	gMatchtigs	19.2	384	1.45	(1.30)	3.47	1.33	(1.05)
2505× Human (0.65)	unitigs	15.1	515	1.00		3.63	1.00
	ProphAsm	14.0	421	1.22		2.86	1.27
	gMatchtigs	14.8	363	1.42	(1.16)	2.86	1.27	(1.00)
(b) canonical SSHash-Lite
\(\sim\)309k× Salmonella (0.75)	unitigs	3.94	1576	1.00		1.13	1.00
	UST	3.30	961	1.64		0.78	1.44
	gMatchtigs	3.71	572	2.75	(1.68)	0.74	1.52	(1.06)
Human reads (0.75)	unitigs	25.0	373	1.00		5.02	1.00
	UST	23.4	324	1.15		4.05	1.24
	gMatchtigs	26.3	266	1.40	(1.22)	3.94	1.28	(1.03)
2505× Human (0.65)	unitigs	21.3	340	1.00		4.26	1.00
	ProphAsm	20.1	258	1.32		3.48	1.22
	gMatchtigs	21.1	232	1.46	(1.11)	3.52	1.21	(0.99)

SSHash-Lite is run with \(k=31\) and a k-mer-inclusion rate of 0.8. On the Salmonella pan-genome, we used a minimizer length of 17 for the regular index and a minimizer length of 16 for the canonical index. On the human reads, we used a minimizer length of 20 for the regular index and a minimizer length of 19 for the canonical index. On the human pangenome, we used a minimizer length of 19 for the regular index and a minimizer length of 20 for the canonical index. The search speedup is with respect to unitigs, and the search speedup in parentheses is with respect to the strings computed by UST. Index time is the end-to-end time required to build the SSHash-Lite index: it includes reading the collections from disk and building the data structure using external memory. Searching time is the time required to check which reads have at least 80% of their k-mers in the input SPSS. The number in parentheses under the genome is the k-mer hitrate, i.e. the fraction of k-mers from the query that are part of the queried dataset

ISSN: 1474-760X