Skip to main content
Fig. 3 | Genome Biology

Fig. 3

From: Kssd: sequence dimensionality reduction by k-mer substring space sampling enables real-time large-scale datasets analysis

Fig. 3

Accuracies of Kssd, Mash, and Bindash on closely related group. The Pearson correlation coefficients r between the ground truth and the estimated mutation rates is scaled to − log (1 − r) for plotting clarity (y-axis). The decimal above the highest data point at a sketch size is the maximal r value of all the three methods of all k settings with that sketch size. The default k-mer lengths k for Kssd, Bindash, and Mash are 16, 21, and 21, respectively; to match k, we also run Mash and Bindash with k = 16 in addition to the default k settings, but Kssd takes only even k, so we also run with k = 20 to vary k. Due to different sketching mechanism, Mash and Bindash take as parameter the sketch size of continuous integers and multiples of 64, respectively; but sketch size is not a parameter of Kssd and can only counted from the sketch file. To match sketch sizes as closely as possible, we first sketched the reference using Kssd with dimensionality reduction levels z = {4, 3, 2, 1, 0} and obtained the sketch sizes sk = {84, 1268, 21,077, 337,277, 5,236,120}, respectively; we got the nearest multiples of 64 of sk (the parenthesized values) and interpolated with their 2-, 4-, and 8-fold sketch sizes to obtain the sketch sizes parameter sb for Bindash; and we merged sk and the interpolated points of sb to obtain the sketch size parameter sm for Mash. Mash with k = 20, 21 at sketch size 5,236,120 are not shown due to the running error

Back to article page