Skip to main content
Fig. 4 | Genome Biology

Fig. 4

From: Kssd: sequence dimensionality reduction by k-mer substring space sampling enables real-time large-scale datasets analysis

Fig. 4

Comparisons of Kssd and Mash screen on containment estimation using shakya datasets. a–d Each dot represents a reference. X-axis indicates the minimum resemblance distance (Eq. 4) from a reference to the 64 known constitutes computed by Kssd (a, b) or Mash (c, d). Y-axis indicates reference-to-mixture containment measurements computed by Kssd (using Aaf distance [13], Eq. 5) and Mash screen (containment-score [9]). Containment score (y) of Mash screen measures the similarity between reference and mixture; hence, it negatively correlated with x. The decimals below the data points are the correlation coefficients (r) of the plot controlling x < 0.15. The 1st and 2nd columns are the plots of the simulated and the real shakya datasets, respectively. On the real shakya datasets, the data points biased from expectation due to low abundance constitutes are circled in purple, and those biased due to two different contaminations are circled in orange and red. e Suppose remote database adopt Kssd sketching and provide Kssd sketch for shakya dataset, the user can greatly reduce their storage costing when performing this analysis. f For large-scale containment analysis, where the sample size of sequences mixture is greatly larger than the number of references, the asymptotic time consumption of Kssd is much smaller than that of Mash screen

Back to article page