Skip to main content
Fig. 2 | Genome Biology

Fig. 2

From: Kssd: sequence dimensionality reduction by k-mer substring space sampling enables real-time large-scale datasets analysis

Fig. 2

Kssd algorithm overview. a k-mer substring space shuffling. In this example, a k-mer substring selection pattern (kssp) p = ‘000010010000’ is pre-determined for 12-mer analysis, so the length of p-selected-substring is 2 and the k-mer substring space has dimensionality D = 42 = 16. This 16-dimensions space is shuffled and partitioned into N subspaces of equal size (3rd column, N = 4 here), and the dimensions in each subspace are recoded by the lexically ordered strings of length \( {\boldsymbol{\log}}_{\mathbf{4}}\frac{\boldsymbol{D}}{\boldsymbol{N}} \) (4th column, length = 1 here). One subspace s (3rd and 4th column, marked as red) is chosen for sequence sketching. b Sequence sketching. First, the k-mers with p-selected-substring (green substring in 1st column) belonging to the red subspace s are selected, where the p-selected-substrings are recoded by the lexically ordered dimension (3rd column), and each selected k-mers is recoded in a way that the recoded p-selected-substring suffixes the rest substring (4th column). c Kssd distance. Once all sequences were sketched, the Jaccard and containment coefficients are estimated by \( \hat{J} \) = \( \frac{\mid S(A)\mathbf{\cap}S(B)\mid }{\mid S(A)\cup S(B)\mid } \)and \( \hat{C} \) = \( \frac{\mid S(A)\mathbf{\cap}S(B)\mid }{\mathit{\min}\left(|S(A)|,|S(B)|\right)} \), respectively

Back to article page