Skip to main content
Fig. 1 | Genome Biology

Fig. 1

From: Kssd: sequence dimensionality reduction by k-mer substring space sampling enables real-time large-scale datasets analysis

Fig. 1

The main idea of Kssd. The Kssd idea originates from the naive sketching method of sampling k-mers directly from the sequence as its sketch as illustrated in a: the k-mers randomly drawn from sequence A and B are represented by blue dots S(A) and orange dots S(B), respectively, and the shared k-mers S(A) S(B) are represented by red dots. However, such a sketching method is ill-suited for similarity (or distance) estimation since the two sketches S(A) and S(B) are probably drawn from unrelated regions of A and B hence shared very few k-mers (with an estimated Jaccard coefficient \( \hat{\boldsymbol{J}} \) approximate to 0) even when A and B are nearly identical. Notwithstanding its naivety, this thought inspired us the idea of k-mer space sampling as illustrated in b: firstly, a subset of k-mers s termed k-mer subspace (shown as red dots here) are drawn randomly from k-mer space S (namely the collection of all possible string of length k defined in a given alphabet set, shown as green dots red dots here); then, the sketch of any given sequence is built by overlapping s with the k-mers set of this sequence. Since s is an unbiased sampling of the k-mer space S, it is independent of any instance k-mer sets. After sketching, two sequences A and B, their intersection A B and union A B should go through dimensionality reductions of the same expectation fold of \( \frac{\mid \boldsymbol{S}\mid }{\mid \boldsymbol{s}\mid } \). Therefore, it enables measuring both the resemblance and the containment of the two sequences directly using their sketches, even if they are of very difference sizes

Back to article page