Skip to main content
Fig. 1 | Genome Biology

Fig. 1

From: When the levee breaks: a practical guide to sketching algorithms for processing the flood of genomic data

Fig. 1

a Sketching applied to a genomic data stream. The genomic data stream is viewed via a window; the window size may be equivalent to the length of a sequence read, a genome or some other arbitrary length. The sequence within the window is decomposed into a set of constituent k-mers; each k-mer can be evaluated against its reverse complement to keep only the canonical k-mer. As k-mers are generated, they are sketched and the sketch data structure may be updated. The sketch can be evaluated and allow feedback to the data stream process. b Common sketching algorithms applied to a single k-mer from a set, using example parameters. MinHash KHF: the k-mer is hashed by three functions, giving three values (green, blue, purple). The number of hash functions corresponds to the length of the sketch. Each value is evaluated against the corresponding position in the sketch; i.e. green compared against the first value, blue against the second, and purple against the third. The sketch updates with any new minimum; e.g. the blue value is smaller than the existing one in this position (3 < 66), so replaces it. Bloom filter: the k-mer is hashed by two functions; giving two values (red and orange). The output range of the hash functions corresponds to the length of the sketch, here 0–3. The hash values are used to set bits to 1 at the corresponding positions. CountMin sketch: the k-mer is hashed by two functions; giving two values (red and brown). The number of functions corresponds to a row in the sketch, here 0 or 1, and the output range of the functions corresponds to the length of the rows, here 0–2. So the first hash value (red) gives matrix position 0,0 and the second gives 1,1. The counters held at these positions in the matrix are incremented. HyperLogLog: the k-mer is hashed by one function; giving a single value (10011). The prefix (brown) corresponds to a register, and the suffix (blue) corresponds to the bit-pattern observable. The suffix is compared to the existing value in register 1, is found to have more leading zeros and so replaces the existing value in register 1

Back to article page