Skip to main content
Fig. 6 | Genome Biology

Fig. 6

From: RabbitTClust: enabling fast clustering analysis of millions of bacteria genomes with MinHash sketches

Fig. 6

Differences between fixed-size and variable-size MinHash sketches on k-mer sets with similar sizes (a) and very different sizes (b). The pair of shaded circles represent two k-mer sets (set A and B) and the overlapping part represents their intersection. The diagrams below each pair circle are the hash sets converted from the k-mer sets. The space of hashes is \(2^b\) with b-bit hash values, leading to a numerical range of hash values between 0 and \(2^b-1\). The points on horizontal lines denote the hash values from set A and B. Solid points are minimum hash values that compose the sketches, while hollow points are hash values not in sketches. Solid arrows represent matching of hashes between two sketches, while dotted lines represent matching not in sketches. In a, sets A and B have similar size, and sketches are composed of 7 minimum hashes (solid points). A and B can get high resemblance from sketches since the distributions of sketch hashes are similar across space \(2^b\). In b, set B is three times the size of set A and contains A totally. The larger set B thus saturates the space more densely. In (\(b_1\)), sketches are both of fixed size of 7. The match rate (1/7) of minimum hashes in the sketches (solid arrow) is much smaller than the containment similarity (7/7) of the original k-mer sets. In (\(b_2\)), sketch sizes are variable and in proportion to the size of respective k-mer sets. Thus, the sketch size (number of solid points) of A is still 7, while the sketch size of B is now 21. The matching rate of minimum hashes in set A is similar to the containment similarity between A and B

Back to article page