Skip to main content
Fig. 1 | Genome Biology

Fig. 1

From: Mash Screen: high-throughput sequence containment estimation for genome discovery

Fig. 1

High-level diagram of set operations with MinHash. The size of the shaded circles represents k-mer set sizes, while the area of their overlap represents the cardinality of their intersections. Below each pair of circles is a diagram of the MinHash resemblance estimation for a sketch size of 5. Horizontal lines represent the space of possible hash values, of which there are 2b, where b is the number of bits used for hashing. Here, diamonds are hashes of the k-mers in sets A and B, and black shading indicates the smallest 5 hashes in each. Vertical lines indicate matching hashes and are solid only if both hashes are in the bottom sketch of their respective set. In a, genomes of similar sizes are well-suited for resemblance estimation, since their hashes are similarly distributed across the hash space. Matching hashes are usually both in, or both out, of the bottom 5. However, if the genomes are of very different sizes, as in b, the larger genome will saturate the space more densely. This causes a higher fraction of matching hashes to be contained in only the sketch of the smaller set, underestimating the containment of A in B. Thus, all hashes of B must be considered to accurately estimate the containment of A

Back to article page