From: Mash Screen: high-throughput sequence containment estimation for genome discovery

Correlation of Mash containment scores with pairwise Mash distances. Points represent RefSeq genomes with sizes between 100K and 20M. The x-axis is the Mash Distance to the nearest genome in the Shakya synthetic metagenome, which serves as an expected containment score. For the y-axis, raw reads from this metagenome were run through Mash Screen (sketch size 1000) to obtain an actual containment score for each of the genomes. Left, real Illumina reads sequenced from the mock community (SRR606249). The area circled in red highlights RefSeq genomes with higher than expected containment scores caused by contamination of the sequencing run. This contamination was independently confirmed via read mapping, revealing the presence of at least four additional species. Circled in magenta and orange are two clades of F. nucleatum that are consistent with low abundance of the reference strain (magenta) and intra-species contamination (orange), both of which have been previously described for this dataset. Right, simulating reads from only the known constituents corrects the outliers and yields the expected correlation. Not shown: 1645 points in (0.28≤x≤0.52,y=0) (left plot) and 356 points in (0.32≤x≤0.52,y=0) (right plot), representing the limits of sensitivity of the default sketch size (1000) used for the y-axis, compared the the higher sensitivity used for the x-axis (sketch size of 100,000). No points lie to the right of the plot areas

