Fig. 4From: Mash Screen: high-throughput sequence containment estimation for genome discoveryCorrelation of Mash Screen containment scores with DIAMOND read mapping identity, each using six-frame translations to compare nucleotide reads to protein reference sequences. Each point represents a gene from the recruited set chosen for the experiment. Position on the x-axis corresponds to global alignment identity estimated from mapping, while position on the y-axis represents the containment score for the same protein as reported by Mash Screen. Coloring represents the Mash Screen p value. The points indicated by arrows have legitimate alignments that were not picked up by DIAMOND with its most sensitive settings; in orange a short (20aa) sequence (AMP55843.1) and in magenta sequences with low-complexity regions (ESV13988.1, EYQ74458.1, WP_002468660.1) that were not aligned despite disabling the low-complexity filter. Points along the x-axis represent the limits of Mash Screen’s sensitivity. For these genes, mismatches were common enough, and evenly distributed enough, to change all k-mers indexed by Mash Screen, which in this case were 9 amino acids longBack to article page