Table 3 Potential reduction in classification database size by pre-screening, illustrated using constituents of the Shakya metagenome

Containment (≥) Positives Positives (%) Database genomes Database gigabases
0 58/58 100 83,327 330
0.9 58/58 100 1585 5.76
0.99 56/58 96 227 0.878
0.999 53/58 91 76 0.237
0.9999 43/58 74 58 0.182
  1. Containment refers to the threshold of the Mash containment score used to filter the genomes in the database, with 0 meaning no filtering and 0.9 meaning filter all genomes with a containment of less than 90%. Positives refers to how many of the known constituents had scores above that threshold, and thus would pass pre-screening (6 of the 64 constituents were not included because they have since been removed from RefSeq and thus were not in the reference database). Database genomes refers to the total number of genomes in the database with scores above the threshold, and Database gigabases refers to the total number of bases in those genomes