Skip to main content

Table 3 Potential reduction in classification database size by pre-screening, illustrated using constituents of the Shakya metagenome

From: Mash Screen: high-throughput sequence containment estimation for genome discovery

Containment (≥)

Positives

Positives (%)

Database genomes

Database gigabases

0

58/58

100

83,327

330

0.9

58/58

100

1585

5.76

0.99

56/58

96

227

0.878

0.999

53/58

91

76

0.237

0.9999

43/58

74

58

0.182

  1. Containment refers to the threshold of the Mash containment score used to filter the genomes in the database, with 0 meaning no filtering and 0.9 meaning filter all genomes with a containment of less than 90%. Positives refers to how many of the known constituents had scores above that threshold, and thus would pass pre-screening (6 of the 64 constituents were not included because they have since been removed from RefSeq and thus were not in the reference database). Database genomes refers to the total number of genomes in the database with scores above the threshold, and Database gigabases refers to the total number of bases in those genomes