Skip to main content

Table 3 Sequencing runs and assemblies searched against the Mash RefSeq database

From: Mash: fast genome and metagenome distance estimation using MinHash

Organism

Tech

Type

NCBI accession

Size (Mbp)

Time (CPU s)

LCA

Best hit

E. coli

K12 MG1655

MiSeq

Assembly

(SPAdes)

4.6

2.45

Entero.

E. coli

K12 MG1655

E. coli

K12 MG1655

PacBio

Assembly

GCA_000801205

4.6

2.66

Entero.

E. coli

K12 MG1655

E. coli

DH1

ABI 3730

Reads

(Trace Archive)

60

17.08

Entero.

E. coli

DH1

E. coli

K12 MG1655

454

Reads

SRR797242

233

57.12

Entero.

E. coli

K12 MG1655

E. coli

K12 MG1655

Ion PGM

Reads

SRR515925

407

72.01

E. coli

E. coli

K12 1655

E. coli

K12 MG1655

MiSeq

Reads

SRR1770413

387

72.01

Entero.

E. coli

KLY

E. coli

K12 MT203

HiSeq

Reads

SRR490124

2155

369.86

E. coli

E. coli

GCF_000833635

E. coli

K12 MG1655

PacBio

Reads

SRR1284073

397

77.96

E. coli

E. coli XH140A GCF_000226585

E. coli

K12 MG1655

MinION

1D

ERR764952..55

248

55.52

Entero.

E. coli

O113 H21

E. coli

K12 MG1655

MinION

2D

ERR764952..55

134

27.82

E. coli

E. coli GCF_000953515

B. anthracis Ames

MinION

1D + 2D

SRR2671867

210

44.66

B. anthracis

B. anthracis

str. Carbosap

B. cereus ATCC 10987

MinION

1D + 2D

SRR2671868

266

76.85

B. cereus ATCC 10987

B. cereus

ATCC 10987

Zaire ebolavirus

MinION

1D + 2D

ERR1050070

8.7

2.06

Zaire ebolavirus

Zaire ebolavirus Mayinga

  1. In all cases, Mash search required 21 MB of RAM for genome assemblies and 209 MB of RAM for sequencing runs (due to the additional Bloom filter overhead). Organism: source strain. Tech: Sequencing technology ABI 3730, 454 GS FLX, Illumina MiSeq, Illumina HiSeq, Ion PGM, PacBio RSII, Oxford Nanopore MinION. Type: Assembly, reads, 1D and 2D nanopore reads. NCBI accession: NCBI accession of the dataset or reads. The SPAdes [63] assembly was derived from the MiSeq reads. Size: total dataset size in Mbp. LCA: lowest common ancestor classification based on the NCBI taxonomy and the resulting hits within a significance tolerance of the best. In several cases, the LCA is at the family level (Enterobacteriaceae) due to significant Mash hits to both E. coli and S. sonnei species. This is a known species naming conflict within the NCBI taxonomy, with some genomes sharing ANI >98 % between these species. Best hit: reports the smallest significant distance reported