Mash sketch
To construct a MinHash sketch, Mash first determines the set of constituent k-mers by sliding a window of length k across the sequence. Mash supports arbitrary alphabets (e.g. nucleotide or amino acid) and both assembled and unassembled sequences. Without loss of generality, here we will assume a nucleotide alphabet Σ = {A,C,G,T}. Depending on the alphabet size and choice of k, each k-mer is hashed to either a 32-bit or 64-bit value via a hash function, h. For nucleotide sequence, Mash uses canonical k-mers by default to allow strand-neutral comparisons. In this case, only the lexicographically smaller of the forward and reverse complement representations of a k-mer is hashed. For a given sketch size s, Mash returns the s smallest hashes output by h over all k-mers in the sequence (Fig. 1). Typically referred to as a “bottom-k sketch” for a sketch of size k, we refer to these simply as “bottom sketches” to avoid confusion with the k-mer size k. For a sketch size s and genome size n, a bottom sketch can be efficiently computed in O(n log s) time by maintaining a sorted list of size s and updating the current sketch only when a new hash is smaller than the current sketch maximum. Further, the probability that the i-th hash of the genome will enter the sketch is s/i, so the expected runtime of the algorithm is O(n + s log s log n) [4], which becomes nearly linear when n > > s.
As demonstrated by Fig. 3, a sketch comprising 400 32-bit hash values is sufficient to roughly group microbial genomes by species. With these parameters, the resulting sketch size equals 1.6 kB for each genome. For large genomes, this represents an enormous lossy compression (e.g. compared to the 750 MB needed to store a 3 Gbp genome using 2-bit encoding). However, the probability of a given k-mer K appearing in a random genome X of size n is:
$$ P\left(K\in X\right)=1-{\left(1-{\left|\Sigma \right|}^{-k}\right)}^n $$
(1)
Thus, for k = 16 the probability of observing a given k-mer in a 3 Gbp genome is 0.50 and 25 % of k-mers are expected to be shared between two random 3 Gbp genomes by chance alone. This will skew any k-mer based distance and make distantly related genomes appear more similar than reality. To avoid this phenomenon, it is sufficient to choose a value of k that minimizes the probability of observing a random k-mer. Given a known genome size n and the desired probability q of observing a random k-mer (e.g. 0.01), this can be computed as [41]:
$$ k\mathit{\hbox{'}}=\left\lceil { \log}_{\left|\Sigma \right|}\left(n\left(1-q\right)/q\right)\right\rceil $$
(2)
which yields k = 14 and k = 19 for 5 Mbp and 3 Gbp genomes (q = 0.01), respectively. We have found the parameters k = 21 and s = 1000 give accurate estimates in most cases (including metagenomes), so this is set as the default and still requires just 8 kB per sketch. However, for constructing the RefSeq database, k = 16 was chosen so that each hash could fit in 32 bits, minimizing the database size at the expense of reduced specificity for larger genomes. The small k also improves sensitivity, which helps when comparing noisy data like single-molecule sequencing (Additional file 1: Figures S2 and S3).
Lastly, for sketching raw sequencing reads, Mash provides both a two-stage MinHash and Bloom filter strategy to remove erroneous k-mers. These approaches assume that redundancy in the data (e.g. depth of coverage >5) will result in true k-mers appearing multiple times in the input, while false k-mers will appear only a few times. Given a coverage threshold c, Mash can optionally ignore such low-abundance k-mers with counts less than c. By default, the coverage threshold is set to one and all k-mers are considered for the sketch. Increasing this threshold enables the two-stage MinHash filter strategy, which is based on tracking both the k-mer hashes in the current sketch and a secondary set of candidate hashes. At any time, the current sketch contains the s smallest hashes of all k-mers that have been observed at least c times and the candidate set contains hashes that are smaller than the largest value in the sketch (sketch max), but have been observed less than c times. When processing new k-mers, those with a hash greater than the sketch max are immediately discarded, as usual. However, if a new hash is smaller than the current sketch max, it is checked against the candidate set. If absent, it is added to this set. If present with a count less than c – 1, its counter is incremented. If present with a count of c – 1 or greater, it is removed from the candidate set and added to the sketch. At this point, the sketch max has changed and the candidate set can be pruned to contain only values less than the new sketch maximum. The result of this online method is equivalent to running the MinHash algorithm on only those k-mers that occur c or more times in the input. However, in the worst case, if all k-mers in the input occur less than the coverage threshold c, no hashes would escape the candidate set and memory use would increase with each new k-mer processed.
Alternatively, a Bloom filter can be used to probabilistically exclude single-copy k-mers using a fixed amount of memory. In this approach, a Bloom filter is maintained instead of a candidate list and new hashes are inserted into the sketch only if they are less than sketch max and found in the Bloom filter. If a new hash would have otherwise been inserted in the sketch but was not found in the Bloom filter, it is inserted into the Bloom filter so that subsequent appearances of the hash will pass. This effectively excludes many single-copy k-mers from the sketch, but does not guarantee that all will be filtered. With this approach, filtering k-mers with a copy number greater than one would also be possible using a counting Bloom filter, but this has not been implemented since the exact method typically outperforms the Bloom method in practice, both in terms of accuracy and memory usage.
Mash distance
A MinHash sketch of size s = 1 is equivalent to the subsequent “minimizer” concept of Roberts et al. [42], which has been used in genome assembly [43], k-mer counting [44], and metagenomics [45]. Importantly, the more general MinHash concept permits an approximation of the Jaccard index \( J\left(A,B\right)=\frac{\left|A\cap B\right|}{\left|A\cup B\right|} \) between two k-mer sets A and B. Mash follows Broder’s original formulation and merge-sorts two bottom sketches S(A) and S(B) to estimate the Jaccard index [4]. The merge is terminated after s unique hashes have been processed (or both sketches exhausted), and the Jaccard estimate is computed as \( j=\frac{x}{s^{\prime }} \) for x shared hashes found after processing s’ hashes. Because the sketches are stored in sorted order, this requires only O(s) time and effectively computes:
$$ J\left(A,B\right)=\frac{\left|A\cap B\right|}{\left|A\cup B\right|}\approx \frac{\left|S\left(A\cup B\right)\cap S(A)\cap S(B)\right|}{\left|S\left(A\cup B\right)\right|} $$
(3)
which is an unbiased estimate of the true Jaccard index, as illustrated in Fig. 1. Conveniently, the error bound of the Jaccard estimate \( \varepsilon =O\left(\frac{1}{\sqrt{s}}\right) \) relies only on the sketch size and is independent of genome size [46]. Specific confidence bounds are given below and in Additional file 1: Figure S1. Note, however, that the relative error can grow quite large for very small Jaccard values (i.e. divergent genomes). In these cases, a larger sketch size or smaller k is needed to compensate. For flexibility, Mash can also compare sketches of different size, but such comparisons are constrained by the smaller of the two sketches s < u and only the s smallest values are considered.
The Jaccard index is a useful measure of global sequence similarity because it correlates with ANI, a common measure of global sequence similarity. However, like the MUM index [19], J is sensitive to genome size and simultaneously captures both point mutations and gene content differences. For distance-based applications, the Jaccard index can be converted to the Jaccard distance J
δ
(A, B) = 1 − J(A, B), which is related to the q-gram distance but without occurrence counts [47]. This can be a useful metric for clustering, but is non-linear in terms of the sequence mutation rate. In contrast, the Mash distance D seeks to directly estimate a mutation rate under a simple Poisson process of random site mutation. As noted by Fan et al. [22], given the probability d of a single substitution, the expected number of mutations in a k-mer is λ = kd. Thus, under a Poisson model (assuming unique k-mers and random, independent mutation), the probability that no mutation will occur in a given k-mer is e
−kd, with an expected value equal to the fraction of conserved k-mers w to the total number of k-mers t in the genome, \( \frac{w}{t} \) . Solving \( {e}^{-kd}=\frac{w}{t} \) gives \( d = -\frac{1}{k} \ln \frac{w}{t} \). To account for two genomes of different sizes, Fan et al. [22] set t to the smaller of the two genome’s k-mer counts, thereby measuring containment of the k-mer set. In contrast, Mash sets t to the average genome size n, thereby penalizing for genome size differences and measuring resemblance (e.g. to avoid a distance of zero between a phage and a genome containing that phage). Finally, because the Jaccard estimate j can be framed in terms of the average genome size \( j=\frac{w}{2n-w} \), the fraction of shared k-mers can be framed in terms of the Jaccard index \( \frac{w}{n}=\frac{2j}{1+j} \), yielding the Mash distance:
$$ D = -\frac{1}{k} \ln \frac{2j}{1+j} $$
(4)
Equation 4 carries many assumptions and does not attempt to model more complex evolutionary processes, but closely approximates the divergence of real genomes (Fig. 2). With appropriate choices of s and k, it can be used as a replacement for costly ANI computations. Table 1 and Additional file 1: Figure S2 give error bounds on the Mash distance for various sketch sizes and Additional file 1: Figure S3 illustrates the relationship between the Jaccard index, Mash distance, k-mer size, and genome size.
Mash P value
In the case of distantly related genomes it can be difficult to judge the significance of a given Jaccard index or Mash distance. As illustrated by Eq. 1, for small k and large n there can be a high probability of a random k-mer appearing by chance. How many k-mers then are expected to match between the sketches of two unrelated genomes? This depends on the sketch size and the probability of a random k-mer appearing in the genome, where the expected Jaccard index r between two random genomes X and Y is given by:
$$ r=\frac{P\left(K\in X\right)P\left(K\in Y\right)}{P\left(K\in X\right)+P\left(K\in Y\right)-P\left(K\in X\right)P\left(K\in Y\right)} $$
(5)
From Eq. 1, the probability of a random k-mer depends both on the size of k, which is known, and total number of k-mers in the genome, which can be estimated from the sketch [48]. For the sketch size s, maximum hash value in the sketch v, and hash bits b, the number of distinct k-mers in the genome is estimated as n = 2b
s/v. For the population size m of all distinct k-mers in X and Y and the number of shared k-mers w, where:
$$ m=\left|X\cup Y\right|=\left|X\right|+\left|Y\right|-w $$
(6)
the probability p of observing x or more matches between the sketches of these two genomes can be computed using the hypergeometric cumulative distribution function. For the sketch size s, shared size w, and population size m:
$$ p\left(x;s;w;m\right)=1-{\displaystyle \sum_{i=0}^{x-1}}\frac{\left(\begin{array}{c}\hfill w\hfill \\ {}\hfill i\hfill \end{array}\right)\left(\begin{array}{c}\hfill m-w\hfill \\ {}\hfill s-i\hfill \end{array}\right)}{\left(\begin{array}{c}\hfill m\hfill \\ {}\hfill s\hfill \end{array}\right)} $$
(7)
However, because m is typically very large and the sketch size is relatively much smaller, it is more practical to approximate the hypergeometric distribution with the binomial distribution where the expected value of \( r=\frac{w}{m} \) can be computed using Eq. 5:
$$ p\left(x;s;r\right)=1-{\displaystyle \sum_{i=0}^{x-1}}\left(\begin{array}{c}\hfill s\hfill \\ {}\hfill i\hfill \end{array}\right){r}^i{\left(1-r\right)}^{s-i} $$
(8)
Mash uses Eq. 8 to compute the P value of observing a given Mash distance (or less) under the null hypothesis that both genomes are random collections of k-mers. This equation does not account for compositional characteristics like GC bias, but it is useful in practice for ruling out clearly insignificant results (especially for small values of k and j). Interestingly, past work suggests that a random model of k-mer occurrence is not entirely unreasonable [41]. Note, this P value only describes the significance of a single comparison and multiple testing must be considered when searching against a large database.
RefSeq clustering
By default, Mash uses 32-bit hashes for k-mers where |Σ|k ≤ 232 and 64-bit hashes for |Σ|k ≤ 264. Thus, to minimize the resulting size of the all-RefSeq sketches, k = 16 was chosen along with a sketch size s = 400. While not ideal for large genomes (due to the small k) or highly divergent genomes (due to the small sketch), these parameters are well suited for determining species-level relationships between the microbial genomes that currently constitute the majority of RefSeq. For similar genomes (e.g. ANI >95 %), sketches of a few hundred hashes are sufficient for basic clustering. As ANI drops further, the Jaccard index rapidly becomes very small and larger sketches are required for accurate estimates. Confidence bounds for the Jaccard estimate can be computed using the inverse cumulative distribution function for the hypergeometric or binomial distributions (Additional file 1: Figure S1). For example, with a sketch size of 400, two genomes with a true Jaccard index of 0.1 (x = 40) are very likely to have a Jaccard estimate between 0.075 and 0.125 (P >0.9, binomial density for 30 ≤ x ≤ 50). For k = 16, this corresponds to a Mash distance between 0.12 and 0.09.
RefSeq Complete release 70 was downloaded from NCBI FTP (ftp://ftp.ncbi.nlm.nih.gov). Using FASTA and Genbank records, replicons and contigs were grouped by organism using a combination of two-letter accession prefix, taxonomy ID, BioProject, BioSample, assembly ID, plasmid ID, and organism name fields to ensure distinct genomes were not combined. In rare cases this strategy resulted in over-separation due to database mislabeling. Plasmids and organelles were grouped with their corresponding nuclear genomes when available; otherwise they were kept as separate entries. Sequences assigned to each resulting “organism” group were combined into multi-FASTA files and chunked for easy parallelization. Each chunk was sketched with:
mash sketch -s 400 -k 16 -f -o chunk *.fasta
This required 26.1 CPU h on a heterogeneous cluster of AMD processors. (Note: option -f is not required in Mash v1.1.) The resulting, chunked sketch files were combined with the Mash paste function to create a single “refseq.msh” file containing all sketches. Each chunked sketch file was then compared against the combined sketch file, again in parallel, using:
mash dist -t refseq.msh chunk.msh
This required 6.9 CPU h to create pairwise distance tables for all chunks. The resulting chunk tables were concatenated and formatted to create a PHYLIP formatted distance table.
For the ANI comparison, a subset of 500 Escherichia genomes was selected to present a range of distances yet bound the runtime of the comparatively expensive ANI computation. ANI was computed using the MUMmer v3.23 “dnadiff” program and extracting the 1-to-1 “AvgIdentity” field from the resulting report files [49]. The corresponding Mash distances were taken from the all-vs-all distance table as described above.
For the primate phylogeny, the FASTA files were sketched separately, in parallel, taking an average time of 8.9 min each and a maximum time of 11 min (Intel Xeon E5-4620 2.2 GHz processor and solid-state drive). The sketches were combined with Mash paste and the combined sketch given to dist. These operations took insignificant amounts of time, and table output from dist was given to PHYLIP v3.695 [50] neighbor to produce the phylogeny. Accessions for all genomes used are given in Additional file 1: Table S1. The UCSC tree was downloaded from [51].
RefSeq search
Each dataset listed in Table 3 was compared against the full RefSeq Mash database using the following command for assemblies:
mash dist refseq.msh seq.fasta
and the following command for raw reads:
mash dist -u refseq.msh seq.fasta
which enabled the Bloom filter to remove erroneous, single-copy k-mers. (Note: option -u was replaced by -b in Mash v1.1.) Hits were sorted by distance and all hits within one order of magnitude of the most significant hit (P ≤10–10) were used to compute the lowest common ancestor using an NCBI taxonomy tree. The RefSeq genome with the smallest significant distance, with ties broken by P value, was also reported.
Metagenomic clustering
The Global Ocean Survey (GOS) dataset [35] was downloaded from the iMicrobe FTP site (ftp://ftp.imicrobe.us/projects/26). The full dataset was split into 44 samples corresponding to Table 1 in Rusch et al. [35]. This is the dataset used for benchmarking in the Compareads paper [33] and that analysis was replicated using both Mash and COMMET [34], the successor to Compareads. COMMET v24/07/2014 was run with default parameters (t = 2, m = all, k = 33) as:
python Commet.py read_sets.txt
where “read_sets.txt” points to the gzipped FASTQ files. This required 34 CPU h (2069 CPU min) and 4 GB of RAM. As suggested by COMMET’s author, samples were also truncated to contain the same number of reads to improve runtime (50,980 reads per sample, Nicolas Maillet, personal communication). On this reduced dataset COMMET required 10 CPU h (598 CPU min). The heatmaps were generated in R using the quartile coloring of COMMET [34] (Additional file 1: Supplementary Note 2). Additional file 1: Figure S8 shows the original heatmap generated by COMMET on this dataset. Mash was run as:
mash sketch -u -g 3500 -k 21 -s 10000 -o gos *.fa
This required 0.6 CPU h (37 CPU min) and 19.6 GB of RAM with Bloom filtering or 8 MB without. (Note: options -u and -g were replaced by -b in Mash v1.1.) The resulting combined sketch file totaled just 3.4 MB in size, compared to the 20 GB FASTA input. Mash distances were computed for all pairs of samples as:
mash dist -t gos.msh gos.msh
which required less than 1 CPU s to complete.
All available HMP and MetaHIT samples were downloaded: HMP reads [52], HMP assemblies [53], MetaHIT reads (ENA accession ERA000116), and MetaHIT assemblies [54]. This totaled 764 sequencing runs (9.3 TB) and 755 assemblies (60 GB) for HMP and 124 sequencing runs (1.1 TB) and 124 assemblies (10 GB) for MetaHIT. Mash was run in parallel with the same parameters used for the GOS datasets and the resulting sketches merged with Mash paste. Sketching the 764 HMP sequencing runs required 259.5 CPU h (average 0.34, max 2.01) and the 755 assemblies required 3.7 CPU h (average 0.005). Sketching the 124 MetaHIT sequencing runs required 20 CPU h (average 0.16, max 0.62), and the 124 assemblies required 0.64 CPU h (average 0.005). COMMET was tested on three read sets (SAMN00038294, SAMN00146305, and SAMN00037421), which were smaller than the average HMP sample size and required an average of 655 CPU s per pairwise comparison. Thus, it was estimated to compare all 8882 pairs of HMP and MetaHIT samples would require at least 143,471 CPU h. Mash distances were computed for all pairs of samples as before for GOS. This required 3.3 CPU min for both sequencing runs and assemblies. HMP samples that did not pass HMP QC requirements [36] were removed from Fig. 5b, but Additional file 1: Figure S7 shows all HMP assemblies clustered, with several samples that did not pass HMP quality controls included. These samples are the only ones that fail to group by body site. Thus, Mash can also act as an alternate QC method to identify mis-tracked or low-quality samples.
Mash engineering
Mash builds upon the following open-source software packages: kseq [55] for FASTA parsing, Cap’n Proto for serialized output [56], MurmurHash3 for k-mer hashing [57], GNU Scientific Library [58] (GSL) for P value computation, and the Open Bloom Filter Library [59]. All Mash code is licensed with a 3-clause BSD license. If needed, Mash can also be built using the Boost library [60] to avoid the GSL (GPLv3) license requirements. Due to Cap’n Proto requirements, a C++11 compatible compiler is required to build from source, but precompiled binaries are distributed for convenience.