Skmer: assembly-free and alignment-free sample identification using genome skims

The ability to inexpensively describe taxonomic diversity is critical in this era of rapid climate and biodiversity changes. The recent genome-skimming approach extends current barcoding practices beyond short markers by applying low-pass sequencing and recovering whole organelle genomes computationally. This approach discards the nuclear DNA, which constitutes the vast majority of the data. In contrast, we suggest using all unassembled reads. We introduce an assembly-free and alignment-free tool, Skmer, to compute genomic distances between the query and reference genome skims. Skmer shows excellent accuracy in estimating distances and identifying the closest match in reference datasets. Electronic supplementary material The online version of this article (10.1186/s13059-019-1632-4) contains supplementary material, which is available to authorized users.

k-mers from both genomes, it's easy to see that J = W 2n W and thus, W n = 2J 1+J . We further make a simplifying assumption. We assume all X i r.v.s are independent, an assumption that is true for most pairs of k-mers but ignores the fact that each kmer overlaps with k-1 other k-mers. With this assumption, the maximum likelihood estimate of p is simplŷ p = W n = 2J 1 + J .
By the functional invariance of maximum likelihood, the ML estimate of d is given

k-mer sampling
We now assume that each genome is covered uniformly at random. Thus, k-mers are also sub-sampled and we assume each k-mer is sampled at least once with probability ⌘ 1 in the first genome and ⌘ 2 in the second genome; we derive the relationship between these probabilities and genome coverage below. We estimate ⌘ values separately (also described below) and here consider them as given. For each 1  i  n and j 2 {1, 2}, let Y j,i ⇠ Bern(⌘ j ) be the indicator of whether the k-mer i is sampled at least once in the genome j. Under this scenario, the number of k-mers shared between the two genomes is given by the r.v. W = P n 1 X i Y 1,i Y 2,i . Defining Z = X i Y 1,i Y 2,i , we get W = P n 1 Z i and Z i ⇠ Bern(r) where r = p⌘ 1 ⌘ 2 by the independence of the mutation process and each of the two k-mer sampling processes. Assuming independence between Z i r.v.s (again ignoring the overlap between consecutive k-mers) we get the ML estimater = W n , and thus (for a given ⌘ 1 and ⌘ 2 ) we havê r =p⌘ 1 ⌘ 2 = W n (S1) It is easy to see that U gives the total number of sampled k-mers in both genomes. However, S i is not a Bernoulli and thus, U is not Binomial. Nevertheless, the same assumptions that we used to treat X i and Z i r.v.s as independent also give us independence between S i values; therefore, by the central limit theorem, U n can be approximated by a Gaussian with (note that X i , Y 1,i and Y 2,i are independent). By this Gaussian approximation, the ML estimate of q given ⌘ 1 , ⌘ 2 is given by: Note that J = W U . Equations S1 and S2 give two di↵erent ML estimators of the same parameter p given two di↵erent types of data (W and U ). While the two estimators are not the same, because n is extremely large, both estimators have a very low variance. Exploiting the low variance, we treat the two estimates of p as equal and divide both sides of Equation S1 by Equation S2 to get: Solving forp and replacingd = 1 p Note that we have assumed a known coverage and thus we are not co-estimating ⌘ j 's and d. In practice, we need to first estimate ⌘ 1 and ⌘ 2 , and we do it as we will describe.

Connection of ⌘ to read coverage
A k-mer stretching from position y to y + k on the genome is covered by the reads that start in the interval [y + k `, y]. Assuming that there is no sequencing error, and a uniform spread of of the N reads across the genome of length L. We show that the probability ⌘ that a k-mer is sampled by at least one read is given by Let X be a r.v. denoting the number of reads that cover a specific k-mer. Assuming a uniform spread of N reads across the genome of length L, the probability of x reads covering a k-mer (starting in an interval of length` k) is given by As N is large and N (l k) L is constant, it can be closely approximated by is the k-mer coverage, and is related to the coverage c by = l k l c As the number of reads covering a k-mer follows Poisson distribution, the fraction of k-mers covered by 1 or more reads is

Sequencing error
We model the sequencing error as an i.i.d process that corrupts each position of each read with a fixed probability ✏. To extend our previous results to cover this scenario, we need to see how the intersection r.v. (W ) and the union r.v. (U ) get a↵ected.
We start with the intersection (W ). We change the meaning of ⌘ to denote the probability that a k-mer is covered by at least one error-free read. The probability of a k-mer within a read being error-free is clearly By conditioning on the number of reads covering a k-mer, the probability of not covering a k-mer with an error-free read is given by P rob(no error-free read) = 1 X i=0 P rob(all reads have error|i reads) P rob(i reads) Hence, the probability that a k-mer is covered by at least one error-free read is given by Note that Eqn. S6 reduces to Eqn. S3 when there is no sequencing error, i.e., ⇢ = 1. Similar to the case of no error, given ⌘ 1 and ⌘ 2 , the r.v. W n (where W is the number of shared k-mers) can be used with Equation S1 to estimate r.
We now turn to the union (r.v. U ). For large enough k, and for genomes that are random and repeat-free, with high probability (> 1 2L 4 k ) an error produces a new k-mer that is not observed in either of the input genomes. Ignoring the exceedingly unlikely event that two errors produce the same k-mer or that they produce a kmer present in one of the two genomes, we can assume that the sequencing error generates as many new k-mers as the number of reads being a↵ected by errors.
In the regime that includes errors, U = P n 1 (T 1,i + T 2,i ) W where the r.v.s T 1,i and T 2,i give the total number of k-mers generated from the position i from the first and second genomes, respectively. W.l.o.g, consider T 1,i . By conditioning on the number of reads covering a k-mer we have Given that x reads are covering a k-mer, T 1,i equals the number of erroneous k-mers E, plus 1 if there is any error-free k-mer. As E ⇠ Binom(x, 1 ⇢) and substituting into (S7) x 1 x! e 1 = 1 e 1 ⇢ + 1 (1 ⇢) Letting ⇣ 1 = E [T 1,i ] and using the same central limit argument we used before, U n becomes approximately a Gaussian with expectation ⇣ 1 + ⇣ 2 ⌘ 1 ⌘ 2 p. Similar to Equation S2, given ⇣ 1 , ⇣ 2 , ⌘ 1 , and ⌘ 2 , the Gaussian approximation gives us: Again, assuming that estimates of p in Equation S1 (with the new definition of ⌘) and Equation S10 are the same (due to low variance), we divide the two equations and solve for d to get the estimator: Excluding low-copy k-mers from the Jaccard index calculation If we discard k-mers observed less than m times, then a k-mer will survive if it is covered by m or more error-free reads. Hence, ⌘ becomes the probability of m or more error-free reads covering a k-mer In general, we have shown that the probability distribution of the number of errorfree k-mers is a Poisson with parameter p.

Appendix B: Computing GTR distances
To compute the GTR matrix using the log-det approach, we need a 4 ⇥ 4 matrix F where each element is the fraction of sites where one genome has one letter while the other genome has the other letter. Given this matrix, d = log(det(F )).
As elsewhere, we assume a no-indel scenario so that each k-mer mismatch can be attributed to a single nucleotide substitution. For i, j 2 {a,c,g,t}, let x ij = x ji denote the number of mutations of the form i $ j. Our goal is to estimate x ij for all i, j. However, the paradigm of computing distance by hashing/sketching k-mers treats all mutations alike. Formally, the estimated distance d equals We do the following: 1 Replace G and T with C, and compute distance d a = x ac + x ag + x at .
2 Replace G and T with A, and compute distance Combining, we get A similar procedure can be used to compute all x ij and normalization gives us F .
Note that this procedure reduces the space of possible k-mers of length k to 2 k possibilities instead of 4 k . Therefore, it will likely be required that k is increased for high accuracy when this approach is used.

Appendix C: Supplementary method details and commands
Here we provide the exact procedures and commands that we used to run external softwares throughout our experiments.
Simulating genome-skims using ART To simulate short reads with length`= 100 and (default) error profiles of Illumina HiSeq2000, we ran To simulate reads with constant error rate ✏ = 0.01 (Phred score = 20) at coverage c, we used

Computing k-mer frequencies using JellyFish
To count all k-mers of length k = 31 in a genome-skim, we used and to get the histogram of k-mer counts jellyfish histo COUNT_FILE

Computing Jaccard index and estimating distance using Mash
We first sketch input genome-skims or assemblies with k-mer length k = 31 and sketch size s = 10 7 . For genome-skims (in FASTQ format) when no k-mer filtering is applied, we run mash sketch -r -k 31 -s 10000000 -o SKETCH_FILE FASTQ_FILE To sketch genome-skims while filtering k-mers with less than C copies, we use where we had to provide TIP_INFO_FILE containing estimates of coverage and sequencing error. To estimate coverage, we followed the procedure suggested in AAF user manual. We first used JellyFish to find the k-mer counts M i 's as described before. They suggest when there is a clear peak in the k-mer frequency distribution, estimate k-mer coverage to be the maximum bin. As they do not suggest a specific rule for that, we first find j = argmax i>1 M i , excluding the count of the first bin M 1 , which is always large because of erroneous k-mers due to sequencing error. If j > 2, it means that we can see a peak in k-mers distribution at j, so we use = j. Otherwise, if j = 2, we follow their suggested formula = P iMi P Mi for the case of low coverage or high sequencing error that there is no clear peak in the k-mer frequency distribution. We should also mention that no k-mer filtering used for AAF, as the coverage was heterogeneous over genome-skims. In fact, in AAF the filtering is applied to all genome-skims if used, and so they suggest to not apply filtering when there is any taxon with low coverage (c < 5) within the dataset.

Preprocessing raw reads using fastp
We used the following command to filter low-quality reads and trim the adapter

Contamination removal
To remove bacterial and mitochondrial sequences, we first created a BLAST database from the assemblies of contaminant genomes by running makeblastdb -in CONTAMINANTS_FASTA_FILE -dbtype nucl -out BLAST_DB and then searched the reads against these genomes using Megablast

Appendix D: Supplementary figures and tables
Distances computed based on full assemblies. The only species from the same genus with hamming distance less than 0.01 were the two eagle species (H. albicilla and H. leucocephalus).