The DESMAN (De novo Extraction of Strains from MetAgeNomes) pipeline is a strategy for resolving both strain haplotypes and variations in gene content directly from shortread shotgun metagenome data. Our proposed approach comprises commonly employed steps of an assemblybased metagenomic binning workflow (such as coassembly of data, annotation of resulting contigs, mapping short reads to the assembly and identification of genome bins), followed by preparing genome bins that match to the target organism for strain extraction using the novel DESMAN algorithm described below.
Assembly and mapping
The first step is to coassemble all reads from all samples. Chimeric contigs can confound the downstream analyses with DESMAN; therefore, the choice of assembler and the assembly parameters are important in targeting more accurate contigs rather than longer, but potentially chimeric ones, even if these selections result in relatively lower N50 values for the overall assembly. For our analyses, we used idba_ud [31], Ray [32] or MEGAHIT [33]. Assembly statistics are given in Additional file 1: Table S2. Note that the Tara Oceans assembly was not performed by us and the details are given in the original paper, although we do describe them briefly below [25].
Only contigs greater than 1 kbp in length were used for downstream analyses, and those greater than 20 kbp in length were fragmented into pieces smaller than 10 kbp [9]. The result of an assembly will be a set of D contigs with lengths in base pairs L
_{
d
}, and sequence composition U
_{
d
} with elements u
_{
d,l
} drawn from the set of nucleotides {A, C, G, T}.
Following coassembly, we used bwa mem [34] to map raw reads in each sample individually back onto the assembled contigs. We then used samtools [35] and sequenzautils [36] or bamreadcount to generate a fourdimensional tensor \(\mathcal {N}\) reporting the observed base frequencies, n
_{
d,l,s,a
}, for each contig and base position in each sample s where d=1,…,D, l=1,…,L
_{
d
}, s=1,…,S and a=1,…,4, which represents an alphabetical ordering of bases 1→ A, 2→C, 3→G and 4→T.
Using this tensor, we calculated an additional D×S matrix, giving the mean coverage of each contig in each sample as:
$$x_{d,s} = \frac{n_{d,.,s,.}}{L_{d}}, $$
where we have used the convenient dot notation for summation, i.e. \( n_{d,.,s,.} \equiv \sum _{l = 1}^{L_{d}} \sum _{a = 1}^{4} n_{d,l,s,a} \).
Contig clustering and target species identification
DESMAN can be used with any contigbinning method. We recommend using a clustering algorithm that takes both sequence composition and differential coverage of contigs into consideration. For the synthetic strain mock and the E. coli O104:H4 outbreak, we used the standard version of the CONCOCT algorithm [9]. For the complex strain mock, clustering was performed in two steps. Firstly, there is a standard CONCOCT run and then a reclustering guided by SCG frequencies. This strategy has been released in the SpeedUp_Mp branch of the CONCOCT distribution https://github.com/BinPro/CONCOCT. The Tara binning strategy is described below and in the original study [25].
Irrespective of binning method, we assume that one or more of the resulting bins match to the target species and that they contain a total of C contigs with indices that are a subset of {1,…,D}. For convenience, we reindex the coverages and base frequency tensor such that x
_{
c,s
} and n
_{
c,l,s,a
} give the mean coverage and base frequencies in this subset, respectively.
Identifying core genes in target species
The algorithm assumes a fixed number of strains in the target species. However, in general, not every gene in every contig will be present in all strains. We address this by identifying a subset of the sequences that occur in every strain as a single copy. Here we identify those core genes for E. coli by (1) downloading 62 complete E. coli genomes from the National Center for Biotechnology Information (NCBI) and (2) assigning COGs [37] to the genes in these genomes. COG identification was performed by RPSBLAST for amino acid sequences against the NCBI COG database. This allowed us to identify 982 COGs that are both single copy and had an average of greater than 95% nucleotide identity between the 62 E. coli genomes. We denote these COGs as SCSGs.
We then identified SCSGs in MAGs that represent our target species, using RPSBLAST, and created a subset of the variant tensor with base positions that fall within SCSG hits. We denote this subset as n
_{
h,l,s,a
}, where h is now indexed over the H SCSGs found and l is the position within each SCSG from 1,…,L
_{
h
}, which have lengths L
_{
h
}. We denote the coverages of these genes as x
_{
h,s
}.
For the E. coli analyses, we have reference genomes available and we could identify core genes, but this will not be the case in general for uncultured organisms, or even for those for which only a few isolates have been sequenced. In that case, we use a completely de novo approach, using 36 SCGs that are conserved across all species [9] but any other singlecopy gene collection [38, 39] could serve the same purpose. We validated this strategy on the complex strain mock and then applied it to the Tara Oceans microbiome survey. The actual identification of SCGs and subsetting of variants proceeds as above. The result is a decrease in resolution, due to the decreased length of sequence that variants are called on, but as we demonstrate, it is still sufficient to resolve strains at low nucleotide divergence.
In real data sets, we have noticed that some core genes will, in some samples, have higher coverages than expected. We suspect that this is due to the recruitment of reads from lowabundance relatives that fail to be assembled. To account for this, we apply an additional filtering step to the core genes. All core genes should have the same coverage profile across samples. Therefore, we applied a robust filtering strategy based around the median absolute deviation [40]. We calculated the absolute divergence of each gene coverage from the median denoted \(x^{m}_{s}\):
$$\operatorname{div}_{h,s} = x_{h,s}  x^{m}_{s}, $$
and then the median of these divergences, denoted by \(\operatorname {div}^{m}_{s}\). If
$$\operatorname{div}_{h,s} > t \times \operatorname{div}^{m}_{s}, $$
we flag it as an outlier in that sample. Typically, we used t=2.5 as the outlier threshold. We only use genes that are not flagged in at least a fraction f of samples, where in these analyses f was set at 80%.
Variant detection
Our algorithmic strategy begins with a rigorous method for identifying possible variant positions within the SCSGs. The main principle is to use a likelihood ratio test to distinguish between two hypotheses for each position. The null hypothesis \(\mathcal {H}_{0}\) is that the observed bases are generated from a single true base under a multinomial distribution and an error matrix that is positionindependent. We denote this error matrix ε, with elements ε
_{
a,b
} giving the probability that a base b is observed when the true base is a. The alternative hypothesis \(\mathcal {H}_{1}\), in which \(\mathcal {H}_{0}\) is nested, is that two true bases are present. For this test, we ignore the distribution of variants over samples, working with the total frequency of each base across all samples:
$$t_{h,l,a} = n_{h,l,.,a}. $$
Although the generalisation of our approach to multiple samples would be quite straightforward, we chose not to do this for computational reasons and because we achieve sufficient variant detection accuracy with aggregate frequencies.
If we make the reasonable assumption that ε
_{
a,a
}>ε
_{
a,b
} for b≠a for all a, then for a single true base with errors, the maximum likelihood solution for the true base is the consensus at that location, which we denote by the vector M
_{
h
} for each SCSG with elements:
$$m_{h,l}^{0} = \arg \max_{a} \left(t_{h,l,a} \right). $$
The likelihood for \(\mathcal {H}_{0}\) at each position is then the multinomial, assuming that bases are independently generated under the error model:
$$\mathcal{H}_{0}\left(t_{h,l,a}  \epsilon, r = m_{h,l}^{0} \right) = \prod_{a} \epsilon_{r,a}^{t_{h,l,a}} \frac{T_{h,l}!}{t_{h,l,a}!}, $$
where we use \(r = m_{h,l}^{0}\) to index the maximum likelihood true base and T
_{
h,l
} is the total number of bases at the focal position, T
_{
h,l
}=t
_{
h,l,.}. Similarly, for the twobase hypothesis, the maximum likelihood solution for the second base (or variant) is:
$$m_{h,l}^{1} = \arg \max_{a \not\in m_{h,l}^{0}} \left(t_{h,l,a} \right). $$
Then the likelihood for the hypothesis \(\mathcal {H}_{1}\) at each position is
$$ \begin{aligned} &\log \mathcal{H}_{1}\left(t_{h,l,a}  \epsilon, r = m_{h,l}^{0}, s = m_{h,l}^{1}, p_{h,l} = p \right)\\ &\quad= \prod_{a} (p \epsilon_{r,a} + (1  p) \epsilon_{s,a})^{tn}\frac{T_{h,l}!}{t_{h,l,a}!}, \end{aligned} $$
(1)
where we have introduced a new parameter for the relative frequency of the consensus base, p. We set an upper bound on this frequency, p
_{max}, such that p
_{
l
}=1−p
_{max} corresponds to the minimum observable variant frequency. For the synthetic mock community, we set p
_{
l
}=0.01, i.e. 1%. For the other two real data sets, where we want to be more conservative, we used p
_{
l
}=0.03. For each position, we determine this by maximum likelihood by performing a simple onedimensional optimisation of Eq. 1 with respect to p. Having defined these likelihoods, our ratio test is:
$$ 2\log{\frac{\mathcal{H}_{0}}{\mathcal{H}_{1}}}, $$
(2)
which will be approximately distributed as a chisquared distribution with one degree of freedom. Hence, we can use this test to determine pvalues for the hypothesis that a variant is present at a particular position.
There still remains the question of how to determine the error matrix, ε. We assume that these errors are positionindependent, and to determine them, we adopt an iterative approach resembling expectation maximisation. We start with a rough approximation to ε, categorise positions as variants or not, and then recalculate ε as the observed base transition frequency across all nonvariant positions. We then reclassify positions and repeat until ε and the number of variants detected converge. Finally, we apply a Benjamini–Hochberg correction to account for multiple testing to give a FDR or qvalue for a variant at each position [41]. The variant positions identified by this procedure should represent sites where we are confident variation exists in the MAG population at greater than 1% frequency. However, we cannot be certain that this variation is necessarily from the target species because of potential recruitment of reads from other organisms; therefore, we prefer the term singlenucleotide variants (SNVs) for these positions, rather than singlenucleotide polymorphisms (SNPs), which we keep for variant positions in isolate genomes.
Probabilistic model for variant frequencies
Having identified a subset of positions that are likely variants, the next step of the pipeline is to use the frequencies of those variants across multiple samples to link the variants into haplotypes. We use a fairly low qvalue cutoff for variant detection, using all those with FDR <1.0×10^{−3}. This ensures that we limit the positions used in this computationally costly next step to those most likely to be true variants. The cost is that we may miss some lowfrequency haplotypes but these are unlikely to be confidently determined anyway. We will index the variant positions on the SCSGs by v and for convenience keep the same index across SCSGs, which we order by their COG number, so that v runs from \(1,\ldots, N_{1},\ldots, N_{1} + N_{2},\ldots,\sum _{h} N_{h}\), where N
_{
h
} is the number of variants on the hth SCSG and keep a note of the mapping back to the original position and SCSG denoted v→(l
_{
v
},h
_{
v
}). We denote the total number of variants by \(V = \sum _{h} N_{h}\) and the tensor of variant frequencies obtained by subsetting \(n_{h_{v},l_{v},s,a} \rightarrow n_{v,s,a}\) on the variant positions as \(\mathcal {N}\).
Model likelihood
The central assumption behind the model is that these variant frequencies can be generated from G underlying haplotypes with relative frequencies in each sample s denoted by π
_{
g,s
}, so that π
_{.,s
}=1. Each haplotype then has a defined base at each variant position denoted τ
_{
v,g,a
}. To encode the bases, we use fourdimensional vectors with elements ∈{0,1}, where a 1 indicates the base and all other entries are 0. The mapping to bases is irrelevant but we use the same alphabetical ordering as above, thus τ
_{
v,g,.}=1.
We also assume a positionindependent base transition or error matrix giving the probability of observing a base b given a true base a as above, ε
_{
a,b
}. Then, assuming independence across variant positions, i.e. explicitly ignoring any read linkage, and more reasonably between samples, the model likelihood is a product of multinomials:
$$ \begin{aligned} \mathcal{L}\left(\mathcal{N}  \pi, \tau,\epsilon \right) &= \prod_{v=1}^{V} \prod_{s = 1}^{S} \prod_{a = 1}^{4} \left(\sum_{b=1}^{4} \sum_{g = 1}^{G} \tau_{v,g,b} \pi_{g,s} \epsilon_{b,a} \right)^{n_{v,s,a}}\\ &\quad\times\frac{n_{v,s,.}!}{n_{v,s,a}!}. \end{aligned} $$
(3)
Model priors
Having defined the likelihood, here we specify some simple conjugate priors for the model parameters. For the frequencies in each sample, we assume symmetric Dirichlet priors with parameter α:
$$P(\pi  \alpha) = \prod_{s} \text{Dir} (\pi_{g,s}  \alpha). $$
Similarly, for each row of the base transition matrix, we assume independent Dirichlets:
$$P(\epsilon  \delta) = \prod_{a} \text{Dir} (\epsilon_{a,b}  \delta) $$
with parameter δ. Finally, for the haplotypes themselves (τ), we assume independence across positions and haplotypes, with uniform priors over the four states:
$$P(\tau_{v,g,a}) = \frac{1}{4}. $$
Gibbs sampling strategy
We will adopt a Bayesian approach to inference of the model parameters, generating samples from the joint posterior distribution:
$$ P(\tau, \pi, \epsilon  \mathcal{N}) = \frac{P(\tau, \pi, \epsilon,\mathcal{N})}{P(\mathcal{N})}. $$
(4)
We use a Gibbs sampling algorithm to sample from the conditional posterior of each parameter in turn, which will converge on the joint posterior given sufficient iterations [42]. The following three steps define one iteration of the Gibbs sampler:

1.
The conditional posterior distribution for the haplotypes, τ
_{
v,g,a
}, is
$$P(\tau  \epsilon, \pi, \mathcal{N}) \propto P(\mathcal{N}  \tau, \pi, \epsilon) P(\tau). $$
Each variant position contributes independently to this term, so we can sample each position independently. The haplotype assignments are discrete states, so their conditional will also be a discrete distribution. We sample τ for each MAG in turn, from the conditional distribution for that genome, with the assignments of the other genomes fixed to their current values:
$$ \begin{aligned} P\left(\tau_{v,g,a}\pi, \epsilon, \mathcal{N},\tau_{v,h \neq g,a}\right) \propto \prod_{s} \prod_{a} \left(\sum_{g} \sum_{b} \tau_{v,g,b} \pi_{g,s} \epsilon_{b,a} \right)^{n_{v,s,a}}. \end{aligned} $$
(5)

2.
To sample ε, we introduce an auxiliary variable, ν
_{
v,s,a,b
}, which gives the number of bases of type a that were generated by a base of type b at location v in sample s. Its distribution, conditional on τ, π, ε and \(\mathcal {N}\), will be multinomial:
$$P\left(\nu_{v,s,a,b}  \tau, \pi, \epsilon, \mathcal{N}\right) = \prod_{b = 1}^{4} \left(\frac{\zeta_{v,s,a,b}^{\nu_{v,s,a,b}} }{\nu_{v,s,a,b}!}\right)n_{v,s,a}!, $$
where
$$\zeta_{v,s,a,b} = \frac{\sum_{g} \tau_{v,g,b} \pi_{g,s} \epsilon_{b,a}}{\sum_{a} \sum_{g} \tau_{v,g,b} \pi_{g,s} \epsilon_{b,a}}. $$
Since the multinomial is conjugate to the Dirichlet prior assumed for ε, then we can easily sample ε conditional on ν:
$$P(\epsilon_{b,a}  \delta, \nu) = \text{Dir} (\nu_{.,.,a,b} + \delta). $$

3.
To sample π, we define a second auxiliary variable ξ
_{
v,s,a,b,g
}, which gives the number of bases of type a that were generated by a base of type b at each position v from haplotype g in sample s. This variable conditioned on τ, π, ε and ν will be distributed as:
$$P(\xi_{v,s,a,b,g}  \tau, \pi, \epsilon, \nu) = \prod_{g} \left(\frac{\psi_{v,s,a,b,g}^{\xi_{v,s,a,b,g}}}{\xi_{v,s,a,b,g}!}\right) \nu_{v,s,a,b}! $$
with
$$\psi_{v,s,a,b,g} = \frac{\tau_{v,g,b} \pi_{g,s} \epsilon_{b,a}}{\sum_{g} \tau_{v,g,b} \pi_{g,s} \epsilon_{b,a}}. $$
Similarly, π is also a Dirichlet conditional on ξ:
$$P(\pi_{g,s} \xi_{.,s,.,.,g}) = \text{Dir}\left(\xi_{.,s,.,.,g} + \alpha \right). $$
Initialisation of the Gibbs sampler
Gibbs samplers can be sensitive to initial conditions. To ensure rapid convergence on a region of high posterior probability, we consider a simplified version of the problem. We calculate the proportions of each variant at each position in each sample:
$$p_{v,s,a} = \frac{n_{v,s,a}}{n_{v,s,.}}. $$
Then an approximate solution for τ and π will minimise the difference between these observations, and
$$\hat p_{v,s,a} = \sum_{g} \tau_{v,g,a} \pi_{g,s}. $$
If we relax the demand that τ
_{
v,g,a
}∈0,1 and instead allow it to be continuous, then solving this problem is an example of an NTF, which itself is a generalisation of the better known NMF problem [19]. We adapted the standard multiplicative update NTF algorithm that minimises the generalised KullbackLeibler divergence between p and \(\hat p\):
$$\mathrm{D}_{KL} (p  \hat p) = \sum_{v} \sum_{s} \sum_{a} p_{v,s,a} \log \left(\frac{p_{v,s,a}}{\hat p_{v,s,a}} \right) + \hat p_{v,s,a}  p_{v,s,a}. $$
This is equivalent to assuming that the observed proportions are a sum of independent Poissondistributed components from each haplotype, ignoring the issue that the Poisson is a discrete distribution [43]. The standard multiplicative NMF algorithm can be applied to our problem [44] by rearranging the τ tensor as a 4V×G matrix τ
w,g′≡τ
_{
v,g,a
}, where w=v+(a−1)V. By doing so, we have created a matrix from the tensor by stacking each of the base components of all the haplotypes vertically. Similarly, we rearrange the variant tensor into a 4V×S matrix with elements \(n^{\prime }_{w,s} \equiv n_{v,s,a}\), where w=v+(a−1)V. The update algorithms become:
$$\begin{array}{@{}rcl@{}} \tau{\prime}_{w,g} & \leftarrow & \tau{\prime}_{w,g} \frac{\sum_{s} \pi_{g,s} n{\prime}_{w,s}/(\tau{\prime}.\pi)_{w,s}}{\sum_{s} \pi_{g,s}},\\ \pi_{g,s} & \leftarrow & \pi_{g,s} \frac{\sum_{w} \tau{\prime}_{w,g} n{\prime}_{w,s}/(\tau{\prime}.\pi)_{w,s}}{\sum_{w} \tau{\prime}_{w,g}}. \end{array} $$
Then we add a normalisation step:
$$\begin{array}{@{}rcl@{}} \tau{\prime}_{w,g} & = & \tau{\prime}_{w,g}/\sum_{a} \tau{\prime}_{v + (a 1).V,g},\\ \pi_{g,s} & = & \pi_{g,s}/\sum_{g} \pi_{g,s}. \end{array} $$
Having run the NTF until the reduction in D_{
KL
} was smaller than 10^{−5}, we discretised the predicted τ values such that the predicted base at each position for each haplotype was the one with the largest τ
^{′}. We used these values with π as the starting point for the Gibbs sampler.
Implementation of the Gibbs sampler
In practice, following initialisation with the NTF, we run the Gibbs sampling algorithm twice for a fixed number of iterations. The first run is a burnin phase to ensure convergence, which can be checked via manual inspection of the time series of parameter values. The second run is the actual sampler, from which T samples are stored as samples from the posterior distribution, θ
_{
t
}=(τ
_{
t
},π
_{
t
},ε
_{
t
}) with t=1,…,T. These can then be summarised by the posterior means, \(\hat {\theta } = \sum _{t} \theta _{t} /T\), and used in subsequent downstream analysis. We also store the sample with the maximum logposterior, denoted θ
^{∗}=(τ
^{∗},π
^{∗},ε
^{∗}), if a single most probable sample is required. For many data sets, V will be too large for samples to be generated within a reasonable time. Fortunately, we do not need to use all variant positions to calculate π with sufficient accuracy. We randomly selected a subset of the variants, ran the sampler, obtained samples (π
_{
t
},ε
_{
t
}) and use these to assign haplotypes to all positions, by running the Gibbs sampler just updating τ sequentially using Eq. 5 and iterating through the stored (π
_{
t
},ε
_{
t
}).
Determining the number of haplotypes and haplotype validation
Ideally the Bayes factor or the model evidence, the denominator in Eq. 4, would be used to compare between models with different numbers of haplotypes. Unfortunately, there is no simple reliable procedure for accurately determining the Bayes factor from Gibbs sampling output. For this reason, we suggest examining the posterior mean deviance [45]:
$$ D = \frac{\sum_{t} 2 \log\left[\mathcal{L}\left(\mathcal{N}  \pi_{t}, \tau_{t},\epsilon_{t} \right)\right]}{T}. $$
As the number of haplotypes increases, the model will fit better and D will decrease. When the rate of decrease is sufficiently small, then we conclude that we have determined the major abundant haplotypes or strains present. This method is ambiguous but has the virtue of not making any unwarranted assumptions necessary for approximate estimation of the Bayes factor. To validate individual haplotypes, we compare replicate runs of the model. Since the model is stochastic, then different sets of haplotypes will be generated each time. If in replicate runs we observe the same haplotypes, then we can be confident in their validity. Therefore, calculating the closest matching haplotypes across replicates gives an estimate of our confidence in them. We define the mean SNV uncertainty for a haplotype as the fraction of positions for which it differs from its closest match in a replicate run, averaged over all the other replicates.
For predictions, the run used was the one with lowest posterior mean deviance giving the predicted G. Parameter predictions were taken as the posterior mean over the sampled values. For the haplotype sequences, these means were discretised by setting τ
_{
v,g,m
}=1 and τ
_{
v,g,a≠m
}=0 where m= arg maxa
τ
_{
v,g,a
}.
When analysing multiple clusters, an automatic method of inferring the true number of haplotypes is required. To provide this, we developed a heuristic algorithm like the humanguided strategy discussed above. As G increases, the mean posterior deviance must decrease but when the rate of decrease is sufficiently small, then we can conclude that we have determined the major abundant haplotypes present. We, therefore, ran multiple replicates (typically five) of the haplotype resolution algorithm for increasing G=1,…,G
_{max}, and set a cutoff d (set at 5% for the studies presented here). When the successive reduction in posterior mean deviance averaged over replicates fell below this value, i.e. \(({\mathbb E}[D_{G1}]  {\mathbb E}[D_{G}])/ {\mathbb E}[D_{G1}] < d\), we used G
_{U}=G−1 as an upper limit on the possible number of resolved haplotypes. We considered all G between 1 and G
_{U} and at each value of G, we calculated the number of haplotypes that had a mean SNV uncertainty (see above) below 10% and a mean relative abundance above 5%. We chose the optimal G to be the one that returned the most haplotypes satisfying these conditions of reproducibility and abundance.
Resolving the accessory genome
Having resolved the number of strains and their haplotypes on the core genome, we now consider the question of how to determine the accessory genome for each strain. The strategy below could equally well be applied to either contigs or genes called on those contigs. In our experience, contigs are frequently chimeric, and we have achieved better results with genebased approaches. If contig assignments are required, then a simple consensus of the genes on a contig can be used. We will, therefore, describe a genebased analysis keeping in mind that contigs could be used interchangeably.
We should have already assigned genes on all contigs in the target bin or bins above. Now we consider not just the SCSGs but all genes, which we will index f=1,…,F. Just as for the SCSGs, we can identify variant positions on the total gene set using Eq. 2. In fact, we apply a slightly modified version of this strategy in this case because of the large number of positions to be screened, replacing the onedimensional optimisation of p with an estimation of the frequency of the consensus base as the ratio of the observed number of consensus bases to the total, \(p = t_{h,l,m_{h,l}^{0}} / T_{h,l}\).
We will denote the number of variant positions associated with gene f by N
_{
f
}. In this case, we do need to keep track of which variant maps to which gene explicitly, so we will consider a fourdimensional variant tensor denoted \(\mathcal {M}\) with elements m
_{
f,l,s,a
} where l is indexed from 1,…,N
_{
f
}. This is generated by subsetting the original contig variant tensor \(\mathcal {N}\) to the variants associated with each gene. In practice, to speed up the algorithm we use only a random subset of variants (20 was used here), since all variants contain the information necessary to determine which gene is present in which strain. An additional source of information that we will use is the average coverage of each gene across samples. This is the exact analogue of the contig coverage introduced above and we will denote it with the same symbol, i.e. \(\mathcal {X}\) with elements x
_{
f,s
}.
Determining the accessory genome corresponds to inferring the copy number of each gene in each strain. We denote this integer as η
_{
f,g
}, for each of the genes f=1,…,F associated with the species in each strain, g=1,…,G. The ideas we present here could be extended to multicopy genes; however, the current implementation of DESMAN assumes that all genes are present in zero or one copies, η
_{
f,g
}∈{0,1}. This simplifies the implementation considerably and in real assemblies the vast majority of genes are either present or absent in a strain. For example, for the STEC genome, this is true of 98.8% of the genes.
The first step is to determine the likelihood. We assume that this is separable for the variants and the coverages. This is an approximation, as the variant positions will contribute to the mean coverage calculation. Formally, we assume:
$$\mathcal{L}\left(\mathcal{M}, \mathcal{X} \eta, \pi, \tau,\epsilon \right) = \mathcal{L}^{v}\left(\mathcal{M} \eta, \pi, \tau,\epsilon \right).\mathcal{L}^{x}\left(\mathcal{X} \eta, \gamma \right). $$
The first likelihood is, like Eq. 3, a product of multinomials:
$$ \begin{aligned} \mathcal{L}^{v}\left(\mathcal{M}  \eta, \pi, \tau,\epsilon \right) &= \prod_{f=1}^{F} \prod_{l=1}^{N_{f}} \prod_{s = 1}^{S} \prod_{a = 1}^{4} \left(\sum_{b=1}^{4} \sum_{g \in G_{f}} \tau_{f,l,g,b} \pi{\prime}^{f}_{g,s} \epsilon_{b,a} \right)^{m_{f,l,s,a}}\\ &\quad\times\frac{m_{f,l,s,.}!}{n_{f,l,s,a}!}. \end{aligned} $$
(6)
The difference is that now the sum over the strains g are only those for which η
_{
f,g
}>0, those which actually possess a copy of gene f, a set that we denote g∈G
_{
f
}. The relative frequencies then have to be renormalised accordingly so that:
$$ \pi{\prime}^{f}_{g,s} = \frac{\pi_{g,s}}{\sum_{g \in G_{f}} \pi_{g,s}}. $$
The likelihood for the coverages is somewhat simpler. We know the relative proportions of each strain in each sample, π
_{
g,s
}. We also know the mean total coverage on the core genes:
$$Y_{s} = n_{.,.,s,.}/\sum_{l=1}^{H} L_{h}. $$
Therefore, we can calculate the coverage associated with each strain:
$$\gamma_{g,s} = \pi_{g,s} Y_{s}. $$
We can make the approximation that each copy of a contig from a strain contributes independently to the total mean coverage observed for that contig in a particular sample. If we further assume that this contribution is Poisson distributed with mean γ
_{
g,s
}, then the total contribution will be from the superposition property of Poisson distributions, which are again Poisson with mean \(\lambda _{f,s} = \sum _{g} \eta _{f,g} \gamma _{g,s}\). Thus,
$$ \mathcal{L}^{x}\left(\mathcal{X} \eta, \gamma \right) = \prod_{f=1}^{F} \prod_{s = 1}^{S} \exp\left(\lambda_{f,s}\right) \lambda_{f,s}^{x_{f,s}} \frac{1}{\Gamma\left(x_{f,s} + 1\right)}. $$
(7)
Our strategy for sampling the gene assignments η
_{
f,g
} is to keep the relative proportions of each strain in each sample, π
_{
g,s
}, and the error matrix, ε
_{
b,a
}, fixed at their posterior mean values \((\hat {\pi },\hat {\epsilon })\). We then use a Gibbs sampler to jointly sample both the η
_{
f,g
} and the haplotypes of those strains τ
_{
f,l,g,a
}. In general, we assume a geometric prior for the η
_{
f,g
}, so that \(P(\eta _{f,g} = \eta) = \eta _{s}^{\eta }/Z\), where η
_{
s
} is less than 1 to penalise multicopy genes, although here, as mentioned above, we restrict ourselves to binary η, and Z is a normalisation constant. Each gene contributes to the likelihood independently and so can be sampled independently. We can, therefore, loop through the genes, sampling η for each strain conditional on the other genomes fixed at their current values:
$$ \begin{aligned} P(\eta_{f,g} = \eta ; \tau_{f,l,g,a} = \tau_{l,a}  \hat{\pi},\hat{\epsilon},\tau_{f,h \neq g,a},\eta_{f,h \neq g}) \propto \\ \mathcal{L}^{v}\left(\mathcal{M} \eta, \hat{\pi}, \tau,\hat{\epsilon} \right). \mathcal{L}^{x}\left(\mathcal{X} \eta, \hat{\gamma}, \right) P(\eta)P(\tau), \end{aligned} $$
(8)
substituting Eqs. 6 and 7 into this and using uniform priors for τ.
To improve the speed of convergence of this sampler, we developed an approximate strategy to initialise η
_{
f,g
} using just the coverages, x
_{
f,s
}. If we ignore for now that η
_{
f,g
} is discrete, then the maximum likelihood prediction for η
_{
f,g
} from Eq. 7 will correspond to minimising the generalised KullbackLeilber divergence between the observed coverages x
_{
f,s
}, and their predictions, \(\hat x_{f,s} = \sum _{g} \eta _{f,g} \gamma _{g,s}\):
$$\mathrm{D}_{KL} \left(x_{f,s}  \hat x_{f,s}\right) = \sum_{c} \sum_{s} x_{f,s} \log \left(\frac{x_{f,s}}{\hat x_{f,s}} \right) + \hat x_{f,s}  \hat x_{f,s}. $$
This also corresponds to NMF but with a fixed estimate for γ
_{
g,s
}. Therefore, to solve it for η
_{
f,g
}, we need only one of the multiplicative update rules [44]:
$$ \eta_{f,g} \leftarrow \eta_{f,g} \frac{\sum_{s} \gamma_{g,s} x_{f,s}/(\eta.\gamma)_{f,s}}{\sum_{s} \gamma_{g,s}}, $$
(9)
which gives continuous estimates for η
_{
f,g
}, but we round these to the nearest integer for discrete copy number predictions.
The sampler is initialised using Eq. 9 before applying a burnin and sampling phase using Eq. 8. Typically, we have found that a relatively small number of samples, just 20, is sufficient before the η values converge. We also use only a random subset of the variant positions (again 20) for the η sampling as discussed above. Optionally, we then allow an additional sampling phase to determine the remaining τ, the haplotype sequences, with the η values fixed at their posterior mean values, if required.
Calculating genome divergence
To determine a measure of overall genome divergence that takes into account both which genes are present in a genome and how divergent in nucleotide sequence those genes are, we calculated for each strain both the gene complement, η
_{
f,g
}, and the gene haplotype, τ
_{
f,l,g,a
}. We converted the haplotypes into gene sequences using the contig references, and clustered all the sequences from all strains in a MAG at 5% nucleotide identity using the clustering algorithm of vsearch [46] and for each strain mapped its gene sequences back onto these cluster centroids and assigned each strain gene sequence to its closing matching cluster. Each strain is then represented as a vector of 5% gene cluster frequencies υ
_{
g,c
} where c indexes the gene clusters, of which we assume there are C in total. A measure of genome divergence between two strains g and h is then:
$$ d_{g,h} = 1  \frac{\sum_{c} \min(\upsilon_{g,c}, \upsilon_{h,c})}{\sum_{c} \max(\upsilon_{g,c}, \upsilon_{h,c})}. $$
(10)
Using this measure, the divergence is 0 if two strains contain all the same genes and their sequences are within 5% nucleotide identity of each other. Conversely, the divergence is 1 if they share no gene sequences within 5% identity.
Creation of complex strain mock
We simulated a complex community comprising 100 different species and 210 strains. The exact strains used are detailed in Additional file 3. The 100 species were chosen randomly from bacteria and archaea for which multiple complete genomes were available from the NCBI. They span a wide range of taxonomic diversity deriving from 10 separate phyla, 49 families and 74 genera, although with an inevitable bias to Proteobacteria. For each species, between one and five separate genomes were used in the simulation with a species strain frequency distribution of (1:50,2:20,3:10,4:10,5:10), i.e. there were 50 species with no strain variation and ten comprised five strains.
We simulated reads from these genomes using the ART sequence simulator [47], run through a set of custom scripts, which are collated in the repository https://github.com/chrisquince/StrainMetaSim. In total, 96 samples were generated, each comprising approximately 6.25 million 2×150 bp pairedend reads with an average insert length of 300 bp with a standard Illumina error profile. This approximates to running the 96 samples on one run of a HiSeq2500 in rapid run mode using dual flow cells (assuming 180 Gbp per run).
We modelled the species abundances across samples using normalised lognormal distributions. We assumed each species, indexed t=1,…,T, to have a mean and standard deviation log abundance of μ
_{
t
} and σ
_{
t
}, respectively, such that its relative frequency n
_{
t,s
} in sample s is generated by:
$$ y_{t,s} \sim \mathcal{N}(\mu_{t}, \sigma_{t}) $$
(11)
and then
$$ n_{t,s} = \frac{\mathrm{e}^{y_{t,s}}}{\sum_{t} \mathrm{e}^{y_{t,s}}}. $$
(12)
The lognormal parameters, μ
_{
t
} and σ
_{
t
}, for each species are themselves generated from a normal (mean = 1.0, standard deviation = 0.25) and gamma distribution (shape = 1.0, scale = 1.0), respectively. Then, within each species, we used a symmetric Dirichlet distribution to model the relative strain frequencies:
$$ \rho_{s} \sim \text{Dir} (\mathbf{a}), $$
(13)
where the vector a has a dimensionality equal to the number of strains in that species. In practice, we used a unit vector for this parameter. The relative frequency for each strain d is then:
$$ \kappa_{d \in t,s} = n_{t,s}\rho_{s,d}. $$
(14)
This gives the probability that a read in a given sample derives from a given strain. The strain coverage is then
$$ z_{d,s} = \frac{R \kappa_{d,s} N_{s}}{L_{d}}, $$
(15)
where N
_{
s
} is the number of reads in sample s, R is the read length and L
_{
d
} is the strain genome length. The program ART was used to generate simulated reads with this level of coverage from each strain genome in each sample. The result was that the number of reads varied slightly, since reads are discrete and coverage a continuous quantity. In total, 599,067,690 paired reads were generated. These reads were then collated into samples to simulate the community.
Assignment of contigs to species and genes to strains
To determine which contig derived from which species, we considered the reads that mapped onto it. Each read has a known genome assignment that derives from the sequence simulator. We, therefore, assign a contig to the species that the majority of its reads derive from. There were relatively few chimeric contigs of the 74,581 contig fragments greater than 1,000 bp in length. Only 228 (0.3%) had less than 90% of mapped reads deriving from the assigned species. Similarly, for each individual gene called across all contigs by prodigal, we determined the fraction of reads deriving from each strain genome for comparison with the inferred gene assignments from the second step of the DESMAN pipeline.
Tara Oceans MAG collection
The details of the generation of the 957 nonredundant Tara Oceans MAGs are given in the original manuscript [25]. Briefly, the 93 Tara Oceans metagenome samples (30.9 billion filtered reads) from the planktonic size fraction (61 surface samples and 32 samples from the deep chlorophyll maximum layer) were grouped into 12 metagenomic sets based on geographic location. These geographic locations are detailed in Additional file 1: Table S9 and Figure S11. Each set was then independently coassembled using MEGAHIT [33] and all contigs >2.5 kbp (>5 kbp for the Southern Ocean) were binned using an initial automatic binning with CONCOCT, followed by interactive refinement with the Anvi’o interface as described in [10]. Redundant MAGs, i.e. the same genome appearing from multiple coassemblies, were identified using a combination of average nucleotide identity (>99%) on at least 75% shared genome and relative abundance correlation (Pearson’s correlation >0.9). CheckM was used to infer the taxonomy of MAGs based on the proximity of 43 singlecopy gene markers within a reference genomic tree [48]. For each MAG, genes were called using the program prodigal with the p meta metagenomics flag [49]. The genes were annotated to KEGG orthologues by amino acid alignments against KEGG FTP Release 20140414 using RAPSearch2. A KEGG pathway module was considered present in a MAG if at least 75% of the orthologues of at least one pathway through that module were found. The genes were also annotated to COGs [37] and the fraction of the 36 singlecopy core COGs (SCGs) identified in Alneberg et al. [9] that were found in a single copy were used to determine MAG purity and completeness.