Synthetic data set generation
The in silico synthetic communities were generated by first downloading a list of complete bacterial genomes from the NCBI and selecting species with multiple strains present. Genomes were restricted to those that were full genome projects, possessed at least 35 of 36 singlecopy core genes (SCGs) identified in [3], and with relatively few contigs (<5) in the assemblies. Communities were created by specifying species from this list and the number of strains desired. The strains selected were then chosen at random from the candidates for each species, with the extra restrictions that all strains in a species were at least 0.05% and no more than 5% nucleotide divergent on the SCGs from any other strain in the species. This corresponds to a minimum divergence of approximately 15 nucleotides over the roughly 30 kbp region formed by summing the SCGs. The genomes used are given in Additional file 1: Tables S1 and S2 and as a csv text file in Additional file 2.
Each species indexed i was then given an abundance, y_{i,s}, in each sample, \(s = 1,\dots,S\), which was drawn from a lognormal distribution with a species dependent mean and standard deviation, themselves drawn from a normal and gamma distribution respectively:
$$\log (y_{i,s}) \sim N(\mu_{i}, \sigma_{i}) $$
where:
$$\mu_{i} \sim N(\mu_{p}, \sigma_{p}) $$
and:
$$\sigma_{i} \sim \text{Gamma}(k_{p}, \theta_{p}). $$
For all the synthetic communities we used μ_{p}=1,σ_{p}=0.125,k_{p}=1 and θ_{p}=1. The species abundances were then normalised to one (\(y^{\prime }_{i,s} = y_{i,s}/ \sum _{i} y_{i,s}\)). For each strain within a species its proportion in each sample was then drawn from a Dirichlet:
$$ \rho_{g,s} \sim \text{Dirichlet}(\alpha) $$
(1)
with α=1 and g=1,…,G indexing G strains per species. This was the procedure used for all the data sets except Synth_T1_S10, Synth_T10_S10, and Synth_T100_S10 where we used a hierarchical Dirichlet first generating for each MAG an expected strain profile for all samples:
$$ \mu_{i,g} \sim \text{Dirichlet}(\alpha) $$
(2)
with α=1 but then in each sample the strain proportions were generated as:
$$ \rho_{i,s,g} \sim \text{Dirichlet}(\mu_{i}\theta), $$
(3)
so that the mean strain proportions were μ_{i} with an adjustable precision θ.
This allowed us to specify a copy number for each genome g in species i in each sample as \(y^{\prime }_{i,s}\rho _{g,s}\). We then generated 150 million pairedend 2x150 bp reads in total across all samples with Illumina HiSeq error distributions using the ART read simulator [23]. The code for the synthetic community generation is available from https://github.com/chrisquince/STRONG_Sim.
Synthetic data set evaluation
We can determine which contig derived from which reference genome by considering the simulated reads that map onto it. We know which reference each of these came from, enabling us to assign a contig to a genome as that which a majority of its reads derive from. We can then assign each MAG generated by STRONG to a reference species as the one which the majority of its contig’s derive from weighted by the contig length.
Anaerobic digester sampling and sequencing
AD sample collection
We obtained ten samples from a farm anaerobic digestion bioreactor across a period of approximately one year. The sampling times, metadata and accession numbers are given in Additional file 1: Table S5. The reactor was fed on a mixture of slurry, whey and crop residues, and operated between 35−40^{∘}C, with mechanical stirring. Biomass samples were taken directly from the AD reactor by the facility operators and shipped in icecooled containers to the University of Warwick. Upon receipt, they were stored at 4^{∘}C and then sampled into several 15mL aliquots within a few days. DNA was usually extracted from these aliquots immediately but some were first stored in a −80^{∘}C freezer until subsequent thawing and extraction.
AD short read sequencing
DNA extraction was performed using the Qiagen Powersoil extraction kit following the manufacturer’s protocol. DNA samples were sequenced by Novogene using the NovaSeq platform with 2x150 bp reads at a mean depth of 11.63 Gbp.
AD long read sequencing
Anaerobic digester samples were stored in 1.8 mL Cryovials at −80^{∘}C. Samples were defrosted at 4^{∘}C overnight prior to DNA extraction. DNA was extracted from a starting mass of 250 mg of anaerobic digester sludge using the MP Biomedical^{TM} FastDNA^{TM} SPIN Kit for Soil (cat no: 116560200) and a modified manufacturers protocol. Defrosted samples were homogenised by pipetting and then transferred to a MP bio^{TM} lysing matrix E tube (cat no: 116914050CF). Samples were resuspended in 938 μL of Sodium phosphate buffer (cat no: 116560205).
Preliminary cell lysis was undertaken using lysozyme at a final concentration of 200 ng/μL and 20 μL of Molzyme Bug Lysis ^{TM} solution. Samples were mixed by inversion and incubated at 37^{∘}C for 30 min on a shaking incubator (<100 rpm). Lysozyme was inactivated by adding 122 μL of MP bio MT buffer and mixing by inversion. Samples were then mechanically lysed in a VelociRuptor V2 bead beating machine (cat no: SLS1401) at 5 m/s for 20 seconds then placed on ice for five minutes.
Samples were centrifuged at 14000 g for five minutes to pellet the particulate matter and the supernatant was transferred to a new 1.5 mL microfuge tube. Proteins were precipitated from the crude lysate by adding 250 μL of PPS^{TM} (cat no: 116560203) and then mixing by inversion. Precipitated proteins were pelleted for five minutes at 14000 g and the supernatant was transferred to 1000 μL of pre mixed DNA binding matrix solution (cat no: 116540408). Samples were mixed by inversion for two minutes.
DNA binding matrix beads were recovered using the MP bio^{TM} spin filter (cat no: 116560210) and manufacturer based spin protocol. The binding matrix was washed of impurities by complete resuspension in 500 μL of SEWSM solution (cat no: 116540405) and centrifuged at 14000 g for five minutes. The DNA binding matrix was then washed for a second time by resuspension in 500 μL of 80% EtOH followed by centrifugation at 14000 g for five minutes. Flow though was discarded and centrifuged at 14000 g for two minutes to remove residual EtOH. The binding matrix was left to air dry for 2 minutes then DNA was eluted using 100 μL of DES elution buffer at 56^{∘}C. Elute was collected by centrifugation at 14000 g for 5 minutes and stored at 4^{∘}C prior to library preparation. Eluted DNA concentration was estimated using a Qubit 4^{TM} fluorometer with the dsDNA Broad Range sensitivity assay kit (cat no: Q32853). 260:280 and 260:230 purity ratios were quantified using a Nanodrop^{TM} 2000.
A 1x SPRI clean up procedure was undertaken prior to library construction to further reduce contaminant carry over. Input DNA was standardised to 1.2 μg in 48 μL of H_{2}O using a qubit 4^{TM} fluorometer and dsDNA 1x High Sensitivity assay kit (cat no: Q33231). Library preparation was undertaken using the Oxford Nanopore^{Ⓒ} Ligation Sequencing Kit (SQKLSK109) with minor modifications to the manufacturer protocol. The FFPE/End repair incubation step was extended to 30 min at 20^{∘}C and 30 min at 65^{∘}C, while DNA was eluted from SPRI beads at 37^{∘}C for 30 min with gentle agitation. The SQKLSK109 long fragment buffer was used to ensure removal of nonligated adaptor units and reduce short fragment carryover into the final sequencing library. The final library DNA concentration was standardised to 250 ng in 12 μL of EB using a qubit 4^{TM} fluorometer and dsDNA 1 x High Sensitivity assay kit.
Sequencing was undertaken for 72 hours on an Oxford Nanopore^{Ⓒ} R 9.4.1 (FLOMIN106) flow cell with 1489 active pores. DNA was left to tether for 1 hour prior to commencing sequencing. The flow cell and sequencing reaction was controlled by a MinION^{TM} MKII device and the GUI MinKNOW V. 19.12.5. ATP refuelling was undertaken every 18 hours with 75 μL of flush buffer (FB). Post Hoc basecalling was undertaken using Guppy V. 3.5.1 and the high accuracy configuration (HAC) mode.
STRONG pipeline
STRONG processes coassembly graph regions for multiple metagenomic datasets in order to simultaneously infer the composition of closely related strains for a particular MAG and their core gene sequences. Here, we provide an overview of STRONG. We start from a series of S related metagenomic samples, e.g. samples of the same (or highly similar) microbial community taken at different time points or from different locations.
The Snakemake based pipeline begins with the recovery of metagenomeassembled genomes (MAGs) [26]. We perform coassembly of all available data with the metaSPAdes assembler [33], and then bin the contigs obtained by composition and coverage profiles across all available samples with CONCOCT [3] or alternatively Metabat2 [25]. Each bin is then analyzed for completeness and contamination based on singlecopy core genes, and poor quality bins are discarded. The default criterion is that a MAG requires greater than or equal to 75% of the SCGs in a single copy. While we currently focus on this combination of software tools, in principle we could use any other software or pipeline for MAG recovery, e.g. we could use MEGAHIT as the primary assembler [29]. For each MAG we then extract the full or partial sequences of the core genes that we further refer to as singlecopy core gene (SCG) sequences.
The final coassembly graph produced by metaSPAdes cannot be used for strain resolution because, as with other modern assembly pipelines, variants between closely related strains will be removed during the graph simplification process. Instead, we output the initial graph for the final Kmer length used in the (potentially) iterative assembly following processing by a custom executable — spadesgsimplifier based on the SPAdes codebase — to remove likely erroneous edges using a ‘mild’ coverage threshold and a tip clipping procedure. We refer to the resulting graph as a highresolution assembly graph or HRAG.
The graph edges are then annotated with their corresponding sequence coverage profiles across all available samples. As is typical in de Bruijn graph analysis, the coverage values are given in terms of the kmer rather than nucleotide coverage. Profile computation is performed by a second tool for aligning reads onto the HRAG: unitigcoverage. The potential advantage of this approach in comparison to estimation based on kmer multiplicity, is that it can correctly handle the results of any bubble removal procedure that we might want to add to the preliminary simplification phase in future.
For each detected SCG sequence (across all MAGs) we next try to identify the subgraph of the HRAG encoding the complete sequences of all variants of the gene across all strains represented by the MAG. The procedure is described in more detail in the next section. During testing we faced two types of problems here: (1) related strains might end up in different MAGs and (2) some subgraphs might consist of fragments corresponding to several different species. We take several steps to mitigate those problems. Firstly, we compare SCG graphs between all bins, not just MAGs. If an SCG graph shares unitigs between bins, then it is flagged as overlapping. If multiple SCG graphs between MAGs (>10) overlap then we merge those MAGs, combining all graphs and processing them for strains together. Following merging any MAG SCG graphs with overlaps remaining are filtered out and not used in the strain resolution.
After MAG merging and COG subgraph filtering we process the remaining MAGs one by one. Before the core ‘decomposition’ procedure is launched on the set of SCG subgraphs corresponding to a particular MAG, they are subjected to a second round of simplification, parameterised by the mean coverage of the MAG, to filter nodes that are likely to be noise again by the spadesgsimplifier program. This module is described in more detail below. The resulting set of simplified SCGs of the HRAG for a MAG are then passed to the core graph decomposition procedure, which uses the graph structure constraints, along with coverage profiles associated with unitig nodes, to simultaneously predict: the number of strains making up the population represented by the MAG; their coverage depths across the samples; paths corresponding to each strain within every subgraph (each path encodes a sequence of the particular SCG instance).
A fraction of the SCGs in a MAG may properly derive from other organisms due to the possibility of incorrect binning i.e contamination. In fact, the default 75% singlecopy criterion allows up to 25% contamination. In addition, the subgraph extraction is not always perfect. We therefore add an extra level of filtering to the BayesPaths algorithm, iteratively running the program for all SCGs, but then filtering those with mean error rates, defined by Eq. 19, that exceed a default of 2.5 times the median deviation from the median gene error. Filtering on the median deviation is in general a robust strategy for identifying outliers. As a result of this filtering the pipeline only infers strain sequences on a subset of the input SCGs. We have found, however, that the number of SCGs for which strain haplotypes are inferred is sufficient for phylogenetics.
Relevant subgraph extraction
Provided with the predicted (partial) gene sequence, \(\bar T\), and the upper bound on the length of the coding sequence, L, defined as \(3 \alpha \langle \bar U_{n} \rangle \) where \(\langle \bar U_{n} \rangle \) is the average length in amino acids of that SCG in the COG database, and α=1.5 by default. The procedure for relevant HRAG subgraph extraction involves the following steps. First, the sequence \(\bar T\) is split into two halves, \( \bar {T'}\) and \(\bar {T''}\), keeping the correct frame (both \(\bar {T'}\) and \(\bar {T''}\) are forced to divide by 3). \( \bar {T'}\) and \(\bar {T''}\) are then processed independently. Without loss of generality we describe the processing of \( \bar {T'}\):

1.
Identify the path \(\mathcal {P}\) corresponding to \( \bar {T'}\) in the HRAG. We denote its length as \(L_{\mathcal {P}}\).

2.
Launch a graph search of the stop codons to the right (left) of the rightmost (leftmost) position of \( \bar {T'}\) (\( \bar {T''}\)). The stop codon search is frame aware and is performed by a depthfirst search (DFS) on the graph in which each vertex corresponds to a pair of the HRAG position and the partial sequence of the last traversed codon. Due to the properties of the procedure and the fact that it deals with DBGs, the actual implementation encodes frame state as an integer [0,2] rather than the string of last partially traversed codon. Vertices of this ‘state graph’ are naturally connected following the HRAG constraints. The search is cut off whenever a vertex with a frame state encoding a stop codon sequence is identified. Several stop codons can be identified within the same HRAG edge sequence in ‘different frames’, moreover the procedure correctly identifies all stop codons even if the graph contains cycles (although such subgraphs may be ignored in later stages of the pipeline). This codon search procedure was originally implemented for https://github.com/ablab/orfsearchand used in [15].

3.
The ‘backward’ search of the stop codons ‘to the left’ is actually implemented as a ‘forward’ search of the complementary sequences from the complementary position in the graph. Note that, as in classic ORF analysis, while the identified positions of the stop codons ‘to the right’ correspond to putative ends of the coding sequences for some of the variants of the analyzed gene, positions of the stop codons ‘to the left’ only provide the likely boundary for where the coding sequence can start. In particular, left stop codons are likely to fall within the coding sequence of the neighbouring gene (in a different frame). Actual start codons are thus likely to lie somewhere on the path (with sequence length divisible by 3) between one of the ‘left’ stop codons and one of the ‘right’ stop codons. For reasons of simplicity, further analysis of edges on the paths between left (right) stop codons ignores the constraint of divisibility by 3.

4.
After the sets of ‘left’ and ‘right’ stop codon positions are identified along with the shortest distances between them and the \( \bar {T'}\) path, we attempt to gather the relevant subgraph given by the union of edges lying on some path of a constrained length (see further) between some pair of left and right stop codons. First, for each pair (s,t) of the left and right stop codon positions we compute the maximal length of the paths that we want to consider L_{s,t} as \(L_{\mathcal {P}} + \textrm {minimal distance from s to start of} \mathcal {P} + \textrm {minimal distance from end of} \mathcal {P} \textrm { to t}\). The edge e=(v,w) is considered relevant if there exists a pair of left (right) stop codon positions (s^{′},t^{′}) such that the edge e lies on the path of length not exceeding L_{s,t} between s^{′} and t^{′}, which is equivalent to checking that min_dist(s^{′},v)+length(e)+min_dist(w,t^{′})<L_{s,t}. To allow for efficient checks of the shortest distances we precompute them by launching the Dijkstra algorithm from all left (right) stop codon positions in the forward (backward) direction. In practice, the Dijkstra runs are initiated from the ends/starts of corresponding edges and the distances are later corrected.

5.
We then exclude from the set of relevant edges the edges that are too far from any putative (right) stop codon to be a part of any COG instance. In particular, we exclude any edge e=(v,w), such that the minimal distance from vertex w to any of the right stop codon positions exceeds L.

6.
After the sets of the graph edges potentially encoding the gene sequence are gathered for \( \bar {T'}\) and \(\bar {T''}\) the union of the two sets, \(\mathcal {ER}\), is then taken and augmented by the edges, connecting the ‘inner’ \(\mathcal {ER}\) vertices (vertices which have at least one outgoing and at least one incoming edge in the \(\mathcal {ER}\)) to the rest of the graph.
Initial splitting of \( \bar {T}\) into \( \bar {T'}\) and \(\bar {T''}\) is required to detect relevant stop codons which are not reachable from the last position of \( \bar {T}\) in HRAG (or from which the first position of \( \bar {T}\) in HRAG can not be reached). In addition to the resulting component in gfa format, we also store the positions of the putative stop codons, and ids of edges connecting the component to the rest of the graph (further referred to as ‘sinks’ and ‘sources’).
Subgraph simplification
While processing SCG subgraphs from a particular MAG we use the available information on the coverage of the MAG in the dataset. In particular, we set up the simplification module to remove tips (a node with either no successors or predecessors) below a certain length (twice the read length) and edges with coverage below a limit that is the larger of 2.0 or one per cent of the MAG coverage for regular edges or 2.5 and two per cent of the MAG coverage for short edges.
While simplifying a SCG subgraph, edges connecting it to the rest of the assembly graph should be handled with care (in particular, they should be excluded from the set of potential tips). This is because in the BayesPaths algorithm they form potential sources and sinks of the possible haplotype paths. Moreover, during the simplification the graph changes, and such edges might become part of longer edges. Since we are interested in which deadends of the component do, and do not lead to the rest of the graph, the output contains the uptodate set of connections of the simplified component to the rest of the graph.
We now briefly describe additional, disabled by default procedures, based on ‘relative coverage’ criteria. Amongst other procedures for erroneous edge removal SPAdes implements a procedure considering the ratio of the edge coverage to the coverage of edges adjacent to it. We define an edge e as ‘predominated’ by vertex v incident to it if there is edge e_{1} outgoing from v and edge e_{2} incoming to v whose coverages exceed the coverage of e at least by a factor of α. Short edges (shorter than k+ε) predominated by both vertices incident to them are then removed from the graph. Erroneous graph elements in high genomic coverage graph regions often form subgraphs of three or more erroneous edges. SPAdes implements a procedure for search (and subsequent removal) of subgraphs limited by a set of predominated edges. Starting from a particular edge (v,w) predominated by vertex v, the graph is traversed from vertex w breadthfirst without taking into account the edge directions. If the vertex considered at the moment predominates the edge by which it was entered, the edges incident to it are not added to the traversal. The standard limitation of erroneous edge lengths naturally transforms into a condition of maximum length of the path between the vertices of the traversed subgraph. A limit on the maximum total length of its edges is additionally introduced.
BayesPaths
The model
We define an assembly graph \(\mathcal {G} = (\mathcal {V}, \mathcal {E})\) as a collection of unitig sequence vertices \(\mathcal {V} = 1,\ldots,V\) and directed edges \( \mathcal {E} \subseteq \mathcal {V} \times \mathcal {V}\). Each edge defines an overlap and comprises a pair of vertices and directions \( \left (u^{d} \rightarrow v^{d}\right) \in \mathcal {E} \) where d∈{+,−} and indicates whether the overlap occurs between the sequence (+) or its reverse complement (−). We define:

Counts x_{v,s} for each unitig v=1,…,V in sample s=1,…,S

Paths for strain g=1,…,G defined by \(\eta ^{g}_{u,v} \in {0,1}\) indicating whether strain g passes through that edge in the graph

Flow of strain g through unitig v, \(\phi ^{g+}_{v} = \sum _{u \in A(v)} \eta ^{g}_{u,v}\) and \(\phi ^{g}_{v} = \sum _{u \in D(v)} \eta ^{g}_{v,u}\) where A(v) is the set of ancestors of v and D(v) descendants in the assembly graph

The following is true \(\phi ^{g+}_{v} = \phi ^{g}_{v} = \phi ^{g}_{v}\)

Strain intensities γ_{g,s} as the rate per position that a read is generated from g in sample s

Unitig lengths L_{v}

Unitig bias θ_{v} is the fractional increase in reads generated from v given factors such as GC content influencing coverage

Source node s and sink node t such that \(\phi ^{g+}_{s} = \phi ^{g}_{s} = \phi ^{g+}_{t} = \phi ^{g}_{t} = 1\)
Then assume normally distributed counts for each node in each sample giving a joint density for observations and latent variables:
$$\begin{array}{*{20}l} P(\mathbf{X},\boldsymbol{\Gamma},\mathbf{H},\boldsymbol{\Theta}) = \prod_{v=1}^{V} \prod_{s=1}^{S} \mathcal{N}\left(x_{v,s}\leftL_{v} \theta_{v} {\sum_{h=1}^{G} \phi^{h}_{v} \gamma_{h,s}},\tau^{1}\right.\right) \prod_{h=1}^{G} \prod_{s=1}^{S} P(\gamma_{h,s}  \lambda_{h}) \\.\prod_{h=1}^{G} \prod_{v=1}^{V} \left[ \phi^{h+}_{v} = \phi^{h}_{v} \right] \left[ \phi^{h}_{s} = 1\right] \left[\phi^{h+}_{t} = 1\right] P(\tau) \\.\prod_{h=1}^{G} P(\lambda_{h}  \alpha_{0}, \beta_{0}) \prod_{v=1}^{V} P(\theta_{v}  \mu_{0}, \tau_{0}) \end{array} $$
(4)
where [] indicates the Iverson bracket evaluating to 1 if the condition is true and zero otherwise. We assume an exponential prior for the γ_{g,s} with a rate parameter that is strain dependent, such that:
$$ P(\gamma_{g,s} \lambda_{g}) = \lambda_{g} \exp(\gamma_{g,s} \lambda_{g}) $$
(5)
We then place gamma hyperpriors on the λ_{g}:
$$ P(\lambda_{g} \alpha_{0}, \beta_{0}) = \frac{\beta_{0}^{\alpha_{0}}}{\Gamma (\alpha_{0})} \lambda_{g}^{\alpha_{0}  1} \exp( \beta_{0} \lambda_{g}) $$
(6)
This acts as a form of automatic relevance determination (ARD) forcing strains with low intensity across all samples to zero in every sample [8].
We use a standard Gamma prior for the precision:
$$ P(\tau  \alpha_{\tau}, \beta_{\tau}) = \frac{\beta_{\tau}^{\alpha_{\tau}}}{\Gamma(\alpha_{\tau})} \tau^{\alpha_{\tau}  1} \exp(\beta_{\tau} \tau) $$
(7)
For the biases θ_{v} we use a truncated normal prior:
$$\begin{array}{*{20}l} P(\theta_{v}  \mu_{0}, \tau_{0}) & = & \frac{\sqrt{\frac{\tau_{0}}{2 \pi}} \exp\left( \frac{\tau_{0}}{2} (\theta_{v}  \mu_{0})^{2}\right) }{1  \Psi\left(\mu_{0} \sqrt \tau_{0} \right)} \;\;\; \theta_{v} >= 0 \\ & = & 0 \;\;\; \theta_{v} < 0 \\ \end{array} $$
where Ψ is the standard normal cumulative distribution. The mean of this is set to one, μ_{0}=1, so that our prior is that the coverage on any given node is unbiased, with a fairly high precision τ_{0}=100, to reflect an assumption that the observed coverage should reflect the summation of the strains. Finally, we assume a uniform prior over the possible discrete values of the \(\eta ^{g}_{v,u}\). If the assembly graph is a directed acyclic graph (DAG) then \(\eta ^{g}_{v,u} \in {0,1}\). We have found that for most genes and typical kmer lengths this is true, but we do not need to assume it.
Variational Approximation
We use variational inference to obtain an approximate solution to the posterior distribution of this model [7]. Variational inference is an alternative strategy to Markov chain Monte Carlo (MCMC) sampling. Rather than attempting to sample from the posterior distribution, variational inference assumes a tractable approximating distribution for the posterior, and then finds the parameters for that distribution that minimise the KullbackLeibler divergence between the approximation and the true posterior distribution. Further, in meanfield variational inference the approximation can be factorised into a product over a number of components that each approximate the posterior of a parameter in the true distribution. In practice the KullbackLeibler divergence is not computable because it depends on the evidence, i.e. the joint distribution marginalised over all latent variables. Instead, inference is carried out by maximising the evidence lower bound (ELBO), which is equal to the negative of the KullbackLeibler divergence plus a constant, that constant being the evidence. In our case, because all the distributions are conjugate we can employ CAVI, coordinate ascent variational inference, to iteratively maximise the ELBO.
Our starting point is to assume the following factorisation for the variational approximation:
$$ q(\mathbf{X},\boldsymbol{\Gamma},\mathbf{H}) = \prod_{h=1}^{G} q_{h}\left(\left\{ \eta^{h}_{v,u} \right\}_{u,v \in \mathcal{A}}\right) \prod_{h=1}^{G} \prod_{s=1}^{S} q_{h}(\gamma_{h,s}) \prod_{h=1}^{G} q_{h}(\lambda_{h}) \prod_{v=1}^{V} q_{v}(\theta_{v}) q(\tau) $$
(8)
where \(\mathcal {A}\) is the set of edges in the assembly graph and \(\mathcal {V} = 1,\ldots,V\) the set of unitig sequence vertices. Note that we have assumed a fully factorised approximation except for the \(\eta ^{h}_{v,u}\), the paths for each strain through the graph. There we assume that the path for each strain forms a separate factor allowing strong correlations between the different elements of the path. This is therefore a form of structured variational inference [22].
To obtain the CAVI updates we use the standard procedure of finding the log of the optimal distributions q for each set of factorised variables as the expectation of the log joint distribution Eq. 4 over all the other variables, except the one being optimised. Using an informal notation we will denote these expectations as \(\langle \ln P \rangle _{q_{j}}\) where q_{j} is the variable being optimised.
Then the mean field update for each set of \( \{\eta ^{g}_{v,u}\}_{u,v \in A}\) is derived as:
$$\begin{array}{*{20}l} \ln q_{g}^{*}\left(\left\{ \eta^{g}_{v,u} \right\}_{u,v \in A}\right) & = \left\langle \ln P \right\rangle_{\eta^{g}_{v,u} u,v \in \mathcal{A}} \\ & = \ln \left(\prod_{v=1}^{V} \delta_{\phi^{g+}_{v}, \phi^{g}_{v}} \delta_{\phi^{g}_{s},1} \delta_{\phi^{g+}_{t},1} \right) \\ &  \left\langle \sum_{v=1}^{V} \sum_{s=1}^{S} \frac{\tau}{2} \left(x_{v,s}  \theta_{v} L_{v}\left[\sum_{h=1}^{G} \phi^{h}_{v} \gamma_{h,s}\right] \right)^{2} \right\rangle_{\eta^{g}_{v,u} u,v \in \mathcal{A}} \\ & + \text{Terms\ independent\ of} \eta^{g} \end{array} $$
Consider the second term only:
$$ \frac{\langle \tau \rangle}{2} \left( \sum_{v=1}^{V} \sum_{s=1}^{S} 2 x_{v,s} L_{v} \langle \theta_{v} \rangle \langle \gamma_{g,s} \rangle \phi^{g}_{v} + L_{v}^{2} \left\langle {\theta_{v}}^{2} \right\rangle\left. \left\langle \left(\sum_{h=1}^{G} \phi^{h}_{v} \gamma_{h,s}\right) \left(\sum_{i=1}^{G} \phi^{i}_{v} \gamma_{i,s}\right) \right\rangle\right) \right) $$
This becomes:
$$\begin{aligned}  \frac{\langle \tau \rangle}{2} \Bigg(\sum_{v=1}^{V} \sum_{s=1}^{S} \Big[ 2 x_{v,s} \langle \theta_{v} \rangle L_{v} \langle \gamma_{g,s} \rangle \phi^{g}_{v} + 2 L_{v}^ 2 \left\langle {\theta_{v}}^{2} \right\rangle \sum_{h \neq g }^{G} \left\langle \phi^{h}_{v} \right\rangle \langle \gamma_{h,s} \rangle \langle \gamma_{g,s} \rangle \phi^{g}_{v} \\ + L_{v}^{2} \left\langle {\theta_{v}}^{2} \right\rangle \left\langle \gamma_{g,s}^{2} \right\rangle \left(\phi^{g}_{v}\right)^{2} \Big] \Bigg) \end{aligned} $$
Which can be reorganised to:
$$ \ln q_{g}^{*}\left(\left\{ \eta^{g}_{v,u} \right\}_{u,v \in A}\right) = \ln \left(\prod_{v=1}^{V} \delta_{\phi^{g+}_{v}, \phi^{g}_{v}} \delta_{\phi^{g}_{s},1} \delta_{\phi^{g+}_{t},1} \right) + \sum_{v=1}^{V} c_{1,v} \phi^{g}_{v} + \sum_{v=1}^{V} c_{2,v} \left(\phi^{g}_{v}\right)^{2} $$
(9)
Where:
$$\begin{array}{@{}rcl@{}} c_{1,v} & = &  \frac{\langle \tau \rangle}{2} \sum_{s=1}^{S} \left [ 2 x_{v,s} \langle \theta_{v} \rangle L_{v} \langle \gamma_{g,s} \rangle + 2 L_{v}^ 2 \left\langle {\theta_{v}}^{2} \right\rangle \sum_{h \neq g }^{G} \left\langle \phi^{h}_{v} \right\rangle \langle \gamma_{h,s} \rangle \langle \gamma_{g,s} \rangle \right ] \\ c_{2,v} & = &  \frac{\langle \tau \rangle}{2} L_{v}^{2} \left\langle {\theta_{v}}^{2} \right\rangle \left\langle \gamma_{g,s}^{2} \right\rangle \end{array} $$
It is apparent from Eq. 9 that the \(q_{g}^{*}\left (\left \{ \eta ^{g}_{v,u} \right \}_{u,v \in A}\right)\) takes the form of a multivariate discrete distribution with u,v∈A dimensions. The first term in Eq. 9 enforces the flow constraints, and does not separate across nodes, the next two terms are effectively coefficients on the total flow through a unitig and its square. The updates for the other variables below, depend on the expected values of the total flow through each of the unitig nodes for the strain g, \(\left \langle \phi ^{g}_{v} \right \rangle \), which themselves depend on the \(\eta ^{g}_{v,u}\). These expected values can be efficiently obtained for all v by representing Eq. 9 as a factor graph comprising nodes consisting of factors corresponding to both the constraints and the flow probabilities through each node with variables \(\eta ^{g}_{v,u}\). We can then find the marginal probabilities for both the \(\eta ^{g}_{v,u}\) and the \(\phi ^{g}_{v}\) using the Junction Tree algorithm [48], from these we can calculate the required expectations.
Next we consider the mean field update for the γ_{g,s}:
$$\begin{array}{*{20}l} \ln q^{*}(\gamma_{g,s}) & = \left\langle \ln P \right\rangle_{\gamma_{g,s}} \\ & =  \left\langle \sum_{v=1}^{V} \frac{\tau}{2}\left(x_{v,s}  \theta_{v} L_{v}\left[\sum_{h=1}^{G} \phi^{h}_{v} \gamma_{h,s}\right] \right)^{2} \right\rangle_{\gamma_{g,s}}  \langle\lambda_{g}\rangle \gamma_{g,s} \end{array} $$
$$\begin{array}{*{20}l} \ln q^{*}(\gamma_{g,s}) & = \\ &  \frac{\langle \tau \rangle }{2} \Bigg (\sum_{v=1}^{V} 2 x_{v,s} \langle \theta_{v} \rangle L_{v} \left\langle \phi^{g}_{v} \right\rangle \gamma_{g,s} + 2 \left\langle \theta_{v}^{2} \right\rangle L_{v}^{2} \gamma_{g,s} \left\langle \phi_{v}^{g} \right\rangle \sum_{h \neq g} \langle \gamma_{h,s} \rangle \left\langle \phi^{h}_{v} \right\rangle \\ & + \left\langle \theta_{v}^{2} \right\rangle L_{v}^{2} \gamma_{g,s}^{2} \left\langle \left[\phi^{g}_{v}\right]^{2} \right\rangle \Bigg) \\ &  \langle\lambda_{g}\rangle \gamma_{g,s} \end{array} $$
with the restriction γ_{g,s}>0 this gives a normal distribution but truncated to (0, inf) for γ_{g,s}, with mean and precision:
$$\begin{array}{*{20}l} \mu_{g,s} & = \frac{\sum_{v} x_{v,s} \theta_{v} L_{v} \left\langle \phi_{v}^{g} \right\rangle  \left\langle \phi_{v}^{g} \right\rangle \sum_{h \neq g} \langle \gamma_{h,s} \rangle \left\langle \phi^{h}_{v}\right\rangle \left\langle \theta_{v}^{2} \right\rangle L_{v}^{2}}{\sum_{v} L_{v}^{2} \left\langle \left[\phi^{g}_{v}\right]^{2} \right\rangle}  \frac{\langle\lambda_{g}\rangle}{\tau_{g,s} } \end{array} $$
(10)
$$\begin{array}{*{20}l} \tau_{g,s} & = \langle \tau \rangle \sum_{v} L_{v}^{2} \left\langle \left[\phi^{g}_{v}\right]^{2} \right\rangle \end{array} $$
(11)
Derivations for the other updates follow similarly giving a Gamma posterior for the τ with parameter updates:
$$\begin{array}{*{20}l} \alpha_{\tau} & = \alpha_{0} + \Omega/2 \end{array} $$
(12)
$$\begin{array}{*{20}l} \beta_{\tau} & = \beta_{0} + \sum_{v,s} \left\langle (x_{v,s}  \lambda_{v,s})^{2}\right\rangle \end{array} $$
(13)
where Ω=VS and we have used λ_{v,s} as a short hand for the predicted count number:
$$\lambda_{v,s} = \theta_{v} \sum_{g} \gamma_{g,s} \phi_{v}^{g}. $$
Then the τ have the following expectations and log expectations:
$$\begin{array}{*{20}l} \langle \tau_{v,s}\rangle & = \alpha_{\tau}/\beta_{\tau} \end{array} $$
(14)
$$\begin{array}{*{20}l} \langle \log \tau_{v,s}\rangle & = \psi\left(\alpha_{\tau} + 1/2\right)  \log\left(\beta_{\tau}\right) \end{array} $$
(15)
where ψ is the digamma function. The biases θ_{v} have a truncated normal distribution and their updates can be derived similar to the above.
Evidence lower bound (ELBO)
Iterating the CAVI updates defined above will generate a variational approximation that is optimal in the sense of maximising the evidence lower bound (ELBO) so called because it bounds the log evidence, log(p(x))≥ELBO(q(z)). It is useful to calculate the ELBO whist performing CAVI updates to verify convergence and the ELBO itself is sometimes used as a Bayesian measure of model fit, although as a bound that may be controversial [7]. The ELBO can be calculated from the relationship:
$$ ELBO(q) = \mathbb{E} \left[ \log p(xz) \right] + \mathbb{E} \left[ \log p(z) \right]  \mathbb{E} \left[ \log q(z)\right] $$
(16)
The first term is simply the expected loglikelihood of the data given the latent variables. In our case it is:
$$ \mathbb{E} \left[ \log p(xz) \right] = \frac{1}{2}\Omega \left (\langle \log \tau \rangle  \log (2 \pi)\right)  \frac{1}{2} \langle \tau \rangle \left\langle \left(x_{v,s}  L_{v} \theta_{v} \sum_{h=1}^{G} \phi^{h}_{v} \gamma_{h,s}\right)^{2}\right\rangle $$
(17)
where Ω=VS and the expectations are over the optimised distributions q.
The second term is the expectation under q(z) of the log prior distributions. In our case with standard distributions it is easy to calculate for instance for each of the γ_{g,s}:
$${} \mathbb{E} \left[ \log p(\gamma_{g,s}) \right ] = \frac{1}{2} \log\left(\frac{\tau_{g,s}}{2\pi}\right)  \frac{\tau_{g,s}}{2}\left(\left\langle \gamma_{g,s}^{2} \right\rangle + \mu_{g,s}^{2}  2 \mu_{g,s} \langle \gamma_{g,s} \rangle \right)  \log \left[\frac{1}{2} \text{erf} \left(\mu_{g,s}\sqrt {\frac{\tau_{g,s}}{2}}\right) \right]. $$
With the μ_{g,s} and τ_{g,s} given by their current values derived from Eq. 11 and the moments of γ_{g,s} calculated from a truncated normal distribution with those current parameters. The third terms are simply the negative entropy of the variational approximations and for the standard distributions used here are easily calculated.
Implementation details
One update of the algorithm consists of updating each variable or sets of variables in turn given the current optimal solutions of the other distributions. In practice we update:

Compute the marginal flows \(\left \{ \eta ^{g}_{v,u} \right \}_{u,v \in A}\) for each strain g=1,…,G in turn using Eq. 9 and the Junction Tree algorithm. This can be performed for each single copycore gene independently

Update the truncated normal strain abundances q(γ_{g,s}) for each strain in each sample, s=1,…,S using Eq. 11

Update the q(τ)

Update the ARD parameter distributions q(λ_{g}) if used

Update the nodes biases q(θ_{v})

Check for redundant or low abundance strains and remove (see below)
After a maximum fixed number of iterations or if the ELBO converges we stop iterating. Variational inference can be sensitive to initial conditions as it can only find local maxima of the ELBO, we therefore use a previously published variational Bayesian version of nonnegative matrix factorisation [8], to find an initial approximate solution.
Empirical modelling of node variances
For lowcoverage MAGs a precision that is identical for all nodes performs satisfactorily, but since the true distribution of read counts is Poisson this overestimates the precision for nodes with high counts x_{v,s}. To address we developed an empirical procedure where we first calculate 〈logτ_{v,s}〉 for each node using Eq. 15 as:
$$ \langle \log \tau_{v,s}\rangle = \psi\left(\alpha_{0} + 1/2\right)  \log\left(\beta_{0} X_{v,s} +\frac{1}{2} \left\langle (x_{v,s}  \lambda_{v,s})^{2} \right\rangle \right) $$
(18)
a quantity which exhibits high variability, so we then smooth this over log(x_{v,s}) using generalised additive models as implemented in pyGAM [43] to give 〈logτ_{v,s}〉^{∗}. The term β_{0}X_{v,s} gives a prior which is effectively Poisson. We then obtain 〈τ_{v,s}〉 as exp(〈logτ_{v,s}〉^{∗}). This procedure has no theoretical justification but gives good results in practice. This approach of modelling a nonGaussian distribution as a Gaussian with empirical variances is similar to that used in voom for RNASeq [27].
Crossvalidation to determine optimum number of strains
The ARD procedure usually converges on the correct number of strains except for highcoverage MAGs where overfitting may occur and too many strains can be predicted. We therefore additionally implemented a crossvalidation procedure, splitting the entire data matrix x_{v,s} into test and train folds (default ten folds) and training the model on the train fold and then testing on the held out data. The correct number of strains was then taken to be the one that maximised the log predictive posterior with an empirical variance reflecting the Poisson nature of the data. The exact test statistic being:
$$ \sum_{v,s \in E} \frac{1}{2}\log\left(\tau'_{v,s}\right)  \frac{1}{2}\sum_{v,s \in E} \tau'_{v,s} \left\langle {(x_{v,s} \lambda_{v,s})}^{2}\right\rangle $$
(19)
where τv,s′=1/(0.5+x_{v,s}) and E indicates data points in the test set to downweight high read count nodes reflecting approximately Poisson noise.
Nanopore sequence analysis
Sequence preprocessing
To enable a qualitative comparison between haplotypes obtained from the Nanopore reads and the BayesPaths predictions we developed the following pipeline applied at the level of individual singlecopy core genes (SCGs) from MAGs:

1.
We mapped all reads using minimap2 [30] against the SCG contig consensus ORF sequence and selected those reads with alignments that spanned at least one third of the gene with a nucleotide identity >80%.

2.
We then extracted the aligned portion of each read, reverse complementing if necessary, so that all reads ran in the same direction as the SCG ORF.

3.
We then obtained the variant positions on the consensus from the output of the DESMAN pipeline [39]. These are variant positions prior to haplotype calling representing the total unlinked variation observed in the short reads.

4.
For each Nanopore fragment we aligned against the SCG ORF using a NeedlemanWunsch global alignment and generated a condensed read comprising bases only from the short read variant positions.
This provided us with a reduced representation of each Nanopore read effectively filtering variation that was not observed in the short reads. These reduced representations were then used to calculate distances, defined as Hamming distances on the variant positions normalised by number of positions observed, both between the reads and between the reads and the predicted COG sequences from BayesPaths. From these we generated NMDS plots indicating sequence diversity, and they provided an input to the hybrid Nanopore strain resolution algorithm below.
EM algorithm for hybrid Nanopore strain resolution
We also developed a simple EM algorithm for direct prediction of paths and their abundances on the short read assembly graph that are consistent with a set of observed long reads. We began by mapping the set of n=1,…,N Nanopore sequences denoted \( \{ \mathcal {S}_{n} \}\) onto the corresponding simplified SCG graph generated by STRONG using GraphAligner [40]. This provided us with N optimal alignment paths as walks in our SCG graph. We denote this graph \(\mathcal {G}\) comprising unitig vertices v and edges e∈{u,v} defining overlaps.
We assume, as is almost always the case that the graphs contain no unitigs in both forward and reverse configurations, and that there are no cycles, so that each SCG is a directed acyclic graph (DAG) with one copy of each unitig, and we only need to track the direction that each overlap enters and leaves each unitig. Then best alignment walks comprise a sequence of edges, \(e^{n}_{1},\dots, e^{n}_{W_{n}}\) where W_{n} is the number of edges in the walk of read n, that traverse the graph.
Given these observed Nanopore reads we aim to reconstruct G haplotypes comprising paths from a dummy source node s, which has outgoing edges to all the true source nodes in the graph, through the graph to a dummy sink t, which connects all the true sinks. We further assume that each haplotype has relative frequency π_{g}. Each such haplotype path \(\mathcal {P}_{g} = \left \{s,e^{g}_{1},\ldots, e^{g}_{W_{g}}, t \right \}\) will translate into a nucleotide sequence \(\mathcal {T}_{g}\). We assume that these haplotypes generate Nanopore reads with a fixed error rate ε which gives a likelihood:
$$ P(\{ \mathcal{S}_{1},\ldots, \mathcal{S}_{N} \}  \pi, \{ \mathcal{T}_{1},\ldots, \mathcal{T}_{G} \}) = \prod_{n=1}^{N} \left(\sum_{g=1}^{G} \pi_{g} \epsilon^{m_{n,g}} (1  \epsilon)^{M_{n,g}} \right). $$
(20)
where m_{n,g} is the number of basepair mismatches between \(\mathcal {S}_{n}\) and \(\mathcal {T}_{g}\) counting insertions, deletions and substitutions equally and M_{n,g} the number of matches.
To maximise this likelihood we used an ExpectationMaximisation algorithm. Iterating the following steps until convergence:

1.
Estep: Calculate the responsibility of each haplotype for each sequence as:
$$ z_{n,g} = \frac{ \pi_{g} \epsilon^{m_{n,g}} (1  \epsilon)^{M_{n,g}}}{\sum_{h} \pi_{h} \epsilon^{m_{n,h}} (1  \epsilon)^{M_{n,h}}} $$
(21)
Alignments of reads against haplotypes were performed using vsearch [41].

2.
Mstep: We update each haplotype by finding the most likely path on the short read graph given the current expectations. These are calculated by assigning a weight \(w^{g}_{e}\) to each edge e in the graph as:
$$ w^{g}_{e} = \sum_{n \in e} z_{n,g} L_{e} $$
(22)
where n∈e are the set of reads whose optimal alignment contains that edge and L_{e} is the unique length of the unitig the edge connects to, i.e. ignoring the overlap length. We then find for haplotype g the maximal weight path through this DAG using a topological sort. The error rates are updated as:
$$ \epsilon = \frac{\sum_{n} \sum_{g} z_{n,g} m_{n,g}}{ \sum_{n} \sum_{g} z_{n,g} L_{n,g}} $$
(23)
where L_{n,g} are the alignment lengths.
As is often the case with EM algorithms convergence depends strongly on initial conditions. Therefore we initialise using a partitioning around medoids clustering using the distances calculated in “Methods  Nanopore sequence analysis” section. We can estimate the number of haplotypes from the negative loglikelihood as a function of haplotype number.