Synthetic data set generation
The in silico synthetic communities were generated by first downloading a list of complete bacterial genomes from the NCBI and selecting species with multiple strains present. Genomes were restricted to those that were full genome projects, possessed at least 35 of 36 single-copy core genes (SCGs) identified in [3], and with relatively few contigs (<5) in the assemblies. Communities were created by specifying species from this list and the number of strains desired. The strains selected were then chosen at random from the candidates for each species, with the extra restrictions that all strains in a species were at least 0.05% and no more than 5% nucleotide divergent on the SCGs from any other strain in the species. This corresponds to a minimum divergence of approximately 15 nucleotides over the roughly 30 kbp region formed by summing the SCGs. The genomes used are given in Additional file 1: Tables S1 and S2 and as a csv text file in Additional file 2.
Each species indexed i was then given an abundance, yi,s, in each sample, \(s = 1,\dots,S\), which was drawn from a lognormal distribution with a species dependent mean and standard deviation, themselves drawn from a normal and gamma distribution respectively:
$$\log (y_{i,s}) \sim N(\mu_{i}, \sigma_{i}) $$
where:
$$\mu_{i} \sim N(\mu_{p}, \sigma_{p}) $$
and:
$$\sigma_{i} \sim \text{Gamma}(k_{p}, \theta_{p}). $$
For all the synthetic communities we used μp=1,σp=0.125,kp=1 and θp=1. The species abundances were then normalised to one (\(y^{\prime }_{i,s} = y_{i,s}/ \sum _{i} y_{i,s}\)). For each strain within a species its proportion in each sample was then drawn from a Dirichlet:
$$ \rho_{g,s} \sim \text{Dirichlet}(\alpha) $$
(1)
with α=1 and g=1,…,G indexing G strains per species. This was the procedure used for all the data sets except Synth_T1_S10, Synth_T10_S10, and Synth_T100_S10 where we used a hierarchical Dirichlet first generating for each MAG an expected strain profile for all samples:
$$ \mu_{i,g} \sim \text{Dirichlet}(\alpha) $$
(2)
with α=1 but then in each sample the strain proportions were generated as:
$$ \rho_{i,s,g} \sim \text{Dirichlet}(\mu_{i}\theta), $$
(3)
so that the mean strain proportions were μi with an adjustable precision θ.
This allowed us to specify a copy number for each genome g in species i in each sample as \(y^{\prime }_{i,s}\rho _{g,s}\). We then generated 150 million paired-end 2x150 bp reads in total across all samples with Illumina HiSeq error distributions using the ART read simulator [23]. The code for the synthetic community generation is available from https://github.com/chrisquince/STRONG_Sim.
Synthetic data set evaluation
We can determine which contig derived from which reference genome by considering the simulated reads that map onto it. We know which reference each of these came from, enabling us to assign a contig to a genome as that which a majority of its reads derive from. We can then assign each MAG generated by STRONG to a reference species as the one which the majority of its contig’s derive from weighted by the contig length.
Anaerobic digester sampling and sequencing
AD sample collection
We obtained ten samples from a farm anaerobic digestion bioreactor across a period of approximately one year. The sampling times, metadata and accession numbers are given in Additional file 1: Table S5. The reactor was fed on a mixture of slurry, whey and crop residues, and operated between 35−40∘C, with mechanical stirring. Biomass samples were taken directly from the AD reactor by the facility operators and shipped in ice-cooled containers to the University of Warwick. Upon receipt, they were stored at 4∘C and then sampled into several 1-5mL aliquots within a few days. DNA was usually extracted from these aliquots immediately but some were first stored in a −80∘C freezer until subsequent thawing and extraction.
AD short read sequencing
DNA extraction was performed using the Qiagen Powersoil extraction kit following the manufacturer’s protocol. DNA samples were sequenced by Novogene using the NovaSeq platform with 2x150 bp reads at a mean depth of 11.63 Gbp.
AD long read sequencing
Anaerobic digester samples were stored in 1.8 mL Cryovials at −80∘C. Samples were defrosted at 4∘C overnight prior to DNA extraction. DNA was extracted from a starting mass of 250 mg of anaerobic digester sludge using the MP BiomedicalTM FastDNATM SPIN Kit for Soil (cat no: 116560200) and a modified manufacturers protocol. Defrosted samples were homogenised by pipetting and then transferred to a MP bioTM lysing matrix E tube (cat no: 116914050-CF). Samples were resuspended in 938 μL of Sodium phosphate buffer (cat no: 116560205).
Preliminary cell lysis was undertaken using lysozyme at a final concentration of 200 ng/μL and 20 μL of Molzyme Bug Lysis TM solution. Samples were mixed by inversion and incubated at 37∘C for 30 min on a shaking incubator (<100 rpm). Lysozyme was inactivated by adding 122 μL of MP bio MT buffer and mixing by inversion. Samples were then mechanically lysed in a VelociRuptor V2 bead beating machine (cat no: SLS1401) at 5 m/s for 20 seconds then placed on ice for five minutes.
Samples were centrifuged at 14000 g for five minutes to pellet the particulate matter and the supernatant was transferred to a new 1.5 mL microfuge tube. Proteins were precipitated from the crude lysate by adding 250 μL of PPSTM (cat no: 116560203) and then mixing by inversion. Precipitated proteins were pelleted for five minutes at 14000 g and the supernatant was transferred to 1000 μL of pre mixed DNA binding matrix solution (cat no: 116540408). Samples were mixed by inversion for two minutes.
DNA binding matrix beads were recovered using the MP bioTM spin filter (cat no: 116560210) and manufacturer based spin protocol. The binding matrix was washed of impurities by complete resuspension in 500 μL of SEWS-M solution (cat no: 116540405) and centrifuged at 14000 g for five minutes. The DNA binding matrix was then washed for a second time by resuspension in 500 μL of 80% EtOH followed by centrifugation at 14000 g for five minutes. Flow though was discarded and centrifuged at 14000 g for two minutes to remove residual EtOH. The binding matrix was left to air dry for 2 minutes then DNA was eluted using 100 μL of DES elution buffer at 56∘C. Elute was collected by centrifugation at 14000 g for 5 minutes and stored at 4∘C prior to library preparation. Eluted DNA concentration was estimated using a Qubit 4TM fluorometer with the dsDNA Broad Range sensitivity assay kit (cat no: Q32853). 260:280 and 260:230 purity ratios were quantified using a NanodropTM 2000.
A 1x SPRI clean up procedure was undertaken prior to library construction to further reduce contaminant carry over. Input DNA was standardised to 1.2 μg in 48 μL of H2O using a qubit 4TM fluorometer and dsDNA 1x High Sensitivity assay kit (cat no: Q33231). Library preparation was undertaken using the Oxford NanoporeⒸ Ligation Sequencing Kit (SQK-LSK109) with minor modifications to the manufacturer protocol. The FFPE/End repair incubation step was extended to 30 min at 20∘C and 30 min at 65∘C, while DNA was eluted from SPRI beads at 37∘C for 30 min with gentle agitation. The SQK-LSK109 long fragment buffer was used to ensure removal of non-ligated adaptor units and reduce short fragment carryover into the final sequencing library. The final library DNA concentration was standardised to 250 ng in 12 μL of EB using a qubit 4TM fluorometer and dsDNA 1 x High Sensitivity assay kit.
Sequencing was undertaken for 72 hours on an Oxford NanoporeⒸ R 9.4.1 (FLO-MIN106) flow cell with 1489 active pores. DNA was left to tether for 1 hour prior to commencing sequencing. The flow cell and sequencing reaction was controlled by a MinIONTM MKII device and the GUI MinKNOW V. 19.12.5. ATP refuelling was undertaken every 18 hours with 75 μL of flush buffer (FB). Post Hoc basecalling was undertaken using Guppy V. 3.5.1 and the high accuracy configuration (HAC) mode.
STRONG pipeline
STRONG processes co-assembly graph regions for multiple metagenomic datasets in order to simultaneously infer the composition of closely related strains for a particular MAG and their core gene sequences. Here, we provide an overview of STRONG. We start from a series of S related metagenomic samples, e.g. samples of the same (or highly similar) microbial community taken at different time points or from different locations.
The Snakemake based pipeline begins with the recovery of metagenome-assembled genomes (MAGs) [26]. We perform co-assembly of all available data with the metaSPAdes assembler [33], and then bin the contigs obtained by composition and coverage profiles across all available samples with CONCOCT [3] or alternatively Metabat2 [25]. Each bin is then analyzed for completeness and contamination based on single-copy core genes, and poor quality bins are discarded. The default criterion is that a MAG requires greater than or equal to 75% of the SCGs in a single copy. While we currently focus on this combination of software tools, in principle we could use any other software or pipeline for MAG recovery, e.g. we could use MEGAHIT as the primary assembler [29]. For each MAG we then extract the full or partial sequences of the core genes that we further refer to as single-copy core gene (SCG) sequences.
The final coassembly graph produced by metaSPAdes cannot be used for strain resolution because, as with other modern assembly pipelines, variants between closely related strains will be removed during the graph simplification process. Instead, we output the initial graph for the final K-mer length used in the (potentially) iterative assembly following processing by a custom executable — spades-gsimplifier based on the SPAdes codebase — to remove likely erroneous edges using a ‘mild’ coverage threshold and a tip clipping procedure. We refer to the resulting graph as a high-resolution assembly graph or HRAG.
The graph edges are then annotated with their corresponding sequence coverage profiles across all available samples. As is typical in de Bruijn graph analysis, the coverage values are given in terms of the k-mer rather than nucleotide coverage. Profile computation is performed by a second tool for aligning reads onto the HRAG: unitig-coverage. The potential advantage of this approach in comparison to estimation based on k-mer multiplicity, is that it can correctly handle the results of any bubble removal procedure that we might want to add to the preliminary simplification phase in future.
For each detected SCG sequence (across all MAGs) we next try to identify the subgraph of the HRAG encoding the complete sequences of all variants of the gene across all strains represented by the MAG. The procedure is described in more detail in the next section. During testing we faced two types of problems here: (1) related strains might end up in different MAGs and (2) some subgraphs might consist of fragments corresponding to several different species. We take several steps to mitigate those problems. Firstly, we compare SCG graphs between all bins, not just MAGs. If an SCG graph shares unitigs between bins, then it is flagged as overlapping. If multiple SCG graphs between MAGs (>10) overlap then we merge those MAGs, combining all graphs and processing them for strains together. Following merging any MAG SCG graphs with overlaps remaining are filtered out and not used in the strain resolution.
After MAG merging and COG subgraph filtering we process the remaining MAGs one by one. Before the core ‘decomposition’ procedure is launched on the set of SCG subgraphs corresponding to a particular MAG, they are subjected to a second round of simplification, parameterised by the mean coverage of the MAG, to filter nodes that are likely to be noise again by the spades-gsimplifier program. This module is described in more detail below. The resulting set of simplified SCGs of the HRAG for a MAG are then passed to the core graph decomposition procedure, which uses the graph structure constraints, along with coverage profiles associated with unitig nodes, to simultaneously predict: the number of strains making up the population represented by the MAG; their coverage depths across the samples; paths corresponding to each strain within every subgraph (each path encodes a sequence of the particular SCG instance).
A fraction of the SCGs in a MAG may properly derive from other organisms due to the possibility of incorrect binning i.e contamination. In fact, the default 75% single-copy criterion allows up to 25% contamination. In addition, the subgraph extraction is not always perfect. We therefore add an extra level of filtering to the BayesPaths algorithm, iteratively running the program for all SCGs, but then filtering those with mean error rates, defined by Eq. 19, that exceed a default of 2.5 times the median deviation from the median gene error. Filtering on the median deviation is in general a robust strategy for identifying outliers. As a result of this filtering the pipeline only infers strain sequences on a subset of the input SCGs. We have found, however, that the number of SCGs for which strain haplotypes are inferred is sufficient for phylogenetics.
Relevant subgraph extraction
Provided with the predicted (partial) gene sequence, \(\bar T\), and the upper bound on the length of the coding sequence, L, defined as \(3 \alpha \langle \bar U_{n} \rangle \) where \(\langle \bar U_{n} \rangle \) is the average length in amino acids of that SCG in the COG database, and α=1.5 by default. The procedure for relevant HRAG subgraph extraction involves the following steps. First, the sequence \(\bar T\) is split into two halves, \( \bar {T'}\) and \(\bar {T''}\), keeping the correct frame (both \(\bar {T'}\) and \(\bar {T''}\) are forced to divide by 3). \( \bar {T'}\) and \(\bar {T''}\) are then processed independently. Without loss of generality we describe the processing of \( \bar {T'}\):
-
1.
Identify the path \(\mathcal {P}\) corresponding to \( \bar {T'}\) in the HRAG. We denote its length as \(L_{\mathcal {P}}\).
-
2.
Launch a graph search of the stop codons to the right (left) of the rightmost (leftmost) position of \( \bar {T'}\) (\( \bar {T''}\)). The stop codon search is frame aware and is performed by a depth-first search (DFS) on the graph in which each vertex corresponds to a pair of the HRAG position and the partial sequence of the last traversed codon. Due to the properties of the procedure and the fact that it deals with DBGs, the actual implementation encodes frame state as an integer [0,2] rather than the string of last partially traversed codon. Vertices of this ‘state graph’ are naturally connected following the HRAG constraints. The search is cut off whenever a vertex with a frame state encoding a stop codon sequence is identified. Several stop codons can be identified within the same HRAG edge sequence in ‘different frames’, moreover the procedure correctly identifies all stop codons even if the graph contains cycles (although such subgraphs may be ignored in later stages of the pipeline). This codon search procedure was originally implemented for https://github.com/ablab/orf-searchand used in [15].
-
3.
The ‘backward’ search of the stop codons ‘to the left’ is actually implemented as a ‘forward’ search of the complementary sequences from the complementary position in the graph. Note that, as in classic ORF analysis, while the identified positions of the stop codons ‘to the right’ correspond to putative ends of the coding sequences for some of the variants of the analyzed gene, positions of the stop codons ‘to the left’ only provide the likely boundary for where the coding sequence can start. In particular, left stop codons are likely to fall within the coding sequence of the neighbouring gene (in a different frame). Actual start codons are thus likely to lie somewhere on the path (with sequence length divisible by 3) between one of the ‘left’ stop codons and one of the ‘right’ stop codons. For reasons of simplicity, further analysis of edges on the paths between left (right) stop codons ignores the constraint of divisibility by 3.
-
4.
After the sets of ‘left’ and ‘right’ stop codon positions are identified along with the shortest distances between them and the \( \bar {T'}\) path, we attempt to gather the relevant subgraph given by the union of edges lying on some path of a constrained length (see further) between some pair of left and right stop codons. First, for each pair (s,t) of the left and right stop codon positions we compute the maximal length of the paths that we want to consider Ls,t as \(L_{\mathcal {P}} + \textrm {minimal distance from s to start of} \mathcal {P} + \textrm {minimal distance from end of} \mathcal {P} \textrm { to t}\). The edge e=(v,w) is considered relevant if there exists a pair of left (right) stop codon positions (s′,t′) such that the edge e lies on the path of length not exceeding Ls,t between s′ and t′, which is equivalent to checking that min_dist(s′,v)+length(e)+min_dist(w,t′)<Ls,t. To allow for efficient checks of the shortest distances we precompute them by launching the Dijkstra algorithm from all left (right) stop codon positions in the forward (backward) direction. In practice, the Dijkstra runs are initiated from the ends/starts of corresponding edges and the distances are later corrected.
-
5.
We then exclude from the set of relevant edges the edges that are too far from any putative (right) stop codon to be a part of any COG instance. In particular, we exclude any edge e=(v,w), such that the minimal distance from vertex w to any of the right stop codon positions exceeds L.
-
6.
After the sets of the graph edges potentially encoding the gene sequence are gathered for \( \bar {T'}\) and \(\bar {T''}\) the union of the two sets, \(\mathcal {ER}\), is then taken and augmented by the edges, connecting the ‘inner’ \(\mathcal {ER}\) vertices (vertices which have at least one outgoing and at least one incoming edge in the \(\mathcal {ER}\)) to the rest of the graph.
Initial splitting of \( \bar {T}\) into \( \bar {T'}\) and \(\bar {T''}\) is required to detect relevant stop codons which are not reachable from the last position of \( \bar {T}\) in HRAG (or from which the first position of \( \bar {T}\) in HRAG can not be reached). In addition to the resulting component in gfa format, we also store the positions of the putative stop codons, and ids of edges connecting the component to the rest of the graph (further referred to as ‘sinks’ and ‘sources’).
Subgraph simplification
While processing SCG subgraphs from a particular MAG we use the available information on the coverage of the MAG in the dataset. In particular, we set up the simplification module to remove tips (a node with either no successors or predecessors) below a certain length (twice the read length) and edges with coverage below a limit that is the larger of 2.0 or one per cent of the MAG coverage for regular edges or 2.5 and two per cent of the MAG coverage for short edges.
While simplifying a SCG subgraph, edges connecting it to the rest of the assembly graph should be handled with care (in particular, they should be excluded from the set of potential tips). This is because in the BayesPaths algorithm they form potential sources and sinks of the possible haplotype paths. Moreover, during the simplification the graph changes, and such edges might become part of longer edges. Since we are interested in which dead-ends of the component do, and do not lead to the rest of the graph, the output contains the up-to-date set of connections of the simplified component to the rest of the graph.
We now briefly describe additional, disabled by default procedures, based on ‘relative coverage’ criteria. Amongst other procedures for erroneous edge removal SPAdes implements a procedure considering the ratio of the edge coverage to the coverage of edges adjacent to it. We define an edge e as ‘predominated’ by vertex v incident to it if there is edge e1 outgoing from v and edge e2 incoming to v whose coverages exceed the coverage of e at least by a factor of α. Short edges (shorter than k+ε) predominated by both vertices incident to them are then removed from the graph. Erroneous graph elements in high genomic coverage graph regions often form subgraphs of three or more erroneous edges. SPAdes implements a procedure for search (and subsequent removal) of subgraphs limited by a set of predominated edges. Starting from a particular edge (v,w) predominated by vertex v, the graph is traversed from vertex w breadth-first without taking into account the edge directions. If the vertex considered at the moment predominates the edge by which it was entered, the edges incident to it are not added to the traversal. The standard limitation of erroneous edge lengths naturally transforms into a condition of maximum length of the path between the vertices of the traversed subgraph. A limit on the maximum total length of its edges is additionally introduced.
BayesPaths
The model
We define an assembly graph \(\mathcal {G} = (\mathcal {V}, \mathcal {E})\) as a collection of unitig sequence vertices \(\mathcal {V} = 1,\ldots,V\) and directed edges \( \mathcal {E} \subseteq \mathcal {V} \times \mathcal {V}\). Each edge defines an overlap and comprises a pair of vertices and directions \( \left (u^{d} \rightarrow v^{d}\right) \in \mathcal {E} \) where d∈{+,−} and indicates whether the overlap occurs between the sequence (+) or its reverse complement (−). We define:
-
Counts xv,s for each unitig v=1,…,V in sample s=1,…,S
-
Paths for strain g=1,…,G defined by \(\eta ^{g}_{u,v} \in {0,1}\) indicating whether strain g passes through that edge in the graph
-
Flow of strain g through unitig v, \(\phi ^{g+}_{v} = \sum _{u \in A(v)} \eta ^{g}_{u,v}\) and \(\phi ^{g-}_{v} = \sum _{u \in D(v)} \eta ^{g}_{v,u}\) where A(v) is the set of ancestors of v and D(v) descendants in the assembly graph
-
The following is true \(\phi ^{g+}_{v} = \phi ^{g-}_{v} = \phi ^{g}_{v}\)
-
Strain intensities γg,s as the rate per position that a read is generated from g in sample s
-
Unitig lengths Lv
-
Unitig bias θv is the fractional increase in reads generated from v given factors such as GC content influencing coverage
-
Source node s and sink node t such that \(\phi ^{g+}_{s} = \phi ^{g-}_{s} = \phi ^{g+}_{t} = \phi ^{g-}_{t} = 1\)
Then assume normally distributed counts for each node in each sample giving a joint density for observations and latent variables:
$$\begin{array}{*{20}l} P(\mathbf{X},\boldsymbol{\Gamma},\mathbf{H},\boldsymbol{\Theta}) = \prod_{v=1}^{V} \prod_{s=1}^{S} \mathcal{N}\left(x_{v,s}\left|L_{v} \theta_{v} {\sum_{h=1}^{G} \phi^{h}_{v} \gamma_{h,s}},\tau^{-1}\right.\right) \prod_{h=1}^{G} \prod_{s=1}^{S} P(\gamma_{h,s} | \lambda_{h}) \\.\prod_{h=1}^{G} \prod_{v=1}^{V} \left[ \phi^{h+}_{v} = \phi^{h-}_{v} \right] \left[ \phi^{h-}_{s} = 1\right] \left[\phi^{h+}_{t} = 1\right] P(\tau) \\.\prod_{h=1}^{G} P(\lambda_{h} | \alpha_{0}, \beta_{0}) \prod_{v=1}^{V} P(\theta_{v} | \mu_{0}, \tau_{0}) \end{array} $$
(4)
where [] indicates the Iverson bracket evaluating to 1 if the condition is true and zero otherwise. We assume an exponential prior for the γg,s with a rate parameter that is strain dependent, such that:
$$ P(\gamma_{g,s}| \lambda_{g}) = \lambda_{g} \exp(-\gamma_{g,s} \lambda_{g}) $$
(5)
We then place gamma hyper-priors on the λg:
$$ P(\lambda_{g}| \alpha_{0}, \beta_{0}) = \frac{\beta_{0}^{\alpha_{0}}}{\Gamma (\alpha_{0})} \lambda_{g}^{\alpha_{0} - 1} \exp(- \beta_{0} \lambda_{g}) $$
(6)
This acts as a form of automatic relevance determination (ARD) forcing strains with low intensity across all samples to zero in every sample [8].
We use a standard Gamma prior for the precision:
$$ P(\tau | \alpha_{\tau}, \beta_{\tau}) = \frac{\beta_{\tau}^{\alpha_{\tau}}}{\Gamma(\alpha_{\tau})} \tau^{\alpha_{\tau} - 1} \exp(-\beta_{\tau} \tau) $$
(7)
For the biases θv we use a truncated normal prior:
$$\begin{array}{*{20}l} P(\theta_{v} | \mu_{0}, \tau_{0}) & = & \frac{\sqrt{\frac{\tau_{0}}{2 \pi}} \exp\left(- \frac{\tau_{0}}{2} (\theta_{v} - \mu_{0})^{2}\right) }{1 - \Psi\left(-\mu_{0} \sqrt \tau_{0} \right)} \;\;\; \theta_{v} >= 0 \\ & = & 0 \;\;\; \theta_{v} < 0 \\ \end{array} $$
where Ψ is the standard normal cumulative distribution. The mean of this is set to one, μ0=1, so that our prior is that the coverage on any given node is unbiased, with a fairly high precision τ0=100, to reflect an assumption that the observed coverage should reflect the summation of the strains. Finally, we assume a uniform prior over the possible discrete values of the \(\eta ^{g}_{v,u}\). If the assembly graph is a directed acyclic graph (DAG) then \(\eta ^{g}_{v,u} \in {0,1}\). We have found that for most genes and typical kmer lengths this is true, but we do not need to assume it.
Variational Approximation
We use variational inference to obtain an approximate solution to the posterior distribution of this model [7]. Variational inference is an alternative strategy to Markov chain Monte Carlo (MCMC) sampling. Rather than attempting to sample from the posterior distribution, variational inference assumes a tractable approximating distribution for the posterior, and then finds the parameters for that distribution that minimise the Kullback-Leibler divergence between the approximation and the true posterior distribution. Further, in mean-field variational inference the approximation can be factorised into a product over a number of components that each approximate the posterior of a parameter in the true distribution. In practice the Kullback-Leibler divergence is not computable because it depends on the evidence, i.e. the joint distribution marginalised over all latent variables. Instead, inference is carried out by maximising the evidence lower bound (ELBO), which is equal to the negative of the Kullback-Leibler divergence plus a constant, that constant being the evidence. In our case, because all the distributions are conjugate we can employ CAVI, coordinate ascent variational inference, to iteratively maximise the ELBO.
Our starting point is to assume the following factorisation for the variational approximation:
$$ q(\mathbf{X},\boldsymbol{\Gamma},\mathbf{H}) = \prod_{h=1}^{G} q_{h}\left(\left\{ \eta^{h}_{v,u} \right\}_{u,v \in \mathcal{A}}\right) \prod_{h=1}^{G} \prod_{s=1}^{S} q_{h}(\gamma_{h,s}) \prod_{h=1}^{G} q_{h}(\lambda_{h}) \prod_{v=1}^{V} q_{v}(\theta_{v}) q(\tau) $$
(8)
where \(\mathcal {A}\) is the set of edges in the assembly graph and \(\mathcal {V} = 1,\ldots,V\) the set of unitig sequence vertices. Note that we have assumed a fully factorised approximation except for the \(\eta ^{h}_{v,u}\), the paths for each strain through the graph. There we assume that the path for each strain forms a separate factor allowing strong correlations between the different elements of the path. This is therefore a form of structured variational inference [22].
To obtain the CAVI updates we use the standard procedure of finding the log of the optimal distributions q for each set of factorised variables as the expectation of the log joint distribution Eq. 4 over all the other variables, except the one being optimised. Using an informal notation we will denote these expectations as \(\langle \ln P \rangle _{-q_{j}}\) where qj is the variable being optimised.
Then the mean field update for each set of \( \{\eta ^{g}_{v,u}\}_{u,v \in A}\) is derived as:
$$\begin{array}{*{20}l} \ln q_{g}^{*}\left(\left\{ \eta^{g}_{v,u} \right\}_{u,v \in A}\right) & = \left\langle \ln P \right\rangle_{-\eta^{g}_{v,u} u,v \in \mathcal{A}} \\ & = \ln \left(\prod_{v=1}^{V} \delta_{\phi^{g+}_{v}, \phi^{g-}_{v}} \delta_{\phi^{g-}_{s},1} \delta_{\phi^{g+}_{t},1} \right) \\ & - \left\langle \sum_{v=1}^{V} \sum_{s=1}^{S} \frac{\tau}{2} \left(x_{v,s} - \theta_{v} L_{v}\left[\sum_{h=1}^{G} \phi^{h}_{v} \gamma_{h,s}\right] \right)^{2} \right\rangle_{-\eta^{g}_{v,u} u,v \in \mathcal{A}} \\ & + \text{Terms\ independent\ of} \eta^{g} \end{array} $$
Consider the second term only:
$$- \frac{\langle \tau \rangle}{2} \left(- \sum_{v=1}^{V} \sum_{s=1}^{S} 2 x_{v,s} L_{v} \langle \theta_{v} \rangle \langle \gamma_{g,s} \rangle \phi^{g}_{v} + L_{v}^{2} \left\langle {\theta_{v}}^{2} \right\rangle\left. \left\langle \left(\sum_{h=1}^{G} \phi^{h}_{v} \gamma_{h,s}\right) \left(\sum_{i=1}^{G} \phi^{i}_{v} \gamma_{i,s}\right) \right\rangle\right) \right) $$
This becomes:
$$\begin{aligned} - \frac{\langle \tau \rangle}{2} \Bigg(\sum_{v=1}^{V} \sum_{s=1}^{S} \Big[ -2 x_{v,s} \langle \theta_{v} \rangle L_{v} \langle \gamma_{g,s} \rangle \phi^{g}_{v} + 2 L_{v}^ 2 \left\langle {\theta_{v}}^{2} \right\rangle \sum_{h \neq g }^{G} \left\langle \phi^{h}_{v} \right\rangle \langle \gamma_{h,s} \rangle \langle \gamma_{g,s} \rangle \phi^{g}_{v} \\ + L_{v}^{2} \left\langle {\theta_{v}}^{2} \right\rangle \left\langle \gamma_{g,s}^{2} \right\rangle \left(\phi^{g}_{v}\right)^{2} \Big] \Bigg) \end{aligned} $$
Which can be reorganised to:
$$ \ln q_{g}^{*}\left(\left\{ \eta^{g}_{v,u} \right\}_{u,v \in A}\right) = \ln \left(\prod_{v=1}^{V} \delta_{\phi^{g+}_{v}, \phi^{g-}_{v}} \delta_{\phi^{g-}_{s},1} \delta_{\phi^{g+}_{t},1} \right) + \sum_{v=1}^{V} c_{1,v} \phi^{g}_{v} + \sum_{v=1}^{V} c_{2,v} \left(\phi^{g}_{v}\right)^{2} $$
(9)
Where:
$$\begin{array}{@{}rcl@{}} c_{1,v} & = & - \frac{\langle \tau \rangle}{2} \sum_{s=1}^{S} \left [ -2 x_{v,s} \langle \theta_{v} \rangle L_{v} \langle \gamma_{g,s} \rangle + 2 L_{v}^ 2 \left\langle {\theta_{v}}^{2} \right\rangle \sum_{h \neq g }^{G} \left\langle \phi^{h}_{v} \right\rangle \langle \gamma_{h,s} \rangle \langle \gamma_{g,s} \rangle \right ] \\ c_{2,v} & = & - \frac{\langle \tau \rangle}{2} L_{v}^{2} \left\langle {\theta_{v}}^{2} \right\rangle \left\langle \gamma_{g,s}^{2} \right\rangle \end{array} $$
It is apparent from Eq. 9 that the \(q_{g}^{*}\left (\left \{ \eta ^{g}_{v,u} \right \}_{u,v \in A}\right)\) takes the form of a multivariate discrete distribution with |u,v∈A| dimensions. The first term in Eq. 9 enforces the flow constraints, and does not separate across nodes, the next two terms are effectively coefficients on the total flow through a unitig and its square. The updates for the other variables below, depend on the expected values of the total flow through each of the unitig nodes for the strain g, \(\left \langle \phi ^{g}_{v} \right \rangle \), which themselves depend on the \(\eta ^{g}_{v,u}\). These expected values can be efficiently obtained for all v by representing Eq. 9 as a factor graph comprising nodes consisting of factors corresponding to both the constraints and the flow probabilities through each node with variables \(\eta ^{g}_{v,u}\). We can then find the marginal probabilities for both the \(\eta ^{g}_{v,u}\) and the \(\phi ^{g}_{v}\) using the Junction Tree algorithm [48], from these we can calculate the required expectations.
Next we consider the mean field update for the γg,s:
$$\begin{array}{*{20}l} \ln q^{*}(\gamma_{g,s}) & = \left\langle \ln P \right\rangle_{-\gamma_{g,s}} \\ & = - \left\langle \sum_{v=1}^{V} \frac{\tau}{2}\left(x_{v,s} - \theta_{v} L_{v}\left[\sum_{h=1}^{G} \phi^{h}_{v} \gamma_{h,s}\right] \right)^{2} \right\rangle_{-\gamma_{g,s}} - \langle\lambda_{g}\rangle \gamma_{g,s} \end{array} $$
$$\begin{array}{*{20}l} \ln q^{*}(\gamma_{g,s}) & = \\ & - \frac{\langle \tau \rangle }{2} \Bigg (\sum_{v=1}^{V} -2 x_{v,s} \langle \theta_{v} \rangle L_{v} \left\langle \phi^{g}_{v} \right\rangle \gamma_{g,s} + 2 \left\langle \theta_{v}^{2} \right\rangle L_{v}^{2} \gamma_{g,s} \left\langle \phi_{v}^{g} \right\rangle \sum_{h \neq g} \langle \gamma_{h,s} \rangle \left\langle \phi^{h}_{v} \right\rangle \\ & + \left\langle \theta_{v}^{2} \right\rangle L_{v}^{2} \gamma_{g,s}^{2} \left\langle \left[\phi^{g}_{v}\right]^{2} \right\rangle \Bigg) \\ & - \langle\lambda_{g}\rangle \gamma_{g,s} \end{array} $$
with the restriction γg,s>0 this gives a normal distribution but truncated to (0, inf) for γg,s, with mean and precision:
$$\begin{array}{*{20}l} \mu_{g,s} & = \frac{\sum_{v} x_{v,s} \theta_{v} L_{v} \left\langle \phi_{v}^{g} \right\rangle - \left\langle \phi_{v}^{g} \right\rangle \sum_{h \neq g} \langle \gamma_{h,s} \rangle \left\langle \phi^{h}_{v}\right\rangle \left\langle \theta_{v}^{2} \right\rangle L_{v}^{2}}{\sum_{v} L_{v}^{2} \left\langle \left[\phi^{g}_{v}\right]^{2} \right\rangle} - \frac{\langle\lambda_{g}\rangle}{\tau_{g,s} } \end{array} $$
(10)
$$\begin{array}{*{20}l} \tau_{g,s} & = \langle \tau \rangle \sum_{v} L_{v}^{2} \left\langle \left[\phi^{g}_{v}\right]^{2} \right\rangle \end{array} $$
(11)
Derivations for the other updates follow similarly giving a Gamma posterior for the τ with parameter updates:
$$\begin{array}{*{20}l} \alpha_{\tau} & = \alpha_{0} + \Omega/2 \end{array} $$
(12)
$$\begin{array}{*{20}l} \beta_{\tau} & = \beta_{0} + \sum_{v,s} \left\langle (x_{v,s} - \lambda_{v,s})^{2}\right\rangle \end{array} $$
(13)
where Ω=VS and we have used λv,s as a short hand for the predicted count number:
$$\lambda_{v,s} = \theta_{v} \sum_{g} \gamma_{g,s} \phi_{v}^{g}. $$
Then the τ have the following expectations and log expectations:
$$\begin{array}{*{20}l} \langle \tau_{v,s}\rangle & = \alpha_{\tau}/\beta_{\tau} \end{array} $$
(14)
$$\begin{array}{*{20}l} \langle \log \tau_{v,s}\rangle & = \psi\left(\alpha_{\tau} + 1/2\right) - \log\left(\beta_{\tau}\right) \end{array} $$
(15)
where ψ is the digamma function. The biases θv have a truncated normal distribution and their updates can be derived similar to the above.
Evidence lower bound (ELBO)
Iterating the CAVI updates defined above will generate a variational approximation that is optimal in the sense of maximising the evidence lower bound (ELBO) so called because it bounds the log evidence, log(p(x))≥ELBO(q(z)). It is useful to calculate the ELBO whist performing CAVI updates to verify convergence and the ELBO itself is sometimes used as a Bayesian measure of model fit, although as a bound that may be controversial [7]. The ELBO can be calculated from the relationship:
$$ ELBO(q) = \mathbb{E} \left[ \log p(x|z) \right] + \mathbb{E} \left[ \log p(z) \right] - \mathbb{E} \left[ \log q(z)\right] $$
(16)
The first term is simply the expected log-likelihood of the data given the latent variables. In our case it is:
$$ \mathbb{E} \left[ \log p(x|z) \right] = \frac{1}{2}\Omega \left (\langle \log \tau \rangle - \log (2 \pi)\right) - \frac{1}{2} \langle \tau \rangle \left\langle \left(x_{v,s} - L_{v} \theta_{v} \sum_{h=1}^{G} \phi^{h}_{v} \gamma_{h,s}\right)^{2}\right\rangle $$
(17)
where Ω=VS and the expectations are over the optimised distributions q.
The second term is the expectation under q(z) of the log prior distributions. In our case with standard distributions it is easy to calculate for instance for each of the γg,s:
$${} \mathbb{E} \left[ \log p(\gamma_{g,s}) \right ] = \frac{1}{2} \log\left(\frac{\tau_{g,s}}{2\pi}\right) - \frac{\tau_{g,s}}{2}\left(\left\langle \gamma_{g,s}^{2} \right\rangle + \mu_{g,s}^{2} - 2 \mu_{g,s} \langle \gamma_{g,s} \rangle \right) - \log \left[\frac{1}{2} \text{erf} \left(\mu_{g,s}\sqrt {\frac{\tau_{g,s}}{2}}\right) \right]. $$
With the μg,s and τg,s given by their current values derived from Eq. 11 and the moments of γg,s calculated from a truncated normal distribution with those current parameters. The third terms are simply the negative entropy of the variational approximations and for the standard distributions used here are easily calculated.
Implementation details
One update of the algorithm consists of updating each variable or sets of variables in turn given the current optimal solutions of the other distributions. In practice we update:
-
Compute the marginal flows \(\left \{ \eta ^{g}_{v,u} \right \}_{u,v \in A}\) for each strain g=1,…,G in turn using Eq. 9 and the Junction Tree algorithm. This can be performed for each single copy-core gene independently
-
Update the truncated normal strain abundances q(γg,s) for each strain in each sample, s=1,…,S using Eq. 11
-
Update the q(τ)
-
Update the ARD parameter distributions q(λg) if used
-
Update the nodes biases q(θv)
-
Check for redundant or low abundance strains and remove (see below)
After a maximum fixed number of iterations or if the ELBO converges we stop iterating. Variational inference can be sensitive to initial conditions as it can only find local maxima of the ELBO, we therefore use a previously published variational Bayesian version of non-negative matrix factorisation [8], to find an initial approximate solution.
Empirical modelling of node variances
For low-coverage MAGs a precision that is identical for all nodes performs satisfactorily, but since the true distribution of read counts is Poisson this overestimates the precision for nodes with high counts xv,s. To address we developed an empirical procedure where we first calculate 〈logτv,s〉 for each node using Eq. 15 as:
$$ \langle \log \tau_{v,s}\rangle = \psi\left(\alpha_{0} + 1/2\right) - \log\left(\beta_{0} X_{v,s} +\frac{1}{2} \left\langle (x_{v,s} - \lambda_{v,s})^{2} \right\rangle \right) $$
(18)
a quantity which exhibits high variability, so we then smooth this over log(xv,s) using generalised additive models as implemented in pyGAM [43] to give 〈logτv,s〉∗. The term β0Xv,s gives a prior which is effectively Poisson. We then obtain 〈τv,s〉 as exp(〈logτv,s〉∗). This procedure has no theoretical justification but gives good results in practice. This approach of modelling a non-Gaussian distribution as a Gaussian with empirical variances is similar to that used in voom for RNASeq [27].
Cross-validation to determine optimum number of strains
The ARD procedure usually converges on the correct number of strains except for high-coverage MAGs where overfitting may occur and too many strains can be predicted. We therefore additionally implemented a cross-validation procedure, splitting the entire data matrix xv,s into test and train folds (default ten folds) and training the model on the train fold and then testing on the held out data. The correct number of strains was then taken to be the one that maximised the log predictive posterior with an empirical variance reflecting the Poisson nature of the data. The exact test statistic being:
$$ \sum_{v,s \in E} \frac{1}{2}\log\left(\tau'_{v,s}\right) - \frac{1}{2}\sum_{v,s \in E} \tau'_{v,s} \left\langle {(x_{v,s} -\lambda_{v,s})}^{2}\right\rangle $$
(19)
where τv,s′=1/(0.5+xv,s) and E indicates data points in the test set to down-weight high read count nodes reflecting approximately Poisson noise.
Nanopore sequence analysis
Sequence preprocessing
To enable a qualitative comparison between haplotypes obtained from the Nanopore reads and the BayesPaths predictions we developed the following pipeline applied at the level of individual single-copy core genes (SCGs) from MAGs:
-
1.
We mapped all reads using minimap2 [30] against the SCG contig consensus ORF sequence and selected those reads with alignments that spanned at least one third of the gene with a nucleotide identity >80%.
-
2.
We then extracted the aligned portion of each read, reverse complementing if necessary, so that all reads ran in the same direction as the SCG ORF.
-
3.
We then obtained the variant positions on the consensus from the output of the DESMAN pipeline [39]. These are variant positions prior to haplotype calling representing the total unlinked variation observed in the short reads.
-
4.
For each Nanopore fragment we aligned against the SCG ORF using a Needleman-Wunsch global alignment and generated a condensed read comprising bases only from the short read variant positions.
This provided us with a reduced representation of each Nanopore read effectively filtering variation that was not observed in the short reads. These reduced representations were then used to calculate distances, defined as Hamming distances on the variant positions normalised by number of positions observed, both between the reads and between the reads and the predicted COG sequences from BayesPaths. From these we generated NMDS plots indicating sequence diversity, and they provided an input to the hybrid Nanopore strain resolution algorithm below.
EM algorithm for hybrid Nanopore strain resolution
We also developed a simple EM algorithm for direct prediction of paths and their abundances on the short read assembly graph that are consistent with a set of observed long reads. We began by mapping the set of n=1,…,N Nanopore sequences denoted \( \{ \mathcal {S}_{n} \}\) onto the corresponding simplified SCG graph generated by STRONG using GraphAligner [40]. This provided us with N optimal alignment paths as walks in our SCG graph. We denote this graph \(\mathcal {G}\) comprising unitig vertices v and edges e∈{u,v} defining overlaps.
We assume, as is almost always the case that the graphs contain no unitigs in both forward and reverse configurations, and that there are no cycles, so that each SCG is a directed acyclic graph (DAG) with one copy of each unitig, and we only need to track the direction that each overlap enters and leaves each unitig. Then best alignment walks comprise a sequence of edges, \(e^{n}_{1},\dots, e^{n}_{W_{n}}\) where Wn is the number of edges in the walk of read n, that traverse the graph.
Given these observed Nanopore reads we aim to reconstruct G haplotypes comprising paths from a dummy source node s, which has outgoing edges to all the true source nodes in the graph, through the graph to a dummy sink t, which connects all the true sinks. We further assume that each haplotype has relative frequency πg. Each such haplotype path \(\mathcal {P}_{g} = \left \{s,e^{g}_{1},\ldots, e^{g}_{W_{g}}, t \right \}\) will translate into a nucleotide sequence \(\mathcal {T}_{g}\). We assume that these haplotypes generate Nanopore reads with a fixed error rate ε which gives a likelihood:
$$ P(\{ \mathcal{S}_{1},\ldots, \mathcal{S}_{N} \} | \pi, \{ \mathcal{T}_{1},\ldots, \mathcal{T}_{G} \}) = \prod_{n=1}^{N} \left(\sum_{g=1}^{G} \pi_{g} \epsilon^{m_{n,g}} (1 - \epsilon)^{M_{n,g}} \right). $$
(20)
where mn,g is the number of basepair mismatches between \(\mathcal {S}_{n}\) and \(\mathcal {T}_{g}\) counting insertions, deletions and substitutions equally and Mn,g the number of matches.
To maximise this likelihood we used an Expectation-Maximisation algorithm. Iterating the following steps until convergence:
-
1.
E-step: Calculate the responsibility of each haplotype for each sequence as:
$$ z_{n,g} = \frac{ \pi_{g} \epsilon^{m_{n,g}} (1 - \epsilon)^{M_{n,g}}}{\sum_{h} \pi_{h} \epsilon^{m_{n,h}} (1 - \epsilon)^{M_{n,h}}} $$
(21)
Alignments of reads against haplotypes were performed using vsearch [41].
-
2.
M-step: We update each haplotype by finding the most likely path on the short read graph given the current expectations. These are calculated by assigning a weight \(w^{g}_{e}\) to each edge e in the graph as:
$$ w^{g}_{e} = \sum_{n \in e} z_{n,g} L_{e} $$
(22)
where n∈e are the set of reads whose optimal alignment contains that edge and Le is the unique length of the unitig the edge connects to, i.e. ignoring the overlap length. We then find for haplotype g the maximal weight path through this DAG using a topological sort. The error rates are updated as:
$$ \epsilon = \frac{\sum_{n} \sum_{g} z_{n,g} m_{n,g}}{ \sum_{n} \sum_{g} z_{n,g} L_{n,g}} $$
(23)
where Ln,g are the alignment lengths.
As is often the case with EM algorithms convergence depends strongly on initial conditions. Therefore we initialise using a partitioning around medoids clustering using the distances calculated in “Methods - Nanopore sequence analysis” section. We can estimate the number of haplotypes from the negative log-likelihood as a function of haplotype number.