Local graph construction
We construct each local graph in the PanRG from an MSA using an iterative partitioning process. The resulting sequence graph contains nested bubbles representing alternative alleles.
Let A be an MSA of length n. For each row of the MSA a = {a0, …, an−1} ∈ A let ai, j = {ai, …, aj−1} be the subsequence of a in interval [i, j). Let s(a) be the DNA sequence obtained by removing all non-AGCT symbols. We can partition alignment A either vertically by partitioning the interval [0, n) or horizontally by partitioning the set of rows of A. In both cases, the partition induces a number of sub-alignments.
For vertical partitions, we define sliceA(i, j) = {ai,j : a ∈ A}. We say that interval [i, j) is a match interval if j − i ≥ m, where m = 7 is the default minimum match length, and there is a single non-trivial sequence in the slice, i.e.,
$$ \mid \left\{s(a):a\in {slice}_A\left(i,j\right)\ \mathrm{and}\ s(a)\ne ""\right\}\mid =1. $$
Otherwise, we call it a non-match interval.
For horizontal partitions, we use a reference-based approach combined with K-means clustering [62] to divide sequences into increasing numbers of clusters K = 2, 3, … until each cluster meets a “one-reference-like” criterion or until K = 10. More formally, let U be the set of all m-mers (substrings of length m, the minimum match length) in {s(a) : a ∈ A}. For a ∈ A, we transform sequence s(a) into a count vector \( \overline{x_a}=\left\{{x_a}^1,\dots, {x_a}^{\left|U\right|}\right\} \) where xai is the count of unique m-mer i ∈ U. The K-means algorithm partitions {s(a) : a ∈ A} into K clusters \( \overline{C}=\left\{{C}_1,\dots, {C}_K\right\} \) by minimizing the inertia, defined as
$$ \mathit{\arg}\ {\mathit{\min}}_C{\sum}_{j=1}^K\sum \limits_{\overline{x_a}\in {C}_j}{\left|\ \overline{x_a}-{\mu}_j\ \right|}^2 $$
where \( {\mu}_j=\frac{1}{\left|{C}_j\right|}{\sum}_{\overline{x_a}\in {C}_j}\overline{x_a} \)is the mean of cluster j.
Given a (sub-)alignment A, we define the reference of A, ref(A), to be the concatenation of the most frequent nucleotide at each position of A. We say that a K-partition is one-reference-like if for the corresponding sub-alignments A1, …, AK the hamming distance between each sequence and its sub-alignment reference
$$ \mid s(a)- ref\left({A}_i\right)\mid <d\ast len(A)\kern2em \forall a\in {A}_i $$
where | | denotes the Hamming distance, and d denotes a maximum hamming distance threshold, set at 0.2 by default. In this case, we accept the partition; otherwise, we look for a K + 1-partition.
The recursive algorithm first partitions an MSA vertically into match and non-match intervals. Match intervals are collapsed down to the single sequence they represent. Independently for each non-match interval, the alignment slice is partitioned horizontally into clusters. The same process is then applied to each induced sub-alignment until a maximum number of recursion levels, r = 5, has been reached. For any remaining alignments, a node is added to the local graph for each unique sequence. See Additional file 1: Supplementary Animation 1 to see an example of this algorithm. We name this algorithm Recursive Cluster and Collapse (RCC). When building the local graph from a MSA, we record and serialize the recursion tree of the RCC algorithm, as well as memoizing all the data in each recursion node. We shall call this algorithm memoized RCC (MRCC). Once the MRCC recursion tree is generated, we can obtain the local graph through a pre-order traversal of the tree (which is equivalent to the call order of the recursive functions in an execution of the RCC algorithm). To update the local graph with new alleles using the MRCC algorithm, we can deserialize the recursion tree from disk, infer in which leaves of the recursion tree the new alleles should be added, add the new alleles in bulk, and then update each modified leaf. This leaf update operation consists of updating just the subaligment of the leaf (which is generally a small fraction of the whole MSA) with the new alleles using MAFFT [63] and recomputing the recursion at the leaf node. All of this is implemented in the make_prg repository (see “Code availability”).
(w,k)-minimizers of graphs
We define (w,k)-minimizers of strings as in Li [27]. Let \( \upvarphi :{\Sigma}^k\to \mathfrak{R} \) be a k-mer hash function and let π : Σ∗ × {0, 1} → Σ∗ be defined such that π(s, 0) = s and \( \pi \left(s,1\right)=\overline{s} \), where \( \overline{s} \) is the reverse complement of s. Consider any integers k ≥ w > 0. For window start position 0 ≤ j ≤ |s| − w − k + 1, let
$$ {T}_j=\left\{\pi \left({s}_{p,p+k},r\right):j\le p<j+w,r\in \left\{0,1\right\}\right\} $$
be the set of forward and reverse-complement k-mers of s in this window. We define a (w,k)-minimizer to be any triple (h, p, r) such that
$$ h=\varphi \left(\pi \left({s}_{p,p+k},r\right)\right)=\min \left\{\varphi (t):t\in {T}_j\right\}. $$
The set W(s) of (w,k)-minimizers for s, is the union of minimizers over such windows:
$$ W(s)=\bigcup \limits_{0\le j\le \mid s\mid -w-k+1}\left\{\left(h,p,r\right):h=\min \left\{\upvarphi (t):t\in {T}_j\right\}\right\}. $$
We extend this definition intuitively to an acyclic sequence graph G = (V,E). Define |v| to be the length of the sequence associated with node v ∈ V and let i = (v, a, b), 0 ≤ a ≤ b ≤ |v| represent the sequence interval [a, b) on v. We define a path in G by
$$ \overline{p}=\left\{\left({i}_1,\dots, {i}_m\right):\left({v}_j,{v}_{j+1}\right)\in E\ and\ {b}_j\equiv |{v}_j|\ for\ 1\le j<m\right\}. $$
This matches the intuitive definition for a path in a sequence graph except that we allow the path to overlap only part of the sequence associated with the first and last nodes. We will use \( {s}_{\overline{p}} \) to refer to the sequence along the path \( \overline{p} \) in the graph.
Let \( \overline{q} \) be a path of length w + k − 1 in G. The string \( {s}_{\overline{q}} \) contains w consecutive k-mers for which we can find the (w,k)-minimizer(s) as before. We therefore define the (w,k)-minimizer(s) of the graph G to be the union of minimizers over all paths of length w + k − 1 in G:
$$ W(G)=\bigcup \limits_{\overline{q}\in G:\mid \overline{q}\mid =w+k-1}\left\{\left(h,\overline{p},r\right):h=\min \left\{\upvarphi (t):t\in {T}_{\overline{q}}\right\}\right\}. $$
Local graph indexing with (w,k)-minimizers
To find minimizers for a graph, we use a streaming algorithm as described in Additional file 1: Supplementary Algorithm 1. For each minimizer found, it simply finds the next minimizer(s) until the end of the graph has been reached.
Let walk(v, i, w, k) be a function which returns all vectors of w consecutive k-mers in G starting at position i on node v. Suppose we have a vector of k-mers x. Let shift(x) be the function which returns all possible vectors of k-mers which extend x by one k-mer. It does this by considering possible ways to walk one letter in G from the end of the final k-mer of x. For a vector of k-mers of length w, the function minimize(x) returns the minimizing k-mers of x.
We define K to be a k-mer graph with nodes corresponding to minimizers \( \left(h,\overline{p},r\right) \). We add edge (u,v) to K if there exists a path in G for which u and v are both minimizers and v is the first minimizer after u along the path. Let K ← add(s, t) denote the addition of nodes s and t to K and the directed edge (s,t). Let K ← add(s, T) denote the addition of nodes s and t ∈ T to K as well as directed edges (s,t) for t ∈ T, and define K ← add(S, t) similarly.
The resulting PanRG index stores a map from each minimizing k-mer hash value to the positions in all local graphs where that (w,k)-minimizer occurred. In addition, we store the induced k-mer graph for each local graph.
Quasi-mapping reads
We infer the presence of PanRG loci in reads by quasi-mapping. For each read, a sketch of (w,k)-minimizers is made, and these are queried in the index. For every (w,k)-minimizer shared between the read and a local graph in the PanRG index, we define a hit to be the coordinates of the minimizer in the read and local graph and whether it was found in the same or reverse orientation. We define clusters of hits from the same read, local graph, and orientation if consecutive read coordinates are within a certain distance. If this cluster is of sufficient size, the locus is deemed to be present and we keep the hits for further analysis. Otherwise, they are discarded as noise. The default for this “sufficient size” is at least 10 hits and at least 1/5th the length of the shortest path through the k-mer graph (Nanopore) or the number of k-mers in a read sketch (Illumina). Note that there is no requirement for all these hits to lie on a single path through the local graph. A further filtering step is therefore applied after the sequence at a locus is inferred to remove false positive loci, as indicated by low mean or median coverage along the inferred sequence by comparison with the global average coverage. This quasi-mapping procedure is described in pseudocode in Additional file 1: Supplementary Algorithm 2.
Initial sequence approximation as a mosaic of references
For each locus identified as present in the set of reads, quasi-mapping provides (filtered) coverage information for nodes of the directed acyclic k-mer graph. We use these to approximate the sequence as a mosaic of references as follows. We model k-mer coverage with a negative binomial distribution and use the simplifying assumption that k-mers are read independently. Let Θ be the set of possible paths through the k-mer graph, which could correspond to the true genomic sequence from which reads were generated. Let r + s be the number of times the underlying DNA was read by the machine, generating a k-mer coverage of s, and r instances where the k-mer was sequenced with errors. Let 1 − p be the probability that a given k-mer was sequenced correctly. For any path θ ∈ Θ, let {X1, …, XM} be independent and identically distributed random variables with probability distribution \( f\left({x}_i,r,p\right)=\frac{\Gamma \left(r+s\right)}{\Gamma (r)s!}{p}^r{\left(1-p\right)}^s \), representing the k-mer coverages along this path. Since the mean and variance are \( \frac{\left(1-p\right)r}{p} \) and \( \frac{\left(1-p\right)r}{p^2} \), we solve for r and p using the observed k-mer coverage mean and variance across all k-mers in all graphs for the sample. Let D be the k-mer coverage data seen in the read dataset. We maximize the score \( \hat{\theta}={\left\{\arg \max\ l\left(\theta |D\right)\right\}}_{\theta \in \Theta} \) where \( l\left(\theta |D\right)=\frac{1}{M}{\sum}_{i=1}^M\log f\left({s}_i,r,p\right) \), where si is the observed coverage of the i-th k-mer in θ. This score is an approximation to a log likelihood, but averages over (up to) a fixed number of k-mers in order to retain sensitivity over longer paths in our C++ implementation. By construction, the k-mer graph is directed and acyclic so this maximization problem can be solved with a dynamic programming algorithm (for pseudocode, see Additional file 1: Supplementary Algorithm 3).
For choices of w ≤ k there is a unique sequence along the discovered path through the k-mer graph (except in rare cases within the first or last w − 1 bases). We use this closest mosaic of reference sequences as an initial approximation of the sample sequence.
De novo variant discovery
The first step in our implementation of local de novo variant discovery in genome graphs is finding the candidate regions of the graph that show evidence of dissimilarity from the sample’s reads.
Finding candidate regions
The input required for finding candidate regions is a local graph, n, within the PanRG, the maximum likelihood path of both sequence and k-mers in n, lmpn and kmpn respectively, and a padding size w for the number of positions surrounding the candidate region to retrieve.
We define a candidate region, r, as an interval within n where coverage on lmpn is less than a given threshold, c, for more than l and less than m consecutive positions. m acts to restrict the size of variants we are able to detect. If set too large, the following steps become much slower due to the combinatorial expansion of possible paths. For a given read, s, that has a mapping to r we define sr to be the subsequence of s that maps to r, including an extra w positions either side of the mapping. We define the pileup Pr as the set of all sr ∈ r.
Enumerating paths through candidate regions
For r ∈ R, where R is the set of all candidate regions, we construct a de Bruijn graph Gr from the pileup Pr using the GATB library [64]. AL and AR are defined as sets of k-mers to the left and right of r in the local graph. They are anchors to allow re-insertion of new sequences found by de novo discovery into the local graph. If we cannot find an anchor on both sides, then we abandon de novo discovery for r. We use sets of k-mers for AL and AR, rather than a single anchor k-mer, to provide redundancy in the case where sequencing errors cause the absence of some k-mers in Gr. Once Gr is built, we define the start anchor k-mer, aL, as the first k-mer in AL that is also in Gr. Likewise, we define the end anchor k-mer, aR, as the first k-mer in AR that is also in Gr.
Tr is the spanning tree obtained by performing depth-first search (DFS) on Gr, beginning from node aL. We define pr as a path, from the root node aL of Tr and ending at node aR, which fulfils the two conditions: (1) pr is shorter than the maximum allowed path length; (2) no more than k nodes along pr have coverage < f er, where er is the expected k-mer coverage for r and f is nr ∗ s , where nr is the number of iterations of path enumeration for r and s is a step size (0.1 by default).
Vr is the set of all pr. If |Vr| is greater than a predefined threshold, then we have too many candidate paths, and we decide to filter more aggressively: f is incremented by s—effectively requiring more coverage for each pr—and Vr is repopulated. If f > 1.0, then de novo discovery is abandoned for r.
Pruning the path-space in a candidate region
As we operate on both accurate and error-prone sequencing reads, the number of valid paths in Gr can be very large. Primarily, this is due to cycles that can occur in Gr and exploring paths that will never reach our required end anchor aR. In order to reduce the path-space within Gr, we prune paths based on multiple criteria. Critically, this pruning happens at each step of the graph walk (path-building).
We used a distance-based optimisation based on Rizzi et al. [65]. In addition to Tr, obtained by performing DFS on Gr, we produce a distance map Dr that results from running reversed breadth-first search (BFS) on Gr, beginning from node aR. We say reversed BFS as we explore the predecessors of each node, rather than the successors. Dr is implemented as a binary search tree where each node in the tree represents a k-mer in Gr that is reachable from aR via reversed BFS. Each node additionally has an integer attached to it that describes the distance from that node to aR.
We can use Dr to prune the path-space by (1) for each node n ∈ pr, we require n ∈ Dr and (2) requiring aR be reached from n in, at most, i nodes, where i is defined as the maximum allowed path length minus the number of nodes walked to reach n.
If one of these conditions is not met, we abandon pr. The advantage of this pruning process is that we never explore paths that will not reach our required endpoint within the maximum allowed path length and when caught in a cycle, we abandon the path once we have made too many iterations around the cycle.
Graph-based genotyping and optimal reference construction for multi-genome comparison
We use graph-based genotyping to output a comparison of samples in a VCF. A path through the graph is selected to be the reference sequence, and graph variation is described with respect to this reference. The chromosome field then details the local graph and the position field gives the position within the chosen reference sequence for possible variant alleles. The reference path for each local graph is chosen to be maximally close to the set of sample mosaic paths. This is achieved by reusing the mosaic path-finding algorithm detailed in Additional file 1: Supplementary Algorithm 3 on a copy of the k-mer graph with coverages incremented along each sample mosaic path, and a modified probability function defined such that the probability of a node is proportional to the number of samples covering it. This results in an optimal path, which is used as the VCF reference for the multi-sample VCF file.
For each sample and site in the VCF file, the mean forward and reverse coverage on k-mers tiling alleles is calculated. A likelihood is then independently calculated for each allele based on a Poisson model. An allele A in a site is called if: (1) A is on the sample mosaic path (i.e., it is on the maximum likelihood path for that sample); (2) A is the most likely allele to be called based on the previous Poisson model. Every allele not in the sample mosaic path will not satisfy (1) and will thus not be called. In the uncommon event where an allele satisfies (1), but not (2), we have an incompatibility between the global and the local choices, and then the site is genotyped as null.
Comparison of variant callers on a diverse set of E. coli
Sample selection
We used a set of 20 diverse E. coli samples for which matched Nanopore and Illumina data and a high-quality assembly were available. These are distributed across 4 major phylogroups of E. coli as shown in Fig. 4. Of these, 16 were isolated from clinical infections and rectal screening swabs in ICU patients in an Australian hospital [66]. One is the reference strain CFT073 that was resequenced and assembled by the REHAB consortium [67]. One is from an ST216 cardiac ward outbreak (identifier: H131800734); the Illumina data was previously obtained [68], and we did the Nanopore sequencing (see below). The two final samples were obtained from Public Health England: one is a Shiga toxin encoding E. coli (we used the identifier O63) [69], and the other an enteroaggregative E. coli (we used the identifier ST38) [70]. Coverage data for these samples can be found in Additional file 1: Supplementary Table 2.
PanRG construction
MSAs for gene clusters curated with panX [33] from 350 RefSeq assemblies were downloaded from http://pangenome.de on 3 May 2018. MSAs for intergenic region clusters based on 228 E. coli ST131 genome sequences were previously generated with Piggy [34] for their publication. The PanRG was built using make_prg. Three loci (GC00000027_2, GC00004221 and GC00000895_r1_r1_1) out of 37,428 were excluded because pandora did not complete in reasonable time (~ 24 h) once de novo variants were added.
Nanopore sequencing of sample H131800734
DNA was extracted using a Blood & Cell Culture DNA Midi Kit (Qiagen, Germany) and prepared for Nanopore sequencing using kits EXP-NBD103 and SQK-LSK108. Sequencing was performed on a MinION Mk1 Shield device using a FLO-MIN106 R9.4 Spoton flowcell and MinKNOW version 1.7.3, for 48 h.
Nanopore basecalling
Recent improvements to the accuracy of Nanopore reads have been largely driven by improvements in basecalling algorithms [71]. For comparison, 4 samples were basecalled with the default (methylation unaware) model and with the methylation-aware, high-accuracy model provided with the proprietary guppy basecaller (version 3.4.5). Additional file 1: Supplementary Figure 5 shows the effect of methylation-aware Nanopore basecalling on the AvgAR/error rate curve for pandora with/without novel variant discovery via local assembly for those 4 samples. With normal basecalling, local de novo assembly increases the error rate substantially from 0.22 to 0.60%, with a negligible increase in recall, from 89.1 to 90.1%, whereas with methylation-aware basecalling it increases the recall from 89.5 to 90.6% and just slightly increases the error rate from 0.18 to 0.22%. On the basis of this, demultiplexing of the subsequent basecalled data was performed using the same methylation-aware version of the guppy software suite with barcode kits EXP-NBD104 and EXP-NBD114 and an option to trim the barcodes from the output.
Phylogenetic tree construction
Chromosomes were aligned using MAFFT [63] v7.467 as implemented in Parsnp [72] v1.5.3. Gubbins v2.4.1 was used to filter for recombination (default settings), and phylogenetic construction was carried out using RAxML [73] v8.2.12 (GTR + GAMMA substitution model, as implemented in Gubbins [74]).
Reference selection for mapping-based callers
A set of references was chosen for testing single-reference variant callers using two standard approaches, as follows. First, a phylogeny was built containing our 20 samples and 243 reference genomes from RefSeq. Then, for each of our 20 samples, the nearest RefSeq E. coli reference was found using Mash [26]. Second, for each of the 20 samples, the nearest RefSeq reference in the phylogeny was manually selected; sometimes one RefSeq assembly was the closest to more than one of the 20. At an earlier stage of the project, there had been another sample (making a total of 21) in phylogroup B1; this was discarded when it failed quality filters (data not shown). Despite this, the Mash/manual selected reference genomes were left in the set of mapping references, to evaluate the impact of mapping to a reference in a different phylogroup to all 20 of our samples.
Construction of truth assemblies
16/20 samples were obtained with matched Illumina and Nanopore data and a hybrid assembly. Sample H131800734 was assembled using the hybrid assembler Unicycler [75] with PacBio and Illumina reads followed by polishing with the PacBio reads using Racon [76], and finally with Illumina reads using Pilon [77]. A small 1 kb artefactual contig was removed from the H131800734 assembly due to low quality and coverage.
In all cases, we mapped the Illumina data to the assembly and masked all positions where the pileup of Illumina reads did not support the assembly.
Construction of a comprehensive and filtered truth set of pairwise SNPs
All pairwise comparisons of the 20 truth assemblies were performed with varifier (https://github.com/iqbal-lab-org/varifier), using subcommand make_truth_vcf. In summary, varifier compares two given genomes (referenced as G1 and G2) twice—first using dnadiff [78] and then using minimap2/paftools [28]. The two output sets of pairwise SNPs are then joined and filtered. We create one sequence probe for each allele (a sequence composed of the allele and 50 bases of flank on either side taken from G1) and then map both to G2 using minimap2. We then evaluate these mappings to verify if the variant found is indeed correct (TP) or not (FP) as follows. If the mapping quality is zero, the variant is discarded to avoid paralogs/duplicates/repeats that are inherently hard to assess. We then check for mismatches in the allele after mapping and confirm that the called allele is the better match.
Constructing a set of ground truth pan-genome variants
When seeking to construct a truth set of all variants within a set of bacterial genomes, there is no universal coordinate system. We start by taking all pairs of genomes and finding the variants between them, and then need to deduplicate them—e.g., when a variant between genomes 1 and 2 is the same as a variant between genomes 3 and 4, they should be identified; we define “the same” in terms of genome, coordinate and allele. An allele A in a position PA of a chromosome CA in a genome GA is defined as a triple A = (GA, CA, PA). A pairwise variant PwV = {A1, A2} is defined as a pair of alleles that describes a variant between two genomes, and a pan-genome variant PgV = {A1, A2, …, An} is defined as a set of two or more alleles that describes the same variant between two or more genomes. A pan-genome variant PgV can also be defined as a set of pairwise variants PgV = {PwV1, PwV2, …, PwVn}, as we can infer the set of alleles of PgV from the pairs of alleles in all these pairwise variants. Note that pan-genome variants are thus able to represent rare and core variants. Given a set of pairwise variants, we seek a set of pan-genome variants satisfying the following properties:
-
1.
[Surjection]:
-
a.
Each pairwise variant is in exactly one pan-genome variant;
-
b.
A pan-genome variant contains at least one pairwise variant;
-
2.
[Transitivity]: if two pairwise variants PwV1 and PwV2 share an allele, then PwV1 and PwV2 are in the same pan-genome variant PgV.
We model the above problem as a graph problem. We represent each pairwise variant as a node in an undirected graph G. There is an edge between two nodes n1 and n2 if n1 and n2 share an allele. Each component (maximal connected subgraph) of G then defines a pan-genome variant, built from the set of pairwise variants in the component, satisfying all the properties previously described. Therefore, the set of components of G defines the set of pan-genome variants P. However, a pan-genome variant in P could (i) have more than one allele stemming from a single genome, due to a duplication/repeat; (ii) represent biallelic, triallelic, or tetrallelic SNPs/indels. For this evaluation, we chose to have a smaller, but more reliable set of pan-genome variants, and thus we filtered P by restricting it to the set of pan-genome variants P′ defined by the variants PgV ∈ P such that (i) PgV has at most one allele stemming from each genome; (ii) PgV is a biallelic SNP. P′ is the set of 618,305 ground truth filtered pan-genome variants that we extracted by comparing and deduplicating the pairwise variants present in our 20 samples and that we use to evaluate the recall of all the tools in this paper. Additional file 1: Supplementary Figure 13 shows an example summarizing the described process of building pan-genome variants from a set of pairwise variants.
Subsampling read data and running all tools
All read data was randomly subsampled to 100× coverage using rasusa—the pipeline is available at https://github.com/iqbal-lab-org/subsampler. A snakemake [79] pipeline to run the pandora workflow with and without de novo discovery (see Fig. 2D) is available at https://github.com/iqbal-lab-org/pandora_workflow. A snakemake pipeline to run snippy, SAMtools, nanopolish, and medaka on all pairwise combinations of 20 samples and 24 references is available at https://github.com/iqbal-lab-org/variant_callers_pipeline.
Evaluating VCF files
Calculating precision
Given a variant/VCF call made by any of the evaluated tools, where the input were reads from a sample (or several samples, in the case of pandora) and a reference sequence (or a PanRG, in the case of pandora), we perform the following steps to assess how correct a call is:
-
1.
Construct a probe for the called allele, consisting of the sequence of the allele flanked by 150 bp on both sides from the reference sequence. This reference sequence is one of the 24 chosen references for snippy, SAMtools, nanopolish, and medaka; or the multi-sample inferred VCF reference for pandora;
-
2.
Map the probe to the sample sequence using BWA-MEM [80];
-
3.
Remove multi-mappings by looking at the Mapping Quality (MAPQ) measure [36] of the SAM records. If the probe is mapped uniquely, then its mapping passes the filter. If there are multiple mappings for the probe, we select the mapping m1 with the highest MAPQ if the difference between its MAPQ and the second highest MAPQ exceeds 10. If m1 does not exist, then there are at least two good enough mappings, and it is ambiguous to choose which one to evaluate. In this case, we prefer to be conservative and filter this call (and all its related mappings) out of the evaluation;
-
4.
We further remove calls mapping to masked regions of the sample sequence, in order to not evaluate calls lying on potentially misassembled regions;
-
5.
Now we evaluate the mapping, giving the call a continuous precision score between 0 and 1. We look only at the alignment of the called allele (i.e., we ignore the flanking sequences alignment) and give a score as follows: number of matches / alignment length.
Finally, we compute the precision for the tool by summing the score of all evaluated calls and dividing by the number of evaluated calls. Note that here we evaluate all types of variants, including SNPs and indels.
Calculating recall
We perform the following steps to calculate the recall of a tool:
-
1.
Apply the VCF calls to the associated reference using the VCF consensus builder (https://github.com/leoisl/vcf_consensus_builder), creating a mutated reference with the variants identified by the tool;
-
2.
Build probes for each allele of each pan-genome variant previously computed (see section “Constructing a set of ground truth pan-genome variants”);
-
3.
Map all pan-genome variants’ probes to the mutated reference using BWA-MEM;
-
4.
Evaluate each probe mapping, which is classified as a TP only if all bases of the allele were correctly mapped to the mutated reference. In the uncommon case where a probe multimaps, it is enough that one of the mappings are classified as TP;
-
5.
Finally, as we now know for each pan-genome variant which of its alleles were found, we calculate both the pan-variant recall and the average allelic recall as per section “Pandora detects rare variation inaccessible to single-reference methods.”
Filters
Given a VCF file with likelihoods for each genotype, the genotype confidence is defined as the log likelihood of the maximum likelihood genotype minus the log likelihood of the next best genotype. Thus a confidence of zero means the two most likely alleles are equally likely, and high-quality calls have higher confidences. In the recall/error rate plots of Fig. 6A,B, each point corresponds to the error rate and recall computed as previously described, on a genotype confidence (gt-conf) filtered VCF file with a specific threshold for minimum confidence.
We also show the same plot with further filters applied in Additional file 1: Supplementary Figure 3. The filters were as follows. For Illumina data: for pandora, a minimum coverage filter of 5×, a strand bias filter of 0.05 (minimum 5% of reads on each strand), and a gaps filter of 0.8 were applied. The gaps filter means at least 20% of the minimizer k-mers on the called allele must have coverage above 10% of the expected depth. As snippy has its own internal filtering, no filters were applied. For SAMtools, a minimum coverage filter of 5× was used. For Nanopore data, for pandora, a minimum coverage filter of 10×, a strand bias filter of 0.05, and a gaps filter of 0.6 were used. For nanopolish, we applied a coverage filter of 10×. We were unable to apply a minimum coverage filter for medaka due to a software bug that prevents annotating the VCF file with coverage information.
Locus presence and distance evaluation
For all loci detected as present in at least one sample by pandora, we mapped the multi-sample inferred reference to all 20 sample assemblies and 24 reference sequences, to identify their true locations. To be confident of these locations, we employed a strict mapping using bowtie2 [81] and requiring end-to-end alignments. From the mapping of all loci to all samples, we computed a truth locus presence-absence matrix and compared it with pandora’s locus presence-absence matrix, classifying each pandora locus call as true/false positive/negative. Additional file 1: Supplementary Figure 6 shows these classifications split by locus length. Having the location of all loci in all the 20 sample assemblies and the 24 references, we then computed the edit distance between them.