Local graph construction
We construct each local graph in the PanRG from an MSA using an iterative partitioning process. The resulting sequence graph contains nested bubbles representing alternative alleles.
Let A be an MSA of length n. For each row of the MSA a = {a_{0}, …, a_{n−1}} ∈ A let a_{i, j} = {a_{i}, …, a_{j−1}} be the subsequence of a in interval [i, j). Let s(a) be the DNA sequence obtained by removing all nonAGCT symbols. We can partition alignment A either vertically by partitioning the interval [0, n) or horizontally by partitioning the set of rows of A. In both cases, the partition induces a number of subalignments.
For vertical partitions, we define slice_{A}(i, j) = {a_{i,j} : a ∈ A}. We say that interval [i, j) is a match interval if j − i ≥ m, where m = 7 is the default minimum match length, and there is a single nontrivial sequence in the slice, i.e.,
$$ \mid \left\{s(a):a\in {slice}_A\left(i,j\right)\ \mathrm{and}\ s(a)\ne ""\right\}\mid =1. $$
Otherwise, we call it a nonmatch interval.
For horizontal partitions, we use a referencebased approach combined with Kmeans clustering [62] to divide sequences into increasing numbers of clusters K = 2, 3, … until each cluster meets a “onereferencelike” criterion or until K = 10. More formally, let U be the set of all mmers (substrings of length m, the minimum match length) in {s(a) : a ∈ A}. For a ∈ A, we transform sequence s(a) into a count vector \( \overline{x_a}=\left\{{x_a}^1,\dots, {x_a}^{\leftU\right}\right\} \) where x_{a}^{i} is the count of unique mmer i ∈ U. The Kmeans algorithm partitions {s(a) : a ∈ A} into K clusters \( \overline{C}=\left\{{C}_1,\dots, {C}_K\right\} \) by minimizing the inertia, defined as
$$ \mathit{\arg}\ {\mathit{\min}}_C{\sum}_{j=1}^K\sum \limits_{\overline{x_a}\in {C}_j}{\left\ \overline{x_a}{\mu}_j\ \right}^2 $$
where \( {\mu}_j=\frac{1}{\left{C}_j\right}{\sum}_{\overline{x_a}\in {C}_j}\overline{x_a} \)is the mean of cluster j.
Given a (sub)alignment A, we define the reference of A, ref(A), to be the concatenation of the most frequent nucleotide at each position of A. We say that a Kpartition is onereferencelike if for the corresponding subalignments A_{1}, …, A_{K} the hamming distance between each sequence and its subalignment reference
$$ \mid s(a) ref\left({A}_i\right)\mid <d\ast len(A)\kern2em \forall a\in {A}_i $$
where   denotes the Hamming distance, and d denotes a maximum hamming distance threshold, set at 0.2 by default. In this case, we accept the partition; otherwise, we look for a K + 1partition.
The recursive algorithm first partitions an MSA vertically into match and nonmatch intervals. Match intervals are collapsed down to the single sequence they represent. Independently for each nonmatch interval, the alignment slice is partitioned horizontally into clusters. The same process is then applied to each induced subalignment until a maximum number of recursion levels, r = 5, has been reached. For any remaining alignments, a node is added to the local graph for each unique sequence. See Additional file 1: Supplementary Animation 1 to see an example of this algorithm. We name this algorithm Recursive Cluster and Collapse (RCC). When building the local graph from a MSA, we record and serialize the recursion tree of the RCC algorithm, as well as memoizing all the data in each recursion node. We shall call this algorithm memoized RCC (MRCC). Once the MRCC recursion tree is generated, we can obtain the local graph through a preorder traversal of the tree (which is equivalent to the call order of the recursive functions in an execution of the RCC algorithm). To update the local graph with new alleles using the MRCC algorithm, we can deserialize the recursion tree from disk, infer in which leaves of the recursion tree the new alleles should be added, add the new alleles in bulk, and then update each modified leaf. This leaf update operation consists of updating just the subaligment of the leaf (which is generally a small fraction of the whole MSA) with the new alleles using MAFFT [63] and recomputing the recursion at the leaf node. All of this is implemented in the make_prg repository (see “Code availability”).
(w,k)minimizers of graphs
We define (w,k)minimizers of strings as in Li [27]. Let \( \upvarphi :{\Sigma}^k\to \mathfrak{R} \) be a kmer hash function and let π : Σ^{∗} × {0, 1} → Σ^{∗} be defined such that π(s, 0) = s and \( \pi \left(s,1\right)=\overline{s} \), where \( \overline{s} \) is the reverse complement of s. Consider any integers k ≥ w > 0. For window start position 0 ≤ j ≤ s − w − k + 1, let
$$ {T}_j=\left\{\pi \left({s}_{p,p+k},r\right):j\le p<j+w,r\in \left\{0,1\right\}\right\} $$
be the set of forward and reversecomplement kmers of s in this window. We define a (w,k)minimizer to be any triple (h, p, r) such that
$$ h=\varphi \left(\pi \left({s}_{p,p+k},r\right)\right)=\min \left\{\varphi (t):t\in {T}_j\right\}. $$
The set W(s) of (w,k)minimizers for s, is the union of minimizers over such windows:
$$ W(s)=\bigcup \limits_{0\le j\le \mid s\mid wk+1}\left\{\left(h,p,r\right):h=\min \left\{\upvarphi (t):t\in {T}_j\right\}\right\}. $$
We extend this definition intuitively to an acyclic sequence graph G = (V,E). Define v to be the length of the sequence associated with node v ∈ V and let i = (v, a, b), 0 ≤ a ≤ b ≤ v represent the sequence interval [a, b) on v. We define a path in G by
$$ \overline{p}=\left\{\left({i}_1,\dots, {i}_m\right):\left({v}_j,{v}_{j+1}\right)\in E\ and\ {b}_j\equiv {v}_j\ for\ 1\le j<m\right\}. $$
This matches the intuitive definition for a path in a sequence graph except that we allow the path to overlap only part of the sequence associated with the first and last nodes. We will use \( {s}_{\overline{p}} \) to refer to the sequence along the path \( \overline{p} \) in the graph.
Let \( \overline{q} \) be a path of length w + k − 1 in G. The string \( {s}_{\overline{q}} \) contains w consecutive kmers for which we can find the (w,k)minimizer(s) as before. We therefore define the (w,k)minimizer(s) of the graph G to be the union of minimizers over all paths of length w + k − 1 in G:
$$ W(G)=\bigcup \limits_{\overline{q}\in G:\mid \overline{q}\mid =w+k1}\left\{\left(h,\overline{p},r\right):h=\min \left\{\upvarphi (t):t\in {T}_{\overline{q}}\right\}\right\}. $$
Local graph indexing with (w,k)minimizers
To find minimizers for a graph, we use a streaming algorithm as described in Additional file 1: Supplementary Algorithm 1. For each minimizer found, it simply finds the next minimizer(s) until the end of the graph has been reached.
Let walk(v, i, w, k) be a function which returns all vectors of w consecutive kmers in G starting at position i on node v. Suppose we have a vector of kmers x. Let shift(x) be the function which returns all possible vectors of kmers which extend x by one kmer. It does this by considering possible ways to walk one letter in G from the end of the final kmer of x. For a vector of kmers of length w, the function minimize(x) returns the minimizing kmers of x.
We define K to be a kmer graph with nodes corresponding to minimizers \( \left(h,\overline{p},r\right) \). We add edge (u,v) to K if there exists a path in G for which u and v are both minimizers and v is the first minimizer after u along the path. Let K ← add(s, t) denote the addition of nodes s and t to K and the directed edge (s,t). Let K ← add(s, T) denote the addition of nodes s and t ∈ T to K as well as directed edges (s,t) for t ∈ T, and define K ← add(S, t) similarly.
The resulting PanRG index stores a map from each minimizing kmer hash value to the positions in all local graphs where that (w,k)minimizer occurred. In addition, we store the induced kmer graph for each local graph.
Quasimapping reads
We infer the presence of PanRG loci in reads by quasimapping. For each read, a sketch of (w,k)minimizers is made, and these are queried in the index. For every (w,k)minimizer shared between the read and a local graph in the PanRG index, we define a hit to be the coordinates of the minimizer in the read and local graph and whether it was found in the same or reverse orientation. We define clusters of hits from the same read, local graph, and orientation if consecutive read coordinates are within a certain distance. If this cluster is of sufficient size, the locus is deemed to be present and we keep the hits for further analysis. Otherwise, they are discarded as noise. The default for this “sufficient size” is at least 10 hits and at least 1/5th the length of the shortest path through the kmer graph (Nanopore) or the number of kmers in a read sketch (Illumina). Note that there is no requirement for all these hits to lie on a single path through the local graph. A further filtering step is therefore applied after the sequence at a locus is inferred to remove false positive loci, as indicated by low mean or median coverage along the inferred sequence by comparison with the global average coverage. This quasimapping procedure is described in pseudocode in Additional file 1: Supplementary Algorithm 2.
Initial sequence approximation as a mosaic of references
For each locus identified as present in the set of reads, quasimapping provides (filtered) coverage information for nodes of the directed acyclic kmer graph. We use these to approximate the sequence as a mosaic of references as follows. We model kmer coverage with a negative binomial distribution and use the simplifying assumption that kmers are read independently. Let Θ be the set of possible paths through the kmer graph, which could correspond to the true genomic sequence from which reads were generated. Let r + s be the number of times the underlying DNA was read by the machine, generating a kmer coverage of s, and r instances where the kmer was sequenced with errors. Let 1 − p be the probability that a given kmer was sequenced correctly. For any path θ ∈ Θ, let {X_{1}, …, X_{M}} be independent and identically distributed random variables with probability distribution \( f\left({x}_i,r,p\right)=\frac{\Gamma \left(r+s\right)}{\Gamma (r)s!}{p}^r{\left(1p\right)}^s \), representing the kmer coverages along this path. Since the mean and variance are \( \frac{\left(1p\right)r}{p} \) and \( \frac{\left(1p\right)r}{p^2} \), we solve for r and p using the observed kmer coverage mean and variance across all kmers in all graphs for the sample. Let D be the kmer coverage data seen in the read dataset. We maximize the score \( \hat{\theta}={\left\{\arg \max\ l\left(\theta D\right)\right\}}_{\theta \in \Theta} \) where \( l\left(\theta D\right)=\frac{1}{M}{\sum}_{i=1}^M\log f\left({s}_i,r,p\right) \), where s_{i} is the observed coverage of the ith kmer in θ. This score is an approximation to a log likelihood, but averages over (up to) a fixed number of kmers in order to retain sensitivity over longer paths in our C++ implementation. By construction, the kmer graph is directed and acyclic so this maximization problem can be solved with a dynamic programming algorithm (for pseudocode, see Additional file 1: Supplementary Algorithm 3).
For choices of w ≤ k there is a unique sequence along the discovered path through the kmer graph (except in rare cases within the first or last w − 1 bases). We use this closest mosaic of reference sequences as an initial approximation of the sample sequence.
De novo variant discovery
The first step in our implementation of local de novo variant discovery in genome graphs is finding the candidate regions of the graph that show evidence of dissimilarity from the sample’s reads.
Finding candidate regions
The input required for finding candidate regions is a local graph, n, within the PanRG, the maximum likelihood path of both sequence and kmers in n, lmp_{n} and kmp_{n }respectively, and a padding size w for the number of positions surrounding the candidate region to retrieve.
We define a candidate region, r, as an interval within n where coverage on lmp_{n} is less than a given threshold, c, for more than l and less than m consecutive positions. m acts to restrict the size of variants we are able to detect. If set too large, the following steps become much slower due to the combinatorial expansion of possible paths. For a given read, s, that has a mapping to r we define s_{r} to be the subsequence of s that maps to r, including an extra w positions either side of the mapping. We define the pileup P_{r} as the set of all s_{r} ∈ r.
Enumerating paths through candidate regions
For r ∈ R, where R is the set of all candidate regions, we construct a de Bruijn graph G_{r} from the pileup P_{r} using the GATB library [64]. A_{L} and A_{R} are defined as sets of kmers to the left and right of r in the local graph. They are anchors to allow reinsertion of new sequences found by de novo discovery into the local graph. If we cannot find an anchor on both sides, then we abandon de novo discovery for r. We use sets of kmers for A_{L} and A_{R}, rather than a single anchor kmer, to provide redundancy in the case where sequencing errors cause the absence of some kmers in G_{r}. Once G_{r} is built, we define the start anchor kmer, a_{L}, as the first kmer in A_{L} that is also in G_{r}. Likewise, we define the end anchor kmer, a_{R}, as the first kmer in A_{R} that is also in G_{r}.
T_{r} is the spanning tree obtained by performing depthfirst search (DFS) on G_{r}, beginning from node a_{L}. We define p_{r} as a path, from the root node a_{L} of T_{r} and ending at node a_{R}, which fulfils the two conditions: (1) p_{r} is shorter than the maximum allowed path length; (2) no more than k nodes along p_{r} have coverage < f e_{r}, where e_{r} is the expected kmer coverage for r and f is n_{r} ∗ s , where n_{r} is the number of iterations of path enumeration for r and s is a step size (0.1 by default).
V_{r} is the set of all p_{r}. If V_{r} is greater than a predefined threshold, then we have too many candidate paths, and we decide to filter more aggressively: f is incremented by s—effectively requiring more coverage for each p_{r}—and V_{r} is repopulated. If f > 1.0, then de novo discovery is abandoned for r.
Pruning the pathspace in a candidate region
As we operate on both accurate and errorprone sequencing reads, the number of valid paths in G_{r} can be very large. Primarily, this is due to cycles that can occur in G_{r} and exploring paths that will never reach our required end anchor a_{R}. In order to reduce the pathspace within G_{r}, we prune paths based on multiple criteria. Critically, this pruning happens at each step of the graph walk (pathbuilding).
We used a distancebased optimisation based on Rizzi et al. [65]. In addition to T_{r}, obtained by performing DFS on G_{r}, we produce a distance map D_{r} that results from running reversed breadthfirst search (BFS) on G_{r}, beginning from node a_{R}. We say reversed BFS as we explore the predecessors of each node, rather than the successors. D_{r} is implemented as a binary search tree where each node in the tree represents a kmer in G_{r} that is reachable from a_{R} via reversed BFS. Each node additionally has an integer attached to it that describes the distance from that node to a_{R}.
We can use D_{r} to prune the pathspace by (1) for each node n ∈ p_{r}, we require n ∈ D_{r} and (2) requiring a_{R} be reached from n in, at most, i nodes, where i is defined as the maximum allowed path length minus the number of nodes walked to reach n.
If one of these conditions is not met, we abandon p_{r}. The advantage of this pruning process is that we never explore paths that will not reach our required endpoint within the maximum allowed path length and when caught in a cycle, we abandon the path once we have made too many iterations around the cycle.
Graphbased genotyping and optimal reference construction for multigenome comparison
We use graphbased genotyping to output a comparison of samples in a VCF. A path through the graph is selected to be the reference sequence, and graph variation is described with respect to this reference. The chromosome field then details the local graph and the position field gives the position within the chosen reference sequence for possible variant alleles. The reference path for each local graph is chosen to be maximally close to the set of sample mosaic paths. This is achieved by reusing the mosaic pathfinding algorithm detailed in Additional file 1: Supplementary Algorithm 3 on a copy of the kmer graph with coverages incremented along each sample mosaic path, and a modified probability function defined such that the probability of a node is proportional to the number of samples covering it. This results in an optimal path, which is used as the VCF reference for the multisample VCF file.
For each sample and site in the VCF file, the mean forward and reverse coverage on kmers tiling alleles is calculated. A likelihood is then independently calculated for each allele based on a Poisson model. An allele A in a site is called if: (1) A is on the sample mosaic path (i.e., it is on the maximum likelihood path for that sample); (2) A is the most likely allele to be called based on the previous Poisson model. Every allele not in the sample mosaic path will not satisfy (1) and will thus not be called. In the uncommon event where an allele satisfies (1), but not (2), we have an incompatibility between the global and the local choices, and then the site is genotyped as null.
Comparison of variant callers on a diverse set of E. coli
Sample selection
We used a set of 20 diverse E. coli samples for which matched Nanopore and Illumina data and a highquality assembly were available. These are distributed across 4 major phylogroups of E. coli as shown in Fig. 4. Of these, 16 were isolated from clinical infections and rectal screening swabs in ICU patients in an Australian hospital [66]. One is the reference strain CFT073 that was resequenced and assembled by the REHAB consortium [67]. One is from an ST216 cardiac ward outbreak (identifier: H131800734); the Illumina data was previously obtained [68], and we did the Nanopore sequencing (see below). The two final samples were obtained from Public Health England: one is a Shiga toxin encoding E. coli (we used the identifier O63) [69], and the other an enteroaggregative E. coli (we used the identifier ST38) [70]. Coverage data for these samples can be found in Additional file 1: Supplementary Table 2.
PanRG construction
MSAs for gene clusters curated with panX [33] from 350 RefSeq assemblies were downloaded from http://pangenome.de on 3 May 2018. MSAs for intergenic region clusters based on 228 E. coli ST131 genome sequences were previously generated with Piggy [34] for their publication. The PanRG was built using make_prg. Three loci (GC00000027_2, GC00004221 and GC00000895_r1_r1_1) out of 37,428 were excluded because pandora did not complete in reasonable time (~ 24 h) once de novo variants were added.
Nanopore sequencing of sample H131800734
DNA was extracted using a Blood & Cell Culture DNA Midi Kit (Qiagen, Germany) and prepared for Nanopore sequencing using kits EXPNBD103 and SQKLSK108. Sequencing was performed on a MinION Mk1 Shield device using a FLOMIN106 R9.4 Spoton flowcell and MinKNOW version 1.7.3, for 48 h.
Nanopore basecalling
Recent improvements to the accuracy of Nanopore reads have been largely driven by improvements in basecalling algorithms [71]. For comparison, 4 samples were basecalled with the default (methylation unaware) model and with the methylationaware, highaccuracy model provided with the proprietary guppy basecaller (version 3.4.5). Additional file 1: Supplementary Figure 5 shows the effect of methylationaware Nanopore basecalling on the AvgAR/error rate curve for pandora with/without novel variant discovery via local assembly for those 4 samples. With normal basecalling, local de novo assembly increases the error rate substantially from 0.22 to 0.60%, with a negligible increase in recall, from 89.1 to 90.1%, whereas with methylationaware basecalling it increases the recall from 89.5 to 90.6% and just slightly increases the error rate from 0.18 to 0.22%. On the basis of this, demultiplexing of the subsequent basecalled data was performed using the same methylationaware version of the guppy software suite with barcode kits EXPNBD104 and EXPNBD114 and an option to trim the barcodes from the output.
Phylogenetic tree construction
Chromosomes were aligned using MAFFT [63] v7.467 as implemented in Parsnp [72] v1.5.3. Gubbins v2.4.1 was used to filter for recombination (default settings), and phylogenetic construction was carried out using RAxML [73] v8.2.12 (GTR + GAMMA substitution model, as implemented in Gubbins [74]).
Reference selection for mappingbased callers
A set of references was chosen for testing singlereference variant callers using two standard approaches, as follows. First, a phylogeny was built containing our 20 samples and 243 reference genomes from RefSeq. Then, for each of our 20 samples, the nearest RefSeq E. coli reference was found using Mash [26]. Second, for each of the 20 samples, the nearest RefSeq reference in the phylogeny was manually selected; sometimes one RefSeq assembly was the closest to more than one of the 20. At an earlier stage of the project, there had been another sample (making a total of 21) in phylogroup B1; this was discarded when it failed quality filters (data not shown). Despite this, the Mash/manual selected reference genomes were left in the set of mapping references, to evaluate the impact of mapping to a reference in a different phylogroup to all 20 of our samples.
Construction of truth assemblies
16/20 samples were obtained with matched Illumina and Nanopore data and a hybrid assembly. Sample H131800734 was assembled using the hybrid assembler Unicycler [75] with PacBio and Illumina reads followed by polishing with the PacBio reads using Racon [76], and finally with Illumina reads using Pilon [77]. A small 1 kb artefactual contig was removed from the H131800734 assembly due to low quality and coverage.
In all cases, we mapped the Illumina data to the assembly and masked all positions where the pileup of Illumina reads did not support the assembly.
Construction of a comprehensive and filtered truth set of pairwise SNPs
All pairwise comparisons of the 20 truth assemblies were performed with varifier (https://github.com/iqballaborg/varifier), using subcommand make_truth_vcf. In summary, varifier compares two given genomes (referenced as G1 and G2) twice—first using dnadiff [78] and then using minimap2/paftools [28]. The two output sets of pairwise SNPs are then joined and filtered. We create one sequence probe for each allele (a sequence composed of the allele and 50 bases of flank on either side taken from G1) and then map both to G2 using minimap2. We then evaluate these mappings to verify if the variant found is indeed correct (TP) or not (FP) as follows. If the mapping quality is zero, the variant is discarded to avoid paralogs/duplicates/repeats that are inherently hard to assess. We then check for mismatches in the allele after mapping and confirm that the called allele is the better match.
Constructing a set of ground truth pangenome variants
When seeking to construct a truth set of all variants within a set of bacterial genomes, there is no universal coordinate system. We start by taking all pairs of genomes and finding the variants between them, and then need to deduplicate them—e.g., when a variant between genomes 1 and 2 is the same as a variant between genomes 3 and 4, they should be identified; we define “the same” in terms of genome, coordinate and allele. An allele A in a position P_{A} of a chromosome C_{A} in a genome G_{A} is defined as a triple A = (G_{A}, C_{A}, P_{A}). A pairwise variant PwV = {A_{1}, A_{2}} is defined as a pair of alleles that describes a variant between two genomes, and a pangenome variant PgV = {A_{1}, A_{2}, …, A_{n}} is defined as a set of two or more alleles that describes the same variant between two or more genomes. A pangenome variant PgV can also be defined as a set of pairwise variants PgV = {PwV_{1}, PwV_{2}, …, PwV_{n}}, as we can infer the set of alleles of PgV from the pairs of alleles in all these pairwise variants. Note that pangenome variants are thus able to represent rare and core variants. Given a set of pairwise variants, we seek a set of pangenome variants satisfying the following properties:

1.
[Surjection]:

a.
Each pairwise variant is in exactly one pangenome variant;

b.
A pangenome variant contains at least one pairwise variant;

2.
[Transitivity]: if two pairwise variants PwV_{1} and PwV_{2} share an allele, then PwV_{1} and PwV_{2} are in the same pangenome variant PgV.
We model the above problem as a graph problem. We represent each pairwise variant as a node in an undirected graph G. There is an edge between two nodes n_{1} and n_{2} if n_{1} and n_{2} share an allele. Each component (maximal connected subgraph) of G then defines a pangenome variant, built from the set of pairwise variants in the component, satisfying all the properties previously described. Therefore, the set of components of G defines the set of pangenome variants P. However, a pangenome variant in P could (i) have more than one allele stemming from a single genome, due to a duplication/repeat; (ii) represent biallelic, triallelic, or tetrallelic SNPs/indels. For this evaluation, we chose to have a smaller, but more reliable set of pangenome variants, and thus we filtered P by restricting it to the set of pangenome variants P′ defined by the variants PgV ∈ P such that (i) PgV has at most one allele stemming from each genome; (ii) PgV is a biallelic SNP. P′ is the set of 618,305 ground truth filtered pangenome variants that we extracted by comparing and deduplicating the pairwise variants present in our 20 samples and that we use to evaluate the recall of all the tools in this paper. Additional file 1: Supplementary Figure 13 shows an example summarizing the described process of building pangenome variants from a set of pairwise variants.
Subsampling read data and running all tools
All read data was randomly subsampled to 100× coverage using rasusa—the pipeline is available at https://github.com/iqballaborg/subsampler. A snakemake [79] pipeline to run the pandora workflow with and without de novo discovery (see Fig. 2D) is available at https://github.com/iqballaborg/pandora_workflow. A snakemake pipeline to run snippy, SAMtools, nanopolish, and medaka on all pairwise combinations of 20 samples and 24 references is available at https://github.com/iqballaborg/variant_callers_pipeline.
Evaluating VCF files
Calculating precision
Given a variant/VCF call made by any of the evaluated tools, where the input were reads from a sample (or several samples, in the case of pandora) and a reference sequence (or a PanRG, in the case of pandora), we perform the following steps to assess how correct a call is:

1.
Construct a probe for the called allele, consisting of the sequence of the allele flanked by 150 bp on both sides from the reference sequence. This reference sequence is one of the 24 chosen references for snippy, SAMtools, nanopolish, and medaka; or the multisample inferred VCF reference for pandora;

2.
Map the probe to the sample sequence using BWAMEM [80];

3.
Remove multimappings by looking at the Mapping Quality (MAPQ) measure [36] of the SAM records. If the probe is mapped uniquely, then its mapping passes the filter. If there are multiple mappings for the probe, we select the mapping m_{1} with the highest MAPQ if the difference between its MAPQ and the second highest MAPQ exceeds 10. If m_{1} does not exist, then there are at least two good enough mappings, and it is ambiguous to choose which one to evaluate. In this case, we prefer to be conservative and filter this call (and all its related mappings) out of the evaluation;

4.
We further remove calls mapping to masked regions of the sample sequence, in order to not evaluate calls lying on potentially misassembled regions;

5.
Now we evaluate the mapping, giving the call a continuous precision score between 0 and 1. We look only at the alignment of the called allele (i.e., we ignore the flanking sequences alignment) and give a score as follows: number of matches / alignment length.
Finally, we compute the precision for the tool by summing the score of all evaluated calls and dividing by the number of evaluated calls. Note that here we evaluate all types of variants, including SNPs and indels.
Calculating recall
We perform the following steps to calculate the recall of a tool:

1.
Apply the VCF calls to the associated reference using the VCF consensus builder (https://github.com/leoisl/vcf_consensus_builder), creating a mutated reference with the variants identified by the tool;

2.
Build probes for each allele of each pangenome variant previously computed (see section “Constructing a set of ground truth pangenome variants”);

3.
Map all pangenome variants’ probes to the mutated reference using BWAMEM;

4.
Evaluate each probe mapping, which is classified as a TP only if all bases of the allele were correctly mapped to the mutated reference. In the uncommon case where a probe multimaps, it is enough that one of the mappings are classified as TP;

5.
Finally, as we now know for each pangenome variant which of its alleles were found, we calculate both the panvariant recall and the average allelic recall as per section “Pandora detects rare variation inaccessible to singlereference methods.”
Filters
Given a VCF file with likelihoods for each genotype, the genotype confidence is defined as the log likelihood of the maximum likelihood genotype minus the log likelihood of the next best genotype. Thus a confidence of zero means the two most likely alleles are equally likely, and highquality calls have higher confidences. In the recall/error rate plots of Fig. 6A,B, each point corresponds to the error rate and recall computed as previously described, on a genotype confidence (gtconf) filtered VCF file with a specific threshold for minimum confidence.
We also show the same plot with further filters applied in Additional file 1: Supplementary Figure 3. The filters were as follows. For Illumina data: for pandora, a minimum coverage filter of 5×, a strand bias filter of 0.05 (minimum 5% of reads on each strand), and a gaps filter of 0.8 were applied. The gaps filter means at least 20% of the minimizer kmers on the called allele must have coverage above 10% of the expected depth. As snippy has its own internal filtering, no filters were applied. For SAMtools, a minimum coverage filter of 5× was used. For Nanopore data, for pandora, a minimum coverage filter of 10×, a strand bias filter of 0.05, and a gaps filter of 0.6 were used. For nanopolish, we applied a coverage filter of 10×. We were unable to apply a minimum coverage filter for medaka due to a software bug that prevents annotating the VCF file with coverage information.
Locus presence and distance evaluation
For all loci detected as present in at least one sample by pandora, we mapped the multisample inferred reference to all 20 sample assemblies and 24 reference sequences, to identify their true locations. To be confident of these locations, we employed a strict mapping using bowtie2 [81] and requiring endtoend alignments. From the mapping of all loci to all samples, we computed a truth locus presenceabsence matrix and compared it with pandora’s locus presenceabsence matrix, classifying each pandora locus call as true/false positive/negative. Additional file 1: Supplementary Figure 6 shows these classifications split by locus length. Having the location of all loci in all the 20 sample assemblies and the 24 references, we then computed the edit distance between them.