Graph definitions
Here, we formally define a variant site and the type of graph that gramtools can support. Let G=(V,E) be a directed acyclic graph (DAG) with a unique minimal and unique maximal element, i.e. G has a unique source and unique sink. Each node v has a number of ingoing edges deg−(v) and a number of outgoing edges deg+(v). Define a node v to be opening if deg+(v)>1 and closing if deg−(v)>1. Note that a node can be both opening and closing.
Let s be the sink of G. Given any opening node v, let S be the set of nodes that are in every path from v to s, excluding v itself. Then, S is non-empty because s belongs to S. Let a and b be any elements of S. Then, by definition of S, there exists a path that contains both a and b. Therefore, using the partial order defined by the edges of G, a and b are comparable and it follows that S is a totally ordered finite non-empty set. Therefore, S contains a unique minimal element, which we denote c(v). Informally, c(v) can be thought of as the first node that “closes” all paths from v. Similarly, given a closing node u, we define o(u) to be c(u) applied to the transpose of G. Informally, o(u) is the node that “opens” u. See Additional File 1: Figure S2 for an example of how S and closing nodes are identified.
We use the notion of opening and closing nodes to define a variant site as follows.
Definition.
Let G be a DAG with a unique source and unique sink. A variant site is defined as the subgraph of G induced from {u,c(u)} or from {o(v),v}, where u is any opening node and v is any closing node of G.
We remind the reader that for any DAG G, there exists at least one ordering of all the nodes v0,v1,…,vn such that given any edge (vi,vj) of G, vi appears before vj in the ordering. This is called a topological ordering of G. Using the above definitions, we can now define the type of graph that is supported by gramtools, which we call a “nested directed acyclic graph”.
Definition.
Let G be a DAG with a unique source and unique sink. G is said to be a nested directed acyclic graph (NDAG) if there exists a topological ordering of all nodes v0,v1,…,vn such that adding brackets to this ordered list of nodes according to the following rules results in balanced opening and closing brackets:
-
1
For each opening node u, add [u after u, and add ]u before c(u);
-
2
For each closing node v, add [v after o(v) and add ]v before v, unless these brackets were already added by case 1.
Note that each matching pair of brackets in the above definition corresponds to one variant site. See Additional File 1: Figure S3 for an illustration.
To be able to index with the vBWT, gramtools would apply the following modifications to the graph, producing a new graph where there is a one-to-one correspondence between the set of opening and closing nodes. Specifically, this means that a node is either opening or closing and cannot open or close more than one node. Essentially, the method entails adding a new node to the graph for each balanced bracket that was added to the topological ordering of the nodes. Starting from the innermost brackets, for each matching pair of brackets [a and ]a, where node a precedes [a and node b follows ]a in the topological ordering with balanced brackets (so we are considering …,a,[a,…,]a,b,…):
-
Add a node called [a with no sequence and an edge (a,[a) to the graph and move the outgoing edges of a to [a;
-
Add a node called ]a with no sequence and an edge (]a,b) to the graph and move the incoming edges of b to ]a.
See Additional File 1: Figure S4 for an illustration of this process.
In practice, gramtools does not need to transform an NDAG or verify if an input DAG is an NDAG, as it takes as input constructed graphs that are already indexable NDAGs. This is achieved using one of two ways described below.
Genome graph construction and make_prg
gramtools can construct genome graphs without nested variation from a reference genome and a VCF of variants. Overlapping records in the VCF file are merged by enumerating all possible combinations up to a specifiable limit. This method creates an NDAG because no variant sites overlap, giving a natural balanced bracket representation of sites. However, this approach rapidly fails in variant-dense regions or for large cohorts of samples due to a prohibitively large number of allele combinations. We solve this problem by allowing for nested variation. To build nested graphs, we apply an algorithm called recursive collapse and cluster (RCC) starting from a multiple-sequence alignment. RCC was first introduced in the context of bacterial pan-genomic tool pandora [15] and is implemented in and available at https://github.com/iqbal-lab-org/make_prg.
RCC identifies invariant regions of a given minimum size and collapses them into a single graph node. The remaining regions form variant sites, and each gets clustered based on their k-mer content. This procedure is repeated recursively on each cluster, until either a maximum nesting level is reached or the sequences are too small (in which case they are directly enumerated as alternative alleles). In this way, variants appear in subsets of samples with similar sequence backgrounds. The RCC algorithm generates hierarchically nested sites by construction: each cluster of sequences corresponds to one variant site, the clustering process generates distinct clusters, and recursive sequence collapsing occurs fully inside of a cluster, making new clusters nested. RCC thus produces an output graph that is an NDAG. We provide an illustration of RCC and how it induces the balanced bracket representation of an NDAG in Additional File 1: Figure S5.
Two command-line parameters affect what graph gets produced. First, max_nesting is the maximum number of collapse and cluster recursions to perform, which gives the maximum number of nesting levels in the graph. Second, min_match_length is the number of invariant bases between samples for them to be collapsed in a single node. Sequence collapse is what allows paths coming before and after to cross; a larger value thus reduces recombination between the input haplotypes. This provides a way to control combinatorial path explosions in the graph.
vBWT data structure in gramtools
The vBWT data structure marks variant sites with numeric identifiers so that alleles get sorted and queried together in the suffix array (Fig. 8a). This representation induces branching at each site entry and exit such that mapping has worst-case exponential run-time. To speed mapping, we seed reads from an index storing the mapped intervals of all sequences of a given size k. Linear-time exact match indexes on genome graphs exist (e.g. GCSA [39]) but require a prefix-sorting step that is worst-case exponential.
vBWT’s numeric identifiers are also used for recording mapped read coverage along variant sites (Fig. 8b). Coverage recording handles two types of uncertainty: horizontal, where sequence is repeated across the genome, and vertical, where sequence is repeated in alleles of a site. To handle horizontal uncertainty, we randomly select one read mapping instance, as is typically done in standard aligners [30]. To handle vertical uncertainty we store allele-level equivalence class counts which are counts of reads compatible with groups of alleles, an idea introduced in kallisto [16]. This allows allelic uncertainty to be accounted for during genotyping. Per-base coverage is also stored on the graph (Fig. 8) and used during genotyping.
Genotyping model
The genotyping model in gramtools supports haploid and diploid genotyping. It assigns a likelihood to each candidate allele (or pair of alleles for diploid) computed from base-level and allele-level read coverage.
gramtools stores coverage in equivalence classes, following ideas from [16]. Let A be the set of alleles at a variant site. We partition the set of all reads overlapping A into subsets, i.e. equivalence classes, where all reads belonging to one subset map perfectly to the same subset X of A (e.g. reads that map uniquely to allele 1, or reads that map equally well to alleles 1 and 2; see Fig. 8b). For each equivalence class, we store a count \(c_{X} \in \mathbb {N}\) of reads compatible with the alleles in set X, and for each mapped read, we increment its cX at each overlapped site. If a read has multiple (horizontal) mapping instances, we select one at random, and the counts cX are incremented as above. When a read’s count cX is incremented, for each allele a∈X, the count of each base the read mapped to is also incremented. Base-level counts are written c(ai), where ai is the ith base of allele a.
Coverage compatible with allele a of length la is defined as \(c(a) = \frac {1}{l_{a}}\sum _{i=1}^{l_{a}}{c(a_{i})}\) and incompatible coverage as \(i(a) = \sum _{X \subset A: a \notin X} c_{X}\). In this way for each candidate allele, we capture a per-base correct coverage generation process as well as an incorrect coverage generation process on incompatible alleles.
We model the expected per-base read coverage in a site using an estimate of the mean λ and the variance σ2 of true coverage across all variant sites. For each site, true coverage is estimated as the average per-base coverage of the allele with the most coverage. If σ2≤λ, we model observed coverage as coming from a Poisson distribution:
$$p(c(a) = k | \lambda) = e^{-\lambda} \frac{\lambda^{k}}{k!} $$
Else, we use the negative binomial (NB) distribution
$$p(c(a) = k | p, r) = \binom{k+r-1}{k}(1-p)^{r}p^{k} $$
When using the NB distribution, we need to estimate the standard parameters of the NB distribution, r and p. r is estimated from rearranging the formula for the expected variance of NB as \(r=\frac {\lambda ^{2}}{\sigma ^{2} - \lambda }\), and similarly p is estimated from the expected mean of NB as \(p=\frac {\lambda }{\lambda + r}\).
We model incorrect coverage i(a) as coming from sequencing errors with rate ε: p(i(a)=k|ε)=εk. ε is estimated from the mean base quality score in the first 10,000 processed reads.
We also use per-base coverage to penalise gaps in coverage. Given a function g(a) returning the number of zero-coverage positions in allele a, the probability p(g(a)=k) of seeing k gaps if a is the true allele is p(zero_cov)k, where p(zero_cov) is obtained by computing p(c(a)) using the above formula with c(a) set to zero. In practice, we use \(\frac {k}{l_{a}}\) as the exponent instead of k so as not to penalise long alleles.
These three terms combined give the likelihood of allele a
$$\mathcal{L}(a) = p(c(a)) p(i(a)) p(g(a))$$
The allele a′ that gets called is the maximum-likelihood allele, and we define the genotype confidence of the call as
$$\min_{a\in A:a\neq a'} \frac{\mathcal{L}(a')}{\mathcal{L}(a)}$$
which is the likelihood ratio of the called allele and the next most likely allele.
This holds for haploid genotyping. For higher ploidy, the likelihood function generalises to a set of alleles S as
$$\mathcal{L}(S) = p(i(S)) \prod_{a \in S} p(c(a)) p(g(a)) $$
where \(i(S) = \sum _{X \subset A: X \cap S = \emptyset } c_{X}\).
For diploid genotyping, S={a1,a2} and p(c(a)) is parameterised by \(\frac {\lambda }{2}\) because we expect half the total site coverage on each of two called alleles.
Nested genotyping
The gramtools genotyping model is applied recursively from child sites to their parent sites. Calls in child sites restrict the set of alleles considered in the parent so that the number of choices is reduced: for each outgoing path from a site, \(\prod _{i}^{n}p_{i}\) paths are considered, where pi is the number of distinct alleles called at site i (e.g. in diploids 0, 1 or 2) and n the number of child sites encountered. Some extra paths are also retained when genotype confidence for a child site is low, in order to propagate uncertainty to parent calls. If there are more than 10,000 possible alleles, only the 10,000 most likely alleles are considered. This does not require enumerating all possible alleles as the most likely alleles in child sites have already been computed.
An example of the nested genotyping procedure is shown in Fig. 9. To maintain coherence, if child sites on two different branches of a parent site are genotyped, whole branches can get invalidated. For example at a ploidy of one if an outgoing branch from a parent site is called, all children sites on the other branches receive null calls.
P. falciparum surface antigen graphs and genotyping validation
Graph construction
We started from VCF files produced by running Cortex, a de novo assembly-based variant caller [5], on read sets of 2,498 samples from the Pf3k project [20] (all reads are publicly available on the ENA, see the Availability of data and materials section). Cortex has a very low false positive call rate and can call the divergent forms of P. falciparum surface antigens [22]. The Cortex independent workflow was run, using the bubble caller with k=31. The VCF files are publicly released on zenodo (see Availability of data and materials) and Cortex is publicly available at https://github.com/iqbal-lab/cortex.
For each surface antigen gene (DBLMSP, DBLMSP2, EBA175 and AMA1), we generated sequences for each sample by applying Cortex variants at the gene coordinates, plus 5000 bp on each side, to the P. falciparum 3D7 reference genome. We generated multiple-sequence alignments of each gene using mafft [40] and passed them as input to our construction tool make_prg. For the simulation experiment, two graphs were built for DBLMSP and DBLMSP2 with maximum nesting levels of 1 and 5. The graphs without nesting have 451 and 413 variant sites for DBLMSP and DBLMSP2, and the graphs with nesting have 558 and 500 variant sites respectively.
Path and read simulation
From each non-nested graph, 10 paths were simulated and threaded through the nested graph using gramtools’ simulate command. This results in jVCF files recording the true genotypes for each path in each graph. Illumina HiSeq25 75-bp reads (0.023% per-base error rate) were simulated from each unique path using ART [41] at 40-fold coverage, genotyped using gramtools and calls evaluated by comparing the genotyped and truth jVCF files.
The nested graph contains more paths than the non-nested graphs due to allowing greater recombination between variant sites. We therefore simulated paths from the non-nested graph to ensure each path exists in both graphs.
Comparison with reference-based callers
The nested graphs of the four surface antigens DBLMSP, DBLMSP2, EBA175 and AMA1 were combined in positional order with the rest of the reference genome, using a custom script (see the Availability of data and materials section).
For each of the 14 validation samples, we ran SAMtools and Cortex using the P. falciparum 3D7 reference genome and gramtools using the surface antigen genome graph. For each genotyped sample, gramtools infers a haploid personalised genome (PR) as the whole-genome path taking the called allele at each variant site. SAMtools and Cortex were then run once more using the personalised reference instead of 3D7.
When mapping the gene sequence with variants applied to the truth assemblies, we measure performance as the edit distance reported by bowtie2 (version 2.4.1) divided by gene length.
M. tuberculosis SNP and large deletion analysis
Hybrid assembly of the 17 evaluated samples
Each sample was initially assembled using Unicycler [42] and Canu [43], followed by Circlator [44] using the corrected reads output by Canu. Unicycler version 0.4.8 was used with the option ‘-mode bold’, Illumina reads given using the options ‘–short1 and ‘–short2’, and the PacBio subreads using the ‘–long’ option. Canu version 2.0 was used with the option ‘genomeSize=4.4m’ and the PacBio reads provided with the option ‘-pacbio-raw’. The only exception was sample N1177, which was initially assembled using Flye [45] version 2.8-b1674 with the PacBio subreads input with the option ‘–pacbio-raw’.
The initial assembly for each sample was chosen for further manual polishing based on inspection of mapped reads and comparison with the H37Rv reference genome. The Unicycler assembly was used for samples N0004, N0091, N0155, N0157, N1283, N0072, N1202 and N0153. The Canu assembly was used for samples N0031, N1176, N0052, N0136, N0146, N1216, N1272 and N0054. Redundant and/or contamination contigs were removed from samples N0072, N1202, N0052, N0136, N0146, N1216 and N1272. Manual fixes were applied to samples N0054 and N0153 by breaking contigs at errors, with the aid of the Artemis Comparison Tool (ACT) [46], and re-merging using Circlator using the default settings. Next, Pilon [47] (version 1.23) was run iteratively on each assembly using the Illumina reads as input, mapped with BWA-MEM [48] (version 0.7.17-r1188, default settings) until no more corrections were made, up to a maximum of 10 iterations. Finally, the ‘fixstart’ function of Circlator was used to ensure that each assembly began with the dnaA gene, for consistency with the H37Rv reference genome. The result was all 17 samples assembled into a single, circularised contig.
Variant discovery
We first obtained variant calls from the read files of the 17 evaluated samples and an additional 1000 samples available in the ENA (see the Availability of data and materials section). We ran Cortex to obtain the calls, using our lab’s wrapper clockwork version 0.8.3, publicly available at https://github.com/iqbal-lab-org/clockwork. clockwork runs Cortex’s independent workflow using the bubble caller with k=31. The VCF files are publicly released on zenodo (see Availability of data and materials).
Cortex identified a total of 73 deletions in the 17 evaluated samples, between 100 and 13,000 bases in length and falling in 45 distinct genomic regions. To validate the calls, we mapped their corresponding long-read assemblies to the M. tuberculosis H37Rv reference genome with minimap2, which validated 68. The remaining 5 were manually confirmed using ACT: for each sample we mapped the short reads to the reference genome and to the assembly using bowtie2 and mapped the assembly to the reference using nucmer [49]. In ACT, we view all three together and validate a deletion when it appears in the assembly-reference mapping at the expected coordinates and when read pileups confirm the event. These are shown in Additional File 1: Figure S12-16.
Having validated all the deletions, we extracted all Cortex calls occurring under the 45 deletion regions in the 1017 samples, giving us a joint set of large deletions and overlapping SNPs and indels.
gramtools genome graph construction
We built one genome graph for each of the 45 regions identified as containing large deletions in our 17 evaluation samples. As for the P. falciparum surface antigen graphs, for each region, we applied Cortex calls to the M. tuberculosis H37Rv reference genome, generated multiple sequence alignments with mafft, passed them as input to our construction tool make_prg and combined them with the rest of the genome.
vg and GraphtTyper2 genome graph construction
We set on building a vg genome graph from the same multiple sequence alignments (MSA) used by gramtools to maximise comparability. Using vg version 1.26.0, we built each of the 45 regions from MSA using vg construct and combined them with the invariant parts of the M. tuberculosis H37Rv reference genome using vg concat.
Indexing this graph, a prerequisite to read mapping and variant calling, used >10 Terabytes of temporary disk space before we stopped it. We deemed > 500 Gigabytes prohibitive and set that as a limit. We ran vg prune to remove densely clustered variation from the graph and, after exceeding 1 Terabyte of disk indexing the pruned graph using default parameters, successfully indexed the pruned graph with parameters -k10 -X3.
We then ran vg call for each of our 1017 samples against the MSA graph. However, after successful mapping to this indexed graph, vg call failed with a segmentation fault.
We therefore built a graph from a VCF file instead. We ran vg deconstruct -p -e to obtain a VCF file describing the variants identified by vg in the vg MSA-constructed graph, and manually validated the variation using one sample when compared to gramtools. However, running vg construct with this VCF also failed with a segmentation fault.
We therefore used vg graph construction and genotyping from a merged VCF of all variants in the 45 regions which we produced using bcftools. This ran successfully after graph pruning to stay under our disk limit.
This VCF file was also used as input to GraphtTyper2 via graphtyper version 2.5.1, running its genotype_sv subcommand. GraphtTyper2 only accepts VCF files as input and not MSAs.
Covered positions and number of variants
Altogether, the 45 deletion regions cover 51,701 bp of the reference genome. The variants under them cover 4105 reference positions in 1,109 sites in the gramtools graph and 2,386 positions in 1434 sites in the merged VCF file used by vg and GraphTyper2.
Mapping evaluated regions to truth assemblies
We evaluated a total of 3,060 sequences by mapping them to truth assemblies: 17 samples x 45 regions x 4 tools (gramtools, vg, GraphtTyper2 and the reference genome sequence). Using bowtie2, 10.4% all sequences failed to be fully aligned due to excessive divergence between the called sequence and the truth.
To recover more alignments, we used minimap2 which is designed to align more highly diverged sequences (such as ONT long reads) [29]. For each evaluated sequence, we took the alignment with the greatest number of matches to the assembly and extracted assembly sequence of the same length from the first aligned position (including soft- or hard-clipped). We obtained the edit distance between the two sequences from Needleman-Wunsch alignment using edlib [50]. Using this approach reduced the proportion of unaligned sequences to 1.1%.
To ensure evaluated alignments are unambiguous, we filter them by MAPQ ≥ 30 so that the probability they are non-unique is ≤10−3 as estimated by minimap2. This removed 0.62% of the evaluated sequences. For each tool, 13 of 765 sequences were not mapped or had insufficiently high mapping quality. The number of unmapped and low MAPQ sequences for each tool are shown in Additional File 1: Figure S18.
We required the VCF records output by each tool to have a FILTER status set to “PASS”. This changed results only marginally, giving the same number of unmapped and low MAPQ sequences and a decreased mean edit distance by 0.11% for GraphtTyper2, 0.08% for gramtools and no differences for vg.
Evaluating variant calls using varifier
varifier is a tool for measuring accuracy of variant calls in a VCF using a reference genome and a truth assembly. Given a variant call, varifier determines if it is correct by aligning the reference genome sequence with called variant applied (plus some flanking sequence to make the alignment specific) to the truth assembly. To compute call precision, which we define as the fraction of calls made that are correct, this procedure is applied to each variant in the evaluated VCF file. To compute call recall, which we define as the fraction of calls made out of all expected calls, varifier first aligns the reference genome to the truth assembly to derive a set of expected variant calls. Each expected call is then evaluated (by the mapping procedure above) against an ‘induced truth genome’ obtained by applying the evaluated VCF call’s variants to the reference genome sequence. Then, if the expected call has been made in the evaluated VCF, it will be found in the induced truth genome. For both recall and precision, we restricted the evaluation to calls with FILTER set to “.” or “PASS” in order to ignore low-confidence calls. This led to an improvement in recall and precision of 0% and 1.9% (gramtools), 0.1% and 2.1% (graphtyper2) and 0.9% and 1.1% (vg) on average across the variant types.
varifier is described in more detail in [51] and available at https://github.com/iqbal-lab-org/varifier.
P. falciparum dimorphic variation analysis
We took a subset of 706 samples from Ghana, Cambodia and Laos out of the 2498 samples used to build the P. falciparum genome graphs (see the “Graph construction” section). Using gramtools, each sample was genotyped using the graph containing four surface antigens (DBLMSP, DBLMSP2, EBA175, AMA1). We combined each sample’s jVCF output file into a single multi-sample jVCF file using gramtools’ combine_jvcf executable and analysed the final jVCF using custom scripts (see the Availability of data and materials section).