Gramtools enables multiscale variation analysis with genome graphs

Genome graphs allow very general representations of genetic variation; depending on the model and implementation, variation at different length-scales (single nucleotide polymorphisms (SNPs), structural variants) and on different sequence backgrounds can be incorporated with different levels of transparency. We implement a model which handles this multiscale variation and develop a JSON extension of VCF (jVCF) allowing for variant calls on multiple references, both implemented in our software gramtools. We find gramtools outperforms existing methods for genotyping SNPs overlapping large deletions in M. tuberculosis and is able to genotype on multiple alternate backgrounds in P. falciparum, revealing previously hidden recombination. Supplementary Information The online version contains supplementary material available at (10.1186/s13059-021-02474-0).

This is intended to show the impact of nesting on genotyping without representing any specific implementation. a) Non-nested graph showing a site with several long alternate alleles differing by SNPs (shown in red). Top allele is the actual allele in a sequenced genome, whose reads have been mapped to this graph. On the right hand end of each allele is the average depth of coverage on this allele. The bottom allele is a deletion, shown with zero coverage. There is very little difference between the true allele and the others which differ by SNPs but only have marginally lower depth of coverage when averaged across the allele. Confidence of the called allele will be low. b) The same sequences in a nested graph, with coverage shown at nested sites. Precise coverage at SNPs makes it very clear what is going on -all SNP calls will be high confidence and correct.
In Fig. S1, we refer to nested and non-nested DAGs; these are formally defined in the Methods section of the paper.

Graph constraints
In this section, we illustrate the types of graphs supported by gramtools. In Fig. S2, we illustrate how variant sites are mapped on DAGs: from an opening node v (shown in blue), the nodes on all paths from v to the graph's sink node form a set S that is non-empty and totally ordered. The minimal element of S is the closing node c(v) of v (see Methods in the main section for details).
In Fig. S3, we illustrate the balanced bracket property and graphs with and without this property. Graphs without a balanced brackets representation cannot be indexed by gramtools (see Methods in the main section). As shown in Fig. S3, graphs without this property can sometimes be transformed into equivalent graphs (ie representing the same sequence) with this property.
In Fig. S4, we show a graph with a balanced brackets representation that cannot be directly indexed by gramtools. gramtools requires a one-to-one correspondence between the opening and closing nodes of the graph (ie each opening or closing node cannot open or close more than one variant site). In Fig. S4 however, node 1 is the opening node for both Site 1 and Site 2. To make the graph indexable by gramtools, each bracket node in the balanced bracket representation of the graph is added to the graph, yielding the graph at the bottom. This graph has a one-to-one correspondence between opening and closing nodes and can thus be indexed by gramtools.
The Methods part of the main section formally describes how to produce a balanced brackets representation and how to transform a graph with balanced brackets into a gramtools-indexable graph. Given an opening node (node 1, shown in blue), we call S the set of nodes on all paths from v to the sink (shown in red). This set is non-empty and totally ordered and so has a minimal element, which is node 7 in this example. Therefore node 7 is the closing node of node 1, and the subgraph induced from node 1 to 7 forms a variant site.   : Indexable and non-indexable graphs representing equivalent sequences. The graphs in the three panels represent the same set of sequences but have different topology. Each node is labeled with an integer identifier in red, increasing according to a topological ordering of the nodes. The balanced brackets representation of each graph is written below it. The graph in panel a) does not have balanced brackets and cannot be indexed by gramtools, while the graphs in panels b) and c) do and can therefore be indexed.  Fig. S4: gramtools transformation of indexable graphs. Each node is labeled with an integer identifier in red, increasing according to a topological ordering of the nodes. The graphs each have two variant sites, with opening and closing nodes marked between curly brackets. In the top graph, node 1 is the opening node for both site 1 (c(1) = 5 is the closing node of site 1) and site 2 (o(4) = 1 is the opening node of site 2). Thus gramtools cannot index this graph, as a one-to-one correspondence between opening and closing nodes is required. From the balanced bracket representation of this graph, a new gramtools-indexable graph is produced, by adding one node for each bracket to the graph (see Methods in the main section for how). In this new graph, each site has a distinct pair of opening and closing nodes. Each of the three resulting grey blocks represents one variant site and will be treated independently. c) Cluster operation. For each variant site from a/b, sequences are clustered using k-means, shown as one coloured block per cluster. d) Recursive application of RCC to each coloured block (here, only the collapse operation is shown). When the length of the coloured block becomes too short, or no suitable clustering is found, all sequences are simply enumerated as alternate alleles, ending the recursion.
In this paper, gramtools takes as input graphs that are NDAGs by construction. These are built using our tool make prg from multiple-sequence alignments (MSA). In Fig. S5, we illustrate the algorithm implemented in make prg. Notice that in this figure, each coloured block corresponds to one variant site, with blocks occurring inside other blocks. If we represent each block using a pair of matching opening and closing brackets, enumerating the brackets by traversing the MSA from left to right and top to bottom, it is clear that the entire set of brackets is balanced. Further, make prg builds graphs with a one-to-one correspondence between opening and closing nodes of variant sites. Thus, the graphs it constructs are gramtools-indexable NDAGs. Benchmarking gramtools genotyping against single-reference variant callers at surface antigens

Validation of nested genotyping with simulated data
We provide here a breakdown of performance for each gene, which also includes the performance of running SAMtools and Cortex on top of the gramtoolsinferred personalised reference (PR) genome for each sample (which is the path given by the maximum-likelihood allele at each variant site). In AMA1, the gene is easily resolved and performance is the same for all three tools (Fig. S7), while in EBA175, DBLMSP and DBLMSP2, gramtools outperforms the singlereference callers (Fig. S8, 9, 10). Running SAMtools and Cortex against the gramtools-inferred PR unlocks further variation in one gene, DBLMSP2 (Fig.  S10), validating the use of a PR for discovering new variation. In DBLMSP, four samples remain at large distances of (4-10%) to the truth assemblies for all tools (Fig. S9). These sequences could also not be resolved by SAMtools and Cortex, suggesting long-read sequencing would be required to fully access the variation in these four samples.
Comparing gramtools genotyping with taking the best input sequence in the graph One advantage of using graphs for genotyping is that the inferred sample's path through the graph can be a recombinant of the sequences used to build the graph. Using our 14 validation samples, we measured to what extent gramtools can recover the closest possible sequence out of the input sequences used to construct the graph, and how often it found a recombinant that is better than this closest sequence (note that all input sequences exist as paths in the graph). The closest sequence for a gene in a sample was determined by mapping all 2,498 sequences used for building the gene's graph to the sample's truth assembly with bowtie2. The closest sequence was the one with the lowest edit distance (NM tag), provided it had mapping quality (MAPQ) greater than 20 (i.e. the probability the mapping was not unique was < 10 −2 according to bowtie2). We show the results in Fig.S11, plotting the distance of the best panel reference from the truth, and the distance of the gramtools-inferred sequence from the truth (where truth refers to the sequence in the truth assembly of the relevant sample). Sequences below the x=y dotted line indicate recombinants found by gramtools that are closer to the truth assembly than any input to the graph. In AMA1 and EBA175, the gramtools inference was as good or closer to the truth for 13/14 samples. For DBLMSP2, gramtools finds many (8) recombinants in the graph not present in the input sequences. For DBLMSP, in two samples the gramtools inference is much worse than the best input in the graph, indicating genotyping failure. We found this was due to a specific unresolved region where sites had null calls caused by low coverage. . The x-axis is the scaled edit distance (edit distance divided by the length of the gene) to the true sequence as determined from highquality assemblies, and the y-axis shows counts for the 14 evaluated samples. The two histograms in the left-hand column show how close the 3D7 reference genome and the gramtools-inferred personalised reference (PR) are to each sample. The two histograms in the middle column show results for SAMtools when using the 3D7 and PR as reference genome. The histograms in the righthand column are the equivalent plots for Cortex. The dotted lines and adjacent numbers show the mean scaled edit distance. . The x-axis is the scaled edit distance (edit distance divided by the length of the gene) to the true sequence as determined from highquality assemblies, and the y-axis shows counts for the 14 evaluated samples. The two histograms in the left-hand column show how close the 3D7 reference genome and the gramtools-inferred personalised reference (PR) are to each sample. The two histograms in the middle column show results for SAMtools when using the 3D7 and PR as reference genome. The histograms in the righthand column are the equivalent plots for Cortex. The dotted lines and adjacent numbers show the mean scaled edit distance. Fig. S11: gramtools genotyping finds recombinants between input haplotypes in the graph. Each dot shows one gene in one sample, with each gene in its own panel. The x-axis is the edit distance between the closest input sequence (out of all sequences used to construct the gramtools genome graph) and the truth assembly, divided by the length of the gene. The y-axis shows the edit distance between the gramtools-inferred sequence and the truth assembly. The dotted y=x line shows the expectation if gramtools always found in the graph the closest sequence used in graph construction. Points above this line highlight gramtools genotyping errors, as the closest input should have been found. Points below the line show gramtools finding recombinants in the graph that are closer than any input sequence used in graph construction.
14 Unified SNP and large deletion analysis in M. tuberculosis Mapping evaluated regions to truth assemblies

Evaluating variant calls using varifier
In the main section of the paper, we benchmark genotyping accuracy of genome graph tools for all variants combined inside a genomic region. Here, we show the genotyping accuracy of individual variants broken down by type (insertion, deletion, SNP) and size (1-10bp, 11-50bp, 51bp and above). To achieve this, we computed the precision and recall of individual variant calls using our tool varifier (see Methods in the main section). Fig. S19 shows the precision and recall for each variant category. Each tool is genotyping a fixed set of input variants incorporated into a genome graph, without an additional discovery step. Thus recall in Fig. S19 is measured relative to the variants originally present in the input VCF files and validated as true (as any variants absent from the genotyped graphs cannot be expected to be called by the tools). The performance differences between the tools are discussed in the main section of the paper.

Variant calls in genome graphs
The Variant Call Format (VCF) is a tab-delimited data format describing genetic variation occurring with respect to a linear reference genome. Here we define an extension to VCF for genome graphs, which are graph structures representing genetic variants occurring on potentially multiple reference genomes.
The format uses JavaScript Object Notation (JSON), a commonly used data format, and we thus call it JSON VCF or jVCF.

Requirements
jVCF assumes: • Variant sites have been defined on the genome graph • Variant sites can be contained in other variant sites. It is known which sites are contained in which others.
For the latter point, a site contained in another occurs in a given sequence background. Each sequence background must be labeled with a unique positive integer, called its haplogroup. See this toy graph for an example.
Briefly, JSON consists of objects and arrays: • An object is an unordered collection of key/value pairs enclosed in curly brackets ('{' and '}'). • An array is an ordered collection of values enclosed in square brackets ('[' and ']').
In an object, a key is a string. Strings are enclosed in double quotes ('"'). In objects and arrays, values can be: • a string • a number • one of the literals 'true', 'false' or 'null' • an array • an object 2

Example
Here is a toy genome graph: Site 0 In Figure 1, black nodes mark site entry and exit points. Each variant site is labeled by a unique positive integer ID, and each outgoing branch from the start of a variant site is labeled by a unique positive integer (its haplogroup).

AT
Assume the ploidy is 1, and we are genotyping a sample whose genotyped path in the graph is the red line. The jVCF of this genotyped sample would look like this: