Minos: variant adjudication and joint genotyping of cohorts of bacterial genomes

There are many short-read variant-calling tools, with different strengths and weaknesses. We present a tool, Minos, which combines outputs from arbitrary variant callers, increasing recall without loss of precision. We benchmark on 62 samples from three bacterial species and an outbreak of 385 Mycobacterium tuberculosis samples. Minos also enables joint genotyping; we demonstrate on a large (N=13k) M. tuberculosis cohort, building a map of non-synonymous SNPs and indels in a region where all such variants are assumed to cause rifampicin resistance. We quantify the correlation with phenotypic resistance and then replicate in a second cohort (N=10k). Supplementary Information The online version contains supplementary material available at (10.1186/s13059-022-02714-x).


Simulated data
Simulations were performed as a sanity check on performance of all benchmarked tools, and were not intended (and are not claimed) to be good representatives of real-world data. They do however provide a way to potentially reveal limitations of tools. Simulated data sets were generated from the M. tuberculosis genome H37Rv using simutator (https://github.com/ iqbal-lab-org/simutator). A mutated genome was generated for each type of variation, with variants distributed uniformly across the genome: a SNP every 200bp, an insertion of length 1 every kilobase, a deletion of length 1 every kilobase etc. Insertions and deletions were simulated up to a length of 10kbp. Paired Illumina reads were simulated at 50X depth from each mutated reference sequence using ART.
BayesTyper, GraphTyper, and Minos achieved near-perfect precision and recall on all data sets, showing that all tools are effective on clean data (Additional files 4,5: Tables S3,S4). All three tools had perfect precision when calling isolated SNPs, insertions and deletions. They also had perfect recall of SNPs, except for Minos with a recall of 99.68% using its default call filters. However, including all Minos calls results in perfect recall (and still perfect precision). This trend continued in the remainder of the simulation results. We found that the Minos default filters are negatively affected by the very clean simulated reads with idealised read depth distributions. As shown later, the filters are however effective on real data. The recall of SNPs and the full range of indels tested is shown in Supplementary figure 1. Again, all tools have near 100% recall, except filtered Minos calls which loses up to 0.9% recall on small (< 50bp) variants.
In addition to isolated SNPs, insertions, and deletions, four data sets were generated with a complex variants every kilobase, with each complex variant consisting of SNPs and/or indels of varying length in a window of size 10 to 50bp. The results were very similar across all tools (Additional file 5: Table S4).

Conflicting Joint Genotyping Calls
As noted in the main text, Minos resolves overlapping input variants, so that no two sites in its output contain reference positions in common. Furthermore, this means Minos cannot output two separate VCF records with incompatible genotype calls, in contrast to BayesTyper and GraphTyper. A typical example is shown next.
On the Walker 2013 outbreak data set, BayesTyper and GraphTyper reported the following two lines in their output VCF files for each sample (first five columns shown only for brevitythese are enough to describe the variants that are being called): BayesTyper and GraphTyper made inconsistent genotype calls of these variants. For example in sample ERR046906, BayesTyper and GraphTyper genotyped the deletion at 397272 as not present (genotype 0 or 0/0), saying in particular that this sample has a G at position 397275. However, both tools also genotyped the second variant line at 397275 as having a C (genotype 1 or 1/1). These genotype calls cannot both be correct. Minos reported the SNP and deletion alleles with the following single line in its VCF output file: NC_000962.3 397272 .

TCGGCGCC T,TCGCCGCC
and called the second alternative allele TCGCCGCC as correct -this is a SNP from G to C at position 397275. Manual inspection of the reads pileup confirmed that this is the correct call.

Example complex variant
It is common for insertions and deletions to result in variants with more than one correct representation in VCF format. Here we show one such example, found in real M. tuberculosis data, which is successfully evaluated by Varifier, but not by hap.py. Here a sample (top sequence) has an insertion compared to the M. tuberculosis H37Rv reference genome (bottom sequence), plus a G to C SNP at position 4359165.
Alignment 1: The corresponding VCF records are: This region is a tandem repeat, containing copies of the sequences GGGGTTCCCGGGGTGATC and GGGGTTCCCGGCGTGATC (which differ by one G/C SNP at position 12). The insertion and SNP can be placed in multiple positions (or, could be represented by two insertions only, which we have actually seen in real calls). Three more of the many possible alignments are shown below, with their matching VCF records.
In particular, this example highlights how confusion can arise from alternative representations of the same variants. Looking in isolation at these two calls from alignments 1 and 3, one might incorrectly conclude that they are incompatible:

Software versions and command lines
All software was run using Singularity containers, which were made from the definition file singularity.def in the github repository https://github.com/iqbal-lab-org/ minos-paper-benchmarking.
The main results, which used Cortex and SAMtools calls as input to BayesTyper, GraphTyper, and Minos, used a build of git commit dbbc62a9388fd30869406cc63c10e70c08d45a0b.

Simulated data
The simulated genomes were made using Simutator (https://github.com/iqbal-lab-org/ simutator) git commit 0218c8a5b37fd72eb4e5b2df4cba9f6118f96788. The command line was: Variant calling pipeline The pipeline described next is implemented in the minos-paper-benchmarking container as a single script that runs all stages on one sample. First, all the relevant genome index files must be made (for example BWA index). This needs to be run once on each mapping reference FASTA file: clockwork reference_prepare --outdir ref_dir mapping_reference.fasta to make a directory (in this example, called ref dir), which is used by the next command. The whole variant calling and evaluation pipeline was then run on one sample with the command minospb run_one_sample \ --truth_mask_bed truth_mask.bed \ --ref_mask_bed ref_mask.bed \ sample_name out_directory ref_dir \ reads_1.fastq.gz reads_2.fastq.gz The stages run by that script are described next. First, reads were trimmed using Trimmomatic version 0.36, with the options

Genome masks
An existing genome mask was used for the M. tuberculosis H37Rv reference genome, which is routinely used by Public Health England variant calling pipeline COMPASS.
For all other genomes, masks for each mapping and truth genome were generated using agreement between the reads and the genomes as follows. The whole process is wrapped in a single script, run with the command: minospb make_mask genome.fasta reads.1.fastq.gz reads.2.fastq.gz outdir Illumina reads expected to perfectly match the genome (ie reads that made the genome assembly sequence) were trimmed, mapped, and PCR duplicates removed using exactly the same method as described earlier for variant calling. The resulting BAM file is parsed using pysam. A position is included in the mask if the depth of reads matching the reference is less than 5, or if the percent of reads that match the reference is less than 90%. The output is a mask in BED file format, which was then used as input to Varifier (see next section).
Varifier varifier git commit ad3d6db3b851eb46357b0308e7496580e191700f was used, with the command line varifier vcf_eval --ref_mask ref_mask.bed \ --truth_mask truth_mask.bed --filter_pass PASS \ truth_ref.fasta mapping_ref.fasta to_evaluate.vcf outdir To evaluate all calls made by a tool in order to see the effect of the FILTER column in the VCF, the option --filter_pass was omitted.

Joint Genotyping
The per-sample VCF files were generated using the variant calling method described above, making calls from SAMtools and Cortex. For each sample, these were input into Minos to make a single VCF for each sample. This was run using the script described above minospb run one sample. These per-sample VCF files were used as input to the Minos joint genotyping pipeline, which is implemented using Nextflow. The Nextflow script and configuration file are in the nextflow/ directory of the Minos repository, called regnotype.nf and regenotype.config. On the Walker 2013 data set, the pipeline was run with the command: nextflow run \ -w /path/to/working/dir/ -with-singularity /path/to/minospg_image \ -c regenotype.config \ -profile medium \ regnotype.nf --ref_fasta /path/to/tb_ref.h37rv.fa \ --make_distance_matrix \ --manifest manifest.tsv \ --mask_bed_file tb.h37rv.mask.R00000039_repregions.bed \ --outdir output_directory where manifest.tsv is the required manifest file, containing for each sample its name, input VCF file, and BAM file of reads. The reads must be in a mapped, sorted, indexed BAM file for the pipeline. This enables it to split the reference into chunks and map reads to each chunk, which saves run time and memory. On the large CRyPTIC and Mykrobe data sets, the option -profile large was used, which allocates more memory and CPUs to tasks where necessary, and the option --make distance matrix was omitted.        The path marked in orange shows an allele combination that results in exactly the same sequence as the reference sequence. Minos prevents this situation from arising, so that reads cannot have the option of mapping to two different paths that are in fact the same sequence.