Rapid Core-Genome Alignment and Visualization for Thousands of Intraspecific Microbial Genomes

Though many microbial species or clades now have hundreds of sequenced genomes, existing whole-genome alignment methods do not efficiently handle comparisons on this scale. Here we present the Harvest suite of core-genome alignment and visualization tools for quickly analyzing thousands of intraspecific microbial strains. Harvest includes Parsnp, a fast core-genome multi-aligner, and Gingr, a dynamic visual platform. Combined they provide interactive core-genome alignments, variant calls, recombination detection, and phylogenetic trees. Using simulated and real data we demonstrate that our approach exhibits unrivaled speed while maintaining the accuracy of existing methods. The Harvest suite is open-source and freely available from: http://github.com/marbl/harvest.

The current influx of microbial sequencing data necessitates methods for large-scale comparative genomics and shifts the focus towards scalability. Current microbial genome alignment methods focus on all-versus-all progressive alignment [31, 36] to detect subset relationships (i.e. gene gain/loss), but these methods are bounded at various steps by quadratic time complexity. This exponential growth in compute time prohibits comparisons involving thousands of genomes. Chan and Ragan [44] reiterated this point, emphasizing that current phylogenomic methods, such as multiple alignment, will not scale with the increasing number of genomes, and that "alignment-free" or exact alignment methods must be used to analyze such datasets.
However, such approaches do not come without compromising phylogenetic resolution [45].
Core-genome alignment is a subset of whole-genome alignment, focused on identifying the set of orthologous sequence conserved in all aligned genomes. In contrast to the exponential complexity of multiple alignment, core-genome alignment is inherently more scalable because it ignores subset relationships. In addition, the core genome contains essential genes that are often vertically inherited and most likely to have the strongest signal-to-noise ratio for inferring phylogeny. The most reliable variants for building such phylogenies are single-nucleotide polymorphisms (SNPs). Thus, core-genome SNP typing is currently the standard method for reconstructing large phylogenies of closely related microbes [46]. Currently, there are three paradigms for core-genome SNP typing based on read mapping, k-mer analyses, and whole-genome alignment.
Read-based methods have dominated the bioinformatics methods landscape since the invention of high-fidelity, short-read sequencing (50-300bp) [47]. This has made it very affordable to sequence, yet extremely challenging to produce finished genomes [48,49]. Thus, comparative genomics has turned to highly efficient and accurate read mapping algorithms to carry out assembly-free analyses, spawning many mapping tools [50][51][52][53] and variant callers [54][55][56] for detecting SNPs and short Indels. Readbased variant calling typically utilizes a finished reference genome and a sensitive read mapper (BWA [52], Smalt), variant caller (samtools/bcftools [56], GATK [54]), and variant filter (minimum mapping quality, core genomic regions). This method has been shown effective in practice [57] and does not rely on assembly. However, mapping requires the read data, which is not always available and can be orders of magnitude larger than the genomes themselves. In addition, mapping can be sensitive to contaminant, overlook structural variation, misalign low-complexity and repetitive sequence, and introduce systematic bias in phylogenetic reconstruction [58][59][60].
Exact alignment methods, often formulated as k-mer matching, can produce high precision results in a fraction of the time required for gapped alignment methods [61][62][63]. Spectral k-mer approaches have been used to estimate genome similarity [64], and k-mer based methods are commonly used to identify or cluster homologous genomic sequence [65,66]. Recently, k-mers have also been extended to SNP identification. kSNP [67] identifies odd-length k-mers between multiple samples that match at all but the central position. The matched k-mers are then mapped back to a reference genome to locate putative SNPs. Conveniently, this approach is suitable for both assembled genomes and read sets, but sensitivity is sacrificed for the improved efficiency of exact alignment [68].
Genome assembly [69][70][71][72][73][74][75][76][77], followed by whole-genome alignment [38,78,79], is the original method for variant detection between closely related bacterial genomes [80] and has been shown to perform well across multiple sequencing platforms [81]. In addition to SNPs, whole-genome alignment is able to reliably identify insertions and deletions (Indels) and other forms of structural variation. Thus, whole-genome alignment is the gold standard for comprehensive variant identification, but relies on highly accurate and continuous assemblies, which can be expensive to generate.
Lastly, and unlike reference mapping, whole-genome alignment is not easily parallelized or scaled to many genomes.
Specifically for the task of whole-genome SNP typing, the choice of read-or genomebased methods can often depend on data availability. For example, of the 24,000 bacterial genomes currently in NCBI RefSeq [82], only 55% have associated SRA read data and analysis of the remaining 45% requires genome-based methods.
Thankfully, recent advances in both sequencing technology and assembly algorithms are making microbial genomes more complete than ever before. Modern de Bruijn assemblers like SPAdes [83]are able to generate high-quality assemblies from short reads [3], and long read technologies have enabled the automated finishing of microbial genomes for under $1,000 [84]. With the number of publically available genomes currently doubling every 18 months [1], and genome quality improving with the arrival of new technologies, we set out to solve the problem of aligning thousands of closely-related whole genomes.

Rapid Core-Genome Alignment and Visualization
Here we present Parsnp and Gingr for the construction and interactive visualization of massive core-genome alignments. For alignment, Parsnp combines the advantages of both whole-genome alignment and read mapping. Like whole-genome alignment, Parsnp accurately aligns microbial genomes to identify both structural and point variations, but like read mapping, Parsnp scales to thousands of closely related genomes. To achieve this scalability, Parsnp is based on a suffix graph data structure for the rapid identification of maximal unique matches (MUMs), which serve as a common foundation to many pairwise [78,79,85] and multiple genome alignment tools [31][32][33][34][35][36]. Parsnp uses MUMs to both recruit similar genomes and anchor the multiple alignment. As input, Parsnp takes a directory of MultiFASTA files to be aligned; and as output, Parsnp produces a core-genome alignment, variant calls, and a SNP tree. These outputs can then be visually explored using Gingr. The details of Parsnp and Gingr are described below.

MUMi recruitment
Parsnp is designed for intraspecific alignments and requires input genomes to be highly similar (e.g. within the same subspecies group or >=97% average nucleotide identity). For novel genomes or an inaccurate taxonomy, which genomes meet this criterion is not always known. To automatically identify genomes suitable for alignment, Parsnp uses a recruitment strategy based on the MUMi distance [86]. Only genomes within a specified MUMi distance threshold are recruited into the full alignment. Note that because multi-MUMs are necessarily present in all genomes, the choice of a reference genome has no effect on the resulting alignment.

Multi-MUM search
Once built for the reference genome, all additional genomes are streamed through the CSG, enabling rapid, linear-time identification of MUMs shared across all genomes.
A divide-and-conquer algorithm, adapted from M-GCAT [35], recursively searches for smaller matches and iteratively refines the multi-MUMs. Next, locally collinear blocks (LCBs) of multi-MUMs are identified. These LCBs form the basis of the coregenome alignment.

Parallelized LCB alignment
The multi-MUMs within LCBs are used to anchor multiple alignments. Gaps between collinear multi-MUMs are aligned in parallel using MUSCLE [87]. To avoid the unnecessary overhead of reading and writing MultiFASTA alignment files, Parsnp makes direct library calls via a MUSCLE API. The MUSCLE library is packaged with Parsnp, but originally sourced from the Mauve code base [88]. As with Mauve, MUSCLE is used to compute an accurate gapped alignment between the match anchors. Though MUSCLE alignment can be computationally expensive, for highly similar genomes, the gaps between collinear multi-MUMs are typically very short (e.g. a single SNP column in the degenerate case).

SNP filtering and trees
The final Parsnp multiple alignment contains all SNP, Indel, and structural variation within the core genome. However, given their ubiquity in microbial genome analyses, Parsnp performs additional processing of the core-genome SNPs. First, all polymorphic columns in the multiple alignment are flagged to identify: (i) repetitive sequence, (ii) small LCB size, (iii) poor alignment quality, (iv) poor base quality, and (v) possible recombination. Alignment quality is determined by a threshold of the number of SNPs and Indels contained within a given window size. Base quality is optionally determined using FreeBayes [55] to measure read support and mixed alleles. Bases likely to have undergone recent recombination are identified using PhiPack [89]. Only columns passing a set of filters based on these criteria are considered reliable core-genome SNPs. The final set of core-genome SNPs is given to FastTree2 [90] for reconstruction of the whole-genome phylogeny.

Compressed alignment file
For simplicity and storage efficiency, the output of Parsnp includes a single binary file encoding the reference genome, annotations, alignment, variants, and tree. Thousandfold compression of the alignment is achieved by storing only the columns that contain variants. The full multiple alignment can be faithfully reconstructed from this reference-compressed representation on demand. Since Parsnp focuses on aligning only core blocks of relatively similar genomes, the number of variant columns tends to increase at a sub-linear rate as the number of genomes increases, resulting in huge space savings versus alternative multiple alignment formats. Conversion utilities are provided for importing/exporting common formats to/from the binary archive file, including: BED, GenBank, FASTA, MAF, Newick, VCF, and XMFA.

Interactive Visualization
Developed in tandem with Parsnp, the visualization tool Gingr allows for interactive exploration of trees and alignments. In addition to the compressed alignment format, Gingr accepts standard alignment formats and can serve as a general-purpose multiple alignment viewer. Uniquely, Gingr is capable of providing dynamic exploration of alignments comprising thousands of genomes and millions of alignment columns. It is the first tool of its kind capable of dynamically visualizing multiple alignments of this scale. The alignment can be seamlessly zoomed from a display of variant density (at the genome level) to a full representation of the multiple alignment (at the nucleotide level). For exploration of phyletic patterns, the alignment is simultaneously presented along with the core-genome SNP tree, annotations, and dynamic variant highlighting.
The tree can be zoomed by clade, or individual genomes selected to expand via a fisheye zoom. Structural variation across the genome can also be displayed using Sybil coloring [91], where a color gradient represents the location and orientation of each LCB with respect to the reference. This is useful for identifying structurally variant regions of the core.

Evaluation of performance
We evaluated Parsnp on three simulated datasets (derived from Escherichia coli K-12 W3110) and three real datasets (Streptococcus pneumoniae, Peptoclostridium difficile, Mycobacterium tuberculosis). Parsnp is compared below versus two wholegenome alignment methods (Mugsy, Mauve), a k-mer based method (kSNP), and two commonly used mapping pipelines (based on Smalt and BWA). The Smalt pipeline replicates the methods of the landmark Harris et al. paper [92] that has been adopted in many subsequent studies. The BWA pipeline is similar to the Smalt pipeline, but uses BWA for read mapping (Methods).

Simulated Escherichia coli W3110 dataset
To precisely measure the accuracy of multiple tools across varying levels of divergence, we computationally evolved the genome of E. coli K-12 W3110 at three different mutation rates: 0.00001 (low), 0.0001 (medium), and 0.001 (high) SNPs per site, per branch. An average of ten rearrangements were introduced, per genome. Each dataset comprises 32 simulated genomes, forming a perfect binary tree.
Approximately 65X coverage of Illumina MiSeq reads was simulated and assembled for each genome to create draft assemblies. For input, the whole-genome alignment programs were given the draft assemblies, and the mapping pipelines the raw reads.
Supplementary Figure 1 details the computational performance on the simulated datasets. Parsnp was the only method to finish in fewer than 10 minutes on the 32genome dataset, with the other methods requiring between 30 minutes to 10 hours. Table 1 gives the accuracy of each tool on each dataset. The tools were benchmarked using true-positive and false-positive rates compared to a known truth, which captures the full alignment accuracy. Figure 1 plots the performance of all tools averaged across all mutation rates.
The whole-genome alignment methods performed comparably across all three mutation rates (Figure 1, red squares), with Mauve exhibiting the highest sensitivity (97.42%) and Parsnp the highest precision (99.99%). Mugsy demonstrated slightly higher sensitivity than Parsnp but with lower precision. Mugsy's lower precision was traced to a single fumA paralog [93] misalignment that generated a high number of false-positive SNPs. All genome alignment methods were affected by misalignment of repeats and missing or low-quality bases in the assembly. Table 1. Core-genome SNP accuracy for simulated E. coli datasets. Data shown indicates performance metrics of the evaluated methods on the three simulated E. coli datasets (low, medium and high). Method: Tool used. (c) indicates aligner ran on closed genomes rather than draft assemblies. Descr: Paradigm employed by each method, WGA indicates whole-genome alignment, CGA indicates core genome alignment, KMER indicates k-mer based SNP calls, and MAP indicates read mapping. False positive (FP) and false negative (FN) counts for the three mutation rates (low, med, and high). TP: number of SNP calls that agreed with the truth. FP: number of SNP calls that are not in truth set. FN: number of truth SNP calls not detected. True positive rate TPR: TP/(TP+FN). False discovery rate FDR: FP/(TP+FP). A total of 1,299,178 SNPs were introduced into the 32genome dataset, across all three mutational rates. *Mugsy's lower precision was traced to a paralog misalignment that resulted in many false-positive SNPs.

Method
Descr Performance of the individual methods was also measured in terms of branch SNP and length error with respect to the true phylogeny ( Figure 2). These errors closely Parsnp on full genomes can be explained by a lack of an LCB extension method.
Mugsy was the most affected by draft genomes, going from best on closed genomes to demonstrating more false positives (Table 1) and LCB counts (

Comparison to read mapping methods
On average, mapping-based methods were as precise and 0.5-1% more sensitive than alignment of draft genomes (Figure 1, blue triangles). Smalt showed the highest sensitivity, while BWA was the most specific. The precision of the mapping approaches may be overestimated for this dataset due to the absence of non-core sequence that is known to confound mapping [94]. Parsnp was the only genome alignment method to match the precision of mapping, but with a slight reduction in sensitivity. However, when provided with finished genomes, the whole-genome alignment methods excel in both sensitivity and specificity compared to read mapping. Thus, the performance divide between whole-genome alignment and mapping is entirely due to assembly quality and completeness. Using short reads, both the mapping and assembly-based approaches suffer false-negatives due to ambiguous mappings or collapsed repeats, respectively. Exceeding 99% sensitivity for this test set requires either longer reads (for mapping) or complete genomes (for alignment) to accurately identify SNPs in the repetitive regions.

Comparison on 31 Streptococcus pneumoniae genomes
Parsnp was compared to whole-genome alignment methods using the 31-genome S.  (Table 3). In addition, Parsnp ran hundreds of times faster than the other methods, finishing this 31-way alignment in less than 60 seconds.

Peptoclostridium difficile outbreak in the UK
Parsnp and Gingr are particularly suited for outbreak analyses of infectious diseases.
To demonstrate this, we applied Parsnp to a recent P. difficile outbreak dataset [95].
To generate input suitable for Parsnp, we assembled all genomes using iMetAMOS [96]. It is important to note that this was a resequencing project not intended for assembly and represents a worst case for a core-genome alignment approach; reads ranged from 50 to 100bp in length and some genomes were sequenced without paired       all SNP calls intersected with MAQ SNPfilter, which discards any SNP with neighboring Indels ±3bp or surrounded by >3 SNPs within a 10bp window. To replicate this study using whole-genome alignment, we assembled all genomes from the raw reads using iMetAMOS and ran Parsnp on the resulting draft assemblies.

Runtime and storage analysis
To evaluate Parsnp's scalability, we profiled performance across six datasets ranging from 32 genomes to 10,000 genomes. Runtime was observed to increase linearly with additional genomes (Supplementary Figure 2), requiring a few minutes for the 32 genome E. coli dataset, 1.5 hours for the 826 genome P. difficile dataset, and a maximum of ~14 hours to align the 10,000 genome set on a 2.2 GHz, 32-core, 1TB RAM server (Supplementary Table 2). In addition, for the 32-genome simulated E.
coli datasets, Parsnp was 10-100 times faster than all other methods evaluated.
Maximum memory usage was 2GB for the 145Mbp E. coli dataset and 309GB for the 21Gbp S. pneumoniae dataset (Supplementary Table 2). Memory usage can be explicitly limited via a command-line parameter (--max-partition-size) but this results in increased runtime.
In addition to runtime efficiency, Parsnp requires much less storage than the other approaches due to its binary alignment format and the compressive effect of assembly. For the 32-genome E. coli dataset, Parsnp's output totals just 4.5MB, compared to 13GB required to store compressed FASTQ [103] and VCF [104] files and 149MB to store XMFA [38]. Storage reductions are amplified for larger datasets.
For example, the raw read data for the P. difficile dataset requires 1.4TB of storage (0.6TB compressed). Assembling these data reduces the total to 3.3GB by removing the redundancy of the reads. The XMFA alignment of these assemblies is 1.4GB, and reference-compressed binary format occupies just 15MB. This equates to roughly a 100,000X (lossy) compression factor from raw reads to compressed archive, requiring only 0.08 bits per base to store the full core-genome alignment plus other related information, which is competitive with related techniques like CRAM [105]. As outbreak studies continue to expand in scale, whole-genome assembly and alignment presents a sustainable alternative to the current mapping-based strategies.

Discussion
Parsnp is orders of magnitude faster than current methods for whole-genome alignment and SNP typing, but it is not without limitations. Parsnp represents a compromise between whole-genome alignment and read mapping. Compared to whole-genome aligners, Parsnp is less flexible because it is designed to conservatively align the core genome and is less sensitive as a result. Compared to read mapping, Parsnp is less robust and requires high-quality assemblies to maximize sensitivity.
Thus, the right tool depends on the data and task at hand.
Core-genome alignment and phylogeny reconstruction are critical to microbial forensics and modern epidemiology. When finished or high-quality genomes are available, Parsnp is both efficient and accurate for these tasks. In addition, even for fragmented draft assemblies, Parsnp exhibits a favorable compromise between sensitivity and specificity. Surprisingly, Parsnp matched the specificity of the mapping-based approaches on the simulated datasets. However, multiplexed shortread sequencing followed by mapping still remains the most economical approach for sensitive analysis of large strain collections. Thus, Parsnp is recommended for analyzing high-quality assemblies or when raw read data is not available.
Assembled genomes have a number of advantages over read data-primarily compression and convenience. Storing, sharing, and analyzing raw read datasets incurs significant overhead from the redundancy in sequencing (often 100 fold), and this burden nearly resulted in the closure of the NCBI SRA database [106]. Adding additional orders of magnitude to the already exponential growth of sequencing data is not sustainable. Instead, information in the reads not currently stored in common assembly formats (e.g. allelic variants) should be propagated to the assembled representation, forming a compressed, but nearly lossless format. In this way, genomes could be shared in their native, assembled format, saving both space and time of analysis. Here, we have taken a small step in that direction by identifying low quality bases, as computed by FreeBayes [55]. This allows filtering of low quality and mixed alleles and improves the specificity of the assembly-based approaches.
However, more comprehensive, graph-based formats are needed to capture the full population information contained in the raw reads.
Parsnp was also built around the observation that high-quality, finished genome sequences have become more common as sequencing technology and assembly algorithms continue to improve. New technologies, such as PacBio SMRT sequencing [107] are enabling the generation of reference-grade sequences at extremely reduced costs. This presents another opportunity for Parsnp-the construction and maintenance of core genomes and trees for clinically important species. With well defined reference cores, outbreaks could be accurately typed in real-time by mapping sequences directly to the tree using phylogenetically aware methods such as pplacer [108] or PAGAN [109]. Such a phylogenetic approach would be preferable to alternative typing schemes based on loosely defined notions of similarity, such as pulse-field electrophoresis (PFGE) [110] and multi-locus sequence typing (MLST) [111].

Conclusion
Parsnp offers a highly efficient method for aligning the core genome of thousands of closely related species, and Gingr provides a flexible, interactive visualization tool for the exploration of huge trees and alignments. Together, they enable analyses not previously possible with whole-genome aligners. We have demonstrated that Parsnp provides highly specific variant calls, even for highly fragmented draft genomes, and can efficiently reconstruct recent outbreak analyses including hundreds of whole genomes. Future improvements in genome assembly quality and formats will enable comprehensive cataloging of microbial population variation, including both point and structural mutations, using genome alignment methods such as Parsnp. FastTree v2 [90] was used to reconstruct phylogenies using default parameters.

E. coli K-12 W3110 simulated dataset
The complete genome of E. coli K-12 W3110 [116], was downloaded from RefSeq (AC_000091). This genome was used as the ancestral genome and evolution was simulated along a balanced tree for three evolutionary rates using the Seq-Gen package [117] with parameters mHKY -t4.0 -l4646332 -n1 -k1 and providing the corresponding binary tree evolved at three evolutionary rates: 0.00001, 0.0001, and 0.001 SNPs per site, per branch. This corresponds to a minimum percent identity of approximately 99%, 99.9%, and 99.99% between the two most divergent genomes, respectively, reflecting the variation seen in typical outbreak analyses. No small (< 5bp) or large Indels were introduced, but an average of ten 1Kbp rearrangements (inversions and translocations) were added, per genome, using a custom script. Paired reads were simulated to model current MiSeq lengths (2x150bp) and error rates (1%).
Moderate coverage, two million PE reads (64X coverage), was simulated for each of the 32 samples using wgsim (default parameters, no Indels), from samtools package version 0.1.17 [56].
Two of the simulated read sets were independently run through iMetAMOS [96] to automatically determine the best assembler. The consensus pick across both datasets was SPAdes version 3.0 [83], which was subsequently run on the remaining 30 simulated read sets using default parameters. The final contigs and scaffolds files were used as input to the genome alignment methods. For mapping methods, the raw simulated reads were used. For accuracy comparisons, Indels were ignored and called SNPs were required to be unambiguously aligned across all 32 genomes (i.e. not part of a subset relationship; SNPs present but part of a subset relationship were ignored).

S. pneumoniae dataset
A full listing of accession numbers for the 31-genome S. pneumoniae dataset is described in [36]. For scalability testing, Streptococcus pneumoniae TIGR4 (NC_003028.3) was used to create a pseudo-outbreak clade involving 10,000 genomes evolved along a star phylogeny with on average 10 SNPs per genome.

M. tuberculosis dataset
We downloaded and assembled sequencing data from a recently published study of M. tuberculosis [101]. 225 runs corresponding to project ERP001731 were downloaded from NCBI SRA and assembled using the iMetAMOS ensemble of SPAdes, MaSuRCA, and Velvet. The iMetAMOS assembly for each sample can be replicated with the following commands, which will automatically download the data for RUN_ID directly from SRA: The M. tuberculosis dataset included a mix of single and paired-end runs with a sequence length ranging from 51-108bp. The average k-mer size selected for unpaired data was 26, resulting in an average of 660 contigs and an N50 size of 17Kbp. For paired-end data, the average selected k-mer was 35, resulting in an average of 333 contigs and an N50 size of 43Kbp. Assemblies containing more than 2,000 contigs, or 1.5X larger/smaller than the reference genome, were removed. The final dataset was reduced to 171 genomes, limited to labeled strains that could be confidently matched to the strains used in the Comas et al. study for SNP and phylogenetic comparison.
We downloaded and assembled sequencing data from a recently published study of P.
difficile [95]. 825 runs corresponding to project ERP003850 were downloaded from NCBI SRA [88] and assembled within iMetAMOS this time only using SPAdes, which was identified as the best performer on the M. tuberculosis dataset. The iMetAMOS assembly for each sample can be replicated with the following commands, which will download the data for RUN_ID directly from SRA: > initPipeline -d asmPD -W iMetAMOS -m RUN_ID -i 200:800 > runPipeline -d asmPD -a spades -p 16 The P. difficile dataset included paired-end runs with a sequence length ranging from 51-100bp. SPAdes was selected as the assembler and run with k-mer sizes of 21, 33, 55, and 77. The assemblies had an average of 660 contigs and an N50 size of 138Kbp.
Assemblies containing more than 2,000 contigs, or 1.5X larger/smaller than the reference genome, were removed.

Data and software availability
All data, supplementary files, assemblies, packaged software binaries and scripts described in the manuscript are available from [112]. Source code of the described software, including Parsnp and Gingr, is available for download from: http://github.com/marbl/harvest

Competing interests
The authors declare that they have no competing interests.