Strategy
FORGe works in cooperation with a variant-aware read aligner (graph aligner) such as HISAT2 [25]. Consider the alignment process as being divided into offline (index building) and online (alignment) stages. FORGe operates in the offline stage. Specifically, FORGe takes a reference genome (FASTA format) and catalog of variants and their frequencies in the population (Variant Call Format). FORGe can also use phasing information when provided in the VCF. FORGe then uses a mathematical model to score each variant according to its expected positive and negative impacts on alignment accuracy and computational overhead. The model could consider factors such as the variant’s frequency in a population, its proximity to other variants, and how its inclusion affects the repetitiveness of the graph genome. Using these scores—together with a parameter for the overall percentage or number of variants to include—FORGe outputs the top-scoring subset of variants, which can then be fed to the index-building component of a graph alignment tool like HISAT2’s hisat2-build program. In the online stage, the aligner uses this FORGe-customized index to align the sequencing reads.
Simulation
We used Mason 0.1.2 [20] to simulate reads (details in Additional file 1: Note S2). Mason simulates sequencing errors and base quality values. Mason also annotates each read with information about its true point of origin. We disabled Mason’s facility for adding genetic variants, since we simulate from already-individualized references. We classify an alignment as correct if its aligned position in the reference is within 10 nt of the true point of origin. If the aligner reports several alignments for a read, we consider only the primary alignment—of which there is exactly one per aligned read, usually with alignment score equal to or greater than all the others—when determining correctness.
Alignment
We tested FORGe with two read alignment strategies capable of including variants in the reference: HISAT2 [25] and the Enhanced Reference Genome (ERG) [43]. HISAT2 is a practical graph aligner that we hypothesized would benefit from careful selection of genetic variants to include. The ERG is simple and compatible with linear aligners like Bowtie. We use ERG only with short unpaired reads (25 nt) to test the hypothesis that the seed-finding step of an aligner can benefit from including FORGe-selected variants. Adapting the ERG approach to paired-end alignment is probably not practical (see the “Discussion” section).
In its offline stage, HISAT2 takes a linear reference genome and a VCF file with single-nucleotide variants and indels. HISAT2 uses GCSA indexing [49] to build a graph-genome index. The resulting graph is the generating graph for all combinations of reference (REF) and included alternate (ALT) alleles. HISAT2 also provides software that, starting from a VCF file (or the UCSC “Common SNPs” track, derived from dbSNP [46]), selects a subset of variants to include. It filters in two ways. First, it excludes variants with allele frequency under 10%. Second, where variants are densely packed, it imposes artificial haplotype constraints to avoid the exponential blowup that results from considering all combinations of REF and ALT alleles. We call this the HISAT2 auto method.
We also tested FORGe with our implementation of the ERG [43]. ERG’s offline phase starts with a linear reference genome and a variant file. It builds an augmented reference genome by adding enhanced segments: reference substrings that include ALTS and flanking context. The amount of context depends on a user-specified window size, r, which typically equals the maximum read length. When n variants co-occur in a window, 2n−1 enhanced segments are added to cover all combinations of ALT and REF alleles. The original ERG study limited growth by considering only the leftmost k variants per length-r window, with k=5 in practice. We use a variation on this limit: if a window contains more than k variants, we consider (a) the leftmost variant, and (b) the k−1 other variants with highest allele frequency according to the input VCF. Including the leftmost guarantees that each variant has its ALT included in at least one of the overlapping enhanced segments. We also set the limit higher (k=15) by default. While k is configurable, we used the default in all experiments here. After adding enhanced segments to the reference, we indexed it with Bowtie [27]. In the online stage, we used Bowtie to align to the enhanced reference. Details on our ERG implementation are in Additional file 1: Note S3.
In all experiments, we ran HISAT2 with the -k 10, —no-spliced-alignment, and —no-temp-splicesite options. In the ERG experiments, we ran Bowtie with the -v 1 option to allow alignments with up to one mismatch. Note that HISAT2 is able to find alignments with mismatches, insertions, or deletions, whereas Bowtie can only find alignments with mismatches. In all cases, we used Python’s rusage module to measure peak resident memory usage and we used the Linux time utility to measure real (“wall clock”) running times. All tools were run using a single thread.
Variant models
As detailed in the “Methods” section, FORGe has two main models for ranking and selecting variants to include in the reference. First is Population Coverage (Pop Cov), which scores variants according to allele frequency. Second is Hybrid, which weighs both a variant’s allele frequency and the degree to which its addition would make the reference more repetitive. Additionally, we evaluated versions of these two models enhanced with a blowup avoidance strategy that, at variant adding time, dynamically down-weights candidates that are close to already-added variants. These versions are called Pop Cov+ and Hybrid+. All of these strategies are detailed in the “Methods” section.
Chromosome 9 simulation
We tested FORGe in a series of simulation experiments. We used human chromosome 9 from the GRCh37 assembly [6]. GRCh37 was chosen to match the coordinates for the official 1000 Genomes Project Phase-3 variants [3]. We simulated sequencing reads from chromosome 9 of NA12878, a female from the CEPH (Utah residents with Northern and Western European ancestry) group studied in the 1000 Genomes Project. Specifically, we generated 10 million unpaired Illumina-like reads from each haplotype of NA12878 for a total of 20 million reads. Each read comes from one of the two haplotypes. We created a VCF file containing all single-nucleotide variants (SNVs) appearing in chromosome 9 in at least one 1000-Genomes individual, excluding NA12878 and family members. The resulting file contained 3.4 million SNVs. Details on how this set of SNVs was obtained are presented in Additional file 1: Note S4. We used the Pop Cov, Hybrid, Pop Cov+, and Hybrid+ models to score the 3.4 M SNVs. The Hybrid and Hybrid+ models used phasing information, whereas the Pop Cov and Pop Cov+ models did not (explained in Methods). We compiled subsets of SNVs consisting of the top-scoring 0%, 2%, 4%, 6%, 8%, 10%, 15%, and 20% up to 100% in 10 point increments.
HISAT2
Figure 1 shows alignment rate and accuracy when using HISAT2 to align our simulated 100 nt reads to the genome indexes created with hisat2-build. The leftmost point (or in the case of Fig. 1c, the point labeled 0%) corresponds to a HISAT2 index with no SNVs added, i.e., a linear reference genome. The diamond labeled Major Allele corresponds to a linear reference with all major alleles; i.e., with every SNV set to the allele that was most most frequent among CEU individuals in the filtered callset. The diamond labeled HISAT2 auto corresponds to the pruned set obtained by running HISAT2’s scripts. The diamond-labeled Personalized shows results when aligning to a personalized NA12878 genome with all non-reference homozygous (HOM) alleles replaced by their ALT versions and all heterozygous (HET) SNVs added as variants, so that neither REF nor ALT are penalized at alignment time. This is not a realistic scenario, but helpful for assessing how close the tested methods come to the personalized-genome ideal. Plotted lines show results obtained when adding progressively larger subsets of SNVs to the graph genome, prioritized by model score.
Figure 1a, b shows alignment rate and fraction of alignments that are correct (henceforth “correctness”) as a function of the number of SNVs included in the genome. For all models except Hybrid+, peak alignment rate and correctness occur in the 8–12% range of SNVs included. All the FORGe models at their peak achieve higher alignment rate and correctness than the major-allele and HISAT2 methods. When greater fractions of variants are included—more than around 12%—alignment rate and correctness generally decrease. Correctness eventually decreases to a level only somewhat higher than that achieved by the linear reference, showing that alignment suffers when too many variants are included. Figure 1d, e is similar to a and b but show alignment rate and correctness as a function of HISAT2’s memory footprint at alignment time. While FORGe’s models at their peak have a roughly 50% larger memory footprint than the linear references (both major-allele and reference-allele), they use roughly half the memory of the “HISAT2 auto” method.
Figure 1c plots a point or a parametric curve for each indexing strategy and model. The vertical axis is the fraction of reads (not alignments) that aligned correctly, and the horizontal axis is the fraction of reads that aligned incorrectly. Notable points on the curves are labeled with the fraction of SNVs included. Diamonds mark points on the curves with maximal y−x, where y is fraction correct and x is fraction incorrect. This is a combined measure for alignment rate and accuracy, and maximal values are reached in the 8–10% range of SNVs included (except Hybrid+, which peaked at 30%). The best performing are superior to (above and to the left of) the linear-genome methods, the “HISAT2 auto” method, and to the genome obtained by adding all of the SNVs (labeled 100%). The best-performing graph genomes come much closer to the personalized-genome ideal than the other methods.
It is notable that the alignment rate curves in Fig. 1a, b, d and e eventually trend downward. Like most read aligners, HISAT2 uses heuristics to limit the effort spent aligning reads to many repetitive regions of the same reference genome. HISAT2 is unusual in that when a read has too many repetitive alignments, it will abort and leave the read unaligned. Bowtie does not have this heuristic; rather, Bowtie chooses one best-scoring alignment to report even when the read has many repetitive alignments. Because of this, HISAT2’s alignment rate decreases as more variants are included and the genome becomes more repetitive.
A known drawback of graph aligners is that accuracy and overhead can suffer when many variants co-occur in a small window of the genome. To measure the impact this has on FORGe’s models, we also plotted results using blowup avoiding versions of the Pop Cov and Hybrid models (Fig. 1, dotted lines), called Pop Cov+ and Hybrid+. These versions will, when selecting variants to add, deprioritize variants that are near already-added variants. We observed that blowup avoidance had a minimal impact on the shape of the Pop Cov curve; e.g., Fig. 1d, e shows the solid and dotted lines for Pop Cov on top of each other. Notably, blowup avoidance did cause the alignment memory to increase more slowly with respect to the number of added variants for the Pop Cov ranking (Fig. 1f). For the Hybrid model, blowup avoidance did not change the relationship between memory footprint and number of variants added (Fig. 1f) and had an adverse effect on alignment rate and correctness. This is likely because the Hybrid model already takes clustered variants into account in its k-mer counts.
We repeated these experiments for paired-end reads (Additional file 1: Figure S3) and the results closely followed those in Fig. 1. Alignment rate and accuracy both increased when using paired-end reads, since an accurate alignment for one end can “rescue” the other in the presence of ambiguity. Peak accuracy (maximal y−x) was achieved at the same SNV fraction except in the case of the Hybrid ranking, which peaked at 15% rather than at 10%.
We also repeated these experiments for reads simulated from Yoruban (YRI) individual NA19238, also sequenced in the 1000 Genomes Project (Additional file 1: Figure S4). As we did for NA12878, we excluded variant calls for NA19238 and family members before providing variants to the model for scoring. These results also closely followed those in Fig. 1, with accuracy and recall peaking at a somewhat higher percentage of variants included (15% for YRI compared to 8–10% for CEU), likely due to YRI’s greater divergence from the reference. We return to this in the “Discussion” section.
Finally, we repeated the unpaired NA12878 experiment including both SNVs and indels in the FORGe analysis (Additional file 1: Figure S5). Whereas previous experiments modeled and scored 3.4 M SNVs, here we modeled and scored 3.4 M SNVs and 131k indels, composed of 49k insertions ranging in length up to 411 nt and 82k deletions up to 92 nt. Given these variant scores, we selected top-scoring fractions, built indexes, simulated reads from NA12878 (both SNVs and indels included) and performed alignments as before. When assessing correctness of the resulting read alignments, we took coordinate shifts due to indels into account. Overall the results are similar to those in Fig. 1. While there is a slight drop in peak alignment and correctness rate, the rates varied over a wider range of percentages relative to the SNV-only experiment. Maximal y−x occurred at slightly higher variant fractions relative to the SNV-only experiment: 10% for Pop Cov and Pop Cov+, 15% for Hybrid and 30% for Hybrid+.
Enhanced Reference Genome
Figure 2 shows alignment rate and correctness when using Bowtie [27] to align simulated 25 nt reads to enhanced references constructed with the ERG method [43]. We used shorter reads and configured Bowtie to find alignments with up to 1 mismatch (-v 1) to mimic the seed alignment step of seed-and-extend aligners.
Unlike HISAT2, Bowtie always reports an alignment if one is found, regardless of how repetitively the read aligns. Consequently, the alignment rate shown in Fig. 2a and d strictly increases as variants are added to the graph. Apart from that, the results reinforce those from Fig. 1. Peak correctness occurs at a relatively small fraction of SNVs (6–20%). As more variants are added, correctness eventually decreases, though the Hybrid ranking does not suffer this drop until over 70% of SNVs are included. The alignment-time memory footprint of the best-performing FORGe indexes is higher than that of the linear reference; e.g., including the top 6% of Pop Cov+-scored SNVs increases the footprint 29%, from 127.9 MB to 165.0 MB. But it is a fraction of the size of the index when 100% of variants are included (1.87 GB). Blowup avoidance (Fig. 2, dotted lines) had a somewhat minor effect on alignment rate and correctness for Pop Cov, and a clear negative effect for Hybrid. On the other hand, it slowed the rate of index growth for both models at low and intermediate fractions of SNVs (Fig. 2f).
Stratification by variant density, variant rarity, and repetitiveness
Figure 1c showed that when we move from 0 to 8% of variants included in the augmented reference, the number of correct alignments increases by about 0.4 percentage points (as a fraction of reads) and the number of incorrect decreases by about 0.1 points. Though these may seem like small differences, in a study with 1.2 billion reads—approximately the number of unpaired 100 nt unpaired reads required to cover the human genome to 40-fold average depth—this would yield about 4.8 M more correctly aligned reads and 1.2 M fewer incorrectly aligned.
Still, we hypothesized that certain read subsets might be affected more dramatically by the inclusion of variants. To this end, we measured alignment rate and correctness when we varied the number of alternate alleles overlapped by a read (Fig. 3a–c), whether the alternate allele was common or rare (Fig. 3d–f) and what kind of genomic region or repeat the read originated from Fig. 3g–i. The measurements studied here are the same as those presented in Fig. 1, but filtered as described below.
Figure 3a–c shows alignment rate and correctness stratified by the number of non-reference SNVs overlapped by a read. To obtain these subsets, we first removed reads originating from reference-genome regions deemed repetitive by DangerTrack [13] (score over 250). We did this after finding that these regions had a combination of low SNV density and repetitive content that caused the 0-SNV stratum to behave very differently from the others. Reads containing 1 or more SNVs undergo a rapid increase in alignment rate and correctness from 0 to 10% of SNVs. Beyond 10%, all strata experience a slow decrease in alignment rate and correctness up to 100% of SNVs added. The 0-SNV stratum has decreasing alignment rate and correctness across the whole range, as expected since the addition of variants cannot help (since the reads lack alternate alleles) but can harm alignment by increasing the repetitiveness of the reference. Strata with more SNVs experience a more dramatic rising-and-falling pattern; for the 3-SNV stratum, alignment rate varies from about 80–98%. While curves for the various strata have different shapes, all peak at a relatively low SNV fraction: 20% or lower.
Figure 3d–f show alignment rate and correctness for reads containing a single rare SNV allele (1000 Genomes frequency <0.5) versus reads containing a single common SNV allele (≥0.5). In both cases, we considered only reads with a single non-reference allele. Rare-SNV reads peak lower and at a higher SNV fraction than common-SNV reads for both alignment rate and correctness (Fig. 3d–f). This is expected, since the Pop cov model prioritizes common over rare SNVs. In other words, by the time a rare variant is added, many common variants have already been added, making the genome more repetitive.
Figure 3h–j shows alignment rate and correctness for reads stratified by feature of origin. We analyzed reads originating from (a) RepeatMasker-annotated repetitive regions (http://www.repeatmasker.org), (b) RepeatMasker-annotated “Alu” repeats, (c) regions captured by the Nextera exome sequencing protocol, and (d) all reads.
Reads from repetitive regions generally had lower alignment rate and correctness compared to all reads. As before, alignment rate and correctness curves peaked at low SNV fractions: 10% or lower. Reads from more repetitive features were more sensitive to the number of variants included in the reference, as evidenced by the vertical spans of the curves.
In a related experiment, we examined the graph genome’s effect specifically on the hypervariable MHC region. We simulated reads from NA12878 Chromosome 6 and used HISAT2 to align to both a linear and a graph genome augmented with the top-scoring 10% of SNVs. We visualized the read-alignment pileup in the hypervariable MHC region using IGV [51] (Additional file 1: Figure S6). Qualitatively, the pileup for the augmented reference looks superior—with more coverage in variant-dense regions and with more even overall coverage—to the pileup for the linear reference.
Ethnicity specificity
We also studied how ethnicity-specific augmented references, advocated in other studies [2, 33, 47], can improve alignment. We used FORGe to select variants from two lists: one with variants drawn from and scored with respect to the overall 1000-Genomes phase-3 callset, and another drawn from and scored for just the CEU individuals. In both cases, variants private to NA12878 and family members were excluded and reads were simulated from NA12878.
Figure 4 shows alignment rate and correctness when aligning to CEU-specific and pan-ethnic references. As expected, the CEU-specific reference yielded higher alignment rate and correctness. CEU-specific curves also peaked at lower numbers of SNVs compared to pan-ethnic. However, the differences were only a few hundredths of a percentage point and cover only a small fraction of the remaining distance to the ideal point. Looking at this another way, if we extrapolate the results to a whole-genome DNA sequencing experiment with 40-fold average coverage, around 250,000 alignments would be affected. We return to these small differences in the “Discussion” section.
Whole human genome
Simulated reads
To show our methods generalize to whole genomes, we repeated experiments like those presented in Fig. 1 using the full GRCh37 reference. We gathered 80.0 million SNVs from the Phase-3 callset of the 1000 Genomes Project [3]. We used FORGe’s Pop Cov+ model to score SNVs and compiled subsets consisting of the top-scoring 2%, 4%, 6%, 8%, 10%, 15%, and 20% up to 100% in 10 point increments. We built graph-genome indexes for each using HISAT2.
We used the Pop Cov+ model because the others required excessive time and/or memory; specifically, the Pop Cov model (without blowup avoidance) produced a set of variants that HISAT2 was unable to index in a practical time and space budget (Additional file 1: Note S5) and the Hybrid and Hybrid+ models required excessive time for the step that generates the FASTA file for G∗ due to exponential blowup (Additional file 1: Note S6).
Figure 5a, b plots HISAT2 alignment rate and correctness as a function of the SNV fraction. We aligned 20 million 100 nt unpaired reads from simulated from NA12878. We omitted NA12878 and family members from variant selection. Results using the ideal personalized index are also shown for comparison. Maximal y−x, where y is the fraction of reads aligned correctly and x is the fraction aligned incorrectly, occurred at 10% of SNVs (Fig. 5c). Interestingly, the maximal point does not approach the personalized-genome ideal point as closely here as it did for the chromosome-9 experiment (Fig. 1). This seems to be due to the added ambiguity that comes when variants in all non-chromosome-9 portions of the genome are added (Additional file 1: Figure S7).
Platinum reads, SNVs
We conducted further experiments using a set of 1.57 billion real 100 nt unpaired sequencing reads from the Platinum Genomes Project [14] (accession: ERR194147). Like the simulated reads, these also come from NA12878. We gathered a set of 80.0 million SNVs from the 1000 Genomes phase-3 callset, omitting variants private to NA12878 and family members. We again used the Pop Cov+ model to select variants.
We cannot assess correctness since the reads were not simulated. Following a prior study [34], we measured the number of reads that align uniquely—where HISAT2 reported exactly one alignment—versus the number that aligned perfectly, matching the reference exactly with no differences. The goal was to capture the variant-inclusion trade-off; we hypothesized that adding more variants will remove the alignment-score penalty associated with known genetic variants (increasing the number of perfect matches) without increasing reference ambiguity (decreasing the number of unique alignments). As shown in Fig. 6a, the points that achieved the peak number of unique plus perfect alignments corresponded to 30% of the SNVs. This fraction is higher than most of our simulated results, perhaps due to the fact that unique-plus-perfect is an imperfect proxy for correct-minus-incorrect (Additional file 1: Figure S8).
Platinum reads, SNVs, and indels
To highlight the effect of including indels in the reference, we repeated the previous experiment but using both SNVs and indels from the 1000 Genomes phase-3 callset. Specifically, we gathered 83.1 million variants, both SNVs and indels, but omitting variants private to NA12878 and family members. We again used the Pop Cov+ model to select variants. We again plotted the number of reads that aligned uniquely versus the number that aligned perfectly (Fig. 6a). The graph genome built from both SNVs and indels achieved peak unique+perfect at 30% of variants, like the graph built from SNVs alone. However, at every percentage it yields more unique and perfect alignments.
Reference bias
We measured how reference bias varies with the fraction of variants included. We analyzed the alignments of the ERR194147 reads to the whole human genome with both SNVs and indels included in in reference. Figure 6b shows a series of boxplots summarizing bias at a set of 2.07 million HET SNVs called in NA12878 by the Platinum Genomes Project [14]. The set of 2.07M HETs was chosen by taking all HETs covered by at least 25 reads in all of our experiments. Each boxplot summarizes the fraction of REF alleles (REF/(REF+ALT)) at the HET site for all 2.07M HETs. As expected, bias decreased as more variants were included. The decrease plateaued at 10–20% of variants. Beyond 20%, including more variants did further reduce bias, but only slightly. From 20 to 70% of variants the mean decreased by only 0.00011.
This is consistent with previous results showing that most of the benefit is achieved at a small fraction of variants.
HLA typing accuracy
Finally, to measure FORGe’s effect on downstream results, we measured how HLA typing recall vary with the fraction of variants included in the reference. We used the same NA12878/ERR194147 alignments evaluated in previous sections, extracted alignments in the MHC region, then provided those alignments to the Kourami [28] HLA typing tool to make HLA calls. We repeated this with indexes for all the same variant-inclusion fractions evaluated previously. More details on the HLA typing methodology are described in Additional file 1: Note S7. In comparison with linear genome, HLA typing recall and accuracy increased substantially when the highest-scoring 10% of SNVs were included in the augmented reference. Recall and average coverage plateaued at larger SNV fractions (Additional file 1: Figure S9). Overall, we see that—as we observed in other results—HLA allele recall benefits from the addition of a carefully-chosen fraction of variants, and that a fraction of only 10% is sufficient to achieve peak recall.