Reference flow: reducing reference bias using multiple population genomes
Genome Biology volume 22, Article number: 8 (2021)
Most sequencing data analyses start by aligning sequencing reads to a linear reference genome, but failure to account for genetic variation leads to reference bias and confounding of results downstream. Other approaches replace the linear reference with structures like graphs that can include genetic variation, incurring major computational overhead. We propose the reference flow alignment method that uses multiple population reference genomes to improve alignment accuracy and reduce reference bias. Compared to the graph aligner vg, reference flow achieves a similar level of accuracy and bias avoidance but with 14% of the memory footprint and 5.5 times the speed.
Sequencing data analysis often begins with aligning reads to a reference genome, with the reference represented as a linear string of bases. Linear references such as the primary GRCh38 assembly  work naturally with efficient text indexes and sequence alignment algorithms. But linearity leads to reference bias: a tendency to miss alignments or report incorrect alignments for reads containing non-reference alleles. This can ultimately lead to confounding of scientific results, especially for analyses concerned with hypervariable regions , allele-specific effects [3–6], ancient DNA analysis [7, 8], or epigenenomic signals . These problems can be more or less adverse depending on the individual under study, e.g., African-ancestry genomes contain more ALT alleles, and so can be more severely affected by reference bias .
While graph aligners [11–15] can reduce reference bias, linear aligners still perform better on certain classes of reads  and graph-aligner performance is sensitive to the number of variants considered . Other efforts have focused on elaborating the linear alignment paradigm to address reference bias. Some studies suggest replacing the typical linear reference with a “major-allele” version, with each variant set to its most common allele. This can increase alignment [16–18] and genotyping accuracy . The major-allele reference is largely compatible with the standard reference (though indels can shift coordinates) and imposes little or no additional computational overhead.
We propose a new strategy called “reference flow” that uses a collection of references chosen so as to cover known genetic variants (Fig. 1). We call the method “reference flow” because it selects which reads to align to which genomes based on how well the read aligned previously. In this work, we propose specific reference-flow strategies where the method proceeds in two passes where the first pass aligns reads to the “initial” reference and identifies unaligned reads and reads with ambiguous (low mapping-quality) alignments. The second pass re-aligns these reads to a collection of references that are chosen to span the genetic space. By merging results from both passes, we can achieve higher alignment sensitivity and lower reference bias compared to methods that use a single reference. We implemented methods to align second-pass reads to the set of five genomes corresponding to the “super populations” studied in the 1000 Genomes Project , as well as to the set of 26 genomes corresponding to the more specific 1000 Genomes “populations.” This method (a) can use existing, robust linear aligners like Bowtie 2  or BWA-MEM , (b) requires only a small number of pre-established linear reference genomes, and (c) imposes minimal computational overhead—with no possibility of exponential blowup—relative to linear aligners.
To contextualize the results, a diploid “personalized reference genome” —the genome from which reads are simulated—is used as the ideal reference genome for alignment. By considering the alignments to the diploid personalized reference as a rough upper bound on how well any method can do, we can express results in terms of the degree to which a method closes the gap between the standard linear reference and the personalized reference. When aligning simulated sequence reads, our “RandFlow-LD” method closed 71.82% of the gap in sensitivity on median compared to using GRCh38. Our method also reduced reference bias, reducing by 37% the number of strongly biased sites, and lowering the overall reference to alternate allele (REF-to-ALT) ratio from 1.014 to 1.004. When aligning real whole-genome sequencing reads from NA12878, our method reduced the number of strongly biased heterozygous (HET) sites by 13,332 (34%) and lowered the overall REF-to-ALT ratio from 1.072 to 1.016. It achieves similar gains as the vg graph aligner  in terms of alignment accuracy and reference bias avoidance while using just 14% of the memory and 18% of the CPU time. RandFlow-LD can use a larger set of 26 population-level references (“RandFlow-LD-26”) to achieve lower reference bias than vg, while still running twice as fast.
Standard and major-allele references
We built a global major-allele reference by modifying the GRCh38 primary reference  to contain the most common allele at each bi-allelic SNV or small indel. Common alleles were determined using the 1000 Genomes Project GRCh38 call set . We call this the “global major” reference. We repeated this process but considering only the five subsets of individuals belonging to the five super populations labeled by the 1000 Genomes Project. We call these “superpop major” references. Table 1 summarizes the variants included in each reference. All references were indexed for use with the Bowtie 2 aligner .
Simulations for major-allele reference flow
We studied the efficacy of a strategy we call “MajorFlow,” which starts by aligning all reads to the global major reference. Reads that fail to align or align with low mapping quality are deferred to a second pass where they are realigned to each of the 5 superpop major references. For each read, we report the best alignment to any reference. We performed all alignments using Bowtie 2 and default parameters , though the method is not restricted to a particular aligner or set of parameters (the “Reference flow” section).
We performed simulation experiments to compare MajorFlow to baselines that used Bowtie 2 to align to the GRCh38 primary assembly  or to major-allele references. We used Mason2  to simulate reads from GRCh38 chromosome 21 (the “DNA data simulation” section). Starting from the 1000 Genomes Project GRCh38 call set [20, 24], we randomly selected 100 individuals and built personalized, diploid references for each using phased variant calls (Additional file 1: Table S1). We included single nucleotide variants (SNVs) and short insertions and deletions (indels). We simulated 1M reads from both haplotypes (2M total) of each individual. Since allelic-balance measurements require deeper coverage, we also simulated a larger set of 20M reads for 25 of the individuals (Additional file 1: Table S1). We also assessed the alignment methods using an ideal, diploid personalized reference genome (the “Building and aligning to the personalized reference” section). Results using the personalized reference serve as a rough upper bound on what is achievable with references that lack foreknowledge of donor genotypes [17, 25, 26]. We call this a “rough” upper bound because, while the personalized reference is ideal in that it contains the correct variants, the accuracy of alignment is also affected by tool-specific heuristics. A true upper bound would be hard to obtain, so we settle for the rough upper bound provided by the personalized genome, as in previous work [17, 27].
We measured sensitivity, the fraction of input reads that aligned correctly, as well as the fraction that aligned incorrectly. We called an alignment correct if its leftmost aligned base was within ±10 bases of its simulated point of origin (the “Measuring sensitivity” section). We also measured allelic balance at HET SNVs, where we defined allelic balance as the number of alignments with the REF allele divided by the number with either the REF or ALT allele. We also counted the number of strongly biased sites, i.e., those with allelic balance ≤ 20% or ≥ 80%. Finally, as an aggregate measure of balance, we measured the overall REF-to-ALT ratio totaled over all HET sites (the “Allelic balance measurement” section).
The MajorFlow method (“MajorFlow” in Fig. 2a) exhibited higher sensitivity than single-reference methods that used the standard reference (“GRCh38”) or any of the major-allele references (“Major”). If we consider the increase in sensitivity relative to the sensitivity gap between the GRCh38 reference and the ideal personalized reference, MajorFlow’s median sensitivity improvement closed about 51.34% of the gap. In terms of number of incorrect alignments, MajorFlow closed 46.81% of the benefit of personalization (Additional file 1: Figure S1). MajorFlow also reduced the number of unaligned reads by 290 (56.5%) compared to using GRCh38 (Additional file 1: Figure S2). MajorFlow’s sensitivity was still higher when we enhanced the major-allele strategy by always matching the ethnicity of the major-allele reference to that of the simulated sample (“Matched”). Alignments for reads simulated from the African super population (AFR) had lower sensitivity compared to the others, even when aligned to the AFR superpop-major reference (Additional file 1: Figure S3). This is consistent with AFR’s greater genetic heterogeneity due to the out-of-Africa bottleneck. Consistent with past results, there was only a small difference in mapping sensitivity when using the global-major versus the superpop-major references, even when the simulated donor’s ethnicity was matched with the reference (Fig. 2a).
MajorFlow also reduced reference bias relative to the single linear references using the set of 25 deeper simulations. Overall REF-to-ALT ratio decreased from 1.0145 using the standard reference to 1.0073 using the global major reference, then further to 1.0064 using MajorFlow method (Fig. 2c). The median number of strongly biased HET sites dropped from 70 for GRCh38 to 59 for MajorFlow (Fig. 2c).
Simulations for stochastic reference flow
While MajorFlow outperformed the single-linear-reference strategies, we noticed it was less effective than the graph-based vg aligner at increasing sensitivity or reducing reference bias (Fig. 2). We hypothesized this was because the major-allele references used by MajorFlow were too similar to each other, narrowing the genetic diversity visible to the method. Specifically, the mean edit distance between all pairs of superpop major references was 15,115 bp for chromosome 21, whereas the mean between all pairs of five individuals randomly drawn from the super populations was 47,966 bp (Additional file 1: Figure S4).
We designed two alternative methods that draw on super-population-specific variation while also keeping the second-pass genomes genetically distinct. “RandFlow” generates a random reference haplotype for each super population by performing an independent draw at each variant site, choosing the ALT allele with probability equal to its frequency in the super population. “RandFlow-LD” is similar but additionally maintains some linkage disequilibrium (LD). RandFlow-LD begins by choosing one haplotype from the super population uniformly at random. Then, starting at the first (leftmost) polymorphic site in the haplotype and for the length of a 1000-bp window, it selects alleles matching the chosen haplotype. At the next polymorphic site beyond the 1000-bp window, the method chooses a new super-population haplotype uniformly at random and repeats the process. In this way, variant selection is still weighted by allele frequency (since haplotypes are selected uniformly at random) but a degree of LD is also maintained. Both strategies result in greater genetic distances between the second-pass references compared to MajorFlow, with mean pairwise distances on chromosome 21 being 47,316 for the RandFlow strategy and 46,326 for RandFlow-LD. Further details are in the “Reference flow” section.
Using the chromosome-21 simulation data from the previous section, we observed that RandFlow and RandFlow-LD achieved higher sensitivity, lower numbers of incorrect alignments and lower numbers of unaligned reads compared to MajorFlow. If we consider the increase relative to the sensitivity gap between the GRCh38 reference and the ideal personalized reference, RandFlow’s and RandFlow-LD’s median sensitivity improvement closed about 70.91% and 71.82% of the gap respectively (Fig. 2a). The reduction of incorrect alignment compared to personalization was 66.22% for RandFlow and 67.34% for RandFlow-LD (Additional file 1: Figure S1). The numbers of unaligned reads were further reduced by 387.5 (75.5%) for RandFlow and 385.5 (75.1%) for RandFlow-LD compared to GRCh38 (Additional file 1: Figure S2). While RandFlow slightly underperformed RandFlow-LD in sensitivity and number of incorrect alignments, we note that RandFlow does not require that variants be phased and so can benefit from larger compendia of unphased genotypes available through projects like gnomAD .
Using the set of 25 deeper simulations, RandFlow-LD reduced the median number of strongly biased HET sites to 44, from a median of 70 using the GRCh38 reference. RandFlow-LD also reduced the overall REF-to-ALT ratio to 1.0038, an improvement over GRCh38 (1.0145) and MajorFlow (1.0064) (Additional file 1: Table S2). We repeated the experiments 14 times with different random seeds and showed that the variation due to randomness for RandFlow-LD was small compared to the difference in alignment methods (Additional file 1: Note S2, Additional file 1: Figures S5, S6, S7).
We further compared the reference flow methods to vg . vg aligns to a reference that is shaped as a graph rather than a string. Paths through the graph correspond to different combinations of REF and ALT alleles. Such methods can improve alignment accuracy and reduce reference bias by adding paths—thereby removing alignment-score penalties—for ALT alleles. We built a vg index using chromosome 21 of the GRCh38 primary assembly as the base and including all variants from the 1000-Genomes GRCh38 callset having allele frequency at least 1% and aligned all reads to the graph. There were 192,846 variants passing the threshold, about twice as many ALT alleles as we considered in our RandFlow (93,146) and RandFlow-LD (95,319) strategies. We found that RandFlow and RandFlow-LD had higher sensitivity and fewer incorrectly aligned reads than vg (Fig. 2a, Additional file 1: Figure S1), but that vg yielded a smaller number of strongly biased sites (30, versus 44 for RandFlow-LD) and a slightly more balanced overall REF-to-ALT ratio (1.0026, versus 1.0038 for RandFlow-LD). While neither approach is the clear winner in this comparison, the reference flow methods use substantially less time and memory, as discussed below.
To explore how using more second-pass genomes improves accuracy, we used the same RandFlow-LD method to make a set of 26 population-specific chromosome 21 sequences. These correspond to the 26 separate populations studied in the 1000 Genomes Project, subdividing the 5 super populations and including 168,593 variants in total. The alignment sensitivity of this “RandFlow-LD-26” approach was the best of any we evaluated, closing 84.08% of the gap between the GRCh38 and personalized references. It achieved lower allelic bias compared to RandFlow-LD, with a median of 39 strongly biased sites and an overall REF-to-ALT ratio of 1.0024. Though it used a total of 27 references (including the first-pass major-allele reference), RandFlow-LD-26 used less CPU time and had a smaller memory footprint compared to vg (the “Computational efficiency” section).
Assessing reference bias with real data
We further assessed these methods using a deep whole-genome sequencing dataset from individual NA12878 (SRR622457) generated by the 1000 Genomes Project . The dataset consisted of 1.4 billion Illumina HiSeq 2000 101-bp paired-end reads, though we used only the first end of the pair in these experiments. Since each read’s true point of origin is unknown, we assess only allelic balance and not sensitivity. We assessed allelic balance only at sites where NA12878 is HET according to the 1000 Genomes Project GRCh38 call set, and then stratified the sites according to the Genome-in-a-Bottle v3.3.2 confidence annotation . There were 1,723,317 (83%) HET sites in high-confidence regions, and 344,945 (17%) in low-confidence regions. We also constructed and aligned to an ideal, diploid personalized reference using the phased variant calls for NA12878 from the GRCh38 call set. We assessed only the RandFlow-LD and RandFlow-LD-26 methods since they outperformed other reference-flow methods in the simulation experiments. After a first-pass alignment to the global major-allele reference, there were 250M (17.4%) reads deferred into the second pass.
Consistent with the simulation experiments, we observed that RandFlow-LD and vg both reduced the number of strongly biased sites in all regions, from 44,810 in the case of GRCh38, to 34,429 (23% reduction) for RandFlow-LD and 31,784 (29% reduction) for vg (Fig. 3 and Table 2). Similarly, RandFlow-LD reduced the overall REF-to-ALT ratio from 1.0719 (GRCh38) to 1.0160 and vg reduced it to 1.0123. Further, RandFlow-LD-26 reduced the number of strongly biased sites to 30,317 (32% reduction) and REF-to-ALT ratio to 1.0081, best among the methods using non-personalized references. The variant-aware methods substantially reduced reference bias compared to a method that aligned only to the global major reference (“Major”). They also reduced the number of unaligned reads (Additional file 1: Figure S8). In high-confidence regions, variant-aware methods reduced the number of strongly biased sites by 39–50% compared to GRCh38 and reduced the REF-to-ALT ratios from 1.041 to about 1.01 (Additional file 1: Figure S9). In low-confidence regions, we observed 11 – 18% reduction in number of strongly biased sites, but a greater benefit in REF-to-ALT ratios, from 1.024 to 1.001–1.028 (Additional file 1: Figure S10). RandFlow-LD-26 reduced bias most among variant-aware approaches.
Notably, the number of strongly biased sites was still as high as 23,290 when aligning to an ideal personalized reference (Table 2). In part, this is because the 1000 Genomes Project calls include only a subset of the variation present in the actual NA12878 genome. This is both because some genomic regions were excluded from the call set because of low mappability and because the call set does not include larger-scale structural variants that can have an outside effect on sensitivity and bias. We also noted that the more strongly biased sites were biased toward REF (13,899) more often than toward ALT (9391) when aligning to the personalized reference, supporting the argument that variants missing from the call set are affecting the bias.
To better understand where variant-aware methods reduce bias the most, we studied the relationship between highly biased HET sites and various categories of repeat families (Fig. 4) and classes (Additional file 1: Figure S11) annotated by RepeatMasker . Using alignment to GRCh38, many strongly biased HETs are in L1 (10,288 or 23%) and Alu (11,255 or 25%). RandFlow-LD greatly reduced the number of strongly biased HET sites in L1 (to 5250 reduced by 49%) and Alu (to 6555 reduced by 42%). A similar reduction is observed when using vg, but the greatest reductions are achieved by RandFlow-LD-26. For instance, RandFlow-LD-26 reduces the number of strongly biased sites in L1 from 10,288 to 3,560, a 65% reduction.
We constructed a dataset consisting of 10M single-end reads randomly sampled from the first end of the SRR622457 paired-end dataset. We ran each alignment method and measured the total size of index files on disk, the peak memory usage, and the CPU time (Table 3). We measured peak memory usage using the maximum resident set size reported by the GNU Time utility. We also measured CPU time using GNU Time. We performed the experiments on a computer with a 2.2 Ghz Intel Xeon CPU (E5-2650 v4) and 515GB memory. We configured all read-alignment jobs to use 16 simultaneous threads but otherwise left parameters at their defaults. Though RandFlow-LD and RandFlow-LD-26 were the only reference-flow approaches we benchmarked here, we expect MajorFlow and RandFlow to perform similarly to RandFlow-LD since they execute the same sequence of steps, using the same number of linear reference genomes.
Compared to an alignment run against the GRCh38 primary assembly, RandFlow-LD used about 5.97 times as much disk space to store the reference index files, consistent with the fact that RandFlow-LD uses 1 reference in the first pass and 5 in the second (Table 3). vg used a similar amount similar size of disk space for its indexes (.xg, .gcsa and .gcsa.lcp). vg used 7.31 times as much peak memory usage compared to the linear-based alignment methods, including RandFlow-LD. The baseline approach used less than 9% of the CPU time as vg, while RandFlow-LD used less than 18% of the CPU time as vg. Overall, RandFlow-LD used only about twice as much CPU time as the baseline. Eighty-four percent of RandFlow-LD’s runtime overhead was spent in re-alignment, 13% was spent in liftover, and less than 2% was spent in merging alignments. When extending RandFlow-LD to RandFlow-LD-26, the CPU time increased to 589% of the baseline and the index size increased to 104.9 GB. But its speed was 1.9 times that of vg.
We note that RandFlow-LD and RandFlow-LD-26 have similar peak memory footprint to the baseline because the reference-flow software runs the alignment jobs serially. In other words, only one reference genome index is resident in memory at a time. Because the read aligners themselves are multithreaded, we can do this while using many simultaneous threads.
Comparison of variant-aware alignment approaches
We further compared the reference flow methods with other graph-based methods, including the graph aligner HISAT2  (Additional file 1: Figures S12, S13 and S14). HISAT2 was computationally efficient, using 46.5% of the CPU time compared to Bowtie2 with a whole-genome graph containing variants with allele frequency ≥ 10% in the 1000 Genomes Project. Its index size (6.1G) and memory usage (6.5G) were small compared to vg’s (index size: 25.6G; memory usage: 25.6G) using the same variant set. However, it performed worse than other methods on mapping sensitivity (92.46%, versus 92.80% for vg), median number of strongly biased sites (138, versus 30 for vg), and overall REF-to-ALT ratio (1.0265, versus 1.0026 for vg) when evaluated using simulated reads from chromosome 21.
To understand the effect of including different numbers of variants in the vg graph, we tested a few vg graph sizes: a vg graph with no variants (just the linear GRCh38 reference), a graph with all 1000-Genomes variants having ≥ 10% allele frequency (AF), and a graph with all ≥ 1% AF variants. For a more direct comparison with RandFlow-LD, we also made a vg graph that included the union of the variants used in all RandFlow-LD references (Additional file 1: Note S1, Additional file 1: Table S3). We indexed the graphs and evaluated alignment performance using the same simulation framework as in the “Simulations for major-allele reference flow” and “Simulations for stochastic reference flow” sections. The median mapping sensitivity of the ≥ 10% AF graph outperforms other vg-based methods (≥ 10% AF: 92.805%; ≥ 1% AF: 92.797%), while the ≥ 1% AF graph gave fewer median strongly biased sites (≥ 10% AF: 40; ≥ 1% AF: 30) and lower overall REF-to-ALT ratio (≥ 10% AF: 1.0051; ≥ 1% AF: 1.0026). When comparing RandFlow-LD with the vg graph built using the RandFlow-LD variants (vg-RandFlow-LD column in Additional file 1: Figures S12, S13 and S14), RandFlow-LD is more sensitive (92.82% versus 92.80% for vg-RandFlow-LD), achieves a more balanced REF-to-ALT ratio (1.0038 versus 1.0069 for vg-RandFlow-LD), and yields a smaller number of highly biased sides (44 versus 50 for vg-RandFlow-LD).
We proposed and evaluated a family of “reference-flow” alignment methods. These are based on the idea that reads that fail to align or align poorly to one reference might align well to another with a different complement of alleles. We first showed that a 2-pass method using superpopulation major-allele references (MajorFlow) outperformed both a standard linear reference and individual major-allele references. As a further improvement, we proposed the RandFlow and RandFlow-LD methods that align to “random individuals” from each super population. These methods performed similarly to vg and approached the performance achieved using the ideal, personalized reference. The reference flow methods were much more computationally efficient than vg, running 5.5 times as fast and using 14% of the memory compared to vg when aligning to a graph containing all 1000 Genomes variants of frequency 10% or higher.
Our results complement key points from previous studies. Like the FORGe study , we also showed that alignment to a major-allele reference improves alignment accuracy compared to the standard linear reference. Also like FORGe, we showed that aligning to a super-population-matched major-allele reference did not substantially improve alignment accuracy compared to a global major-allele reference combining all super populations. Our results also reinforce that a linear aligner can be extended to incorporate variants and exhibit similar accuracy to a graph aligner [16, 31].
For compatibility with downstream tools, alignments output by reference-flow methods must have their reference coordinates translated back to the standard linear reference. Notably, this requires only a pairwise alignment from each of the reference-flow references to the standard reference. Thus, approaches such as RandFlow and RandFlow-LD use 5 references in the second pass require 6 pairwise whole-genome alignments: one from the first-pass major-allele reference to the standard reference and 5 from each of the second-pass references. This can be advantageous in the situation where the reference-flow genomes are assemblies with no pre-existing multiple alignment (, VCF file) describing their relationship. Algorithms for calculating genome-scale multiple alignments are resource intensive [32, 33] and yield a more complex structure compared to a pairwise alignment. Reference flow’s use of pairwise alignments also helps to solve an “N+1” problem; adding one additional reference to the second pass requires only that we index the new genome and obtain an additional whole-genome alignment (or otherwise infer such an alignment, from a VCF file) to the standard reference. We demonstrated that we could extend reference flow to 26 1000-Genomes populations, reducing bias still further while still aligning faster than vg. This flexibility could be important in the coming era where long and ultra-long sequencing reads allow us to build many high quality human genome assemblies.
While we explored methods involving a single initial reference and a set of second-pass references based on 1000-Genomes populations or super populations, we can also consider a wider class of possible architectures. For instance, considering that our method consistently performs worst on the AFR super population, we could imagine building a deeper “tree” of AFR-covering references. A read aligning poorly to the second-pass reference representing the AFR super population could, in a third pass, be aligned to an array of references for specific populations within AFR. We can imagine more complex architectures as well, leading to a general notion of a “reference flow graph” where nodes represent references and directed edges indicate which references might be attempted next. Whether a read should be forwarded along an edge would be dictated by a (possibly edge-specific) rule that uses alignment score, mapping quality, whether the alignment overlapped some genomic region, or other factors.
Our approach for selecting population-specific genomes involves randomness, chiefly as a way of “pushing” genomes further apart compared to the major-allele references. An alternative would be to cast this as a problem of optimizing the references’ “coverage” of the overall genotype space. Such an optimization approach might improve coverage (and therefore accuracy) while removing the random element. This might be accomplished using unsupervised, sequence-driven clustering methods [34, 35], using the “founder sequence” framework [36, 37], or using some form of submodular optimization . A more radical idea is to simply index all available individuals, forgoing the need to choose representatives; this is becoming more practical with the advent of new approaches for haplotype-aware path indexing  and efficient indexing for repetitive texts .
Since reference flow is essentially a “wrapper” that can be placed around an existing aligner, Bowtie 2 could be replaced by a different linear aligner such as BWA-MEM or even a graph aligner such as vg. It is even possible for different nodes in the graph to use different alignment tools. Since the wrapper is written using Snakemake , it is easily deployable both in one-sample single-computer scenarios and in scenarios involving many samples or a collection of networked computers.
In the future, it will be important to benchmark reference-flow methods when larger structural variants are included in the references. Structural variants have a disproportionate effect on alignment quality [41, 42]. In principle they are not difficult to include in the reference flow framework, though our lift over procedure is not currently robust enough to handle more complex structural variants like inversions or rearranges.
DNA data simulation
We built diploid consensus genomes for the selected individuals (Additional file 1: Table S1) using bcftools  based on the SNVs and indels specified by the 1000 Genome Project GRCh38 call set . We used Mason2  to simulate paired-end Illumina 100-bp reads, but used only the first end in most experiments. Since variants were already included in the reference genomes we simulated from, we did not use Mason2’s variation-adding feature. We enabled Mason2’s features for generating random sequencing errors and quality values. We simulated reads independently from each haplotype to generate diploid read sets, keeping information about the haplotype, chromosome, and offset of origin for downstream evaluations.
Building and aligning to the personalized reference
We built personalized, diploid reference genomes for each of the 100 randomly selected 1000 Genomes individuals [5, 44] (Additional file 1: Table S1). We used phased variant calls—including SNVs and indels and including sites with more than 2 ALT alleles—from both haplotypes of the selected individual to build FASTA files containing a personalized diploid reference genome. When aligning to the personalized diploid references, we aligned all reads separately to both haplotypes. We aligned to the haplotypes separately so that the mapping qualities could be informative; aligning to both together would have yielded consistently low mapping qualities. We then merged the resulting alignments. For a read that aligned to both haplotypes, we took the alignment with the higher alignment score. We broke ties by taking the alignment with higher mapping quality or, if the tie remained, at random.
For the simulated experiment using chr21, we aligned to each personalized haplotype 5 separate times, providing the aligner with 5 different random seeds. This yielded 10 total alignments from which we selected the best. This helped to improve the upper bound somewhat, since the 5 random seeds gave the aligner 5 times as many chances of finding the best alignment even with the censoring effect of alignment heuristics (Additional file 1: Figure S12).
In simulation experiments, we keep information about each read’s haplotype, chromosome, and offset of origin. We say a read aligns correctly if the alignment’s leftmost mapped base is within ± 10-bp of the leftmost base at the read’s point of origin. Since we use Bowtie 2 with default alignment parameters, no “soft clipping” is possible and it does not affect the definition of correctness. Reads that align outside of the ± 10-bp window are called incorrect. We define sensitivity as the fraction of reads that aligned correctly.
Allelic balance measurement
We measured allelic balance at each bi-allelic HET SNVs reported in the 1000 Genomes Project GRCh38 call set. HET SNVs that were contained within a larger deletion variant were excluded, whether or not the deletion was heterozygous. At each relevant HET, we considered the “pileup” of alleles at the site induced by overlapping read alignments. Let aref and aalt denote the number of REF and ALT alleles overlapping the site:
We say a site is strongly biased when β≤0.2 or β≥0.8. For a collection of sites, we calculate the overall REF-to-ALT ratio as total number of REF alleles divided by the total number of ALT alleles across the sites:
We ignore alleles besides REF and ALT, and we ignore alignments having a gap at the site. The assumption that on average β should equal 0.5 at HET sites is well founded for simulated datasets. Real datasets have biases, which might be due to systematic sequencing errors or fragmentation bias, for example. Biases might also arise from errors in the set of sites we consider to be HET, e.g., if the variant caller that produced the HET calls was itself affected by allelic bias.
Preparation The reference-flow methods require that we first build read-alignment indexes and coordinate-translation indexes for the relevant species and populations. Both can be generated from a reference genome in FASTA format and a collection of population variants in VCF format. The reference-flow software (a) processes the VCF to select variants to include in the population reference genomes, (b) generates both the first-pass and the second-pass references based on the reference genome, and (c) builds Bowtie 2 indexes for all references.
For convenience, we provide pre-built RandFlow-LD genomes and indexes based on the GRCh38 reference and the 1000 Genomes Project GRCh38 call set (see the “Availability of data and materials” section).
First pass In the first pass, we align all reads to an initial reference genome. For the particular reference-flow strategies evaluated here (MajorFlow, RandFlow, RandFlow-LD, and RandFlow-LD-26), we first aligned to the “global major” reference (the “Simulations for major-allele reference flow” section). Reads that fail to align or that align with low mapping quality are “forwarded” to a second pass, whereas reads that align with high mapping quality are “committed” and are ultimately passed through to the final output. We use a mapping-quality threshold because it is readily available—reported by most popular read aligners—and because alignments with low MAPQ are the most likely to benefit from the second alignment pass. After empirical experiments, we selected a MAPQ threshold of 10 (Additional file 1: Figures S15 and S16).
Second pass For reads forwarded to the second pass, we realign to a set of references that include a wider range of genetic variation. In the methods evaluated here other than RandFlow-LD-26, we use five second-pass references, each corresponding to a 1000 Genomes Project superpopulation: AFR (African), AMR (admixed American), EAS (East Asian), EUR (European), and SAS (South Asian). For RandFlow-LD-26, we use 26 second-pass references, each corresponding to a population in the 1000 Genomes Project.
In the case of the MajorFlow method, the second-pass genomes are simply the major-allele references corresponding to each of these superpopulations (the “Simulations for major-allele reference flow” section). In all cases, the second-pass references consist of a single haplotype.
Stochastic references In the RandFlow, RandFlow-LD, and RandFlow-LD-26 strategies, second-pass references are designed to represent “random individuals” from the super populations. For RandFlow, we construct the second-pass references by iterating through each polymorphic site i and performing an independent random draw to choose the ALT allele with probability equal to its allele frequency pi in the super population:
In the case of the RandFlow-LD and RandFlow-LD-26 strategies, for a variant site we select one haplotype in the super population uniformly at random. We then maintain the linkage disequilibrium (LD) relationship by selecting the genotypes from the same haplotype for the next 1000-bp region.
While we used the population and super population labels provided by the 1000 Genomes Project here, the reference-flow framework can work with any granularity of label. Further, neither the MajorFlow nor the RandFlow strategies require that genetic variants be phased. Those approaches could also work with larger, unphased compendia of genetic information such as GnomAD .
Merging and lifting For reads that aligned to more than one reference, we must choose a single “best” alignment to include in the ultimate SAM output. We select by choosing the alignment with the highest alignment score; roughly, this corresponds to the alignment with the fewest mismatches and gaps. If there is a tie for best alignment score, the alignment with higher mapping quality is selected. If there is a tie in both categories, we select at random from among the tied alignments.
For maximum compatibility with downstream tools, the SAM output from our reference-flow methods is with respect to the standard GRCh38 primary assembly. But since the reference genomes in our method—including the major-allele references—can have insertions or deletions with respect to the standard reference, we must translate (“lift over”) these alignments to standard reference coordinates before outputting them. We implemented a simple method to lift over alignments that builds a succinct mapping of coordinates from a genome to the standard reference genome using a VCF file. We use the mapping to adjust the POS and CIGAR fields of a SAM file so as to be compatible with the standard reference. The time and memory used to lift the alignments were included in the benchmarking measurements discussed in the “Computational efficiency” section.
Availability of data and materials
The reference flow software is available at  under the open source MIT license. An archival version of this software is available at . The experiments described in this paper are available at  under the open source MIT license. An archival version of this software is available at . An archival version of the raw data is available at . Pre-built RandFlow-LD genomes and indexes based on the GRCh38 reference and the 1000 Genomes Project GRCh38 call set are available at: https://genome-idx.s3.amazonaws.com/bt/flow/randflow_ld.tar.gz. A similar package that uses the RandFlow-LD-26 references instead of the RandFlow-LD references is available at https://genome-idx.s3.amazonaws.com/bt/flow/randflow_ld_26.tar.gz. An archival version of the RandFlow-LD and RandFlow-LD-26 variants used in the experiments is available at . Software versions used in the experiments are specified in Additional file 1: Table S4.
Church DM, Schneider VA, Steinberg KM, Schatz MC, Quinlan AR, Chin CS, Kitts PA, Aken B, Marth GT, Hoffman MM, Herrero J, Mendoza ML, Durbin R, Flicek P. Extending reference assembly models. Genome Biol. 2015; 16:13.
Brandt DY, Aguiar VR, Bitarello BD, Nunes K, Goudet J, Meyer D. Mapping bias overestimates reference allele frequencies at the HLA genes in the 1000 genomes project phase I data. G3: Gene Genomes Genet. 2015; 5(5):931–41.
Van De Geijn B, McVicker G, Gilad Y, Pritchard JK. WASP: allele-specific software for robust molecular quantitative trait locus discovery. Nat Methods. 2015; 12(11):1061–3.
Degner JF, Marioni JC, Pai AA, Pickrell JK, Nkadori E, Gilad Y, Pritchard JK. Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data. Bioinformatics. 2009; 25(24):3207–12.
Rozowsky J, Abyzov A, Wang J, Alves P, Raha D, Harmanci A, Leng J, Bjornson R, Kong Y, Kitabayashi N, et al. AlleleSeq: analysis of allele-specific expression and binding in a network framework. Mol Syst Biol. 2011; 7(1):522.
Salavati M, Bush SJ, Palma-Vera S, Mcculloch MEB, Hume DA, Clark EL. Elimination of reference mapping bias reveals robust immune related allele-specific expression in cross-bred sheep. Front Genet. 2019; 10:863.
Martiniano R, Garrison E, Jones ER, Manica A, Durbin R. Removing reference bias and improving indel calling in ancient DNA data analysis by mapping to a sequence variation graph. BioRxiv. 2020;:782755. https://doi.org/10.1186/s13059-020-02160-7.
Günther T, Nettelblad C. The presence and impact of reference bias on population genomic studies of prehistoric human populations. PLoS Genet. 2019; 15(7):1008302.
Groza C, Kwan T, Soranzo N, Pastinen T, Bourque G. Personalized and graph genomes reveal missing signal in epigenomic data. Genome Biol. 2020; 21(1):1–22.
Gurdasani D, Carstensen T, Tekola-Ayele F, Pagani L, Tachmazidou I, Hatzikotoulas K, Karthikeyan S, Iles L, Pollard MO, Choudhury A, et al. The African genome variation project shapes medical genetics in Africa. Nature. 2015; 517(7534):327–32.
Garrison E, Sirén J, Novak AM, Hickey G, Eizenga JM, Dawson ET, Jones W, Garg S, Markello C, Lin MF, Paten B, Durbin R. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat Biotechnol. 2018; 36(9):875–9.
Kim D, Paggi JM, Park C, Bennett C, Salzberg SL. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat Biotechnol. 2019; 37(8):907–15.
Rakocevic G, Semenyuk V, Lee W-P, Spencer J, Browning J, Johnson IJ, Arsenijevic V, Nadj J, Ghose K, Suciu MC, et al. Fast and accurate genomic analyses using genome graphs. Nat Genet. 2019; 51(2):354–62.
Rautiainen M, Marschall T. GraphAligner: rapid and versatile sequence-to-graph alignment. Genome Biol. 2020; 21(1):253.
Li H, Feng X, Chu C. The design and construction of reference pangenome graphs. arXiv preprint arXiv:2003.06079. 2020. https://doi.org/10.1186/s13059-020-02168-z.
Grytten I, Rand KD, Nederbragt AJ, Sandve GK. Assessing graph-based read mappers against a baseline approach highlights strengths and weaknesses of current methods. BMC Genomics. 2020; 21:1–9.
Pritt J, Chen N-C, Langmead B. FORGe: prioritizing variants for graph genomes. Genome Biol. 2018; 19(1):220.
Shukla HG, Bawa PS, Srinivasan S. hg19KIndel: ethnicity normalized human reference genome. BMC Genomics. 2019; 20(1):459.
Dewey FE, Chen R, Cordero SP, Ormond KE, Caleshu C, Karczewski KJ, Whirl-Carrillo M, Wheeler MT, Dudley JT, Byrnes JK, et al. Phased whole-genome genetic risk in a family quartet using a major allele reference sequence. PLoS Genet. 2011; 7(9):1002280.
Auton A, Brooks LD, Durbin RM, Garrison EP, Kang HM, Korbel JO, Marchini JL, McCarthy S, McVean GA, Abecasis GR, Auton A, Abecasis GR, Altshuler DM, Durbin RM, Abecasis GR, Bentley DR, Chakravarti A, Clark AG, Donnelly P, Eichler EE, Flicek P, Gabriel SB, Gibbs RA, Green ED, Hurles ME, Knoppers BM, Korbel JO, Lander ES, Lee C, Lehrach H, Mardis ER, Marth GT, McVean GA, Nickerson DA, Schmidt JP, Sherry ST, Wang J, Wilson RK, Gibbs RA, Boerwinkle E, Doddapaneni H, Han Y, Korchina V, Kovar C, Lee S, Muzny D, Reid JG, Zhu Y, Wang J, Chang Y, Feng Q, Fang X, Guo X, Jian M, Jiang H, Jin X, Lan T, Li G, Li J, Li Y, Liu S, Liu X, Lu Y, Ma X, Tang M, Wang B, Wang G, Wu H, Wu R, Xu X, Yin Y, Zhang D, Zhang W, Zhao J, Zhao M, Zheng X, Lander ES, Altshuler DM, Gabriel SB, Gupta N, Gharani N, Toji LH, Gerry NP, Resch AM, Flicek P, Barker J, Clarke L, Gil L, Hunt SE, Kelman G, Kulesha E, Leinonen R, McLaren WM, Radhakrishnan R, Roa A, Smirnov D, Smith RE, Streeter I, Thormann A, Toneva I, Vaughan B, Zheng-Bradley X, Bentley DR, Grocock R, Humphray S, James T, Kingsbury Z, Lehrach H, Sudbrak R, Albrecht MW, Amstislavskiy VS, Borodina TA, Lienhard M, Mertes F, Sultan M, Timmermann B, Yaspo ML, Mardis ER, Wilson RK, Fulton L, Fulton R, Sherry ST, Ananiev V, Belaia Z, Beloslyudtsev D, Bouk N, Chen C, Church D, Cohen R, Cook C, Garner J, Hefferon T, Kimelman M, Liu C, Lopez J, Meric P, O’Sullivan C, Ostapchuk Y, Phan L, Ponomarov S, Schneider V, Shekhtman E, Sirotkin K, Slotta D, Zhang H, McVean GA, Durbin RM, Balasubramaniam S, Burton J, Danecek P, Keane TM, Kolb-Kokocinski A, McCarthy S, Stalker J, Quail M, Durbin RM, Balasubramaniam S, Burton J, Danecek P, Keane TM, Kolb-Kokocinski A, McCarthy S, Stalker J, Quail M, Schmidt JP, Davies CJ, Gollub J, Webster T, Wong B, Zhan Y, Auton A, Campbell CL, Kong Y, Marcketta A, Gibbs RA, Yu F, Antunes L, Bainbridge M, Muzny D, Sabo A, Huang Z, Wang J, Coin LJ, Fang L, Guo X, Jin X, Li G, Li Q, Li Y, Li Z, Lin H, Liu B, Luo R, Shao H, Xie Y, Ye C, Yu C, Zhang F, Zheng H, Zhu H, Alkan C, Dal E, Kahveci F, Marth GT, Garrison EP, Kural D, Lee WP, Leong WF, Stromberg M, Ward AN, Wu J, Zhang M, Daly MJ, DePristo MA, Handsaker RE, Altshuler DM, Banks E, Bhatia G, Del Angel G, Gabriel SB, Genovese G, Gupta N, Li H, Kashin S, Lander ES, McCarroll SA, Nemesh JC, Poplin RE, Yoon SC, Lihm J, Makarov V, Clark AG, Gottipati S, Keinan A, Rodriguez-Flores JL, Korbel JO, Rausch T, Fritz MH, Stutz AM, Flicek P, Beal K, Clarke L, Datta A, Herrero J, McLaren WM, Ritchie GR, Smith RE, Zerbino D, Zheng-Bradley X, Sabeti PC, Shlyakhter I, Schaffner SF, Vitti J, Cooper DN, et al. A global reference for human genetic variation. Nature. 2015; 526(7571):68–74.
Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012; 9(4):357.
Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint arXiv:1303.3997. 2013.
Holtgrewe M. Mason: a read simulator for second generation sequencing data. Technical Reports of Institut für Mathematik und Informatik, Freie Universität Berlin TR-B-10-06. 2010.
Lowy-Gallego E, Fairley S, Zheng-Bradley X, Ruffier M, Clarke L, Flicek P, Consortium GP, et al. Variant calling on the GRCh38 assembly with the data from phase three of the 1000 Genomes Project [version 2; peer review: 2 approved]. Wellcome Open Res. 2019; 4:50. https://doi.org/10.12688/wellcomeopenres.15126.2.
Ballouz S, Dobin A, Gillis JA. Is it time to change the reference genome?Genome Biol. 2019; 20(1):159.
Liu X, MacLeod JN, Liu J. iMapSplice: Alleviating reference bias through personalized RNA-seq alignment. PLoS One. 2018; 13(8):e0201554. https://doi.org/10.1371/journal.pone.0201554.
Crysnanto D, Pausch H. Bovine breed-specific augmented reference graphs facilitate accurate sequence read mapping and unbiased variant discovery. Genome Biol. 2020; 21(1):184.
Karczewski KJ, Francioli LC, Tiao G, Cummings BB, Alföldi J, Wang Q, Collins RL, Laricchia KM, Ganna A, Birnbaum DP, et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature. 2020; 581(7809):434–43.
Zook JM, Catoe D, McDaniel J, Vang L, Spies N, Sidow A, Weng Z, Liu Y, Mason CE, Alexander N, et al. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Sci Data. 2016; 3(1):1–26.
Mokveld T, Linthorst J, Al-Ars Z, Holstege H, Reinders M. CHOP: haplotype-aware path indexing in population graphs. Genome Biol. 2020; 21(1):1–16.
Garriga E, Di Tommaso P, Magis C, Erb I, Mansouri L, Baltzis A, Laayouni H, Kondrashov F, Floden E, Notredame C. Large multiple sequence alignments with a root-to-leaf regressive method. Nat Biotechnol. 2019; 37(12):1466–70.
Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W, Lopez R, McWilliam H, Remmert M, Söding J, et al. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol Syst Biol. 2011; 7(1):539. https://doi.org/10.1038/msb.2011.75.
Karim M, Cochez M, Zappa A, Sahay R, Beyan O, Schuhmann D-R, Decker S, et al. Convolutional embedded networks for population scale clustering and bio-ancestry inferencing. arXiv preprint arXiv:1805.12218. 2018.
Han E, Carbonetto P, Curtis RE, Wang Y, Granka JM, Byrnes J, Noto K, Kermany AR, Myres NM, Barber MJ, et al. Clustering of 770,000 genomes reveals post-colonial population structure of North America. Nat Commun. 2017; 8(1):1–12.
Norri T, Cazaux B, Kosolobov D, Mäkinen V. Linear time minimum segmentation enables scalable founder reconstruction. Algorithm Mol Biol. 2019; 14(1):12.
Mäkinen V, Cazaux B, Equi M, Norri T, Tomescu AI. Linear time construction of indexable founder block graphs. Leibniz International Proceedings in Informatics. LIPIcs. 2020;:172. https://doi.org/10.4230/LIPIcs.WABI.2020.7.
Libbrecht MW, Bilmes JA, Noble WS. Choosing non-redundant representative subsets of protein sequence data sets using submodular optimization. Proteins. 2018; 86(4):454–66.
Kuhnle A, Mun T, Boucher C, Gagie T, Langmead B, Manzini G. Efficient construction of a complete index for pan-genomics read alignment. J Comput Biol. 2020; 27(4):500–13.
Köster J, Rahmann S. Snakemake – a scalable bioinformatics workflow engine. Bioinformatics. 2012; 28(19):2520–2.
Sherman RM, Forman J, Antonescu V, Puiu D, Daya M, Rafaels N, Boorgula MP, Chavan S, Vergara C, Ortega VE, et al. Assembly of a pan-genome from deep sequencing of 910 humans of African descent. Nature Genet. 2019; 51(1):30–5.
Audano PA, Sulovari A, Graves-Lindsay TA, Cantsilieris S, Sorensen M, Welch AE, Dougherty ML, Nelson BJ, Shah A, Dutcher SK, et al. Characterizing the major structural variant alleles of the human genome. Cell. 2019; 176(3):663–75.
Li H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics. 2011; 27(21):2987–93.
Yuan S, Qin Z. Read-mapping using personalized diploid reference genome for RNA sequencing data reduced bias for detecting allele-specific expression. In: 2012 IEEE International Conference on Bioinformatics and Biomedicine Workshops: 2012. p. 718–24, IEEE.
Chen N-C, Solomon B, Mun T, Iyer S, Langmead B. Reference flow software. 2020. Github https://github.com/langmead-lab/reference_flow.
Chen N-C, Solomon B, Mun T, Iyer S, Langmead B. Reference flow software. 2020. Zenodo https://doi.org/10.5281/zenodo.4287778.
Chen N-C, Solomon B, Mun T, Iyer S, Langmead B. Software for reference flow study experiments. 2020. Github https://github.com/langmead-lab/reference_flow-experiments.
Chen N-C, Solomon B, Mun T, Iyer S, Langmead B. Software for reference flow study experiments. 2020. Zenodo https://doi.org/10.5281/zenodo.4287729.
Chen N-C, Solomon B, Mun T, Iyer S, Langmead B. Raw data for reference flow experiments. 2020. Zenodo http://doi.org/10.5281/zenodo.4287794.
Chen N-C, Solomon B, Mun T, Iyer S, Langmead B. Reference flow VCF for pre-built genomes. 2020. Zenodo http://doi.org/10.5281/zenodo.4289428.
Part of this research project was conducted using computational resources at the Maryland Advanced Research Computing Center (MARCC). Reference-flow indexes are made freely available on Amazon Web Services thanks to the AWS Public Dataset Program.
The review history is available as Additional file 2.
NC, BS, TM, and BL were supported by NIH grant R01GM118568 to BL and by NSF grant IIS-1349906 to BL. NC and BL were also supported by NIH grant R01HG011392 to BL.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Chen, NC., Solomon, B., Mun, T. et al. Reference flow: reducing reference bias using multiple population genomes. Genome Biol 22, 8 (2021). https://doi.org/10.1186/s13059-020-02229-3