Phasing pipeline and strategy
We developed the nPhase pipeline, an alignment-based phasing method and associated algorithm that run using three inputs: highly accurate short reads, informative long reads, and a reference sequence. The pipeline takes the raw inputs and processes them into data usable by the nPhase algorithm. Unlike other existing methods, our algorithm is designed for ploidy agnostic phasing. It does not require the user to input a ploidy level and it does not contain any logic that attempts to estimate the ploidy of the input data. The idea at the core of the algorithm is that if you iteratively cluster the most similar long reads and groups of long reads together, you will naturally recreate the original haplotypes.
For the first step of the pipeline, the long and short reads are aligned to the reference, then the aligned short reads are variant called to identify heterozygous positions and generate a high-quality dataset of variable positions (Fig. 1). Each long read is then reduced to its set of heterozygous SNPs according to the previously identified variable positions. We also collect long-read coverage information to allow the level of representation of a haplotype in the data to influence its likelihood of being properly phased (see the “Methods” section).
The reduced long reads and the coverage information are then passed onto the nPhase algorithm, an iterative clustering method. On the first iteration, nPhase clusters together the two most similar long reads, then it checks that the cluster identities are maintained, i.e., it checks that merging these two long reads together does not significantly change the information they each contain individually, and finally, it generates a consensus sequence representative of the group of these two reads. The next iteration will be exactly the same with N− 1 reads. nPhase will run until all remaining clusters are sufficiently different from each other to fail the cluster identity maintenance check. These remaining clusters represent the different haplotypes within the original dataset.
nPhase, a ploidy agnostic phasing algorithm
As described earlier, nPhase is an iterative clustering algorithm. It is composed of three main ideas: (i) clustering, which ensures that similar reads are clustered together; (ii) cluster identity maintenance, which ensures that only similar clusters are merged into larger ones; and finally (iii) consensus, a way to reduce a cluster to a consensus sequence in order to easily compare it to other clusters (Fig. 2).
Each step of the clustering algorithm starts by calculating the similarity between every overlapping pair of reads (Fig. 2a). By default, the minimal overlap is 10 heterozygous positions. Similarity is defined as S = Nshared variants/Nshared positions. The pair of reads with the highest similarity is clustered together. If there is a tie, then we cluster together the pair of reads with the most variable positions in common. If there is again a tie, then we select a pair randomly. By default, the algorithm will not attempt to merge two sequences with less than 1% similarity.
The pair that was selected now forms a cluster of two reads (Fig. 2b). In order to continue this iterative algorithm, we need to define a way to calculate the similarity between a read and a cluster of reads, and the similarity between two clusters of reads. We do so by computing a consensus sequence for each cluster of reads and we use the consensus sequence to calculate the similarity as defined above. For each position, the consensus is defined as the base which has the most support from the reads in the cluster. Each read gets a vote equal to the context coverage of the base it supports. If there is a tie, then all tied bases are included in the consensus sequence.
As defined, the clustering algorithm will continue to iterate, merging clusters together until all available options are exhausted and output only one cluster per region (Fig. 2c). The solution is to set restrictions on which clusters are allowed to be merged in the clustering step. We consider that each cluster has its own “identity” defined by the population of reads that comprise it. If merging two clusters has a significant effect on the identity of both clusters, then the merge is not allowed. We calculate how much merging of two clusters would change them. The amount of change allowed is limited by the ID parameter. In order to quantify the amount of change to a cluster’s identity, we keep track of the “demographics” of each position, i.e., how strongly represented each base is for that position in that cluster. We differentiate positive identity changes from negative identity changes: (i) if a merge of two clusters results in increased support for their consensus sequence bases, then that change is considered positive; (ii) if the merge results in decreased support for a consensus sequence base, then that change is considered negative; and we calculate how many votes the base lost, even if it remains the consensus base after the merge. The number of votes lost is divided by the total number of votes in the region that both clusters have in common to obtain the cluster identity change percentage. By default, if it is higher than 5% we do not allow the two clusters to merge. Once all remaining clusters fail this test, the algorithm stops. The resulting clusters represent the different haplotypes that nPhase found and are output as different sets of reads, heterozygous SNPs, and consensus sequences.
Validation of the nPhase algorithm by combining reads of non-heterozygous individuals
To test and validate the performance of nPhase, we decided to combine sequencing datasets of haploid and homozygous diploid organisms into virtual polyploid datasets. We selected four natural S. cerevisiae isolates as the basis for our virtual genomes: ACA, a haploid strain, and three homozygous diploid strains: CCN, BMB and CRL (Additional file 1: Table S1). These four strains have different ecological and geographical origins and are sufficiently distinct from each other to allow us to evaluate the performance of nPhase at heterozygosity levels of up to 1% of the genome [24].
We sequenced these strains using an Oxford Nanopore long-read sequencing strategy and obtained Illumina short-read data from our 1011 yeast genomes project [24]. Since these strains do not have any heterozygosity, we could map their short reads to the Saccharomyces cerevisiae reference genome and variant call them to obtain their haplotypes (Fig. 1). We then used these haplotypes as a truth set to assess the performance of nPhase. With this truth set, we tested the influence of dataset characteristics: coverage, ploidy, heterozygosity level, and the inclusion or exclusion of reads that map to distant regions of the genome, hereafter described as split reads. We also investigated the influence of parameters that modulate the behavior of the nPhase algorithm: minimum similarity, minimum overlap, and maximum ID change (for a description of them, see the available parameters in the “Methods” section).
To assess the influence of ploidy, we used the three constructions of the different virtual genomes previously mentioned. We also randomly sampled 6250, 12,500, 62,500, and 125,000 heterozygous SNPs from each virtual genome to simulate datasets where 0.05%, 0.1%, 0.5%, and 1% of the positions in the genome are heterozygous. This equates to three different ploidies and four heterozygosity levels, or 12 polyploid genomes to test.
By running a total of 6000 validation tests on varying ploidy, heterozygosity, and coverage levels exploring the parameter space, we determined default parameters of nPhase (see the “Methods” section). According to these tests, the parameters that result in optimal results in terms of accuracy and contiguity are the following: 1% minimum similarity, 10% minimum overlap, and 5% maximum ID (see the “Identifying optimal parameters” section in the “Methods” section). We then ran nPhase with these default parameters on our previously described optimal datasets of varying ploidy (2n, 3n, and 4n) and heterozygosity levels (0.05%, 0.1%, 0.5% and 1%) of 20X long reads per haplotype with split read information (Additional file 1: Table S2).
As an example, we phased the tetraploid genome showing a heterozygosity level of 0.5% using nPhase (Fig. 3). Since we know the ground truth, we can assign each haplotig to the strain whose haplotype it most closely represents and we can calculate our accuracy metrics.
In order to measure accuracy, we distinguish between two forms of errors: standard errors, i.e., heterozygous SNPs erroneously attributed to the wrong haplotype, and missing errors, i.e., heterozygous SNPs which we know are present but which were erroneously not represented in the predictions. The accuracy is the percentage of all SNPs which were correctly attributed to their haplotype. The error rate is the percentage of all predictions which were incorrect. The missing rate is the percentage of all heterozygous SNPs which were never attributed to their haplotype. We use the following formula: accuracy = TP/(TP + FP + FN) with TP = true positive; the SNP was attributed to the correct haplotype.
FP=false positive; the SNP does not belong in this haplotype.
FN=false negative; the SNP is not represented in the results.
The result of this test was an accuracy of 93.7%, an error rate of 4.0%, and a missing rate of 2.2% with an average of 2.4 haplotigs per haplotype. Seven of the sixteen chromosomes have an L90 of 1, meaning that for all four haplotypes, more than 90% of the heterozygous SNPs were assigned to one haplotig. For the nine remaining chromosomes, seven have at least two chromosome-length haplotigs. In all cases, the chromosomes are nearly fully covered by haplotigs that represent the four different haplotypes, as confirmed by the low missing haplotype prediction rate (2.2%). As a ploidy agnostic tool, nPhase was not given any information about the ploidy of this sample and does not attempt to estimate its ploidy. Despite that, nPhase reached a high accuracy (93.7%) and contiguity (2.4 haplotigs per haplotype), demonstrating its ability to reliably phase a tetraploid of that heterozygosity level. The same representation is available for the other datasets of different ploidy and heterozygosity levels (Additional file 2: Fig. S1).
Across the 12 phased genomes with variable ploidy and heterozygosity levels, we noted little variation in terms of contiguity as we obtained between 2.4 and 4.3 haplotigs per haplotype (Fig. 4a). At a heterozygosity level of 0.05%, the least contiguous genomes are observed with around 4 haplotigs per haplotype (Fig. 4a). The triploid genomes decrease to around 3 haplotigs per haplotype for heterozygosity levels greater than 0.1% (Fig. 4a). The tetraploid tests continue the trend of higher ploidies becoming more stable and contiguous as the heterozygosity level increases, dropping to 3.1 haplotigs per haplotype at the 0.1% heterozygosity level and then stabilizing at 2.4 haplotigs per haplotype at the 0.5% and 1% heterozygosity levels (Fig. 4a). This could be explained by the availability of more haplotigs to potentially merge with each other as ploidy increases.
Regarding the accuracy, we observed that for heterozygosity levels greater than 0.5%, the accuracy appears stable and high across ploidies with a minimum of 93.56% for the diploid (2n) at a 0.5% heterozygosity level, and a maximum of 96.70% accuracy for the triploid (3n) at a 1% heterozygosity level (Fig. 4b). For lower heterozygosity levels (≤ 0.1%), we have results that are more variable between ploidies (Fig. 4b). Diploid tests retain a high 95.32% accuracy for the 0.1% heterozygosity level but drop to 90.34% accuracy for the 0.05% heterozygosity level. For triploid genomes, the results drop to 90.70% accuracy for the 0.1% heterozygosity level, then down to 87.00% at 0.05% heterozygosity level. Continuing the trend of higher ploidies performing worse with lower heterozygosity levels, the accuracies for the 0.1% and 0.05% heterozygosity levels for the tetraploid tests output 81.65% and 78.62% accuracy, respectively.
In addition, we observed that errors are more frequent in all tests than missing calls (Fig. 4c, d). For higher heterozygosity levels (≥ 0.5%), these two forms of error are stable and very low. The error rate is set between a minimum of 2.53% for the 1% heterozygosity level triploid and a maximum of 4.51% for the 0.5% heterozygosity level diploid. And the missing rate is set between a minimum of 0.31% for the 1% heterozygosity level diploid and a maximum of 2.21% for the 0.5% heterozygosity level tetraploid. For lower heterozygosity levels (≤ 0.1%), both the error and missing rates increase with ploidy, suggesting both types of errors may be linked. If we set aside the 0.1% heterozygosity level diploid which has an error and missing rates of 3.82% and 0.86%, respectively, the error rates have a wide range with a minimum error of 6.49% for the 0.1% heterozygosity level triploid and a maximum error of 12.91% for the 0.05% heterozygosity level tetraploid. Similarly, the missing rates range from a minimum of 2.97% for the 0.05% heterozygosity level diploid to a maximum of 8.46% for the 0.05% heterozygosity level tetraploid, again adding to the trend of lower heterozygosity levels coupled with higher ploidies yielding worse results.
Benchmarking nPhase against other polyploid phasing tools
Some methods currently exist to phase polyploids using long-read data such as WhatsHap polyphase [20], as well as other methods which were mostly designed to work with short-read sequencing data but can sometimes use long reads as input [21,22,23]. Because nPhase is a phasing tool that leverages the linking power of long reads to achieve its high accuracy and contiguity metrics, we did not benchmark it against tools that rely exclusively on short reads for phasing, since these are inherently limited by the size of their reads. We also did not benchmark nPhase against tools that can only phase diploid genomes as this is not the intended use case for our algorithm. We therefore compare nPhase to the recently released WhatsHap polyphase, to our knowledge the only other polyploid phasing algorithm that handles long reads.
We compared the results nPhase (default parameters) with WhatsHap polyphase on the same samples (Fig. 5). Since WhatsHap polyphase has a parameter named “--block-cut-sensitivity” that can be set to determine the tradeoff between accuracy and contiguity, we tested WhatsHap polyphase using all possible values for this parameter (integers from 0 to 5) to compare all possible results to nPhase’s default results. A value of 0 for this parameter means that WhatsHap polyphase will generate the most contiguous results possible, and 5 means that it will generate the most accurate results possible.
The performance of WhatsHap polyphase was measured in terms of switch error rate and N50 block lengths. Instead, we will talk about accuracy and average number of haplotigs per haplotype, two metrics that are more direct representations of the performance of the algorithms and answer two important questions: “How reliable are the results?”, i.e., what are the proportions of accurate, erroneous and missing calls? And “How informative are they?”, i.e., by how many haplotigs is each haplotype represented? nPhase and WhatsHap polyphase were both applied to our 20X test datasets of different ploidy and heterozygosity levels. nPhase was tested using its default parameters and WhatsHap polyphase was tested with all six possible values of its adjustable sensitivity parameter. We report here the average accuracy, error, and missing rates, as well as the average number of haplotigs obtained for the genome, normalized by the ploidy.
In our tests, nPhase has an average accuracy of 91.2%, slightly outperforming WhatsHap polyphase’s most sensitive setting (5), which yields an average accuracy of 90.1%, and its second most sensitive setting (4) which yields an average accuracy of 88.9% (Fig. 5a). Lower sensitivity levels for WhatsHap polyphase quickly lose accuracy, with the next lowest setting yielding only 81.1% accuracy on average, and its least sensitive setting only reaching 59% accuracy.
In addition to its high accuracy, nPhase is highly contiguous, outputting these accurate results, on average, in 3.4 haplotigs per chromosome per haplotype (Fig. 5b). The highly accurate WhatsHap polyphase sensitivity levels (5 and 4) output their results in a highly discontiguous 258.7 and 88.9 haplotigs per haplotype, respectively. In order to output results of similar contiguity to nPhase, WhatsHap polyphase must sacrifice accuracy and drop to a sensitivity level of 1 or 0, which output 2.5 and 0.9 haplotigs per chromosome per haplotype, respectively. This tradeoff between accuracy and contiguity performed by WhatsHap polyphase does not appear to have a useful middle ground and nPhase demonstrates that it is not necessary to make a choice given that it simultaneously achieves both.
Validation of the nPhase algorithm on a real Brettanomyces bruxellensis triploid strain
We further tested nPhase by running it on a real triploid organism. We selected GB54, a triploid strain of the yeast species Brettanomyces bruxellensis with a 0.7% heterozygosity level. GB54 was sequenced by Oxford Nanopore long-read sequencing and Illumina short-read sequencing, then processed through the nPhase pipeline. Since we know that this strain is a triploid strain, we should expect a successful phasing to reflect that triploid nature by outputting three haplotypes per region. In our results, we observe that most regions have been phased into two or three haplotypes, with few small exceptions (Fig. 6).
The regions that output more or less than three haplotypes are unexpected and potentially represent a phasing failure. For example, the highlighted region in Fig. 6, on chromosome 4, transitions from three haplotypes to only two haplotypes. By remapping each haplotig’s reads back to the reference and viewing the coverage, we note that regions phased into only two haplotigs have a coverage distribution consistent with the presence of only two haplotypes but three genomic copies (Fig. 7). One haplotig accounts for 2/3 of the coverage and the other haplotig accounts for the remaining 1/3 of the coverage. In Fig. 7, we highlighted the previously described region of chromosome 4, showing us that the three haplotigs in the first triploid region have roughly equal coverage, and in the region with only two haplotigs, one of them is twice as covered as the other.
The haplotigs that represent 2/3 of the reads in the region they cover either represent one single haplotype which is present in two copies, or they represent two very similar haplotypes that were erroneously clustered into one by nPhase. By looking at the distribution of the heterozygous allele frequency within each haplotig’s corresponding cluster of reads, we show that few clusters are clearly enriched in allele frequencies around 50% (Fig. 8). Two of these clusters correspond to regions erroneously predicted to contain only one haplotig (Additional file 2: Fig. S2), confirming that the allele frequency within a haplotig cluster can reveal chimeric clusters. The absence of a noticeable enrichment in the allele frequencies of other clusters is further evidence that the predictions made by nPhase are highly accurate.
Implementing automated cleaning steps
We observed in our initial in silico results that nPhase outputs many shorter haplotigs which consequently do not contain much phasing information. With our test on the Brettanomyces bruxellensis strain, we identified that we can use haplotig cluster allele frequency as a proxy for phasing quality. We also noted that, by design, nPhase will only output unique haplotypes which sometimes means that a region will be phased into fewer copies than might be naively expected based on ploidy. Finally, we also find that raw nPhase results can sometimes appear to be too fragmented.
To provide a method that begins addressing these issues, we developed a series of three steps intended to automatically clean nPhase’s raw results without significantly affecting accuracy:
-
1.
Merging as many remaining haplotigs as possible together
-
2.
Filtering out haplotigs that account for less than 1% of all coverage
-
3.
Redistributing the reads of highly covered haplotigs
These steps are further described in the “Automated cleaning” section in the “Methods” section. We checked the effect of these steps on accuracy and number of haplotigs by applying them to the results of running nPhase with default parameters on all our virtual polyploids (10X and 20X coverage). Overall, our raw results had an average of 4.03 haplotigs per haplotype (Fig. 9a) and an average accuracy of 88.6% (Fig. 9b). After cleaning, we observed an average of 2.37 haplotigs per haplotype and an average accuracy of 87.4%. If we only consider our tests at 0.5% heterozygosity or higher, then our raw results had an average of 3.81 haplotigs per haplotype and an accuracy of 91.49%. After cleaning, we had an average of 1.87 haplotigs per haplotype with an accuracy of 91.44%.
We tested this method on our results with GB54 and observe a significantly more contiguous phasing (Fig. 10). The cleaning process successfully redistributed the reads in the previously highlighted region of chromosome 4 (Additional file 2: Fig. S3) and merged haplotigs together in a way that renders the results much easier to interpret. The number of haplotigs has been reduced from 93 to 33, greatly reducing noise. We expect the accuracy not to have been negatively affected by this step based on the way the read coverage of cleaned haplotig clusters is distributed (1/3, 1/3, 1/3 coverage or 2/3, 1/3 coverage) and the allele frequency distributions of the cleaned haplotigs (Additional file 2: Fig. S4).
Running the nPhase algorithm on chromosome 2 of the potato plant species Solanum tuberosum
We also tested nPhase on one chromosome of a larger, more repetitive plant genome. We used the autotetraploid Solanum tuberosum (potato) dataset generated for the WhatsHap polyphase paper [20]. We used the latest version of the DM1–3 516 R44 assembly as a reference (v6.1) [25]. We chose to limit this section to phasing chromosome 2. At 46 Mb, it is the shortest chromosome in the reference assembly (chromosome 1 is the largest with an 88 Mb chromosome), and is 30× larger than the longest chromosome in S. cerevisiae (chromosome 4, 1.5 Mb). We observe a 2.4% heterozygosity level for chromosome 2 based on Illumina data, more than twice as high as our most heterozygous test case (1%).
To reduce computation time, we used a randomly sampled subset of heterozygous variants, effectively phasing at a 0.5% heterozygosity level. The raw results of nPhase yielded 1129 haplotigs, which were reduced to 25 haplotigs by a modified version of the automated cleaning steps; the final gap filling step was disabled in light of the fragmented nature of our raw results (Fig. 11). Of the 25 cleaned haplotigs output, we find 90% of predicted variants in the 9 largest cleaned haplotigs, and 99% of predicted variants in the 12 largest cleaned haplotigs. The remaining 13 cleaned haplotigs account for less than 0.6% of predictions made and could reasonably be filtered out.
We note that the haplotigs obtained skip around the chromosome, which may either be due to structural variation or to long reads being mapped in error to the wrong repetitive regions, thereby giving the illusion of widespread structural variation. We also checked the phasing predictions for the 5 longest genes of chromosome 2 (Additional file 2: Fig. S5) and found that they are coherent with our expectations for an autotetraploid.