Haplotype-aware diplotyping from noisy long reads

Current genotyping approaches for single-nucleotide variations rely on short, accurate reads from second-generation sequencing devices. Presently, third-generation sequencing platforms are rapidly becoming more widespread, yet approaches for leveraging their long but error-prone reads for genotyping are lacking. Here, we introduce a novel statistical framework for the joint inference of haplotypes and genotypes from noisy long reads, which we term diplotyping. Our technique takes full advantage of linkage information provided by long reads. We validate hundreds of thousands of candidate variants that have not yet been included in the high-confidence reference set of the Genome-in-a-Bottle effort. Electronic supplementary material The online version of this article (10.1186/s13059-019-1709-0) contains supplementary material, which is available to authorized users.

I only suggest a minor revision that may help to clarify some points of the paper.
Minor revision -I would like to see a discussion e. g. a statistical analysis on how the novel tools behave in detecting heterozygous versus homozygous sites of the genotypes in the GIAB project.
-You say that the methods rely on computing and estimating bipartitions of the long reads. In other words the approach to infer SNVs validates the values of reads in given positions, i.e. it corrects reads that are first clustered in one of the two partitions, each partition corresponding to a given haplotype that underlies the infererred genotype. From what I know there are other methods in the literature to validate values of reads based on correcting reads and bipartition them to infer haplotypes. For example see [9]. Could you please comment more on the existing literature and approaches for the above purpose, i.e. assembling haplotypes?
-Since you mention work [9] as an another approach in the same direction, I would like to see in the Method section a more deep discussion of the existing literature on the topic, as mentioning alternative combinatorial approaches to the genotype assembly or haplotype assembly that lead to solve the diplotyping problem.
-You say that the approach in [9] does not scale to process the whole genome, but your discussion on how the approach scales does not specify the memory requirements of your approach and how it scaled w.r.t. the dataset characteristics, such as for example coverage or size of the dataset. A more estensive discussion is required.
-On page 9, the first two paragraphs are not very clear to me. You should give more details on what do you mean by saying paving the way for genotyping at intermediate coverage levels. We emphasize that our method operates at coverage levels that preclude the possibility of performing a de novo genome assembly, which, until now, was the most common use of long read data.
-page 9 possibilty -> possibility Reviewer 2 Are the methods appropriate to the aims of the study, are they well described, and are necessary controls included? Methods are interesting but poorly described.

Are the conclusions adequately supported by the data shown?
Yes.

Are sufficient details provided to allow replication and comparison with related analyses that may have been performed?
Yes.

Does the method perform better than existing methods (as demonstrated by direct comparison with available methods)?
Yes.
Is the method likely to be of broad utility? Is any software component easy to install and use?
As part of the WhatsHap software, yes it will be used. I'm actually a user of the current version of WhatsHap.

Is the paper of broad interest to others in the field, or of outstanding interest to a broad audience of biologists?
Yes: Long read sequencing technologies are becoming popular but there is a lack of statistical methods able to properly process them.

Comments to author:
In this paper, the authors describe and benchmark a new method to perform variant, genotype and haplotype calling from noisy long reads as those generated by sequencing technologies PacBio and Oxford Nanopore. This new method aims at leveraging the ability of long reads to span multiple hets in order to improve genotype calling and produce reliable haplotype calls. There is indeed a real need for methods able to achieve this task on this data and I appreciate the efforts made by the authors to provide an elegant solution to this problem; i.e. a probabilistic approach able to deal with the uncertainty inherent to this technology. However, I have multiple major concerns about the paper in its current form.

Paper organization
Overall, I find the paper really hard to read and understand for two main reasons. First, it took me some time to actually understand that the authors do not compare their new approach to WhatsHap (haplotype assembly methods published in 2015 with a bioRxiv paper from 2016). Instead, they actually compare two different implementations of their HMM method, one stand-alone and the other implemented as a module for WhatsHap. The authors should state this much clearer in their manuscript, for instance, by giving a single name to their method, not two! Overall, this will definitely confuse people. Second, the method description really needs to be rewritten. The description in the main text is very brief (10 lines + 1 line for figure 1B!) and does not give any sense of what the method does, how it does it and what novelty its brings. Please, contrast it better to existing approaches. The formal description at the end of the manuscript is useful for people interested in reproducing their model but clearly not accessible to general readers of Genome Biology. This is why the description in the main text should definitely be improved.

Results
You need to explain why MarginPhase gives 96.6% accuracy while whatshap goes up to 99.78% for PacBio genotyping. This is just a massive gap. Why the two implementations you proposed give very different results? Dis you study the error mode for each? This is very confusing. At least, you should provide a comparison of the GLs obtained by each approach and clearly interpret/describe the discordances between both.
Along this line, the authors should give guidelines for their two implementations. Which one should I use on my data?
One of the major advantages of having long reads resides in their ability to detect and type SVs. This is one of the key motivations for using these technologies. You should show some results about this. Along the same line, there are some short indels in GIAB if I remember well. Why not showing any results for short indels?
I would like to see the precision/recall for (i) variant discovery, (ii) genotyping and (iii) phasing as a function of coverage. By this I do not mean down-sampling your data as you did, but instead to show how well the method perform as a function of the #reads it uses at each position, i.e. plot figure 3 as a function of coverage.
I would like to see a plot of genotyping accuracy as a function of GL certainty. It would be useful to see if the GLs you compute are well calibrated.
I do not see the point of showing the performance obtained when using two long read approaches. In practice, nobody will do this excepted in few very specific studies. The authors should therefore motivate better why they present this. Something more useful to me would be to assess the performance in terms of phasing/genotyping when you combine deep short Illumina reads with low coverage very long reads.
When you measure phasing errors, how do you deal with the genotyping errors? Do you identify switch errors only at properly typed genotypes?
An optional point would be to show or mention how phasing accuracy is clearly independent from minor allele frequency. This shows to the general readers the advantages of experimental phasing over population based phasing. Comment 1.1: The paper presents a technique to infer genotypes and haplotypes from long reads using a statistical framework and an algorithmic approach inspired by haplotype assembly. Indeed, the main idea is considering bipartitions of reads associated to the two inferred haplotypes and then estimating the probability of a having a given genotype induced by a given bipartition of long reads. The authors develop two tools based on the technique. The problem is quite relevant and their technique allows to validate several candidate variants included in reference set of the GIAB project. The implementation they propose is the result of extending Whatshap, a previous tool for haplotype assembly from long reads with statistical analysis of the bipartitions generated by Whatshap. The work is quite complex from a combinatorial point of view since it combines several approaches: a graph theoretical framework for validating the quality of the alignment of long reads and statistical analysis. I recommend acceptance of the work.
We thank the reviewer for this positive evaluation of our work.
I only suggest a minor revision that may help to clarify some points of the paper.
Minor revision Comment 1.2: I would like to see a discussion e. g. a statistical analysis on how the novel tools behave in detecting heterozygous versus homozygous sites of the genotypes in the GIAB project.

Response:
We thank the reviewer for raising this important point. We included an analysis of this in the supplement. The results confirm our expectation that heterozygous sites are more difficult to detect. For the PacBio data set using WhatsHap, for instance, the recall of homozygous sites is 99.28% while it is only 93.77% for heterozygous sites (see Supplementary  Tables S3 and S4 for full details). Likewise the precision of detecting homozygous sites is higher (98.27%) than for detecting heterozygous sites (96.78%). So in summary, the F1 score is 98.77% for homozygous and 95.25% for heterozygous sites. In Nanopore data using MarginPhase, the difference between our ability to characterize homozygous vs. heterozygous sites is even more drastic, with an F1 score of 91.26% for homozygous and 72.02% for heterozygous variants, which is probably due to the increased noise levels in Nanopore data. Comment 1.3: You say that the methods rely on computing and estimating bipartitions of the long reads. In other words the approach to infer SNVs validates the values of reads in given positions, i.e. it corrects reads that are first clustered in one of the two partitions, each partition corresponding to a given haplotype that underlies the infererred genotype. From what I know there are other methods in the literature to validate values of reads based on correcting reads and bipartition them to infer haplotypes. For example see [9]. Could you please comment more on the existing literature and approaches for the above purpose, i.e. assembling haplotypes?
Response: We substantially extended our discussion of related work.
First, we point out the connection to the Minimum Error Correction (MEC) problem right at the beginning of the paper and cite five corresponding reviews. While MEC could theoretically serve also for genotyping, all tools that we are aware of make the "all heterozygous" assumption, i.e. they exclusively work on sites previously determined to be heterozygous.
Second, we updated the manuscript with a more detailed review of the method by Guo et al. [9]. In their paper, the authors also evaluate their tool on PacBio data from NA12878 and report substantially worse performance (F1=86.6%) compared to what we observe for our method (F1=97.1%). To reproduce the comparison on exactly the same data set used by us, we attempted to run the application provided by Guo et al., for which only pre-compiled java class files were included in the repository (no source code was present). Unfortunately, we did not succeed in running the tool on our data sets and there was no documentation available. We reached out to the author multiple times, and they indicated they intended to update the repository in response to the problems we reported, but changes have not been made available as of yet (their last response was dated Nov 14). In its present state, the tool does not seem to be usable.
Third, in the meantime, two variant callers based on convolutional neural networks (CNNs) have appeared: a BioRxiv preprint by Luo et al. introduces "Clairvoyante", a long-read variant caller using CNNs and a paper by Poplin et al. (Nat. Biotech., 2018) introduces "DeepVariant". The evaluations provided by these authors also compare PacBio variant calls on NA12878 to GIAB benchmark data and are hence comparable to our results. While DeepVariant produces considerably worse results, the results reported for Clairvoyante are comparable to ours. We want to stress, however, that it is presently unclear whether the numbers are overly optimistic estimates since the learning approaches in fact are trained on data from GIAB (for different individual and/or chromosomes). So if the GIAB truth sets should contain any systematic biases or errors, the learning approaches will adopt the same biases and hence will evaluate favorably. While we cannot assess the extent to which this happens, we emphasize that such potential problems are absent from our evaluation. Furthermore, our method additionally reconstruct haplotypes and comes with a statistical model that is readily interpretable.
We now cite the corresponding performance statistics for DeepVariant and Clairvoyante in our paper to put the performance of our method in perspective. Beyond the three tools discussed above, we are not aware of any tools able to generate variant calls from noisy long reads. Hence, we think it is fair to conclude that our approach significantly improves over any published tools and is on a par with an upcoming tool for which the evaluation has not yet undergone peer review. At the same time, we offer a conceptually novel way of statistically modeling sequencing data from diploid samples, which we envision to serve as the basis for many future tools.

Comment 1.4:
Since you mention work [9] as an another approach in the same direction, I would like to see in the Method section a more deep discussion of the existing literature on the topic, as mentioning alternative combinatorial approaches to the genotype assembly or haplotype assembly that lead to solve the diplotyping problem.
Response: See above. Comment 1.5: You say that the approach in [9] does not scale to process the whole genome, but your discussion on how the approach scales does not specify the memory requirements of your approach and how it scaled w.r.t. the dataset characteristics, such as for example coverage or size of the dataset. A more estensive discussion is required.

Response:
We apologize for the omission and have updated the manuscript with statistics on memory usage. Guo et al. only ran their method on chr1 of the genome, and did not provide metrics for memory usage or CPU utilization / concurrent threads (only runtime). Unfortunately, we were not able to run this tool to perform our own measurements (see above). Comment 1.6: On page 9, the first two paragraphs are not very clear to me. You should give more details on what do you mean by saying "paving the way for genotyping at intermediate coverage levels. We emphasize that our method operates at coverage levels that preclude the possibility of performing a de novo genome assembly, which, until now, was the most common use of long read data." Response: Thank you. We have provided an analysis in the supplement showing precision and recall as a function of read depth. Furthermore, we have changed the definition of "callable" for our method to include regions which have a minimum depth of 20 instead of 10.

Reviewer #2
Comment 2.1: In this paper, the authors describe and benchmark a new method to perform variant, genotype and haplotype calling from noisy long reads as those generated by sequencing technologies PacBio and Oxford Nanopore. This new method aims at leveraging the ability of long reads to span multiple hets in order to improve genotype calling and produce reliable haplotype calls. There is indeed a real need for methods able to achieve this task on this data and I appreciate the efforts made by the authors to provide an elegant solution to this problem; i.e. a probabilistic approach able to deal with the uncertainty inherent to this technology.

Response:
We thank the reviewer for her/his general enthusiasm. Comment 2.2: However, I have multiple major concerns about the paper in its current form.

Response:
We are thankful for the points raised. We believe that, by addressing them, we have substantially improved the presentation of methodology and results in our paper. We provide point-by-point responses below. Comment 2.3: 1. Paper organization Overall, I find the paper really hard to read and understand for two main reasons. First, it took me some time to actually understand that the authors do not compare their new approach to WhatsHap (haplotype assembly methods published in 2015 with a bioRxiv paper from 2016).

Response:
We apologize for creating confusion here. We have updated the introduction to provide more background on haplotype phasing (see response to Comment 1.3) and now also clarify why our existing tools (such as WhatsHap) for haplotyping are not suitable for genotyping. In brief, these tool only work on heterozygous sites given as input.
Comment 2.4: Instead, they actually compare two different implementations of their HMM method, one stand-alone and the other implemented as a module for WhatsHap. The authors should state this much clearer in their manuscript, for instance, by giving a single name to their method, not two! Overall, this will definitely confuse people.

Response:
We have updated the text to clarify our decision to provide two implementations. Historically the two implementations emerged from the two groups (Paten/Marschall) working independently on this problem before we joined forces. We considered merging them into one unified tools, but decided against this as the two implementations come with different strength/weaknesses. Each employs different preprocessing steps, which results in each method performing better on one of the sequencing technologies. In addition, we strongly believe that having two independent implementations adds robustness to our results. We now clarify this in the text and provide clear guidance to users by recommending to use MarginPhase for ONT data and WhatsHap for PacBio data. We have restructured the Results section accordingly and now show these combinations in the main text and the less favorable combinations in the supplement.
Comment 2.5: Second, the method description really needs to be rewritten. The description in the main text is very brief (10 lines + 1 line for figure 1B!) and does not give any sense of what the method does, how it does it and what novelty its brings. Please, contrast it better to existing approaches. The formal description at the end of the manuscript is useful for people interested in reproducing their model but clearly not accessible to general readers of Genome Biology. This is why the description in the main text should definitely be improved.
Response: Thank you for this comment. We addressed it in the following ways: First, we substantially expanded our discussion of related literature, both on related approaches for haplotype phasing as well as on long-read variant calling, in the Background section (see our response to Comment 1.3). Second, we have extended our description of the method in the first section of the Results (Section 2.1, which now is 22 lines long). Third, we have improved the caption of Figure 1b, which was terse indeed. We think that these measures make the novelty of our approach more obvious to readers and provide readers with a high-level understanding of the method, without them having to read through the Method section at the end.

Comment 2.6: 2. Results
You need to explain why MarginPhase gives 96.6% accuracy while whatshap goes up to 99.78% for PacBio genotyping. This is just a massive gap. Why the two implementations you proposed give very different results? Dis you study the error mode for each? This is very confusing. At least, you should provide a comparison of the GLs obtained by each approach and clearly interpret/describe the discordances between both.
Response: To elucidate this, we have included in the supplement a detailed analysis of the error profiles of both implementations with respect to homozygous and heterozygous variants (Table S3 and Table S4), the relationship between read depth and accuracy ( Figure S3 and Figure S4), and genotype likelihoods ( Figure S5). We now emphasize the differences between the two implementations (mostly in preprocessing the data) already in Section 2.1, and point the reader to the corresponding sections in the Methods part where the full details are spelled out. Comment 2.7: Along this line, the authors should give guidelines for their two implementations. Which one should I use on my data?

Response:
We have added a clear recommendation (Section 2.1, Section 4) that WhatsHap should be preferred for PacBio data and MarginPhase should be preferred for ONT data (also see response to Comment 2.4). Furthermore, we have moved the less relevant parts of the analysis to the supplement (i.e. for. WhatsHap on ONT data and MarginPhase on PacBio data), so as to provide a clear focus on the configurations we recommend to the user. Comment 2.8: One of the major advantages of having long reads resides in their ability to detect and type SVs. This is one of the key motivations for using these technologies. You should show some results about this. Along the same line, there are some short indels in GIAB if I remember well. Why not showing any results for short indels?

Response:
We agree with the reviewer that indels and SVs are important variant classes that can potentially be accessed using long reads. However, from our previous studies on calling SVs from Oxford Nanopore data (Nat. Comms., 2017, doi:10.1038/s41467-017-01343-4) and from PacBio data (Human Genome Structural Variation Consortium, bioRxiv, 2018, doi: 10.1101, we have to conclude that developing methods for handling SVs properly is a substantial research project in itself and, in our view, outside the scope of the present work. We plan to use the statistical model introduced here as a basis to attack the problem of SV genotyping in the future. Indels, in particular short tandem repeats (STRs) come with their own challenges and nominating candidate alleles from (indel-error-rich) long reads is particularly difficult (especially using ONT data, where systematic indel errors are quite pronounced). Therefore, due to the error distributions in the input sequencing read types, our present methods are restricted to SNV detection, not including any short insertions or deletions. However, a given variant (consisting of a known REF and ALT allele sequence) can be re-genotyped using WhatsHap, including short indels. To assess this feature, we now additionally provide the results for re-genotyping given indel variants in the supplement.