We describe a probabilistic model for diplotype inference, and in this paper use it, primarily, to find maximum posterior probability genotypes. The approach builds upon the WhatsHap approach [22] but incorporates a full probabilistic allele inference model into the problem. It has similarities to that proposed by Kuleshov [38], but we here frame the problem using Hidden Markov models (HMMs).
Alignment matrix
Let M be an alignment matrix whose rows represent sequencing reads and whose columns represent genetic sites. Let m be the number of rows, let n be the number of columns, and let Mi,j be the jth element in the ith row. In each column, let Σj⊂Σ represent the set of possible alleles such that Mi,j∈Σj∪{−}, the “ −” gap symbol representing a site at which the read provides no information. We assume no row or column is composed only of gap symbols, an uninteresting edge case. An example alignment matrix is shown in Fig. 6. Throughout the following, we will be informal and refer to a row i or column j, being clear from the context whether we are referring to the row or column itself or the coordinate.
Genotype inference problem overview
A diplotype H=(H1,H2) is a pair of haplotype (segments); a haplotype (segment)\(H^{k} = H_{1}^{k}, H_{2}^{k}, \ldots, H_{n}^{k}\) is a sequence of length n whose elements represents alleles such that \(H_{j}^{k} \in \Sigma _{j}\). Let B=(B1,B2) be a bipartition of the rows of M into two parts (sets): B1, the first part, and B2, the second part. We use bipartitions to represent which haplotypes the reads came from, of the two in a genome. By convention, we assume that the first part of B are the reads arising from H1 and the second part of B are the reads arising from H2.
The problem we analyze is based upon a probabilistic model that essentially represents the (weighted) minimum error correction (MEC) problem [39, 40], while modeling the evolutionary relationship between the two haplotypes and so imposing a cost on bipartitions that create differences between the inferred haplotypes.
For a bipartition B, and making an i.i.d. assumption between sites in the reads:
$${}P(H | B, \mathbf{M}) = \prod_{j=1}^{n} \sum_{Z_{j} \in \Sigma_{j}} P\left(H_{j}^{1} | B^{1}, Z_{j}\right) P\left(H_{j}^{2} | B^{2}, Z_{j}\right) P(Z_{j}) $$
Here, P(Zj) is the prior probability of the ancestral allele Zj of the two haplotypes at column j, by default we can use a simple flat distribution over ancestral alleles (but see below). The posterior probability \(P(H_{j}^{k}|B^{k}, Z_{j}) = \)
$$\frac{ P\left(H_{j}^{k} | Z_{j}\right) \prod_{\left\{ i \in B^{k} : \mathbf{M}_{i,j} \not= -\right\}} P\left(\mathbf{M}_{i,j} | H_{j}^{k}\right)}{\sum_{Y_{j} \in \Sigma_{j}} P(Y_{j} | Z_{j}) \prod_{\left\{ i \in B^{k} : \mathbf{M}_{i,j} \not= -\right\}} P(\mathbf{M}_{i,j} | Y_{j})} $$
for k∈{1,2}, where the probability \(P\left (H^{k}_{j} | Z_{j}\right)\) is the probability of the haplotype allele \(H^{k}_{j}\) given the ancestral allele Zj. For this, we can use a continuous time Markov model for allele substitutions, such as Jukes and Cantor [41], or some more sophisticated model that factors the similarities between alleles (see below). Similarly, \(P\left (\mathbf {M}_{i,j} | H_{j}^{k}\right)\) is the probability of observing allele Mi,j in a read given the haplotype allele \(H_{j}^{k}\).
The genotype inference problem we consider is finding for each site:
$$\underset{\left(H^{1}_{j}, H^{2}_{j}\right)}{\text{arg\,max}} P\left(H^{1}_{j}, H^{2}_{j} |\mathbf{M}\right) = \underset{\left(H^{1}_{j}, H^{2}_{j}\right)}{\text{arg\,max}} \sum_{B} P\left(H^{1}_{j}, H^{2}_{j} | B,\mathbf{M}\right) $$
i.e., finding the genotype \(\left (H^{1}_{j}, H^{2}_{j}\right)\) with maximum posterior probability for a generative model of the reads embedded in M.
A graphical representation of read partitions
For column j in M, row i is active if the first non-gap symbol in row i occurs at or before column j and the last non-gap symbol in row i occurs at or after column j. Let Aj be the set of active rows of column j. For column j, row i is terminal if its last non-gap symbol occurs at column j or j=n. Let \(A^{\prime }_{j}\) be the set of active, non-terminal rows of column j.
Let \(B_{j} = \left (B_{j}^{1}, B_{j}^{2}\right)\) be a bipartition of Aj into the first part \(B_{j}^{1}\) and a second part \(B_{j}^{2}\). Let Bj be the set of all possible such bipartitions of the active rows of j. Similarly, let \(C_{j} = \left (C_{j}^{1}, C_{j}^{2}\right)\) be a bipartition of \(A^{\prime }_{j}\) and Cj be the set of all possible such bipartitions of the active, non-terminal rows of j.
For two bipartitions B=(B1,B2) and C=(C1,C2), B is compatible with C if the subset of B1 in C1∪C2 is a subset of C1, and, similarly, the subset of B2 in C1∪C2 is a subset of C2. Note this definition is symmetric and reflexive, although not transitive.
Let G=(VG,EG) be a directed graph. The vertices VG are the set of bipartitions of both the active rows and the active, non-terminal rows for all columns of M and a special start and end vertex, i.e., \(V_{G} = \{ \mathrm {start, end} \} \cup (\bigcup _{j} \mathbf {B}_{\mathbf {j}} \cup \mathbf {C}_{\mathbf {j}}\)). The edges EG are a subset of compatibility relationships, such that (1) for all j, there is an edge (Bj∈Bj,Cj∈Cj) if Bj is compatible with Cj; (2) for all 1≤j<n, there is an edge (Cj∈Cj,Bj+1∈Bj+1) if Cj is compatible with Bj+1; (3) there is an edge from the start vertex to each member of B1; and (4) there is an edge from each member of Bn to the end vertex (note that Cn is empty and so contributes no vertices to G). Figure 7 shows an example graph.
The graph G has a large degree of symmetry and the following properties are easily verified:
-
For all j and all Bj∈Bj, the indegree and outdegree of Bj is 1.
-
For all j, the indegree of all members of Cj is equal.
-
Similarly, for all j, the outdegree of all members of Cj is equal.
Let the maximum coverage, denoted maxCov, be the maximum cardinality of a set Aj over all j. By definition, maxCov≤m. Using the above properties it is easily verified that (1) the cardinality of G (number of vertices) is bounded by this maximum coverage, being less than or equal to 2+(2n−1)2maxCov and (2) the size of G (number of edges) is at most 2n2maxCov.
Let a directed path from the start vertex to the end vertex be called a diploid path, D=(D1=start,D2,…,D2n+1=end). The graph is naturally organized by the columns of M, so that \(D_{2j} = \left (B_{j}^{1}, B_{j}^{2}\right) \in \mathbf {B}_{\mathbf {j}}\) and \(D_{2j+1} = \left (C_{j+1}^{1}, C_{j+1}^{2}\right) \in \mathbf {C}_{\mathbf {j}}\) for all 0<j≤n. Let \(B_{D} = \left (B_{D}^{1}, B_{D}^{2}\right)\) denote a pair of sets, where \(B_{D}^{1}\) is the union of the first parts of the vertices of D2,…,D2n+1 and, similarly, \(B_{D}^{2}\) is the union of second parts of the vertices of D2,…,D2n+1.
\(B_{D}^{1}\) and \(B_{D}^{2}\) are disjoint because otherwise there must exist a pair of vertices within D that are incompatible, which is easily verified to be impossible. Further, because D visits a vertex for every column of M, it follows that the sum of the cardinalities of these two sets is m. BD is therefore a bipartition of the rows of M which we call a diploid path bipartition.
Lemma 1
The set of diploid path bipartitions is the set of bipartitions of the rows of M and each diploid path defines a unique diploid path bipartition.
Proof
We first prove that each diploid path defines a unique bipartition of the rows of M. For each column j of M, each vertex Bj∈Bj is a different bipartition of the same set of active rows. Bj is by definition compatible with a diploid path bipartition of a diploid path that contains it and incompatible with every other member of Bj. It follows that for each column j, two diploid paths with the same diploid path bipartition must visit the same node in Bj, and, by identical logic, the same node in Cj, but then two such diploid paths are therefore equal.
There are 2m partitions of the rows of M. It remains to prove that there are 2m diploid paths. By the structure of the graph, the set of diploid paths can be enumerated backwards by traversing right-to-left from the end vertex by depth-first search and exploring each incoming edge for all encountered nodes. As stated previously, the only vertices with indegree greater than one are for all j the members of Cj, and each member of Cj has the same indegree. For all j, the indegree of Cj is clearly \(\phantom {\dot {i}\!}2^{|C_{j}| - |B_{j}|}\): two to the power of the number of number of active, terminal rows at column j. The number of possible paths must therefore be \(\prod _{j=1}^{n} 2^{|C_{j}| - |B_{j}|}\). As each row is active and terminal in exactly one column, we obtain \(m = \sum _{j} |C_{j}| - |B_{j}|\) and therefore:
$$2^{m} = \prod_{j=1}^{n} 2^{|C_{j}| - |B_{j}|} $$
□
A hidden Markov model for genotype and diplotype inference
In order to infer diplotypes, we define a hidden Markov model which is based on G but additionally represents all possible genotypes at each genomic site (i.e., in each B column). To this end, we define the set of states Bj×Σj×Σj, which contains a state for each bipartition of the active rows at position j and all possible assignments of alleles in Σj to the two partitions. Additionally, the HMM contains a hidden state for each bipartition in Cj, exactly as defined for G above. Transitions between states are defined by the compatibility relationships of the corresponding bipartitions as before. This HMM construction is illustrated in Fig. 8.
For all j and all Cj∈Cj, each outgoing edge has transition probability \(P(a_{1}, a_{2}) = \sum _{Z_{j}} P(a_{1} | Z_{j}) P(a_{2} | Z_{j})P(Z_{j})\), where (Bj,a1,a2)∈Bj×Σj×Σj is the state being transitioned to. Similarly, each outgoing edge of the start node has transition probability P(a1,a2). The outdegree of all remaining nodes is 1, so these edges have transition probability 1.
The start node, the end node, and the members of Cj for all j are silent states and hence do not emit symbols. For all j, members of Bj×Σj×Σj output the entries in the jth column of M that are different from “–.” We assume every matrix entry to be associated with an error probability, which we can compute from \(P(\mathbf {M}_{ij} | H_{j}^{k})\) defined previously. Based on this, the probability of observing a specific output column of M can be easily calculated.
Computing genotype likelihoods
The goal is to compute the genotype likelihoods for the possible genotypes for each variant position using the HMM defined above. Performing the forward-backward algorithm returns forward and backward probabilities of all hidden states. Using those, the posterior distribution of a state (B,a1,a2)∈Bj×Σj×Σj, corresponding to bipartition B and assigned alleles a1 and a2, can be computed as:
$$ \begin{aligned} P((B,a_{1}, a_{2})\vert \mathbf{M})\! =\! \frac{\alpha_{j}(B,a_{1}, a_{2}) \cdot \beta_{j}(B,a_{1}, a_{2})}{\sum\limits_{B^{\prime} \in \mathcal{B}(A_{j})} \sum\limits_{a_{1}^{\prime}, a_{2}^{\prime} \in \Sigma_{j}} \alpha_{j}\left(B^{\prime},a_{1}^{\prime},a_{2}^{\prime}\right)\cdot \beta_{j}\left(B^{\prime},a_{1}^{\prime},a_{2}^{\prime}\right)} \end{aligned} $$
(1)
where αj(B,a1,a2) and βj(B,a1,a2) denote forward and backward probabilities of the state (B,a1,a2) and \(\mathcal {B}(A_{j})\), the set of all bipartitions of Aj. The above term represents the probability for a bipartition B=(B1,B2) of the reads in Aj and alleles a1 and a2 assigned to these partitions. In order to finally compute the likelihood for a certain genotype, one can marginalize over all bipartitions of a column and all allele assignments corresponding to that genotype.
Example 1
In order to compute genotype likelihoods for each column of the alignment matrix, posterior state probabilities corresponding to states of the same color in Fig. 8 need to be summed up. For the first column, adding up the red probabilities gives the genotype likelihood of genotype T/T, blue of genotype G/T, and yellow of G/G.
Implementations
We created two independent software implementations of this model, one based upon WhatsHap and one from scratch, which we call MarginPhase. Each uses different optimizations and heuristics that we briefly describe.
WhatsHap implementation
We extended the implementation of WhatsHap ([22], https://bitbucket.org/whatshap/whatshap) to enable haplotype aware genotyping of bi-allelic variants based on the above model. WhatsHap focuses on re-genotyping variants, i.e., it assumes SNV positions to be given. In order to detect variants, a simple SNV calling pipeline was developed. It is based on samtools mpileup [42] which provides information about the bases supported by each read covering a genomic position. A set of SNV candidates is generated by selecting genomic positions at which the frequency of a non-reference allele is above a fixed threshold (0.25 for PacBio data, 0.4 for Nanopore data), and the absolute number of reads supporting the non-reference allele is at least 3. These SNV positions are then genotyped using WhatsHap.
Allele detection
In order to construct the alignment matrix, a crucial step is to determine whether each read supports the reference or the alternative allele at each of n given genomic positions. In WhatsHap, this is done based on re-aligning sections of the reads [16]. Given an existing read alignment from the provided BAM file, its sequence in a window around the variant is extracted. It is aligned to the corresponding region of the reference sequence and, additionally, to the alternative sequence, which is artificially produced by inserting the alternative allele into the reference. The alignment cost is computed by using affine gap costs. Phred scores representing the probabilities for opening and extending a gap and for a mismatch in the alignment can be estimated from the given BAM file. The allele leading to a lower alignment cost is assumed to be supported by the read and is reported in the alignment matrix. If both alleles lead to the same cost, the corresponding matrix entry is “–.” The absolute difference of both alignment scores is assigned as a weight to the corresponding entry in the alignment matrix. It can be interpreted as a phred scaled probability for the allele being wrong and is utilized for the computation of output probabilities.
Read selection
Our algorithm enumerates all bipartitions of reads covering a variant position and thus has a runtime exponential in the maximum coverage of the data. To ensure that this quantity is bounded, the same read selection step implemented previously in the WhatsHap software is run before constructing the HMM and computing genotype likelihoods. Briefly, a heuristic approach described in [43] is applied, which selects phase informative reads iteratively taking into account the number of heterozygous variants covered by the read and its quality.
Transitions
Defining separate states for each allele assignment in Bj enables easy incorporation of prior genotype likelihoods by weighting transitions between states in Cj−1 and Bj×Σj×Σj. Since there are two states corresponding to a heterozygous genotype in the bi-allelic case (0|1 and 1|0), the prior probability for the heterozygous genotype is equally spread between these states.
In order to compute such genotype priors, the same likelihood function underlying the approaches described in [44] and [45] was utilized. For each SNV position, the model computes a likelihood for each SNV to be absent, heterozygous, or homozygous based on all reads that cover a particular site. Each read contributes a probability term to the likelihood function, which is computed based on whether it supports the reference or the alternative allele [44]. Furthermore, the approach accounts for statistical uncertainties arising from read mapping and has a runtime linear in the number of variants to be genotyped [45]. Prior genotype likelihoods are computed before read selection. In this way, information of all input reads covering a position can be incorporated.
MarginPhase implementation
MarginPhase (https://github.com/benedictpaten/marginPhase) is an experimental, open source implementation of the described HMM written in C. It differs from the WhatsHap implementation in the method it uses to explore bipartitions and the method to generate allele support probabilities from the reads.
Read bipartitions
The described HMM scales exponentially in terms of increasing read coverage. For typical 20–60 × sequencing coverage (i.e., average number of active rows per column), it is impractical to store all possible bipartitions of the rows of the matrix. MarginPhase implements a simple, greedy pruning and merging heuristic outlined in recursive pseudocode in Algorithm 1.
The procedure computePrunedHMM takes an alignment matrix and returns a connected subgraph of the HMM for M that can be used for inference, choosing to divide the input alignment matrix into two if the number of rows (termed maxCov) exceeds a threshold t, recursively.
The sub-procedure mergeHMMs takes two pruned HMMs for two disjoint alignment matrices with the same number of columns and joins them together in the natural way such that if at each site i there are \(\left |\mathbf {B}_{\mathbf {i}}^{\mathbf {1}}\right |\) states in HMM1 and \(\left |\mathbf {B}_{\mathbf {i}}^{\mathbf {2}}\right |\) in HMM2, then the resulting HMM will have \(\left |\mathbf {B}_{\mathbf {i}}^{\mathbf {1}}\right | \times \left |\mathbf {B}_{\mathbf {i}}^{\mathbf {2}}\right |\) states. This is illustrated in Fig. 9. In the experiments used here t=8 and v=0.01.
Allele supports
In MarginPhase, the alignment matrix initially has a site for each base in the reference genome. To generate the allele support for each reference base from the reads for each read, we calculate the posterior probability of each allele (reference base) using the implementation of the banded forward-backward pairwise alignment described in [46]. The result is that for each reference base, for each read that overlaps (according to an initial guide alignment extracted from the SAM/BAM file) the reference base, we calculate the probability of each possible nucleotide (i.e., { ‘A’, ‘C’, ‘G’, ‘T’ }). The gaps are ignored and treated as missing data. This approach allows summation over all alignments within the band. Given the supports for each reference base, we then prune the set of reference bases considered to those with greater than (by default) three expected non-reference alleles. This expectation is merely the sum of non-reference allele base probabilities over the reads. This reduces the number of considered sites by approximately two orders of magnitude, greatly accelerating the HMM computation.
Substitution probabilities
We set the read error substitution probabilities, i.e., \(P\left (\mathbf {M}_{i,j} | H_{j}^{k}\right)\), empirically and iteratively. Starting from a 1% flat substitution probability, we generate a ML read bipartition and pair of haplotypes, we then re-estimate the read error probabilities from the differences between the reads and the haplotypes. We then rerun the model and repeat the process to derive the final probabilities. For the haplotype substitution probabilities, i.e., \(P\left (H_{j}^{k} | Z_{j}\right)\), we use substitution probabilities of 0.1% for transversions and 0.4% for transitions, reflecting the facts that transitions are twice as likely empirically but that there are twice as many possible transversions.
Phase blocks
MarginPhase divides the read partitioning HMMs into phase sets based on the number of reads which span adjacent likely heterozygous sites. The bipartitioning is performed on each of these phase sets individually. MarginPhase’s output includes a BAM which encodes the phasing of each read, including which phase set it is in, which haplotype it belongs to, and what of the aligned portion falls into each phase set. Reads which span a phase set boundary have information for both encoded in them.