CellPhy: accurate and fast probabilistic inference of single-cell phylogenies from scDNA-seq data

We introduce CellPhy, a maximum likelihood framework for inferring phylogenetic trees from somatic single-cell single-nucleotide variants. CellPhy leverages a finite-site Markov genotype model with 16 diploid states and considers amplification error and allelic dropout. We implement CellPhy into RAxML-NG, a widely used phylogenetic inference package that provides statistical confidence measurements and scales well on large datasets with hundreds or thousands of cells. Comprehensive simulations suggest that CellPhy is more robust to single-cell genomics errors and outperforms state-of-the-art methods under realistic scenarios, both in accuracy and speed. CellPhy is freely available at https://github.com/amkozlov/cellphy. Supplementary Information The online version contains supplementary material available at 10.1186/s13059-021-02583-w.


Additional file 1
CellPhy: accurate and fast probabilistic inference of single-cell phylogenies from scDNA-seq data Alexey Kozlov, Joao Alves, Alexandros Stamatakis and David Posada

Contents
Table S1. Simulation scenarios.                       n/a n/a n/a n/a n/a n/a n/a n/a 5, 30, 100 5 5 Coverage overdispers ion n/a n/a n/a 5 5 5 Sequencing error n/a n/a n/a 0, 0.01, 0.05 0/0.01 0.01 Amplificatio n error mean* n/a n/a n/a 0, 0.05, 0.10 0/0.05 0.05 Amplificatio n error variance n/a n/a n/a 0.01 0.01 0.01 Allelic imbalance n/a n/a n/a 0.50 0.50 0.50 Haploid coverage reduction n/a n/a n/a 0.   Remarks: Although not single-cell specific, HaplotypeCaller is still widely popular within the single-cell genomics community. B In a given infinitesimal amount of time (Δt), only one mutation is allowed in one of the two homologous chromosomes. These instantaneous rates form the Q matrix (Equation 2 in the main text) and contain 0 values when the corresponding genotypes differ at more than one nucleotide. C Three cells are sampled from a population of cells dividing asexually will have a specific genealogy (in bold). After cell division, cells will give rise to 0, 1, or 2 daughter cells (cells that die after division are not represented). D From the observed genotypes, we try to estimate the history of these three cells. Internal nodes are not observed, but the ancestral genotype at these nodes (in gray) can be estimated. More than one change can occur along any branch of length t (formed by infinite Δt). The transition probability matrix P(t) only considers the nucleotide at the beginning and at the end of a branch (Equation 3 in the main text), and all its entries are positive.

Fig. S2. Preliminary assessment of phylogenetic accuracy for all methods.
Results for all methods were obtained for two scenarios of Simulation 1 ("target-ISM"). Datasets consisted of 40 tumor cells plus one healthy, with 250 SNVs. Accuracy was evaluated under two different levels of genotype error (ERR), allelic dropout (ADO), with the "missing" genotype recoding strategy. Phylogenetic accuracy is defined as 1 -nRF (see Methods). Results based on 20 replicates. Boxplots were generated using the ggplot2 R package (https://ggplot2.tidyverse.org) with default parameters. The lower and upper hinges correspond to the first and third quartiles. The upper whisker extends from the hinge to the largest value no further than 1.5 * IQR (IQR is the interquartile range or distance between the first and third quartiles). The lower whisker extends from the hinge to the smallest value at most 1.5 * IQR of the hinge. Data beyond the end of the whiskers are called "outlying" points and are plotted individually.     All methods use the ML genotypes except CellPhy-GL, which uses the genotype likelihoods. Phylogenetic accuracy is defined as 1 -nRF (see Methods). AMP is the amplification error rate, SEQ is the sequencing error rate, and ADO is the allelic dropout rate. See Fig. S2 for an explanation of the boxplots. Phylogenetic accuracy is defined as 1 -nRF (see Methods). AMP is the amplification error rate, SEQ is the sequencing error rate, and ADO is the allelic dropout rate. See Fig. S2 for an explanation of the boxplots.

Fig. S9. Phylogenetic accuracy in Simulation 5 ("NGS-doublet").
Data simulated under mutational signature S2 and with a 5x sequencing depth. All methods use the ML genotypes except CellPhy-GL, which uses the genotype likelihoods. Phylogenetic accuracy is defined as 1 -nRF (see Methods). DBL is the doublet rate, AMP is the amplification error rate, SEQ is the sequencing error rate, and ADO is the allelic dropout rate. See Fig. S2 for an explanation of the boxplots.             Supplementary Note 1. Genotype error model P (N | M) is the probability of observing the single-cell genotype N after sequencing, given the true genotype M, diploid and biallelic. We consider two types of technical errors resulting in a wrong genotype, allelic dropout (ADO) and amplification/sequencing error (ERR), which occur at rates δ and ε, respectively. Note that we allow for the presence of both ADO and ERR in the same observed genotype.
Allelic dropout occurs during single-cell whole-genome amplification (scWGA) when one of the two alleles is not amplified and cannot be represented in the observed data. ADO implies a single allele, as otherwise the genotype is "missing". Thus, the rate δ is the probability that the amplification of one or the other allele has failed, and therefore that we observe the homozygous genotype defined by the amplified allele. Given the phased genotype a|b, and with "_ " indicating the dropped allele: ADO rate (δ) = P (_|b | a|b) + P (a|_ | a|b) An amplification/sequencing error occurs when the observed allele is not the true allele. Given that the ERR rate tends to be small, we assume a maximum of one ERR per genotype. Specifically:

Phased genotypes error model
Under these assumptions, there are seven scenarios (I-VII) with non-zero probability for the calculation of P (N | M) for phased genotypes, with alleles a-d: If the true genotype is homozygous For example, P (AA | AA). This can happen in two ways: (1) without ADO (i.e., 1 -δ) and without ERR (i.e., 1 -ε), or (2) after ADO in either allele (δ) and without ERR in the non-dropped allele (i.e., 1 -½ ε).
V. P (c|c | a|b) = δ * ½ ⅓ ε = ⅙ δε For example, P (CC | AT). Because we ignore the possibility of two ERR in the same genotype, this can only occur if there is ADO in either allele (i.e., δ), followed by an ERR in the non-dropped allele (i.e., ½ ⅓ ε).

Supplementary Note 2. Approximate model of evolution for unphased diploid genotypes with ten states
Current techniques for producing scDNA-seq data do not reveal the phase of the genotypes (i.e., we do not know which allele is located in the maternal or paternal chromosome). We have also implemented a specific model for unphased genotypes with only ten states that speed up the calculations. However, for unphased states, the probability of changing between a homozygous and a heterozygous genotype is not reversible, as, in principle, the change from genotype aa to genotype ab is twice more probable (either allele can change) than from ab to aa (change can occur only in allele b).
Considering this asymmetry would result in a non-reversible Q matrix, which would yield the calculation of the tree likelihood much more complex and prohibitively slow, as it would require rooted trees. As a compromised, approximate solution, we implemented a GT10 model based on a reversible Q matrix (here called Q 10 ): In this case, we need to estimate five nucleotide exchangeabilities ( = r(A↔C), = r(A↔G), = r(A↔T), = r(C↔G), = r(C↔T); let = r(G↔T) = 1) and nine stationary unphased genotype frequencies (π A/A , π A/C , π A/G , π A/T , π C/C , π C/G , π C/T , π G/G , π G/T ; π T/T = 1 -∑π a/b ). Regarding the error model, the definitions of ADO, ERR and P (N | M) are the same as for the GT16, except for P (a|b | aa) = (1 -δ) * ⅓ ε.
In simulations, this (wrong) reversibility assumption did not affect performance; there was no decrease in accuracy, and the calculations were ~2X faster than for the GT16 model (see Fig. S22).

Supplementary Note 3. Standard phylogenetic likelihood calculations on DNA sequence alignments
Let us consider how we compute the phylogenetic likelihood on a standard multiple DNA sequence alignment (MSA). For further details, we refer the reader to Felsenstein [51,52] and Yang [73]. The calculation of the phylogenetic likelihood for our genotype model (see Section "phylogenetic likelihood" in the main text) is precisely analogous. Given an MSA comprising the sequences under study, a 4x4 instantaneous rate matrix Q, the ability to compute the corresponding 4x4 transition probability matrix P t , and the stationary frequency vector π for the four nucleotides, we calculate the likelihood as follows.
We assume that MSA sites evolve independently of each other. Hence, given the MSA, the overall likelihood of the tree is the product over the per-site likelihoods. Thus, it suffices to consider how to compute the likelihood for a single MSA site. Given a fixed tree topology, with fixed branch lengths and known inner/ancestral states for one MSA site, we simply compute the per-site likelihood as the product over all transition probabilities along the tree branches, times the stationary frequency of the root state. While the ancestral nucleotide states are typically not known, we can still calculate the likelihood of the site as the sum over the per-site likelihoods for all possible assignments of nucleotides (i.e., sum over all possible evolutionary histories that could have generated the data, given the tree) to the ancestral states of the given tree topology.
While this, at first glance, appears to be computationally intense (e.g., for a tree with two ancestral nodes, there already exist 4^2 distinct possible assignments of nucleotides to inner nodes of the tree), the likelihood on such a tree can be efficiently computed via the so-called Felsenstein pruning algorithm. The key idea of this algorithm is to calculate the likelihood bottom-up, that is, from the tips toward the root of the tree, and to store intermediate results, so-called conditional likelihood vectors (CLVs), at each inner node of the tree (Fig. S16). CLVs essentially summarize the signal stemming from the subtree they root. That is, they tell us how likely it is to observe an A, C, G, or T, given (conditional on) the subtree they represent. Thus, every inner state in our calculations for one single MSA site consists of a vector containing four conditional likelihoods for, A, C, G, and T, respectively. At the tips, we initialize this vector to (1.0, 0.0, 0.0, 0.0) if we have an A, to (0.0, 1.0, 0.0, 0.0) if we have a C, etc. as the nucleotide state is known and assuming that we are not uncertain about its state.
However, under a single, uniform sequencing error ε, for instance, we can initialize the CLV for A at the tip of the tree as (1.0 -3ε, ε, ε, ε). This flexibility is used for modeling genotype errors as presented in Section 'single-cell genotype errors' of the main text. The Felsenstein pruning algorithm allows us to compute the likelihood of a given tree, with given branch lengths, and given evolutionary rates specified in Q. Now, to obtain the maximum likelihood (ML) score for such a fixed tree topology, we need to optimize the free parameters of the model, that is, the branch lengths and the rates in Q concerning the likelihood using appropriate numerical optimization routines.
Finally, we also need to find the tree topology with the best ML score, which constitutes an NP-hard optimization problem (i.e., there are no algorithms for finding the ML tree in polynomial runtime as a function of the number of sequences involved) [100]. In layman's terms, this means that we are simply not able to find the globally best ML tree, as there are too many possible tree topologies that would have to be scored. For example, for 50 sequences, there already exist 2.84 x 10 74 distinct alternative tree topologies. Therefore, to find the tree with the best ML score, we use ad hoc heuristic search strategies that search through this enormous tree space in a "clever" way and return a tree with a "good" score. Most ML tree search strategies imply constructing an initial tree and then apply successive changes ("moves") to the tree topology to find a tree with a better ML score. The three fundamental types of tree moves are Nearest Neighbour Interchange (NNI), Subtree Pruning and Re-grafting (SPR), and Tree Bisection and Reconnection (TBR). Indeed, these heuristics do not offer a guarantee that we find the ML tree, although we typically refer to the best tree found as the ML tree. We refer the reader to Stamatakis et al.
Furthermore, suppose our matrix Q is time-reversible. In that case, we can root our tree at any branch or node and will always obtain the identical analytical likelihood score for all possible rooting locations. This is computationally convenient, as, for a given tree, we do not need to determine the optimal root placement to compute its ML score. Hence, the output of phylogenetic inferences under time-reversible models is always an unrooted binary tree, as any rooting will be mathematically meaningless.
Adapting the nucleotide substitution model to a model with more states is, in essence, straight-forward, as precisely the same computational procedure can be used to analyze protein data with 20 states, or our model with ten states here (see corresponding sections on the genotype error and likelihood model in the main text).