The CellPhy model
Model of nucleotide substitution for phased/unphased diploid genotypes
We developed a substitution model for diploid genotypes akin to those typically used for DNA sequences in organismal phylogenetics ([see [51]). Specifically, we built a finitesite, continuoustime, Markov model of genotype evolution considering all possible 16 phased diploid states Γ= {AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, TT}, in which SNVs are independent and identically distributed. This model, named GT16, is defined by a rate matrix Q that contains the instantaneous transition rates qX↔Y among genotypes X and Y. For computational convenience, we assume a timereversible process, in which case the Q matrix is the product of a symmetric exchangeability matrix R (rX↔Y = rY↔X) and a diagonal matrix of stationary genotype frequencies π_{X}:
$$Q=R\cdotp \mathit{\operatorname{diag}}\left({\pi}_{A\mid \mathrm{A}},{\pi}_{A\mid C},\dots, {\pi}_{T\mid T}\right)$$
(1)
We assume that only one of the two alleles in a genotype can change in an infinitesimal amount of time. Furthermore, because maternal and paternal chromosomes evolve independently in somatic cells, we also assume that the instantaneous transition rate of a given allele a in genotype X to allele b in genotype Y does not depend on the identity of the homologous allele n within the respective genotypes. In other words, we assume r(na↔nb) = r(b↔a). Conveniently, these assumptions significantly reduce the number of free parameters in the Q matrix, as we only need to consider five nucleotide exchangeabilities (α = r(A↔C), β = r(A↔G), γ= r(A↔T), κ = r(C↔G), λ = r(C↔T); let μ = r(G↔T) = 1) and 15 stationaryphased genotype frequencies (π_{AA}, π_{AC}, π_{AG}, π_{AT}, π_{CA}, π_{CC}, π_{CG}, π_{CT}, π_{GA}, π_{GC}, π_{GG}, π_{GT}, π_{TA}, π_{TC}, π_{TG}; π_{TT}= 1  ∑π_{X}):
(2)
The Q matrix gives us the probabilities of changing from one genotype to another in an infinitely small time interval. When two genotypes differ at more than one nucleotide, these rates are 0. To calculate the tree likelihood, we first need to compute the change probabilities from the beginning to the end of each branch. The probabilities of changing from a given genotype to another along a branch of length t (in units of expected number of mutations per SNV site) are given by:
The resulting matrix P(t) is known as the transition probability matrix, and all its entries are (small) positive values. For further details, we refer the reader to Felsenstein [51] and Yang [73].
Notably, the GT16 model can also work with unphased genotype data simply by assigning the same relative likelihood at the tips of the tree to the two possible resolutions of the phase for the observed genotype (see Table S2). This flexibility is relevant because current scDNAseq techniques do not reveal the phase of the genotypes (i.e., we do not know which allele is located in the maternal or paternal chromosome). Therefore, in our simulations (see below), the simulated genotypes will always be unphased.
In addition to the GT16 model, we also implemented a model for unphased genotypes with only ten states, called GT10 (see Supplementary Note 2 for details). The GT10 model is computationally less expensive, approximately twice as fast as GT16. To maintain the reversibility assumption—otherwise, the calculations are much more complex—the GT10 model assumes that the probability of change between homozygous and heterozygous genotypes is equivalent. However, this is incorrect, as the change from homozygote to heterozygote is twice as likely as the change from heterozygote to homozygote. Despite this theoretical flaw, GT10 and GT16 showed very similar tree inference accuracy in our experiments.
Singlecell genotype error model
We assume that the true genotypes are always diploid and biallelic. To incorporate errors in the observed genotypes arising during singlecell wholegenome amplification (scWGA) and sequencing, we consider two free parameters: the allelic dropout (ADO) rate (δ) and the amplification/sequencing error (ERR) rate (ε).
Allele dropout occurs during scWGA when one of the two alleles is not amplified and is absent from the sequencing library. ADO always implies a single allele in our model, as if both alleles drop, no reads would be available, and no genotype could be observed (i.e., it would be “missing”). Therefore, δ is the probability that the amplification of one or the other allele has failed and thus that we observe the homozygous genotype defined by the amplified allele. Given the phased genotype ab, and with “_” indicating the dropped allele, we define the ADO rate as:
$$\delta =\mathrm{P}\left(\_\leftb\lefta\leftb\right.\right.\right.\right)+\mathrm{P}\left(a\left\_\lefta\leftb\right.\right.\right.\right),$$
(4)
where P (YX) is the probability of observing genotype Y after sequencing, given the true genotype X.
Genotype errors other than ADO can result from polymerase errors during scWGA, in the course of sequencing, variant calling, or generally represent an incorrect allele. Importantly, we assume a maximum of one ERR per genotype. Given that ε tends to be small, in the order of 10^{3} to 10^{5} [74], the probability of two ERR in one genotype, ε^{2}, is negligible. Specifically, ε is the probability that allele a will be observed as another allele b ≠ a:
$$\varepsilon =\mathrm{P}\left(b\lefta\right.\right)$$
(5)
Note that we allow for the presence of both ADO and ERR in the same observed genotype. Under these assumptions, if the true genotype is homozygous aa, for phased genotypes P (YX) becomes:
$$P\space(a\vert a \space\vert\space a \vert a)=1\varepsilon +\space\raisebox{1ex}{$1$}\!/ \!\space\raisebox{1ex}{$2$}\space\delta \varepsilon$$
(6)
$$P\space(a\vert b \space\vert\space a \vert a)=\left(1\delta \right)\cdot\space\raisebox{1ex}{$1$}\!/\space \!\raisebox{1ex}{$6$}\space\varepsilon$$
(7)
$$P\space(b\vert b \space\vert\space a \vert a)=\space\raisebox{1ex}{$1$}\!/ \!\space\raisebox{1ex}{$6$} \space \delta \varepsilon$$
(8)
Likewise, if the true genotype is heterozygous ab, P (YX) can be shown to be:
$$P\space(a\vert a \space\vert\space a \vert b)=\space \raisebox{1ex}{$1$}\!/ \!\space\raisebox{1ex}{$2$} \space\delta \space + \space \raisebox{1ex}{$1$}\!/ \!\space\raisebox{1ex}{$6$}\space\varepsilon \space  \space\raisebox{1ex}{$1$}\!/ \!\space\raisebox{1ex}{$3$}\space \delta \varepsilon$$
(9)
$$P\space(c\vert c \space\vert\space a\vert b)=\space \raisebox{1ex}{$1$}\!/ \!\space \raisebox{1ex}{$6$} \space \delta \varepsilon$$
(10)
$$P\space(a\vert c \space\vert\space a\vert b)=(1\delta)\cdot \space\raisebox{1ex}{$1$}\!/ \space\!\raisebox{1ex}{$6$}\space\varepsilon$$
(11)
$$P\space(a\vert b \space\vert\space a\vert b)=(1\delta)\cdot (1\varepsilon )$$
(12)
For the remaining scenarios, given the assumptions of the error model, P (YX) is zero. See Supplementary Material Note 1 for a detailed explanation.
Given the observed genotype Y for the SNV i and cell j, the likelihood of a genotype X being the true genotype at this position becomes:
$${L}_i^j\left(X\leftY\right.\right)=P\left(Y\leftX\right.\right),\forall X\in \Gamma$$
(13)
This follows from Bayes’ theorem assuming equal prior genotype probabilities.
For missing data, we consider all possible genotypes to be equally likely, which is a standard assumption in likelihoodbased phylogenetic inference. In this case:
$${L}_i^j\left(X\ \ missing\right)=P\ \left( missing\ \ X\right)=1,\forall X\in \Gamma$$
(14)
Phylogenetic likelihood
We consider the evolutionary history of cells as an unrooted binary tree, where τ is a tree topology, and t is a vector of branch lengths. We define the phylogenetic likelihood of the cell tree T as the conditional probability of the observed SNV matrix S given the substitution model M with parameters θ and T:
$$L\left(T,M,\theta \right)=P\left(S\ \ T,M,\theta \right)$$
(15)
We assume that genomic sites evolve independently under model M. Therefore, the probability of observing the SNV matrix S is the product over the independent probabilities for all individual SNVs, S_{i}:
$$L(T,M,\theta )={\displaystyle\prod}_{i=1}^s \space P({S}_i \space\vert\space T,M,\theta)$$
(16)
where s is the total number of SNVs. For numerical reasons, that is, to avoid floatingpoint underflow, in practice, we calculate the loglikelihood score:
$$\log L(T,M,\theta)={\sum}_{i=1}^s \log P({S}_i \space \vert \space T,M,\theta)$$
(17)
We can efficiently compute the loglikelihood score of a given tree with the standard method called Felsenstein’s pruning algorithm [52]. Let us place an additional imaginary node, called virtual root, into an arbitrary branch of the unrooted tree T at a random position along that branch. Without loss of generality, if we assume that this virtual root node has an index of 0 and that c is the number of cells, we can index the tip nodes from 1 to c by their respective cell numbers and index the internal nodes from c+1 to 2c–2 (note that an unrooted binary tree with c leaves has c–3 inner nodes, i.e., (2c–2) – (c+1) = c–3). Then, under the GT16 model, perSNV probabilities can be computed as follows:
$$P\space({S}_i \space\vert\space T, GT \textit{16},R,\pi)=\sum\limits_{X\in\Gamma} \space\space{L}_i^0(X)\space{\pi}_x,\space i=1 \dots \mathrm{s}$$
(18)
where X is a genotype, π_{x}is the stationary frequency of the genotype X, and \({L}_i^0\)is the vector of genotype likelihoods at the virtual root, which we compute as follows:
$${L}_i^{\nu}(X)=\sum\limits_{Y\in \Gamma}{P}_{X,Y} \space ({t}_u)\space{L}_i^u(Y) \space\cdot \sum\limits_{Y\in \Gamma} {P}_{X,Y} \space ({t}_w) \space {L}_i^w (Y), \space\space\space\forall\space X\in\Gamma, \space\upsilon \notin [1\dots c]$$
(19)
where Y is another genotype, u and w are both children nodes of v in the direction from the virtual root (Fig. S19) for the respective branch lengths.
We initialize the genotype likelihood vectors at the tip nodes (\({L}_i^1\).. \({L}_i^c\)) depending on the input type (see Section “Input data” below) and the error model. If the input is the genotype matrix S, in the absence of observational error, these tip likelihood vectors are:
$${L}_i^v(X)=\{1\ if\ {S}_i^v=X,\space \space 0\ otherwise\},\space v\in [1\dots c], \space i\in [1\dots s]$$
(20)
while if we include the error model, the tip likelihoods are computed according to the equations in the “singlecell genotype error model” section.
Otherwise, if the input consists of genotype likelihoods G, the tip likelihoods are:
$${L}_i^v(X)=1{0}^{G_i^v(X)}, \space \forall \space X\in \Gamma, \space v\in [1\dots c], \space i\in [1\dots s]$$
(21)
The GT16 model will consider both phasing options equally likely when the input genotype likelihoods are unphased:
$${L}_i^v(ab)={L}_i^v(ba)=1{0}^{G_i^v(a/b)}, \space \forall \space a,b,v\in [1\dots c], \space i\in [1\dots s]$$
(22)
where a and b represent any two alleles.
Finally, if Phredscaled likelihoods for REF/REF, REF/ALT, and ALT/ALT genotypes are provided (PL field in a VCF file), we calculate tip likelihoods as follows:
$${L}_i^{\upsilon }(X)={L}_i^{\upsilon}\left(a\leftb\right.\right)=\left\{\begin{array}{ll}{10}^{0.1\cdot {PL}_i^{\upsilon }(0)}& \mathrm{if}\kern0.5em a= REF\wedge b= REF\\ {}{10}^{0.1\cdot {PL}_i^{\upsilon }(1)}& \mathrm{if}\left(a= REF\wedge b= ALT\right)\vee (a= ALT\wedge b= REF)\\ {}{10}^{0.1\cdot {PL}_i^{\upsilon }(2)}& \mathrm{if}\kern0.5em a= ALT\wedge b= REF\\ {}0& \mathrm{otherwise}\end{array}\right.$$
(23)
The final tree can be rooted using an outgroup (see [51]). For the sake of completeness, we included a simple explanation of standard phylogenetic likelihood calculations on DNA sequence alignments in the Supplementary Material (Figs. S19S20, Supplementary Note 3).
Implementation
Overview
We implemented CellPhy as a pipeline based on a modified version of RAxMLNG [48]. In addition to the core tree search functionality of RAxMLNG, the CellPhy pipeline offers VCF conversion, mutation mapping, and tree visualization using the ape [75] and the ggtree package [76]. Furthermore, CellPhy provides reasonable defaults for most parameters, which allows the user to run a “standard” CellPhy analysis by specifying just the input VCF file (or the genotype matrix). Alternatively, expert users can customize every aspect of the CellPhy analysis to fit their needs, as shown in the tutorial (https://github.com/amkozlov/cellphy/blob/master/doc/CellPhyTutorial.pdf). In the remainder of this section, we will also provide implementation details on each step of the CellPhy pipeline. CellPhy code and documentation are freely available at https://github.com/amkozlov/cellphy.
Input data
CellPhy accepts two input types, a matrix of genotypes in FASTA or PHYLIP format or a standard VCF file (https://samtools.github.io/htsspecs/VCFv4.3.pdf). When the input is a genotype matrix, genotypes are encoded as shown in Table S2. When the input is a VCF, CellPhy can run in two distinct modes. The first mode (“CellPhyML”) requires a VCF with at least the GT field (that stores the genotype calls), in which case CellPhy simply extracts the genotype matrix. The second mode (“CellPhyGL”) requires a VCF with the PL field (which stores the Phredscaled genotype likelihoods) and uses the likelihood of each genotype instead. While commonly used variant callers for singlecell data (e.g., Monovar [77]) generate VCF files with a standard PL field, users should know that the PL definition may differ from its standard meaning in different callers. Indeed, SCCaller [66], for instance, uses the PL field to store the likelihood of heterozygous and alternative homozygous genotypes and the likelihood of sequencing noise and amplification artifacts. On this basis, the PL field in VCF files stemming from SCCaller needs to be converted to the standard PL format before CellPhy can be used (see Table S3 outlining CellPhy’s compatibility with popular variant calling algorithms).
Phylogenetic tree search
CellPhy uses the broadly used and welltested heuristic tree search strategy of RAxML [53] and RAxMLNG [48]. CellPhy’s search algorithm starts by default with 20 different trees, ten obtained using a maximum parsimonybased randomized stepwise addition order routine and ten with a completely random topology. The ML tree search itself alternates between model parameter optimization and tree topology optimization phases. One of the most popular approaches for searching tree topologies is the socalled subtree pruning and regrafting (SPR) [78], which removes subtrees from the tree and subsequently places them into different branches to assess if the likelihood improves. Those SPR moves are applied iteratively until no SPR move further increases the tree’s likelihood. In this case, CellPhy terminates and returns the bestfound ML tree. For further details, please see [79] and references therein.
Model parameter optimization
CellPhy uses the LBFGSB method [80] to optimize genotype substitution rates and equilibrium base frequencies. ADO rate and amplification/sequencing error rate are optimized independently using Brent’s method [81]. After each iteration of Brent’s algorithm, CellPhy recomputes all pergenotype likelihoods according to Eq. 7 using the new values of the ADO rate δ` and the amplification/sequencing error ε’.
Branch support
CellPhy can compute confidence values for individual inner branches of the ML tree using two bootstrap (BS) techniques, the standard BS [55], and the transfer BS [56]. In the standard BS, the first step consists of generating many BS replicates, typically 100 to 1000, from the original dataset by randomly sampling SNV sites with replacement. Then, an ML tree is estimated for each replicate. Finally, the support for each inner branch in the ML tree is computed as the percentage of BS trees that contain that branch, as outlined in the example provided in Fig. S21. The standard BS only considers exact matches (i.e., the branch in the ML tree and the BS trees must match exactly to be counted). In contrast, the transfer BS also considers inexact matches to account for the tree search uncertainty and vastness of the tree search space in phylogenetic analyses with many cells.
Mapping mutations onto the tree
CellPhy can show predicted mutations on the branches of the inferred cell tree. To this end, it performs marginal ancestral state reconstruction [57] to obtain the ML genotype for every SNV at every inner node of the tree. At the tips of the tree, occupied by the observed cell genotypes, depending on the input, CellPhy applies Eq. 4 to compute genotype likelihoods given the observed genotype y and estimated error rates (δ, ε) or directly uses the genotype likelihoods provided in the VCF file. Then, it compares ML genotypes between two nodes connected by a branch, and if they differ, a mutation is predicted on the corresponding branch. The mutation mapping output consists of two files, a branchlabeled tree in the Newick format and a text file with a list of predicted mutations (SNV names or positions) at each branch. We also provide a script (https://github.com/amkozlov/cellphy/blob/master/script/mutationmap.R) that automatically generates a plot with the mutations mapped onto the resulting phylogenetic tree, together with a tutorial that explains its use.
Computational efficiency
RAxMLNG was developed with a particular focus on high performance and scalability to large datasets. Hence, CellPhy capitalizes on numerous computational optimizations implemented therein, including highly efficient and vectorized likelihood calculation code, coarse and finegrained parallelization with multithreading, checkpointing, and fast transfer bootstrap computation [82].
Benchmarking
We used computer simulations to benchmark the accuracy of CellPhy under different scenarios (Table S1) relative to stateoftheart methods for singlecell phylogenies like OncoNEM [25], SPhyR [30], SASC [31], ScisTree [33], infSCITE [26, 27], SiFit [28], and SCIPhI [29]. For OncoNEM, SPhyR, SASC, infSCITE, and SiFit, the input is the matrix of observed reference/nonreference homozygous/heterozygous genotypes. ScisTree uses a matrix of genotype probabilities estimated from the simulated read counts for each site, while SCIPhI relies on a standard mpileup file with the simulated read counts. Also, we included in the comparison the standard phylogenetic method TNT [49]. TNT implements a maximum parsimony (MP) approach and attempts to find the tree/s that require the least number of mutations to explain the data. TNT is very popular in MP organismal phylogenetics, heavily optimized for computational speed and efficiency. It is not designed for singlecell NGS data and therefore assumes that the observed genotypes are errorfree.
Simulation of genealogies and genotypes
For the simulation of singlecell diploid genotypes, we used CellCoal [83]. This program can simulate the evolution of a set of cells sampled from a growing population, introducing singlenucleotide variants on the coalescent genealogy under different models of DNA mutation. Furthermore, it can also introduce the typical errors of singlecell sequencing, specifically ADO, amplification, and sequencing errors, and doublets, either to the observed genotypes or directly into the read counts.
We designed six distinct simulation scenarios (simulations 1–6) representing different types of scDNAseq datasets (Table S1), including variable numbers of cells (40–1000) and sites (1000–50,000). We simulated unphased genotype data in all cases, as current scDNAseq techniques do not reveal the genotype phase. We chose a set of scenarios and parameter values that, in our opinion, are representative of different situations that researchers are likely to encounter. The cell samples were assumed to come from an exponentially growing population (growth rate equal to 1×10^{4}) with a presentday effective population size of 10,000. Across scenarios, we set a constant value of 0.1 for the root branch. Note that the mutations in this branch are shared by all cells. We also defined an outgroup branch length of zero in all cases, so the healthy cell and the most recent common ancestor (MRCA) of the sample (a single healthy cell plus several tumor cells) have identical genotypes. The standard coalescent process results in an ultrametric tree, where all tips have the same distance from the MRCA of the sample. However, we introduced rate variation across cell lineages by multiplying the branch lengths of the resulting coalescent genealogy with scaling factors sampled from a Gamma distribution with a mean of 1.0.
Only in the first simulation scenario, we consider a fixed number of SNVs. In the remaining scenarios, the number of observed mutations resulted from applying a mutation rate of 1×10^{6} [84], plus the different scDNAseq errors. We explored different infinite and finitesite mutation models at the single nucleotide or trinucleotide level. Except for the ISM scenarios (simulations 1 and 3), we introduced a variable mutation rate across sites using a Gamma distribution with a mean of 1.0.
We simulated unphased genotypes, as current scDNAseq techniques do not reveal the phase. We generated the observed genotype matrices in two distinct ways. In the first three scenarios (simulations 1–3), we obtained the observed genotypes by directly adding sequencing/amplification errors (i.e., changing one or both alleles) and ADO to the simulated genotype matrices. In the other three (simulations 4–6), the generation of the observed genotypes was more complex. In this case, we first simulated read counts for each cell based on the true genotypes, considering different overdispersed sequencing depths, as well as amplification and sequencing errors. For simplicity, we assumed that maternal and paternal chromosomes are amplified with the same probability. Also, we consider that the number of reads is half for those genomic positions in which only one allele was amplified. In simulation 5, we introduced socalled doublets, that is, two cells that are erroneously sequenced together and that thus appear as a single cell in the sequencing data. For every combination of parameters, we generated 100 replicates. In total, we generated 19,400 cell samples.
Preliminary simulations
We carried out a set of preliminary simulations under favorable conditions, without errors or relatively low noise levels (ADO = 0.10 and amplification error= 0.01). We simulated 40 cells with 250 SNVs, assuming a diploid ISM model [85]. Under this model, a given site can only accumulate a single mutation along with the genealogy, either in the maternal or paternal chromosome.
Simulation 1: infinitesite model, low number of SNVs (“targetISM”)
We started the full simulations with a simple scenario with 40 sampled cells and 250, 500, or 1000 SNVs, assuming a diploid ISM model [85]. We introduced genotype errors and ADO at different rates (Table S1).
Simulation 2: finitesite model, large number of SNVs (“WGSFSM”)
In this case, we simulated a larger number of SNVs, more typical of wholegenome sequencing (WGS) experiments. The number of sampled cells was 100. The mutation model, in this case, was a nonreversible version of the finitesite General Time Reversible Markov model [50], that we called GTnR, assuming a set of singlenucleotide instantaneous rates extrapolated (essentially, we pooled the same mutation in the same rate independently of the 5′ and 3′ context) from the trinucleotide mutational signature 1 at COSMIC (https://cancer.sanger.ac.uk/cosmic/signatures):
$${Q}_{GTnR}=\left[\begin{array}{llll}0& 0.03& 0.12& 0.04\\ {}0.11& 0& 0.02& 0.68\\ {}6.68& 0.02& 0& 0.11\\ {}0.04& 0.12& 0.13& 0\end{array}\right]$$
(24)
The overall mutation rate was set to 1×10^{6}, which resulted in about 2000 true SNVs (see Table S1). However, since ADO events and genotype errors can introduce false negatives and false positives, the number of observed SNVs ranged between 1531 and 10,000. The mutation rates varied across sites according to a Gamma distribution (+G) with shape parameter and mean equal to 1.0 (i.e., moderate amongsite rate heterogeneity).
Simulation 3: mutational signatures and large number of SNVs (“WGSsig”)
This scenario is similar to the previous one, with 60 cells and assuming a trinucleotide ISM model, with COSMIC signatures 1 and 5. The former is a ubiquitous signature in human cells with a predominance of C>T transitions in the NCG trinucleotide context and is related to the spontaneous deamination of 5methylcytosine [86]. The latter is also a typical agerelated signature with a predominance of T>C substitutions in the ATN trinucleotide context, related to transcriptional strand bias [87].
Simulation 4: genotype likelihoods from NGS read counts (“NGSlike”)
In this scenario, and the next two, we simulated NGS read counts from the simulated genotypes. The number of sampled cells was 40, with 10,000 sites and the same mutation model (GTnR+G) and mutation rates as in Simulation 3. We explored three sequencing depths (5x, 30x, and 100x), three ADO rates (0, 0.05, 0.10), three amplification error rates (0, 0.05, 0.10), and three sequencing error rates (0, 0.01, 0.05). We assumed that amplification and sequencing errors among the four nucleotides were equally likely. From the read counts, CellCoal can also simulate the likelihood for all ten possible unphased genotypes at each SNV site, in this case under a 4template amplification model [83]. The input for CellPhy was either the ten genotype likelihoods at each SNV site (CellPhyGL mode) or the unphased genotype with the maximum likelihood (ML) value (CellPhyML mode). The input for SCIPhi was the read counts, while for the rest of the programs, it was the ML unphased genotype. In the rare case of tied ML genotypes, we chose one of them at random.
Simulation 5: NGS doublets (“NGSdoublet”)
In this case, we intended to explore the effects of doublets in the data. Settings were very similar to those for Simulation 4 but, for simplicity, we fixed the sequencing depth to 5x and explored two amplification error rates (0, 0.05), two sequencing error rates (0, 0.01), and four doublet rates (0, 0.05, 0.10, 0.20).
Simulation 6: NGS for large numbers of cells and SNVs (“NGSlarge”)
Finally, to assess the scalability of the tools, we simulated scenarios with 100, 500, or 1000 cells and with 1000, 10,000, or 50,000 sites. Given the mutation rate, a large number of cells, and, most importantly, the amplification and sequencing error rates, almost all sites were observed as SNVs. Settings were very similar to those specified for Simulation 5 but, for simplicity, we fixed the sequencing depth to 5x and explored only one amplification (0.05) and one sequencing (0.01) error value. We only analyzed the first 20 replicates in this scenario due to the high computational cost and prohibitive running times for several competing tools.
Settings for the phylogenetic analyses
Coding DNA into ternary genotypes
Our simulations produce unphased DNA genotypes with ten possible states. However, except for CellPhy, existing tools work with an alphabet composed of 0 (homozygous for the reference allele), 1 (heterozygous), 2 (homozygous for the alternative allele), and 3 (missing genotype). Therefore, we had to encode the simulated DNA genotypes into ternary genotypes (0–3). For this, we used the true reference allele, considering that, in real life, we usually know which allele is the reference. For the sake of simplicity, we did not introduce germline mutations. Importantly, our simulations do not necessarily produce biallelic SNVs, as in the finitesite model multiple mutations can coincide, and amplification and sequencing errors can also result in new alleles called. CellPhy does not limit the number of alleles at an SNV site, but competing tools handle multiallelic sites differently. We explored three ways of coding sites with more than two alleles into ternary genotypes:

Option keep: transform all heterozygotes to “1” and all homozygotes for the alternative allele/s to “2”. In this case, all simulated sites are held, regardless of the number of observed alleles.

Option remove: eliminate sites from the data with more than two alleles. The final genotype matrix includes only biallelic sites.

Option missing: keep only those genotypes that contain the reference allele and/or the major (most common) alternative allele. All other genotypes (containing minor alternative alleles) are considered as missing data (“3”). Therefore, the final ternary genotype matrix includes the same number of sites as the original DNA genotype matrix.
We considered all three encoding options only in simulations 1 and 2. In the remaining simulations, we used only the “missing” option, as it maximized accuracy in most cases.
TNT settings
We performed the TNT analyses using a binary data matrix in TNT format for all simulated and empirical datasets. We allowed 1000 trees to be retained for each run and performed tree searches by setting mult = replic 100. We stored all equally parsimonious trees and used additional ttags to store branch lengths and bootstrap support values.
OncoNEM settings
We ran OncoNEM following the recommended settings in the OncoNEM vignette. We set the false positive rate as the actual genotype error for each scenario and performed a tree search for 200 iterations. Because of its heavy computational requirements and poor performance, we only run it in Simulation 1.
SASC settings
We generated binary matrices as input files and set k option to 1. Additionally, the false positive rate was set as the actual genotype error, and the false negative rate was set as the actual ADO rate. The command line was: sasc n <CELLS> m <SNVS> k 1 a <ADO> b <ERR> E data.CellID l i data.sasc p 24. Similar to OncoNEM, due to its bad performance, SASC was only run partially in Simulation 1.
SPhyR settings
We produced binary input files using a custom script and ran the kDPFC version with the false positive rate (a) set to the simulated genotype error and the false negative rate (b) set to the simulated ADO rate. The command line was kDPFC data.sphyr a <ERR> b <ADO> k 1 t 24 > data.sphyr.result. Afterwards, we used the visualize program to generate the output tree. The command line was visualize data.sphyr.result c data.snv_labels t data.cell_labels > data.sphyr.dot. Because it continuously crashed for most datasets, SPhyR was only run partially in simulation 1.
infSCITE settings
For simulations 1–3, we ran infSCITE using a ternary data matrix composed of 0, 1, 2, and 3, as described above, and set the false positive rate (fd) to be the actual genotype error for each simulated scenario. We set the falsepositive rate for simulations 4–5 as the sum of the simulated sequencing and amplification error rates. For the empirical analyses, we set the false positive rate to 1e−05. We set the remaining parameters to the default values in all runs and obtained results after running an MCMC chain with 5 million steps, a fair tradeoff between runtime and apparent MCMC convergence (the best tree score barely changed after 1M iterations). We additionally set the transpose option to return a tree where the singlecell samples are the leaf labels. The command line was infSCITE i data.infSCITE n <SNVS> m <CELLS> r 1 l 5000000 fd <ERR> ad 1.46e1 cc 1.299164e05 transpose e 0.2 o data.infSCITE.Tree
SiFit settings
For all simulated and empirical datasets, we ran SiFit for 200,000 iterations. The command line was java jar /SiFit.jar m <CELLS> n <SNVS> r 1 iter 200000 df 1 ipMat data.sf cellNames data.names.
SCIPhI settings
Since SCIPhI requires read counts to perform joint variant calling and phylogenetic reconstruction, we ran it only for simulations 4–6. We set the mean error rate (wildMean) for each scenario as the sum of the true sequencing and amplification error rates. The command line was sciphi o data.Result in sampleNames u 1 ncf 0 md 1 mmw 4 mnp 1 ms 1 mc 1 unc true l 200000 seed $RANDOM data.mpileup ese 0 wildMean <ERR>. To make the results comparable, for the empirical datasets we used samtools to generate mpileups for the positions previously identified by SCCaller. These were in turn used as input for SCIPhI: sciphi o <SET>.SCIPhI in <SET>SampleNames.txt u 0 ncf 0 md 0 mmw 4 mnp 1 unc true l 200000 seed 421 <SET>.SCCallerPositions.mpileup.
ScisTree settings
We extracted allele counts from VCF files using a custom script (available in the CellPhy’s GitHub repository) and obtained genotype probabilities using the scprob program provided with ScisTree. We ran ScisTree v1.2.0.6 with default parameters, which we called as follows: scistree <genotypeprobabilitymatrix>.
CellPhy settings
For simulations 1–6, we performed a heuristic tree search starting from a single parsimonybased tree under the GT16 model. The command line for runs with ML genotypes as input was cellphy.sh RAXML search msa data.phy model GT16+FO+E tree pars{1}. The command line for runs where the input was genotype likelihoods (VCF) was cellphy.sh RAXML search msa data.vcf model GT16+FO tree pars{1}. For all empirical datasets except LS140, to take advantage of the genotype likelihood model, we used an inhouse bash script (sccallerconvert.sh, distributed together with CellPhy) to convert the PL field from our SCCaller VCFs. In short, following SCCaller authors’ suggestion, we used the highest likelihood score of the first two values in the PL field (i.e., sequencing noise, amplification artifact) as the Phredscaled genotype likelihood of the reference homozygous (0/0) genotype, and the remaining values as the likelihood for heterozygous (0/1) and alternative homozygous (1/1) genotypes, respectively. Afterward, we ran CellPhy using the following command line cellphy.sh RAXML all msa data.vcf model GT16+FO bsmetric fbp,tbe bstrees 100 to perform an allinone analysis (ML tree inference and bootstrapping based on 100 bootstrap trees). For LS140, since we only had the genotype matrix available and these data were generated without wholegenome amplification, we ran CellPhy without the singlecell error model using the following command line cellphy.sh RAXML all msa data.vcf model GT16+FO probmsa off bsmetric fbp,tbe bstrees 100.
Evaluation of phylogenetic accuracy
We defined phylogenetic accuracy as one minus the normalized RobinsonFoulds (nRF) distance [88] between the inferred tree and the (true) simulated tree. This normalization divides the (absolute) RF distance by the total number of (internal) branches in both trees. Hence, the nRF distance is a convenient metric from zero to one that reflects the proportion of branches (bipartitions of the data) correctly inferred.
Running time comparisons
We characterized the computational efficiency of CellPhy by comparing running times for all methods on six datasets from simulations 1–6 (sim1ADO:0.50, ERR:0.10, sim2ADO:0.10, ERR:0.05, sim3ADO:0.15, ERR:0.10, Signature1, sim4Number of cells:40, ADO:0.25, Amp error:0.10, Seq error: 0.05; sim6100Number of Cells:100, ADO:0.10, Amp error:0.05; Seq error:0.01; and sim6500Number of Cells:500, ADO:0.10, Amp error:0.05; Seq error:0.01) and two empirical datasets (CRC24 and L86) described below. We measured running times using the Linux/Unix “time” command, as follows: {time ./cellphy.sh RAXML search msa data.vcf model GT16+FO tree pars{1} prefix dataout threads 1 ; } 2> dataCellPhyGL16.time. We ran all analyses on a single core from an Intel Xeon E52680 v3 Haswell Processor 2.5 GHz with 128 Gb of RAM. For the bootstrap benchmark, the CellPhy command line was {time ./cellphy.sh RAXML all msa data.vcf model GT16+FO tree pars{1} bstrees 100 prefix dataoutBoot threads 1 ; } 2> dataCellPhyGL16.Boot.time. As for TNT, additional ttags were used to perform and store 100 bootstrap replicates.
Analysis of empirical data
Inhouse singlecell WGS data from a colorectal cancer patient (CRC24)
We obtained a fresh frozen primary tumor and normal tissues from a single colorectal cancer patient (CRC24). We isolated EpCAM+ cells with a BD FACSAria III cytometer from two tumoral regions (tumor inferior (TI) section and tumor middle section (TM)) and classified them as stem or nonstem according to the stemness markers at the cell surface (stem: EpCAM+/Lgr5+/CD44/CD166; nonstem: EpCAM+/Lgr5/CD44/CD166). We amplified the genomes of 24 cells with Ampli1 (Silicon Biosystems) and built wholegenome sequencing libraries using the KAPA (Kapa Biosystems) library kit. Each library was then sequenced at ~ 6× on an Illumina Novaseq 6000 at the National Center of Genomic Analysis (CNAGCR; https://www.cnag.crg.eu/).
Retrieval of publicly available datasets (L86, E15, and LS140)
We also analyzed three public data sets with 86, 15, and 140 cells (L86, E15, and LS140, respectively). The L86 dataset consists of targeted sequencing data from 86 cells from a metastatic colorectal cancer patient (CRC2 in [21]) that we downloaded from the Sequence Read Archive (SRA) in FASTQ format, together with paired healthytumor bulk cell population samples (accession number: SRP074289). The E15 dataset consists of WGS data from 15 neurons [69] from a healthy donor, downloaded from the SRA in FASTQ format, together with a bulk cell population from heart tissue (accession number: SRP041470). The LS140 dataset consists of 140 single cellderived human hematopoietic stem and progenitor colonies from a healthy individual [15]. For this dataset, we directly downloaded the substitution calls from the Mendeley data archive (https://data.mendeley.com/datasets/yzjw2stk7f/1).
NGS data processing and variant calling
We aligned singlecell and bulk reads to the human reference GRCh37 using the MEM algorithm in the BWA software [89]. The mapped reads were then independently processed for all datasets by filtering reads displaying low mapping quality, performing local realignment around indels, and removing PCR duplicates. For the tumor bulk samples (i.e., CRC24 & L86), we obtained SNV calls using the pairedsample variantcalling approach implemented in the MuTect2 software [90]. For the E15 dataset, we ran HaplotypeCaller from the Genome Analysis Toolkit (GATK) [91] software on the bulk sample from the heart tissue to identify and remove all germline variants.
In parallel, we used the singlecell SCcaller software [66] to retrieve singlecell SNV calls. In short, for each singlecell BAM, we ran SCCaller together with the corresponding healthy bulk DNA as input under default settings. Since different amplification methods were used to generate each dataset, we defined the bias estimation interval (lamb) as the average amplicon size of each amplification method—10,000 for MDAbased protocols (L86, E15) and 2000 for Ampli1 (CRC24). Besides, since the actual genomic targets of the L86 dataset were not available, we ran SCcaller on the entire exome. We applied a series of heavy filters (see below) to remove potential offtarget calls. We additionally estimated copynumber variants (CNVs) for each singlecell dataset. For the scWGS datasets (CRC24 and E15), we obtained CNV calls with the Ginkgo software [92] using variablelength bins of around 500 kb. For the L86 dataset, we determined CNVs using CNVPanelizer, an algorithm specifically designed to infer copy number states from targeted sequencing data.
We filtered our raw singlecell VCFs by excluding short indels, SNVs with a flag other than “true” in the SO format field (i.e., showing weak evidence of being a true somatic mutation), and variable sites with an alternative read count < 3. We also excluded variable sites in which the ML genotype estimate was above 120 (Phredscaled). Such uncertainty in the genotype call was usually associated with sites experiencing an apparent disparity in the proportion of both alleles (i.e., allelic bias). Moreover, as we are primarily interested in analyzing diploid genomic regions, we removed those SNVs located within CN variable regions.
For each dataset, we then merged singlecell VCFs using the bcftools software [93] and applied a “consensus” filter to only retain sites present in at least one cell and the bulk tumor sample, or in two cells. For the E15 dataset, we limited this “consensus” filter to somatic sites observed in at least two cells, as we classified as germline all variants observed in the bulk sample. Finally, we removed positions missing (i.e., not covered by any read) in more than 50% of the cells and SNVs comprising more than one alternative allele. For the L86 dataset, we filtered out offtarget SNVs located outside exonic regions. For the LS140 dataset, we converted the binary genotype matrix into a VCF by transforming 0, 1, and NA values into 0/0, 0/1, and ./., respectively. Afterward, we removed all duplicated (nonbiallelic) positions and indels.