The CellPhy model
Model of nucleotide substitution for phased/unphased diploid genotypes
We developed a substitution model for diploid genotypes akin to those typically used for DNA sequences in organismal phylogenetics ([see [51]). Specifically, we built a finite-site, continuous-time, Markov model of genotype evolution considering all possible 16 phased diploid states Γ= {A|A, A|C, A|G, A|T, C|A, C|C, C|G, C|T, G|A, G|C, G|G, G|T, T|A, T|C, T|G, T|T}, in which SNVs are independent and identically distributed. This model, named GT16, is defined by a rate matrix Q that contains the instantaneous transition rates qX↔Y among genotypes X and Y. For computational convenience, we assume a time-reversible process, in which case the Q matrix is the product of a symmetric exchangeability matrix R (rX↔Y = rY↔X) and a diagonal matrix of stationary genotype frequencies πX:
$$Q=R\cdotp \mathit{\operatorname{diag}}\left({\pi}_{A\mid \mathrm{A}},{\pi}_{A\mid C},\dots, {\pi}_{T\mid T}\right)$$
(1)
We assume that only one of the two alleles in a genotype can change in an infinitesimal amount of time. Furthermore, because maternal and paternal chromosomes evolve independently in somatic cells, we also assume that the instantaneous transition rate of a given allele a in genotype X to allele b in genotype Y does not depend on the identity of the homologous allele n within the respective genotypes. In other words, we assume r(na↔nb) = r(b↔a). Conveniently, these assumptions significantly reduce the number of free parameters in the Q matrix, as we only need to consider five nucleotide exchangeabilities (α = r(A↔C), β = r(A↔G), γ= r(A↔T), κ = r(C↔G), λ = r(C↔T); let μ = r(G↔T) = 1) and 15 stationary-phased genotype frequencies (πA|A, πA|C, πA|G, πA|T, πC|A, πC|C, πC|G, πC|T, πG|A, πG|C, πG|G, πG|T, πT|A, πT|C, πT|G; πT|T= 1 - ∑πX):
(2)
The Q matrix gives us the probabilities of changing from one genotype to another in an infinitely small time interval. When two genotypes differ at more than one nucleotide, these rates are 0. To calculate the tree likelihood, we first need to compute the change probabilities from the beginning to the end of each branch. The probabilities of changing from a given genotype to another along a branch of length t (in units of expected number of mutations per SNV site) are given by:
The resulting matrix P(t) is known as the transition probability matrix, and all its entries are (small) positive values. For further details, we refer the reader to Felsenstein [51] and Yang [73].
Notably, the GT16 model can also work with unphased genotype data simply by assigning the same relative likelihood at the tips of the tree to the two possible resolutions of the phase for the observed genotype (see Table S2). This flexibility is relevant because current scDNA-seq techniques do not reveal the phase of the genotypes (i.e., we do not know which allele is located in the maternal or paternal chromosome). Therefore, in our simulations (see below), the simulated genotypes will always be unphased.
In addition to the GT16 model, we also implemented a model for unphased genotypes with only ten states, called GT10 (see Supplementary Note 2 for details). The GT10 model is computationally less expensive, approximately twice as fast as GT16. To maintain the reversibility assumption—otherwise, the calculations are much more complex—the GT10 model assumes that the probability of change between homozygous and heterozygous genotypes is equivalent. However, this is incorrect, as the change from homozygote to heterozygote is twice as likely as the change from heterozygote to homozygote. Despite this theoretical flaw, GT10 and GT16 showed very similar tree inference accuracy in our experiments.
Single-cell genotype error model
We assume that the true genotypes are always diploid and biallelic. To incorporate errors in the observed genotypes arising during single-cell whole-genome amplification (scWGA) and sequencing, we consider two free parameters: the allelic dropout (ADO) rate (δ) and the amplification/sequencing error (ERR) rate (ε).
Allele dropout occurs during scWGA when one of the two alleles is not amplified and is absent from the sequencing library. ADO always implies a single allele in our model, as if both alleles drop, no reads would be available, and no genotype could be observed (i.e., it would be “missing”). Therefore, δ is the probability that the amplification of one or the other allele has failed and thus that we observe the homozygous genotype defined by the amplified allele. Given the phased genotype a|b, and with “_” indicating the dropped allele, we define the ADO rate as:
$$\delta =\mathrm{P}\left(\_\left|b\left|a\left|b\right.\right.\right.\right)+\mathrm{P}\left(a\left|\_\left|a\left|b\right.\right.\right.\right),$$
(4)
where P (Y|X) is the probability of observing genotype Y after sequencing, given the true genotype X.
Genotype errors other than ADO can result from polymerase errors during scWGA, in the course of sequencing, variant calling, or generally represent an incorrect allele. Importantly, we assume a maximum of one ERR per genotype. Given that ε tends to be small, in the order of 10-3 to 10-5 [74], the probability of two ERR in one genotype, ε2, is negligible. Specifically, ε is the probability that allele a will be observed as another allele b ≠ a:
$$\varepsilon =\mathrm{P}\left(b\left|a\right.\right)$$
(5)
Note that we allow for the presence of both ADO and ERR in the same observed genotype. Under these assumptions, if the true genotype is homozygous a|a, for phased genotypes P (Y|X) becomes:
$$P\space(a\vert a \space\vert\space a \vert a)=1-\varepsilon +\space\raisebox{1ex}{$1$}\!/ \!\space\raisebox{-1ex}{$2$}\space\delta \varepsilon$$
(6)
$$P\space(a\vert b \space\vert\space a \vert a)=\left(1-\delta \right)\cdot\space\raisebox{1ex}{$1$}\!/\space \!\raisebox{-1ex}{$6$}\space\varepsilon$$
(7)
$$P\space(b\vert b \space\vert\space a \vert a)=\space\raisebox{1ex}{$1$}\!/ \!\space\raisebox{-1ex}{$6$} \space \delta \varepsilon$$
(8)
Likewise, if the true genotype is heterozygous a|b, P (Y|X) can be shown to be:
$$P\space(a\vert a \space\vert\space a \vert b)=\space \raisebox{1ex}{$1$}\!/ \!\space\raisebox{-1ex}{$2$} \space\delta \space + \space \raisebox{1ex}{$1$}\!/ \!\space\raisebox{-1ex}{$6$}\space\varepsilon \space - \space\raisebox{1ex}{$1$}\!/ \!\space\raisebox{-1ex}{$3$}\space \delta \varepsilon$$
(9)
$$P\space(c\vert c \space\vert\space a\vert b)=\space \raisebox{1ex}{$1$}\!/ \!\space \raisebox{-1ex}{$6$} \space \delta \varepsilon$$
(10)
$$P\space(a\vert c \space\vert\space a\vert b)=(1-\delta)\cdot \space\raisebox{1ex}{$1$}\!/ \space\!\raisebox{-1ex}{$6$}\space\varepsilon$$
(11)
$$P\space(a\vert b \space\vert\space a\vert b)=(1-\delta)\cdot (1-\varepsilon )$$
(12)
For the remaining scenarios, given the assumptions of the error model, P (Y|X) is zero. See Supplementary Material Note 1 for a detailed explanation.
Given the observed genotype Y for the SNV i and cell j, the likelihood of a genotype X being the true genotype at this position becomes:
$${L}_i^j\left(X\left|Y\right.\right)=P\left(Y\left|X\right.\right),\forall X\in \Gamma$$
(13)
This follows from Bayes’ theorem assuming equal prior genotype probabilities.
For missing data, we consider all possible genotypes to be equally likely, which is a standard assumption in likelihood-based phylogenetic inference. In this case:
$${L}_i^j\left(X\ |\ missing\right)=P\ \left( missing\ |\ X\right)=1,\forall X\in \Gamma$$
(14)
Phylogenetic likelihood
We consider the evolutionary history of cells as an unrooted binary tree, where τ is a tree topology, and t is a vector of branch lengths. We define the phylogenetic likelihood of the cell tree T as the conditional probability of the observed SNV matrix S given the substitution model M with parameters θ and T:
$$L\left(T,M,\theta \right)=P\left(S\ |\ T,M,\theta \right)$$
(15)
We assume that genomic sites evolve independently under model M. Therefore, the probability of observing the SNV matrix S is the product over the independent probabilities for all individual SNVs, Si:
$$L(T,M,\theta )={\displaystyle\prod}_{i=1}^s \space P({S}_i \space\vert\space T,M,\theta)$$
(16)
where s is the total number of SNVs. For numerical reasons, that is, to avoid floating-point underflow, in practice, we calculate the log-likelihood score:
$$\log L(T,M,\theta)={\sum}_{i=1}^s \log P({S}_i \space \vert \space T,M,\theta)$$
(17)
We can efficiently compute the log-likelihood score of a given tree with the standard method called Felsenstein’s pruning algorithm [52]. Let us place an additional imaginary node, called virtual root, into an arbitrary branch of the unrooted tree T at a random position along that branch. Without loss of generality, if we assume that this virtual root node has an index of 0 and that c is the number of cells, we can index the tip nodes from 1 to c by their respective cell numbers and index the internal nodes from c+1 to 2c–2 (note that an unrooted binary tree with c leaves has c–3 inner nodes, i.e., (2c–2) – (c+1) = c–3). Then, under the GT16 model, per-SNV probabilities can be computed as follows:
$$P\space({S}_i \space\vert\space T, GT \textit{16},R,\pi)=\sum\limits_{X\in\Gamma} \space\space{L}_i^0(X)\space{\pi}_x,\space i=1 \dots \mathrm{s}$$
(18)
where X is a genotype, πxis the stationary frequency of the genotype X, and \({L}_i^0\)is the vector of genotype likelihoods at the virtual root, which we compute as follows:
$${L}_i^{\nu}(X)=\sum\limits_{Y\in \Gamma}{P}_{X,Y} \space ({t}_u)\space{L}_i^u(Y) \space\cdot \sum\limits_{Y\in \Gamma} {P}_{X,Y} \space ({t}_w) \space {L}_i^w (Y), \space\space\space\forall\space X\in\Gamma, \space\upsilon \notin [1\dots c]$$
(19)
where Y is another genotype, u and w are both children nodes of v in the direction from the virtual root (Fig. S19) for the respective branch lengths.
We initialize the genotype likelihood vectors at the tip nodes (\({L}_i^1\).. \({L}_i^c\)) depending on the input type (see Section “Input data” below) and the error model. If the input is the genotype matrix S, in the absence of observational error, these tip likelihood vectors are:
$${L}_i^v(X)=\{1\ if\ {S}_i^v=X,\space \space 0\ otherwise\},\space v\in [1\dots c], \space i\in [1\dots s]$$
(20)
while if we include the error model, the tip likelihoods are computed according to the equations in the “single-cell genotype error model” section.
Otherwise, if the input consists of genotype likelihoods G, the tip likelihoods are:
$${L}_i^v(X)=1{0}^{G_i^v(X)}, \space \forall \space X\in \Gamma, \space v\in [1\dots c], \space i\in [1\dots s]$$
(21)
The GT16 model will consider both phasing options equally likely when the input genotype likelihoods are unphased:
$${L}_i^v(a|b)={L}_i^v(b|a)=1{0}^{G_i^v(a/b)}, \space \forall \space a,b,v\in [1\dots c], \space i\in [1\dots s]$$
(22)
where a and b represent any two alleles.
Finally, if Phred-scaled likelihoods for REF/REF, REF/ALT, and ALT/ALT genotypes are provided (PL field in a VCF file), we calculate tip likelihoods as follows:
$${L}_i^{\upsilon }(X)={L}_i^{\upsilon}\left(a\left|b\right.\right)=\left\{\begin{array}{ll}{10}^{-0.1\cdot {PL}_i^{\upsilon }(0)}& \mathrm{if}\kern0.5em a= REF\wedge b= REF\\ {}{10}^{-0.1\cdot {PL}_i^{\upsilon }(1)}& \mathrm{if}\left(a= REF\wedge b= ALT\right)\vee (a= ALT\wedge b= REF)\\ {}{10}^{-0.1\cdot {PL}_i^{\upsilon }(2)}& \mathrm{if}\kern0.5em a= ALT\wedge b= REF\\ {}0& \mathrm{otherwise}\end{array}\right.$$
(23)
The final tree can be rooted using an outgroup (see [51]). For the sake of completeness, we included a simple explanation of standard phylogenetic likelihood calculations on DNA sequence alignments in the Supplementary Material (Figs. S19-S20, Supplementary Note 3).
Implementation
Overview
We implemented CellPhy as a pipeline based on a modified version of RAxML-NG [48]. In addition to the core tree search functionality of RAxML-NG, the CellPhy pipeline offers VCF conversion, mutation mapping, and tree visualization using the ape [75] and the ggtree package [76]. Furthermore, CellPhy provides reasonable defaults for most parameters, which allows the user to run a “standard” CellPhy analysis by specifying just the input VCF file (or the genotype matrix). Alternatively, expert users can customize every aspect of the CellPhy analysis to fit their needs, as shown in the tutorial (https://github.com/amkozlov/cellphy/blob/master/doc/CellPhy-Tutorial.pdf). In the remainder of this section, we will also provide implementation details on each step of the CellPhy pipeline. CellPhy code and documentation are freely available at https://github.com/amkozlov/cellphy.
Input data
CellPhy accepts two input types, a matrix of genotypes in FASTA or PHYLIP format or a standard VCF file (https://samtools.github.io/hts-specs/VCFv4.3.pdf). When the input is a genotype matrix, genotypes are encoded as shown in Table S2. When the input is a VCF, CellPhy can run in two distinct modes. The first mode (“CellPhy-ML”) requires a VCF with at least the GT field (that stores the genotype calls), in which case CellPhy simply extracts the genotype matrix. The second mode (“CellPhy-GL”) requires a VCF with the PL field (which stores the Phred-scaled genotype likelihoods) and uses the likelihood of each genotype instead. While commonly used variant callers for single-cell data (e.g., Monovar [77]) generate VCF files with a standard PL field, users should know that the PL definition may differ from its standard meaning in different callers. Indeed, SC-Caller [66], for instance, uses the PL field to store the likelihood of heterozygous and alternative homozygous genotypes and the likelihood of sequencing noise and amplification artifacts. On this basis, the PL field in VCF files stemming from SC-Caller needs to be converted to the standard PL format before CellPhy can be used (see Table S3 outlining CellPhy’s compatibility with popular variant calling algorithms).
Phylogenetic tree search
CellPhy uses the broadly used and well-tested heuristic tree search strategy of RAxML [53] and RAxML-NG [48]. CellPhy’s search algorithm starts by default with 20 different trees, ten obtained using a maximum parsimony-based randomized stepwise addition order routine and ten with a completely random topology. The ML tree search itself alternates between model parameter optimization and tree topology optimization phases. One of the most popular approaches for searching tree topologies is the so-called subtree pruning and regrafting (SPR) [78], which removes subtrees from the tree and subsequently places them into different branches to assess if the likelihood improves. Those SPR moves are applied iteratively until no SPR move further increases the tree’s likelihood. In this case, CellPhy terminates and returns the best-found ML tree. For further details, please see [79] and references therein.
Model parameter optimization
CellPhy uses the L-BFGS-B method [80] to optimize genotype substitution rates and equilibrium base frequencies. ADO rate and amplification/sequencing error rate are optimized independently using Brent’s method [81]. After each iteration of Brent’s algorithm, CellPhy re-computes all per-genotype likelihoods according to Eq. 7 using the new values of the ADO rate δ` and the amplification/sequencing error ε’.
Branch support
CellPhy can compute confidence values for individual inner branches of the ML tree using two bootstrap (BS) techniques, the standard BS [55], and the transfer BS [56]. In the standard BS, the first step consists of generating many BS replicates, typically 100 to 1000, from the original dataset by randomly sampling SNV sites with replacement. Then, an ML tree is estimated for each replicate. Finally, the support for each inner branch in the ML tree is computed as the percentage of BS trees that contain that branch, as outlined in the example provided in Fig. S21. The standard BS only considers exact matches (i.e., the branch in the ML tree and the BS trees must match exactly to be counted). In contrast, the transfer BS also considers inexact matches to account for the tree search uncertainty and vastness of the tree search space in phylogenetic analyses with many cells.
Mapping mutations onto the tree
CellPhy can show predicted mutations on the branches of the inferred cell tree. To this end, it performs marginal ancestral state reconstruction [57] to obtain the ML genotype for every SNV at every inner node of the tree. At the tips of the tree, occupied by the observed cell genotypes, depending on the input, CellPhy applies Eq. 4 to compute genotype likelihoods given the observed genotype y and estimated error rates (δ, ε) or directly uses the genotype likelihoods provided in the VCF file. Then, it compares ML genotypes between two nodes connected by a branch, and if they differ, a mutation is predicted on the corresponding branch. The mutation mapping output consists of two files, a branch-labeled tree in the Newick format and a text file with a list of predicted mutations (SNV names or positions) at each branch. We also provide a script (https://github.com/amkozlov/cellphy/blob/master/script/mutation-map.R) that automatically generates a plot with the mutations mapped onto the resulting phylogenetic tree, together with a tutorial that explains its use.
Computational efficiency
RAxML-NG was developed with a particular focus on high performance and scalability to large datasets. Hence, CellPhy capitalizes on numerous computational optimizations implemented therein, including highly efficient and vectorized likelihood calculation code, coarse- and fine-grained parallelization with multi-threading, checkpointing, and fast transfer bootstrap computation [82].
Benchmarking
We used computer simulations to benchmark the accuracy of CellPhy under different scenarios (Table S1) relative to state-of-the-art methods for single-cell phylogenies like OncoNEM [25], SPhyR [30], SASC [31], ScisTree [33], infSCITE [26, 27], SiFit [28], and SCIPhI [29]. For OncoNEM, SPhyR, SASC, infSCITE, and SiFit, the input is the matrix of observed reference/non-reference homozygous/heterozygous genotypes. ScisTree uses a matrix of genotype probabilities estimated from the simulated read counts for each site, while SCIPhI relies on a standard mpileup file with the simulated read counts. Also, we included in the comparison the standard phylogenetic method TNT [49]. TNT implements a maximum parsimony (MP) approach and attempts to find the tree/s that require the least number of mutations to explain the data. TNT is very popular in MP organismal phylogenetics, heavily optimized for computational speed and efficiency. It is not designed for single-cell NGS data and therefore assumes that the observed genotypes are error-free.
Simulation of genealogies and genotypes
For the simulation of single-cell diploid genotypes, we used CellCoal [83]. This program can simulate the evolution of a set of cells sampled from a growing population, introducing single-nucleotide variants on the coalescent genealogy under different models of DNA mutation. Furthermore, it can also introduce the typical errors of single-cell sequencing, specifically ADO, amplification, and sequencing errors, and doublets, either to the observed genotypes or directly into the read counts.
We designed six distinct simulation scenarios (simulations 1–6) representing different types of scDNA-seq datasets (Table S1), including variable numbers of cells (40–1000) and sites (1000–50,000). We simulated unphased genotype data in all cases, as current scDNA-seq techniques do not reveal the genotype phase. We chose a set of scenarios and parameter values that, in our opinion, are representative of different situations that researchers are likely to encounter. The cell samples were assumed to come from an exponentially growing population (growth rate equal to 1×10-4) with a present-day effective population size of 10,000. Across scenarios, we set a constant value of 0.1 for the root branch. Note that the mutations in this branch are shared by all cells. We also defined an outgroup branch length of zero in all cases, so the healthy cell and the most recent common ancestor (MRCA) of the sample (a single healthy cell plus several tumor cells) have identical genotypes. The standard coalescent process results in an ultrametric tree, where all tips have the same distance from the MRCA of the sample. However, we introduced rate variation across cell lineages by multiplying the branch lengths of the resulting coalescent genealogy with scaling factors sampled from a Gamma distribution with a mean of 1.0.
Only in the first simulation scenario, we consider a fixed number of SNVs. In the remaining scenarios, the number of observed mutations resulted from applying a mutation rate of 1×10-6 [84], plus the different scDNA-seq errors. We explored different infinite- and finite-site mutation models at the single nucleotide or trinucleotide level. Except for the ISM scenarios (simulations 1 and 3), we introduced a variable mutation rate across sites using a Gamma distribution with a mean of 1.0.
We simulated unphased genotypes, as current scDNA-seq techniques do not reveal the phase. We generated the observed genotype matrices in two distinct ways. In the first three scenarios (simulations 1–3), we obtained the observed genotypes by directly adding sequencing/amplification errors (i.e., changing one or both alleles) and ADO to the simulated genotype matrices. In the other three (simulations 4–6), the generation of the observed genotypes was more complex. In this case, we first simulated read counts for each cell based on the true genotypes, considering different overdispersed sequencing depths, as well as amplification and sequencing errors. For simplicity, we assumed that maternal and paternal chromosomes are amplified with the same probability. Also, we consider that the number of reads is half for those genomic positions in which only one allele was amplified. In simulation 5, we introduced so-called doublets, that is, two cells that are erroneously sequenced together and that thus appear as a single cell in the sequencing data. For every combination of parameters, we generated 100 replicates. In total, we generated 19,400 cell samples.
Preliminary simulations
We carried out a set of preliminary simulations under favorable conditions, without errors or relatively low noise levels (ADO = 0.10 and amplification error= 0.01). We simulated 40 cells with 250 SNVs, assuming a diploid ISM model [85]. Under this model, a given site can only accumulate a single mutation along with the genealogy, either in the maternal or paternal chromosome.
Simulation 1: infinite-site model, low number of SNVs (“target-ISM”)
We started the full simulations with a simple scenario with 40 sampled cells and 250, 500, or 1000 SNVs, assuming a diploid ISM model [85]. We introduced genotype errors and ADO at different rates (Table S1).
Simulation 2: finite-site model, large number of SNVs (“WGS-FSM”)
In this case, we simulated a larger number of SNVs, more typical of whole-genome sequencing (WGS) experiments. The number of sampled cells was 100. The mutation model, in this case, was a non-reversible version of the finite-site General Time Reversible Markov model [50], that we called GTnR, assuming a set of single-nucleotide instantaneous rates extrapolated (essentially, we pooled the same mutation in the same rate independently of the 5′ and 3′ context) from the trinucleotide mutational signature 1 at COSMIC (https://cancer.sanger.ac.uk/cosmic/signatures):
$${Q}_{GTnR}=\left[\begin{array}{llll}0& 0.03& 0.12& 0.04\\ {}0.11& 0& 0.02& 0.68\\ {}6.68& 0.02& 0& 0.11\\ {}0.04& 0.12& 0.13& 0\end{array}\right]$$
(24)
The overall mutation rate was set to 1×10-6, which resulted in about 2000 true SNVs (see Table S1). However, since ADO events and genotype errors can introduce false negatives and false positives, the number of observed SNVs ranged between 1531 and 10,000. The mutation rates varied across sites according to a Gamma distribution (+G) with shape parameter and mean equal to 1.0 (i.e., moderate among-site rate heterogeneity).
Simulation 3: mutational signatures and large number of SNVs (“WGS-sig”)
This scenario is similar to the previous one, with 60 cells and assuming a trinucleotide ISM model, with COSMIC signatures 1 and 5. The former is a ubiquitous signature in human cells with a predominance of C>T transitions in the NCG trinucleotide context and is related to the spontaneous deamination of 5-methylcytosine [86]. The latter is also a typical age-related signature with a predominance of T>C substitutions in the ATN trinucleotide context, related to transcriptional strand bias [87].
Simulation 4: genotype likelihoods from NGS read counts (“NGS-like”)
In this scenario, and the next two, we simulated NGS read counts from the simulated genotypes. The number of sampled cells was 40, with 10,000 sites and the same mutation model (GTnR+G) and mutation rates as in Simulation 3. We explored three sequencing depths (5x, 30x, and 100x), three ADO rates (0, 0.05, 0.10), three amplification error rates (0, 0.05, 0.10), and three sequencing error rates (0, 0.01, 0.05). We assumed that amplification and sequencing errors among the four nucleotides were equally likely. From the read counts, CellCoal can also simulate the likelihood for all ten possible unphased genotypes at each SNV site, in this case under a 4-template amplification model [83]. The input for CellPhy was either the ten genotype likelihoods at each SNV site (CellPhy-GL mode) or the unphased genotype with the maximum likelihood (ML) value (CellPhy-ML mode). The input for SCIPhi was the read counts, while for the rest of the programs, it was the ML unphased genotype. In the rare case of tied ML genotypes, we chose one of them at random.
Simulation 5: NGS doublets (“NGS-doublet”)
In this case, we intended to explore the effects of doublets in the data. Settings were very similar to those for Simulation 4 but, for simplicity, we fixed the sequencing depth to 5x and explored two amplification error rates (0, 0.05), two sequencing error rates (0, 0.01), and four doublet rates (0, 0.05, 0.10, 0.20).
Simulation 6: NGS for large numbers of cells and SNVs (“NGS-large”)
Finally, to assess the scalability of the tools, we simulated scenarios with 100, 500, or 1000 cells and with 1000, 10,000, or 50,000 sites. Given the mutation rate, a large number of cells, and, most importantly, the amplification and sequencing error rates, almost all sites were observed as SNVs. Settings were very similar to those specified for Simulation 5 but, for simplicity, we fixed the sequencing depth to 5x and explored only one amplification (0.05) and one sequencing (0.01) error value. We only analyzed the first 20 replicates in this scenario due to the high computational cost and prohibitive running times for several competing tools.
Settings for the phylogenetic analyses
Coding DNA into ternary genotypes
Our simulations produce unphased DNA genotypes with ten possible states. However, except for CellPhy, existing tools work with an alphabet composed of 0 (homozygous for the reference allele), 1 (heterozygous), 2 (homozygous for the alternative allele), and 3 (missing genotype). Therefore, we had to encode the simulated DNA genotypes into ternary genotypes (0–3). For this, we used the true reference allele, considering that, in real life, we usually know which allele is the reference. For the sake of simplicity, we did not introduce germline mutations. Importantly, our simulations do not necessarily produce bi-allelic SNVs, as in the finite-site model multiple mutations can coincide, and amplification and sequencing errors can also result in new alleles called. CellPhy does not limit the number of alleles at an SNV site, but competing tools handle multi-allelic sites differently. We explored three ways of coding sites with more than two alleles into ternary genotypes:
-
Option keep: transform all heterozygotes to “1” and all homozygotes for the alternative allele/s to “2”. In this case, all simulated sites are held, regardless of the number of observed alleles.
-
Option remove: eliminate sites from the data with more than two alleles. The final genotype matrix includes only bi-allelic sites.
-
Option missing: keep only those genotypes that contain the reference allele and/or the major (most common) alternative allele. All other genotypes (containing minor alternative alleles) are considered as missing data (“3”). Therefore, the final ternary genotype matrix includes the same number of sites as the original DNA genotype matrix.
We considered all three encoding options only in simulations 1 and 2. In the remaining simulations, we used only the “missing” option, as it maximized accuracy in most cases.
TNT settings
We performed the TNT analyses using a binary data matrix in TNT format for all simulated and empirical datasets. We allowed 1000 trees to be retained for each run and performed tree searches by setting mult = replic 100. We stored all equally parsimonious trees and used additional ttags to store branch lengths and bootstrap support values.
OncoNEM settings
We ran OncoNEM following the recommended settings in the OncoNEM vignette. We set the false positive rate as the actual genotype error for each scenario and performed a tree search for 200 iterations. Because of its heavy computational requirements and poor performance, we only run it in Simulation 1.
SASC settings
We generated binary matrices as input files and set -k option to 1. Additionally, the false positive rate was set as the actual genotype error, and the false negative rate was set as the actual ADO rate. The command line was: sasc -n <CELLS> -m <SNVS> -k 1 -a <ADO> -b <ERR> -E data.CellID -l -i data.sasc -p 24. Similar to OncoNEM, due to its bad performance, SASC was only run partially in Simulation 1.
SPhyR settings
We produced binary input files using a custom script and ran the kDPFC version with the false positive rate (-a) set to the simulated genotype error and the false negative rate (-b) set to the simulated ADO rate. The command line was kDPFC data.sphyr -a <ERR> -b <ADO> -k 1 -t 24 > data.sphyr.result. Afterwards, we used the visualize program to generate the output tree. The command line was visualize data.sphyr.result -c data.snv_labels -t data.cell_labels > data.sphyr.dot. Because it continuously crashed for most datasets, SPhyR was only run partially in simulation 1.
infSCITE settings
For simulations 1–3, we ran infSCITE using a ternary data matrix composed of 0, 1, 2, and 3, as described above, and set the false positive rate (-fd) to be the actual genotype error for each simulated scenario. We set the false-positive rate for simulations 4–5 as the sum of the simulated sequencing and amplification error rates. For the empirical analyses, we set the false positive rate to 1e−05. We set the remaining parameters to the default values in all runs and obtained results after running an MCMC chain with 5 million steps, a fair trade-off between runtime and apparent MCMC convergence (the best tree score barely changed after 1M iterations). We additionally set the -transpose option to return a tree where the single-cell samples are the leaf labels. The command line was infSCITE -i data.infSCITE -n <SNVS> -m <CELLS> -r 1 -l 5000000 -fd <ERR> -ad 1.46e-1 -cc 1.299164e-05 -transpose -e 0.2 -o data.infSCITE.Tree
SiFit settings
For all simulated and empirical datasets, we ran SiFit for 200,000 iterations. The command line was java -jar /SiFit.jar -m <CELLS> -n <SNVS> -r 1 -iter 200000 -df 1 -ipMat data.sf -cellNames data.names.
SCIPhI settings
Since SCIPhI requires read counts to perform joint variant calling and phylogenetic reconstruction, we ran it only for simulations 4–6. We set the mean error rate (-wildMean) for each scenario as the sum of the true sequencing and amplification error rates. The command line was sciphi -o data.Result --in sampleNames -u 1 --ncf 0 --md 1 --mmw 4 --mnp 1 --ms 1 --mc 1 --unc true -l 200000 --seed $RANDOM data.mpileup --ese 0 --wildMean <ERR>. To make the results comparable, for the empirical datasets we used samtools to generate mpileups for the positions previously identified by SC-Caller. These were in turn used as input for SCIPhI: sciphi -o <SET>.SCIPhI --in <SET>-SampleNames.txt -u 0 --ncf 0 --md 0 --mmw 4 --mnp 1 --unc true -l 200000 --seed 421 <SET>.SCCaller-Positions.mpileup.
ScisTree settings
We extracted allele counts from VCF files using a custom script (available in the CellPhy’s GitHub repository) and obtained genotype probabilities using the scprob program provided with ScisTree. We ran ScisTree v1.2.0.6 with default parameters, which we called as follows: scistree <genotype-probability-matrix>.
CellPhy settings
For simulations 1–6, we performed a heuristic tree search starting from a single parsimony-based tree under the GT16 model. The command line for runs with ML genotypes as input was cellphy.sh RAXML --search --msa data.phy --model GT16+FO+E --tree pars{1}. The command line for runs where the input was genotype likelihoods (VCF) was cellphy.sh RAXML --search --msa data.vcf --model GT16+FO --tree pars{1}. For all empirical datasets except LS140, to take advantage of the genotype likelihood model, we used an in-house bash script (sc-caller-convert.sh, distributed together with CellPhy) to convert the PL field from our SC-Caller VCFs. In short, following SC-Caller authors’ suggestion, we used the highest likelihood score of the first two values in the PL field (i.e., sequencing noise, amplification artifact) as the Phred-scaled genotype likelihood of the reference homozygous (0/0) genotype, and the remaining values as the likelihood for heterozygous (0/1) and alternative homozygous (1/1) genotypes, respectively. Afterward, we ran CellPhy using the following command line cellphy.sh RAXML --all --msa data.vcf --model GT16+FO --bs-metric fbp,tbe --bs-trees 100 to perform an all-in-one analysis (ML tree inference and bootstrapping based on 100 bootstrap trees). For LS140, since we only had the genotype matrix available and these data were generated without whole-genome amplification, we ran CellPhy without the single-cell error model using the following command line cellphy.sh RAXML --all --msa data.vcf --model GT16+FO --prob-msa off --bs-metric fbp,tbe --bs-trees 100.
Evaluation of phylogenetic accuracy
We defined phylogenetic accuracy as one minus the normalized Robinson-Foulds (nRF) distance [88] between the inferred tree and the (true) simulated tree. This normalization divides the (absolute) RF distance by the total number of (internal) branches in both trees. Hence, the nRF distance is a convenient metric from zero to one that reflects the proportion of branches (bipartitions of the data) correctly inferred.
Running time comparisons
We characterized the computational efficiency of CellPhy by comparing running times for all methods on six datasets from simulations 1–6 (sim1-ADO:0.50, ERR:0.10, sim2-ADO:0.10, ERR:0.05, sim3-ADO:0.15, ERR:0.10, Signature1, sim4-Number of cells:40, ADO:0.25, Amp error:0.10, Seq error: 0.05; sim6-100-Number of Cells:100, ADO:0.10, Amp error:0.05; Seq error:0.01; and sim6-500-Number of Cells:500, ADO:0.10, Amp error:0.05; Seq error:0.01) and two empirical datasets (CRC24 and L86) described below. We measured running times using the Linux/Unix “time” command, as follows: {time ./cellphy.sh RAXML --search --msa data.vcf --model GT16+FO --tree pars{1} --prefix data-out --threads 1 ; } 2> data-CellPhyGL16.time. We ran all analyses on a single core from an Intel Xeon E5-2680 v3 Haswell Processor 2.5 GHz with 128 Gb of RAM. For the bootstrap benchmark, the CellPhy command line was {time ./cellphy.sh RAXML --all --msa data.vcf --model GT16+FO --tree pars{1} --bs-trees 100 --prefix data-out-Boot --threads 1 ; } 2> data-CellPhyGL16.Boot.time. As for TNT, additional ttags were used to perform and store 100 bootstrap replicates.
Analysis of empirical data
In-house single-cell WGS data from a colorectal cancer patient (CRC24)
We obtained a fresh frozen primary tumor and normal tissues from a single colorectal cancer patient (CRC24). We isolated EpCAM+ cells with a BD FACSAria III cytometer from two tumoral regions (tumor inferior (TI) section and tumor middle section (TM)) and classified them as stem or non-stem according to the stemness markers at the cell surface (stem: EpCAM+/Lgr5+/CD44-/CD166-; non-stem: EpCAM+/Lgr5-/CD44-/CD166-). We amplified the genomes of 24 cells with Ampli1 (Silicon Biosystems) and built whole-genome sequencing libraries using the KAPA (Kapa Biosystems) library kit. Each library was then sequenced at ~ 6× on an Illumina Novaseq 6000 at the National Center of Genomic Analysis (CNAG-CR; https://www.cnag.crg.eu/).
Retrieval of publicly available datasets (L86, E15, and LS140)
We also analyzed three public data sets with 86, 15, and 140 cells (L86, E15, and LS140, respectively). The L86 dataset consists of targeted sequencing data from 86 cells from a metastatic colorectal cancer patient (CRC2 in [21]) that we downloaded from the Sequence Read Archive (SRA) in FASTQ format, together with paired healthy-tumor bulk cell population samples (accession number: SRP074289). The E15 dataset consists of WGS data from 15 neurons [69] from a healthy donor, downloaded from the SRA in FASTQ format, together with a bulk cell population from heart tissue (accession number: SRP041470). The LS140 dataset consists of 140 single cell-derived human hematopoietic stem and progenitor colonies from a healthy individual [15]. For this dataset, we directly downloaded the substitution calls from the Mendeley data archive (https://data.mendeley.com/datasets/yzjw2stk7f/1).
NGS data processing and variant calling
We aligned single-cell and bulk reads to the human reference GRCh37 using the MEM algorithm in the BWA software [89]. The mapped reads were then independently processed for all datasets by filtering reads displaying low mapping quality, performing local realignment around indels, and removing PCR duplicates. For the tumor bulk samples (i.e., CRC24 & L86), we obtained SNV calls using the paired-sample variant-calling approach implemented in the MuTect2 software [90]. For the E15 dataset, we ran HaplotypeCaller from the Genome Analysis Toolkit (GATK) [91] software on the bulk sample from the heart tissue to identify and remove all germline variants.
In parallel, we used the single-cell SC-caller software [66] to retrieve single-cell SNV calls. In short, for each single-cell BAM, we ran SC-Caller together with the corresponding healthy bulk DNA as input under default settings. Since different amplification methods were used to generate each dataset, we defined the bias estimation interval (--lamb) as the average amplicon size of each amplification method—10,000 for MDA-based protocols (L86, E15) and 2000 for Ampli1 (CRC24). Besides, since the actual genomic targets of the L86 dataset were not available, we ran SC-caller on the entire exome. We applied a series of heavy filters (see below) to remove potential off-target calls. We additionally estimated copy-number variants (CNVs) for each single-cell dataset. For the sc-WGS datasets (CRC24 and E15), we obtained CNV calls with the Ginkgo software [92] using variable-length bins of around 500 kb. For the L86 dataset, we determined CNVs using CNVPanelizer, an algorithm specifically designed to infer copy number states from targeted sequencing data.
We filtered our raw single-cell VCFs by excluding short indels, SNVs with a flag other than “true” in the SO format field (i.e., showing weak evidence of being a true somatic mutation), and variable sites with an alternative read count < 3. We also excluded variable sites in which the ML genotype estimate was above 120 (Phred-scaled). Such uncertainty in the genotype call was usually associated with sites experiencing an apparent disparity in the proportion of both alleles (i.e., allelic bias). Moreover, as we are primarily interested in analyzing diploid genomic regions, we removed those SNVs located within CN variable regions.
For each dataset, we then merged single-cell VCFs using the bcftools software [93] and applied a “consensus” filter to only retain sites present in at least one cell and the bulk tumor sample, or in two cells. For the E15 dataset, we limited this “consensus” filter to somatic sites observed in at least two cells, as we classified as germline all variants observed in the bulk sample. Finally, we removed positions missing (i.e., not covered by any read) in more than 50% of the cells and SNVs comprising more than one alternative allele. For the L86 dataset, we filtered out off-target SNVs located outside exonic regions. For the LS140 dataset, we converted the binary genotype matrix into a VCF by transforming 0, 1, and NA values into 0/0, 0/1, and ./., respectively. Afterward, we removed all duplicated (non-biallelic) positions and indels.