Overview of SiFit
We start with a brief explanation of how SiFit infers a tumor phylogeny from noisy genotype data obtained from SCS. The input data consist of the following:
-
1.
An n×m genotype matrix, which contains the observed genotypes for m single cells at n different loci. The genotype matrix can be binary or ternary depending on the data.
-
2.
The FP rate (α) and FN rate (β). These error parameters can be learned from the data.
SiFit includes (1) a finite-sites model of tumor evolution and an error model for SCS, based on which the likelihood score of a candidate phylogenetic tree and error rate can be quantified and (2) a heuristic algorithm for exploring the joint space of trees and error rates in search of optimal parameters.
SiFit outputs a phylogenetic tree describing the evolutionary relationship between the single cells and the estimated error rates. The single cells are placed at the leaves of the phylogenetic tree. A more detailed technical description of SiFit can be found in “Methods” section.
Phylogenetic trees and model of tumor evolution
We assume that the observed single cells evolved according to an underlying phylogenetic tree. A phylogeny or phylogenetic tree represents the genealogical relationship among genes, species, populations, etc. [44]. In the context of a tumor, it is a rooted binary tree that represents the genealogical relationship among a set of cells. The sequenced single cells are placed at the leaves of the phylogenetic tree. We also assume that the cells evolve according to a finite-sites model along the branches of the tree.
The n×m true genotype matrix G contains the true genotypes of m single cells at n different loci. If the data contain information only about the presence or absence of a mutation at a locus, the matrix is binary, where the absence or presence of a mutation is represented by a 0 or 1 at the entry G(i,j), respectively. Assuming the cells to be diploid, if the data differentiates between heterozygous and homozygous mutations, the genotype matrix is ternary, where a 0, 1, or 2 at entry G(i,j) denotes a homozygous reference or a heterozygous or homozygous non-reference genotype, respectively. Heterozygous or homozygous non-reference genotypes represent mutations. This ternary representation facilitates the use of a mutation profile from modern variant-calling algorithms (e.g., Monovar [27] and GATK [45]), which report the mutation status of a sample in terms of genotypes.
To accommodate SCS data, we develop a finite-sites model of evolution (\(\mathcal {M}\)) that accounts for the effects of point mutations, deletions, and LOH on genomic sites. The finite-sites model of evolution encompasses a continuous-time Markov chain that assigns a transition probability for one genotype state changing to another along a branch of length t. The value of the transition probabilities depends on the branch length (t) and the parameters (\(\mathcal {M}_{\lambda }\)) of the model of evolution (see “Methods” section for details). By assigning a finite probability for all possible genotype transitions, this finite-sites model of evolution enables us to account for convergent evolution or reversal of genotypes that are excluded by methods that make the infinite-sites assumption (SCITE and OncoNEM). OncoNEM also assumes only binary data and does not differentiate between heterozygous and homozygous mutations. This binarization of data might result in loss of information for a data set with ternary genotypes, since heterozygous and homozygous non-reference genotypes cannot be distinguished when data is binarized. On the other hand, SCITE assumes that the observation of a homozygous non-reference genotype is due to technical errors only. These assumptions follow from using the infinite-sites model and are not made by SiFit.
SCITE also removes the mutations that are present in all cells or in one cell as non-informative in tree reconstruction. SiFit does not remove such mutations as these can be informative in the computation of the likelihood under a finite-sites model of evolution.
Model of single-cell errors
The observed genotype matrix, denoted by D, is an imperfect noisy version of the true genotype matrix G. The FP errors and the FN errors are responsible for adding noise in the observed genotype matrix. Considering binary genotype data, FP errors result in observing a 1 with probability α when the true genotype is 0. Similarly, due to FN errors, with probability β, we will observe a 0, instead of a 1. These relationships between the true and observed genotype matrices are given by
$$ \operatorname{Pr}(D_{i,j}|G_{i,j}) = \left\{ \begin{array}{lll} 1 - \alpha,\,\,\,\, & \text{if }\,\, D_{i,j} = 0, G_{i,j} = 0, \\ \beta,\,\,\,\, & \text{if }\,\, D_{i,j} = 0, G_{i,j} = 1, \\ \alpha,\,\,\,\, & \text{if }\,\, D_{i,j} = 1, G_{i,j} = 0, \\ 1 - \beta,\,\,\,\, & \text{if }\,\, D_{i,j} = 1, G_{i,j} = 1. \end{array}\right. $$
(1)
The error model for ternary data is described in detail in “Methods” section. The observed genotype matrix can also have missing data because of the uneven coverage of SCS. SiFit handles missing data by marginalizing over possible genotypes (see “Methods” section for details).
Tree likelihood
A phylogenetic tree, \(\mathcal {T} = (T, \mathbf {t})\), consists of a tree topology T and a vector of the branch lengths, t. Assuming the technical errors to be independent of each other and that sites evolve independently, the likelihood of a phylogenetic tree (\(\mathcal {T}\)), the error rates (θ=(α,β)), and the parameters of the model of evolution (\(\mathcal {M}_{\lambda }\)) are given by
$$ {}\begin{aligned} \mathcal{L}(\mathcal{T}, \boldsymbol{\theta}, \mathcal{M}_{\lambda}) = \operatorname{Pr}(D|\mathcal{T}, \boldsymbol{\theta}, \mathcal{M}_{\lambda}) = \prod_{i=1}^{n} \operatorname{Pr}(D_{i}|\mathcal{T}, \boldsymbol{\theta}, \mathcal{M}_{\lambda}), \end{aligned} $$
(2)
where D
i
is the observed data at site i. It is a vector with m values corresponding to m single cells. The likelihood calculation for a particular site is described in detail in “Methods” section. The ML estimate is obtained from
$$ (\mathcal{T}, \boldsymbol{\theta}, \mathcal{M}_{\lambda})_{\text{ML}} = \underset{(\mathcal{T}, \boldsymbol{\theta}, \mathcal{M}_{\lambda})} {\text{argmax}}\hspace{1mm} \operatorname{Pr}(D|\mathcal{T}, \boldsymbol{\theta}, \mathcal{M}_{\lambda}). $$
(3)
Heuristic search algorithm
Our model has three main components: the phylogenetic tree (\(\mathcal {T}\)), the error rates of single-cell data (θ), and the parameters of the model of evolution \((\mathcal {M}_{\lambda })\). The tree search space has (2m−3)!/2m−2(m−2)! discrete bifurcating tree topologies for m cells, and each topology has a continuous component for branch lengths. The overall search space also has a continuous component for error rates and model parameters along with the tree space. We designed a heuristic search algorithm to explore the joint search space to infer the ML configuration of phylogeny, error rates and evolution model parameters. In the joint \((\mathcal {T}, \boldsymbol {\theta }, \mathcal {M}_{\lambda })\) space, we consider three types of moves to propose a new configuration. In each type of move, one component is changed. Thus, from a current configuration \((\mathcal {T}, \boldsymbol {\theta }, \mathcal {M}_{\lambda })\), a new configuration of \((\mathcal {T'}, \boldsymbol {\theta }, \mathcal {M}_{\lambda })\), \((\mathcal {T}, \boldsymbol {\theta }', \mathcal {M}_{\lambda })\), or \((\mathcal {T}, \boldsymbol {\theta }, \mathcal {M}_{\lambda '})\) is proposed. The new configuration is heuristically accepted according to a ratio of likelihood. The search procedure terminates when the likelihood does not improve or the maximum number of iterations has been reached.
Performance on simulated data
First, we evaluated the performance of SiFit on extensive simulated data sets. The simulation studies were aimed at analyzing SiFit’s accuracy in phylogeny inference under different experimental conditions. We also assessed SiFit’s ability to estimate the error rates and its robustness against increased error rates. We compared SiFit’s performance to three other methods. To analyze how the tree inference process degrades if the inference algorithm fails to account for the SCS errors, we chose a representative of the classic phylogeny inference methods as used by Eirew et al. [34]. Eirew et al. used MrBayes [35], a Bayesian phylogenetic inference method, which reports a set of trees drawn from the posterior distribution. Even though it was applied on SCS data, this method does not account for the errors in that data. The trees inferred from this method can be directly compared against the true trees. For MrBayes, we compute the average tree reconstruction error by averaging over all inferred trees. We also compared against SCITE [39] and OncoNEM [38], methods that infer tumor trees under the infinite-sites assumption. SCITE was designed to infer a mutation tree, but it can also infer a binary leaf-labeled tree, where the cells are the leaf labels and edges contain mutations. We used SCITE to infer the binary leaf-labeled tree from simulated data sets so that they can be directly compared against the true trees. Since, SCITE is an MCMC-based algorithm, occasionally it might report more than one optimal tree. In such cases, we measure the average accuracy over all the reported trees. OncoNEM infers a clonal tree, which cannot be directly compared against the simulated trees. OncoNEM first infers a cell lineage tree and then converts it to a clonal tree by clustering nodes. The cell lineage tree inferred by OncoNEM is a different representation of the clonal tree. We convert the cell lineage tree inferred from OncoNEM to an equivalent phylogenetic tree (potentially non-binary) by projecting the internal nodes to leaves (for details see “Methods” section), enabling us to compare OncoNEM results against true trees.
We use the tree reconstruction error for the performance metric. This measures the distance of the inferred tree from the true tree. The distance between two binary trees is measured in terms of the Robinson–Foulds (RF) distance [46], which counts the number of non-trivial bipartitions that are present in the inferred or the true tree but not in both trees. We normalize this count using the total number of bipartitions in the two trees. The output of SiFit, SCITE, and the Bayesian phylogenetic inference algorithm (MrBayes) is compared against the true tree in terms of the RF distance. The tree inferred by OncoNEM might be non-binary, so for OncoNEM trees, we separately computed the FP and FN distances between the true tree and the inferred tree. For binary trees with the same leaf set, the FP and FN distances are equal. For a non-binary tree, the FP and FN distances could differ from each other. “Methods” section gives the details of the tree reconstruction error metric for comparing trees.
Accuracy of phylogeny inference
To analyze the accuracy of SiFit’s tree inference, we simulated three sets of single-cell data with varying levels of doublet noise: (1) data sets without any doublet (δ=0), (2) data sets with 5% doublet rate (δ=0.05), and (3) data sets with 10% doublet rate (δ=0.1). For each setting, we simulated random binary phylogenetic trees for a varying number of leaves (single cells). The number of cells, i.e., leaves in the trees, m, was varied as m=50, m=100, and m=200. The number of sites, n, was varied as n=200, n=400, and n=600. For each combination of δ, n, and m, we generated ten data sets that were simulated from ten random trees. At the root of the tree, all sites have a homozygous reference genotype. The sequences are evolved along the branches of the tree starting from the root. In each branch of the tree, we simulate four types of events that can alter the genotype of a site: a new mutation, a deletion, LOH, and a recurrent point mutation (see “Methods” section for details). After evolving, the leaves have genotype sequences with true mutations. m genotype sequences corresponding to m single cells constitute the true genotype matrix. Errors are introduced into the true genotype matrix to simulate single-cell errors. For data sets with doublets, doublets are formed by merging the genotypes of two single cells (see “Methods” section) with probability δ. The FN rate for cell c, β
c
, is sampled from a normal distribution with mean β
mean=0.2 and standard deviation β
sd=β
mean/10. FNs are introduced into the genotype matrix with probability β
c
for cell c. We introduced FPs into the genotype matrix with error rate α=0.01 by converting homozygous reference genotypes to heterozygous genotypes with probability α. It is important to note that here the FP rate, α, is by definition different from the false discovery rate (FDR) reported in single-cell-based studies such as [22, 24, 26]. α here indicates the fraction of non-mutant sites that are reported as mutant in the observed genotype matrix, whereas the FDR reported in the aforementioned studies refers to the number of FP errors per sequenced base pair. For exome-sequencing studies, even a very small FDR (∼10−5) can lead to a large number of FP variants in the observed genotype matrix, making α much higher than the reported FDR. After adding noise, the imperfect genotype matrices were used as input to SiFit for learning the ML tree.
SiFit’s tree inference accuracy was compared against three other methods. The same imperfect genotype matrix was used as input to SiFit and SCITE. For OncoNEM and MrBayes, the genotype matrices were binarized by converting the heterozygous and homozygous non-reference genotypes to 1, i.e., the presence of a mutation. The comparison is shown in Fig. 1, which shows the tree reconstruction error. For each value of n, the mean error metric over ten data sets is plotted along with the standard deviation as the error bar. For data sets without doublets, SiFit substantially outperforms the other three methods for all values of m and n. The performance of each algorithm except for OncoNEM improves as the value of n increases. The behavior of OncoNEM is different. For m=100, its accuracy decreases for n=600 compared to n=400. This might be because OncoNEM was developed for clonal tree inference and the effect of an additional number of sites cannot be observed in the equivalent phylogenetic tree unless they (the additional sites) are different across the clones. For data sets with a higher number of sites (n=600), SiFit was able to find either the true tree topology or a near-perfect tree topology for most of the data sets, demonstrating its ability to infer the correct trees given enough data.
For the data sets with doublets, we measured the tree reconstruction error in two ways: (1) doublets are removed from both the true tree and inferred tree and then the RF distance is calculated and (2) the RF distance is calculated between the true tree and inferred tree without any distinction of doublets. Since, doublets are a hybrid of two cells that belong to two places in the tree, measuring the tree reconstruction error as in (1) ensures that position of all the other cells except the doublets are properly inferred, whereas (2) measures the overall tree reconstruction error. Figure 1 compares the algorithms in terms of tree reconstruction error as described in (1). SiFit outperforms the other three methods for all values of δ, m, and n. The performance of SCITE and MrBayes is substantially affected by the presence of doublets, specifically for the data sets with a smaller number of mutations. In comparison, SiFit’s performance is much more robust in the presence of doublets while recovering the positions of the non-doublets in the tree. Even in terms of the overall tree reconstruction error (measured as described in (2)), SiFit performs better than the other algorithms for all simulation settings corresponding to different values of δ, n, and m (Additional file 1: Figure S1).
Inference with missing data
Due to uneven coverage and amplification bias, current SCS data sets are challenged by missing data points where genotype states are unobserved. To investigate how missing data affect phylogeny reconstruction, we performed additional simulation experiments. For m=100 and n={200,400,600}, we generated data sets using the same error rates as before. For each combination of δ, n, and m, we generated ten data sets, for each of which, two other data sets with missing data of {10%,25%} were generated. To generate the data sets with missing data, the genotype information of sites was removed with probability 0.1, and 0.25 for missing data of {10%,25%}, respectively. SiFit’s results were compared against SCITE and OncoNEM. The results are shown in Fig. 2. For each value of δ, as the missing data rate increases from 0 to 25%, for each of the competing methods, we observe a steady increase in the tree reconstruction error.
For data sets without doublets (δ=0), irrespective of the percentage of missing data, SiFit performs substantially better than SCITE and OncoNEM. SiFit’s likelihood calculation treats each missing data point as contributing a marginal probability of 1, effectively making it equivalent to reducing the number of sites n. For the data sets with doublets, we measured the tree reconstruction error in two ways as described in the previous section. SiFit outperforms both SCITE and OncoNEM irrespective of the way the tree reconstruction error was measured (Fig. 2 and Additional file 1: Figure S2).
Robustness to increasing error rates
ADO is the major source of error in SCS data resulting in FNs [14]. To test the robustness of SiFit to an increase in FN rate β, we simulated data sets with increased FN rates. The number of cells m was set to 100 and the number of sites n was set to 400. The mean FN rate β
mean was varied from 0.2 to 0.4 in steps of 0.1, i.e., β
mean∈{0.2,0.3,0.4}. The FN rate of cell c, β
c
, was sampled from a normal distribution as described in the previous experiment. The FP rate was set to α=0.01. With these settings, for each value of β
mean∈{0.2,0.3,0.4}, ten data sets were simulated for phylogeny reconstruction.
The performance of SiFit was compared against SCITE and OncoNEM. For different settings of FN rates, SiFit consistently performs better than SCITE and OncoNEM by achieving the lowest tree reconstruction error (Fig. 3). For SCITE and SiFit, with the increase in the FN rate, the tree inference error increases. For OncoNEM, the tree reconstruction error first increases and then decreases. The rate of increase in tree reconstruction error for SiFit is also much lower compared to that of SCITE. This indicates SiFit’s higher robustness against amplification errors compared to SCITE. OncoNEM’s tree reconstruction error is higher than those of SCITE and SiFit for all values of the FN rate. For OncoNEM, binarization of the data leads to loss of information and it employs a grid search to learn the parameters before learning the optimal tree. This divisive sequential approach of learning may lead to a suboptimal solution if the initial solution gets stuck in local optima.
Estimation of error rates
In addition to the phylogenetic tree, SiFit also learns the error parameters from the data. To examine SiFit’s capability to estimate the FN rate from the data, we simulated 30 data sets from 30 random binary trees. For these data sets, the number of cells was set to 100, the number of sites was set to 400, and the FP rate was set to α=0.01. The FN rate β was varied from 0.1 to 0.4. These imperfect data matrices were given to SiFit for inference of the tree and FN rate.
SiFit performed very well in estimating FN rate, as shown in Fig. 4. The ML values of β learned from the data were highly correlated (0.9843) to the ones that generated the data. This experiment demonstrates SiFit’s ability to infer error parameters from data.
SCITE and OncoNEM can also learn the FN rate from the data. To compare SiFit’s estimate of the error rate against those of OncoNEM and SCITE, we applied SCITE and OncoNEM to the same data sets for learning FN rates. SCITE’s performance (correlation 0.9622) was better than that of OncoNEM (correlation 0.8766) but SiFit was the best performer. Specifically, for data sets with a higher FN rate (>0.2), SiFit’s estimates were much better than those of SCITE and OncoNEM. This indicates a degree of robustness of SiFit in the presence of higher error rates compared to the other methods.
Run times
To measure the run time of SiFit, we simulated data sets containing different numbers of cells. The number of cells, i.e., leaves in the trees, m, was varied as m=100, m=200, and m=500. The number of sites n was varied as n=200 and n=400. The error rates were chosen as described in the previous experiments. For each combination of m and n, ten data sets were simulated. For each of these data sets, SiFit was run for 200,000 iterations in a node with 24 CPU cores (AMD 2.2 GHz). In each case, the average run time for 200,000 iterations was recorded (Additional file 1: Figure S3). For a fixed number of sites n, with the increase in the number of cells in the tree, SiFit’s run time increases almost linearly. This behavior is observed for both n=200 and n=400. This indicates that SiFit is scalable and will adopt well when future experiments generate sequencing data consisting of thousands of single cells. The theoretical computational complexity of SiFit is described in “Methods” section.
Inference of tumor phylogeny from experimental SCS data
We applied SiFit to two experimental SCS data sets: exome sequencing from a non-hereditary colorectal cancer patient and high-throughput SCS from a metastatic colorectal cancer patient. From these data, we inferred the phylogenetic lineages of the tumor and ordered the chronology of mutations. These studies used different SCS methods and had different samples sizes and error rates. We selected them to show that SiFit is flexible and can be applied broadly to different single-cell mutation data sets.
Phylogenetic lineage of adenomatous polyps and colorectal cancer
SiFit was applied to single-cell exome sequencing data from a non-hereditary colorectal cancer [37] patient. The data set consisted of 61 single cells in total, with 35 cells sampled from colorectal cancer tissue, 13 from an adenomatous polyp tissue, and 13 from normal colorectal tissue. Variant calling resulted in the detection of 77 somatic SNVs from these 61 cells. In total, approximately 9.4% of the values were missing in the data set. The reported genotypes were binary values, representing the presence or absence of a mutation at the SNV sites (Additional file 1: Figure S4a).
To test whether the genotype matrix violates the infinite-sites assumption, we ran the four-gamete test. The four-gamete theorem states that an m×n binary matrix M has an undirected perfect phylogeny if and only if no pair of columns contain all four binary pairs (0,0; 0,1; 1,0; and 1,1), where m represents the number of taxa (leaves of the tree) and n represents genomic sites [47]. The perfect phylogeny model conveys the biological feature that every genomic site mutates at most once in the phylogeny [47] and that mutations are never lost. The existence of a perfect phylogeny shows that the data could fit the infinite-sites model of evolution. A violation of the four-gamete condition may indicate a potential deviation from the infinite-sites assumption. However, it is important to note that for SCS data, there could be more than one potential event leading to violation of the four-gamete test (see Additional file 1: Supplementary Note and Additional file 1: Figures S5 and S6 for more details). The binary mutation matrix from this colorectal patient violated the four-gamete test, with 1847 (out of 2926) pairs of SNV sites that contained all four binary pairs.
The ML tree inferred by SiFit on 77 SNVs is shown in Fig. 5. The tree shows that the normal cells are placed very close to the root. In the original study, some of the adenomatous polyp cells were found to have no somatic mutations and were speculated to have derived from normal colorectal cells. In the tree inferred by SiFit, these cells (ap8–ap13) are accurately placed along with the normal cells. The original study also reported a set of cells from the cancer tissue as normal cells because they did not contain any somatic mutations. The tree inferred from SiFit placed these cells along with the normal cells, representing a completely independent lineage that likely initiated from a different originating cell.
We performed k-medoids clustering using the silhouette score (see “Methods” section for details) on the ML tree-based distance matrix. The cancer cells were clustered into two subpopulations (A and B). The chronological order of the mutations was inferred based on the inference of the mutation status of the internal nodes. We extended the algorithm in [48] for inferring ancestral sequences by accounting for single-cell-specific errors (see “Methods” section for details). This enabled us to find the ML solution for placing the mutations on the branches of the SiFit tree. Altogether, 53 clonal mutations occurred in the trunk of the tree, including mutations in LAMA1 (PI3K-Akt signaling pathway) and ADCY3 (FGFR signaling pathway). These clonal mutations are driver events that likely led to the expansion of subpopulation A. Subpopulation B emerged from subpopulation A by acquiring additional subclonal mutations in EPHA5, CASQ2, and SMARCE1. The SiFit tree also shows the evolution of the adenomatous polyp cells (marked in blue), which evolved from the normal cells by acquiring mutations in OR1B1 (GPCR signaling pathway), DCDC5, and MLLT1. The adenomatous polyp cells evolved independently and further accumulated mutations in CSMD1, FBXO15, and TCP11. The tree inferred by SiFit represented the evolution of both the adenomatous polyp cells and the colorectal cancer cells and identified the order of the mutations that are associated with different signaling pathways and may have played a key role in the development of heterogeneity in this cancer patient.
To compare the results of SiFit with other algorithms, we also applied SCITE and OncoNEM on this data set. To enable a direct comparison, SCITE was used to infer a binary leaf-labeled tree, which is an ML solution with the single cells placed at the leaves of the tree. SCITE reported a single ML tree (T
SCITE) from this data set (Additional file 1: Figure S7). We compared the tree inferred by SiFit (T
SiFit) to the tree inferred by SCITE in terms of the likelihood value. Since the ML tree inferred by SCITE (T
SCITE) does not have branch lengths, we cannot directly compute the likelihood value of T
SCITE using our likelihood function. Instead, we used the likelihood function of SCITE to compare the two trees. SCITE uses an expected mutation matrix defined by the mutation tree topology and sample attachments to compute the likelihood of a tree. After finding the ML placement of the mutations on the SiFit tree (T
SiFit), we obtained the expected mutation matrix E, defined by T
SiFit and the annotated mutations on the branches of T
SiFit and then calculated its likelihood using Eq. 3 of [39]. This likelihood function of SCITE gives an edge to SCITE and is disadvantageous for SiFit because the branch lengths inferred by SiFit are ignored in this likelihood calculation. T
SiFit had a log-likelihood value of −632.5, which was substantially higher than the log-likelihood (−785.92) of T
SCITE. This higher likelihood suggests that the tree inferred by SiFit explains the data better than that of SCITE on this experimental data set.
We used OncoNEM to infer the cell lineage tree (T
OncoNEM) from this data set (Additional file 1: Figure S8). OncoNEM can also estimate the occurrence of mutations on the cell lineage tree based on posterior probability. Since, OncoNEM follows the infinite-sites assumption, if a cell in the lineage tree contains a mutation, all its descendants should have that mutation. Based on this principle and OncoNEM’s estimate of the occurrence of mutations, we can compute an expected mutation matrix that is defined by T
OncoNEM. This enabled us to use the likelihood function of SCITE to compare T
OncoNEM against T
SiFit. The log-likelihood value (−664.79) of T
OncoNEM was better than that of T
SCITE but it was worse than that of T
SiFit. The higher likelihood of the tree inferred by SiFit compared to those of OncoNEM and SCITE suggests that the expected mutation matrix defined by SiFit’s tree inferred under a finite-sites model of evolution explains the data better than those of its contemporaries inferred under the infinite-sites assumption.
Phylogenetic lineage of a metastatic colorectal cancer patient
Next, we applied SiFit to infer the metastatic lineage of a colorectal cancer patient with a matched primary tumor and liver metastasis that was untreated. This data set consisted of highly-multiplexed SCS data [31] from 178 single cells using a 1000 cancer gene panel. Variant calling resulted in the detection of 16 somatic SNVs from these 178 cells [49]. The FP rate was estimated to be 1.52% and the FN rate was estimated to be 7.89%. In total, approximately 6.9% of the values were missing in the data set. The reported genotypes were binary values, representing the presence or absence of a mutation at the SNV sites (Additional file 1: Figure S4b).
Altogether, 104 (out of 120) pairs of SNV sites violated the four-gamete test, indicating the potential violation of the infinite-sites assumption.
The ML tree inferred by SiFit from this data set is shown in Fig. 6. k-medoids clustering using the silhouette score on the ML tree-based distance matrix identified three subpopulations of somatically mutated cells along with the population of cells without mutations. The subpopulation of cells (marked in cyan) without mutations consisted mostly of diploid cells, suggesting they are normal stromal cells. The first somatic subpopulation (marked in green) consisted of mostly diploid cells. The second subpopulation (marked in blue) consisted of mostly primary aneuploid cells and a few diploid cells. The third subpopulation (marked in red) consisted of metastatic cells only. The chronological order of the mutations was inferred based on the ML placement of the mutations on the branches of the tree. Three diploid cells in the first subpopulation first acquired a heterozygous nonsense mutation in APC. This mutation was present in all the descendants (all primary and metastatic tumor cells), suggesting that this was the first mutation that initiated the tumor. Subsequently, mutations were acquired in the KRAS oncogene, the TP53 tumor suppressor gene, and the CCNE1 oncogene, which led to the expansion of the primary tumor mass. These primary tumor cells accumulated seven additional somatic mutations. In the later stages of the phylogeny, the accumulation of mutations in EYS, ZNF521, and TRRAP marked the point of metastatic divergence, after which tumor cells disseminated to the liver. Three more mutations occurred in RBFOX1, GATA1, and MYH9. The phylogeny also indicates potential losses of mutations, including POU2AF1, which was lost in 17 primary tumor cells, and the mutation in TCF7L2, which was lost in four metastatic tumor cells, but these losses did not mark any point of divergence, indicating they might be passenger mutations.
We also applied SCITE and OncoNEM on this data set. SCITE inferred a single binary leaf-labeled tree (T
SCITE, shown in Additional file 1: Figure S9), which is the ML solution with a log-likelihood score of −387.68. To compute the likelihood of the tree (T
SiFit) inferred by SiFit using SCITE’s likelihood function, we computed the expected mutation matrix E defined by T
SiFit using the ML placement of the mutations on its branches. T
SiFit had a higher value of the log-likelihood score (−201.63). OncoNEM was used to infer a cell lineage tree (T
OncoNEM, shown in Additional file 1: Figure S10) from this data set. We also estimated the occurrence of mutations on T
OncoNEM based on the posterior probability values. This enabled us to calculate the likelihood of T
OncoNEM through the computation of the expected mutation defined by T
OncoNEM. T
OncoNEM had a log-likelihood value of −349.95, which is worse than that of T
SiFit. The higher likelihood value of T
SiFit on this data set suggests that the tree inferred by SiFit is superior to those of SCITE and OncoNEM in terms of explaining the data.