Tree inference from single-cell mutation profiles
We first provide a brief description of our approach to tree inference from single-cell mutation profiles. We start with a model for representing single-cell mutation histories and the likelihood-based approach to deal with sequencing errors. Then we give an overview on the different variants of the MCMC sampling scheme implemented in SCITE. A more technical description of SCITE is in the “Methods” section.
Model of tumor evolution and tree representation
We restrict the evolutionary model to point mutations in this work and make the infinite sites assumption, which states that every genome position mutates at most once in the evolutionary history of a tumor. No further constraints are necessary, in particular no assumption on a monoclonal origin of the tumor is made, a core assumption in tree reconstruction from mixed samples.
We represent the mutation status of m single cells at n different loci in a binary n×m mutation matrix E where a 1, respectively a 0, at entry (i,j) denotes the presence, respectively the absence, of mutation i in cell j (Fig. 1
c). With the exclusion of convergent evolution due to the infinite sites assumption, this matrix defines a perfect phylogeny of the single cells. This means that there exists a rooted binary tree with the cells as leaves in which every mutation can be placed on one edge such that the mutation status of every leaf equals the set of mutations on its path to the root (Fig. 1
b). Mutations present in all cells can be removed from the data as their location in the tree is known. The same is true for mutations observed only in a single cell. These are directly associated with the cell and non-informative in the tree reconstruction. For example, the mutation matrix from Fig. 1
c reduces to:
$$ \begin{aligned} &\qquad \qquad \quad s_{1} \quad s_{2} \quad s_{3} \quad s_{4} \quad \ s_{5} \quad s_{6} \quad s_{7}\\ &E= \begin{array}{c} M_{1}\\ M_{2}\\ M_{3} \end{array} \left(\begin{array}{ccccccc} 1 & \quad 1 & \quad 1 & \quad 0 & \quad 0 & \quad 0 & \quad 0 \\ 0 & \quad 0 & \quad 0 & \quad 0 & \quad 1 & \quad 1 & \quad 1 \\ 0 & \quad 0 & \quad 0 & \quad 0 & \quad 1 & \quad 1 & \quad 0 \end{array}\right), \end{aligned} $$
((1))
where we now represent the remaining three mutations as M
1, M
2, and M
3. In general, the binary tree defined by the matrix E will not be unique. In the example in Fig. 1
b, since the three left-most leaves all have the same mutation status, their branching order in the tree is, therefore, arbitrary. Also the correct placement of the fourth leaf is not unique, as it has no mutation other than the ones shared by all samples. It could equally well branch off in the left subtree after the two ubiquitous mutations instead of the right one. A more compact tree representation of E is a mutation tree
T, which represents the mutations as nodes and connects the nodes according to their order in the evolutionary history. An empty node is used to indicate the root (Fig. 1
d). The mutation tree can be seen as the perfect phylogeny tree, where instead of placing the mutations along the edges we encapsulate them inside internal nodes. This slight change in representation facilitates our inference later. The mutation tree can be augmented with the sequenced cells by attaching them to the node that matches their mutation state (Fig. 1
f). The order of mutations shared by the exact same set of cells is unidentifiable in the mutation tree, as is the case for the two top-most mutations in Fig. 1
f. Such subsets of mutations are summarized in a single node, here highlighted as a shaded box.
Observational errors
In real data, we do not observe a perfect mutation matrix (Fig. 1
c) but a noisy version of it (Fig. 1
g), which we denote by D in the following. If the true mutation value is 0, we may observe a 1 with probability α (false positive), and if the true mutation value is 1, we may observe a 0 with probability β (false negative) such that
$${} \begin{aligned} P(D_{ij}=0 | E_{ij} = 0) &= 1-\alpha, & P(D_{ij}=0 | E_{ij} = 1) &= \beta,\\ P(D_{ij}=1 | E_{ij} = 0) &=\alpha, & P(D_{ij}=1 | E_{ij} = 1) &= 1-\beta. \end{aligned} $$
((2))
Assuming the observational errors are independent of each other, the likelihood of the data given a mutation tree T, knowledge of the attachment of the samples σ, and the error rates θ=(α,β) is then
$$ P(D| T,\boldsymbol{\sigma},\boldsymbol{\theta}) =\prod_{i=1}^{n}\prod_{j=1}^{m} P(D_{ij} | E_{ij}), $$
((3))
where E is the mutation matrix defined by T and σ.
For the posterior,
$$ P(T,\boldsymbol{\sigma},\boldsymbol{\theta} | D) \propto P(D| T,\boldsymbol{\sigma},\boldsymbol{\theta}) P(T,\boldsymbol{\sigma},\boldsymbol{\theta}), $$
((4))
we can factorize the prior, P(T,σ,θ)=P(σ|T,θ)P(T,θ), and we assume independence of the error rates to set P(T,σ,θ)=P(σ|T)P(T)P(θ) so that the attachment prior P(σ|T) depends on T. Such a prior might be useful if one suspects that cells are more likely to be sampled from later stages in tumor development and lower down in the tree. Here though we use a uniform attachment prior.
MCMC sampling
Our model for learning mutation histories from single-cell mutation profiles consists of three parts: the mutation tree T, the sample attachment vector σ, and the error rates of the sequencing experiment θ. The resulting search space has a continuous component for θ and a discrete component of size (n+1)(n−1)(n+1)m for (T,σ), which prohibits an exhaustive search. Instead, with Eqs. 3 and 4 we built SCITE, a MCMC scheme to sample from the joint posterior given the data. From the current state (T,σ,θ), we propose a new state (T
′,σ
′,θ
′) with an ergodic mixture of moves where we change one component at a time. With properly defined transition probabilities and acceptance ratio, our chain converges to the posterior. In practice, we marginalize out the sample attachments in our model not only to speed up convergence but to focus on the mutation tree T as the informative part for understanding the mutation history. Thus,
$$ P(T,\boldsymbol{\theta} | D) = \sum_{\boldsymbol{\sigma}} P(T,\boldsymbol{\sigma},\boldsymbol{\theta} | D). $$
((5))
We then only need to consider moves in the joint (T,θ) space, thereby reducing the search space by a factor of (n+1)m. It is still possible to augment the tree with the samples in a post-processing step by sampling them conditionally on the tree.
After convergence, the MCMC chain can be used to sample trees and error rates proportionally to the joint posterior distribution in Eq. 4. In addition, it is possible to obtain a single best fitting combination of mutation tree and error rates via point estimates of the model parameters. One way of doing this is via maximum a posteriori (MAP) estimates:
$$ (T,\boldsymbol{\theta})_{\text{MAP}} = \text{arg} \max_{(T,\boldsymbol{\theta})} P(T,\boldsymbol{\theta} | D). $$
((6))
Another possibility is to use ML estimates. Since the likelihood depends on the full set of model parameters (T,σ,θ), it is more natural to optimize them all jointly rather than marginalizing out the sample attachment:
$$ (T,\boldsymbol{\sigma},\boldsymbol{\theta})_{\text{ML}} = \text{arg} \max_{(T,\boldsymbol{\sigma},\boldsymbol{\theta})} P(D | T,\boldsymbol{\sigma}, \boldsymbol{\theta}). $$
((7))
In the ML framework, SCITE includes a parameter γ that amplifies the likelihood and which can speed up discovery of the ML tree.
Finally, SCITE provides an option to skip the learning of error rates when fixed error rates are provided. Since these are often available for sequencing data, they can be used instead to reduce the search space size.
Reconstructing mutation histories from real tumor data
For a first evaluation of SCITE, we applied it to three real single-cell tumor data sets of different data quality.
JAK2-negative myeloproliferative neoplasm
The first tumor data is single-cell exome sequencing data from a JAK2-negative myeloproliferative neoplasm (essential thrombocythemia) [30]. It originally consists of 712 SNVs detected in the exomes of 58 tumor cells. In our evaluation, we focus on the 18 mutation sites selected as cancer-related by [30]. The error rates of the sequencing were estimated as α=6.04×10−6 (false positives) and β=0.4309 (false negatives, allelic dropout). In addition, the reduced set has 45 % missing data points (compared to 58 % in the full data set). The mutation matrix (Additional file 1: Figure S1a) is taken from [34]. It distinguishes three observed states: normal, heterozygous, and homozygous mutation. This means only that a homozygous mutation is observed, not that it is actually present in the data. The latter would contradict the infinite sites model that each site mutates at most once. Explanations consistent with infinite sites are that we either have a false negative for the normal copy of a heterozygous site, or less likely, a combination of a false positive and an allelic dropout for a site whose true state is homozygous normal. Another explanation for observing a homozygous mutation could be a loss of heterozygosity. We adapted our approach to integrate the third mutation state by using the same error probabilities as [34]. They assume that an allelic dropout is equally likely to cause a heterozygous mutation to be recorded as a normal state or as homozygous. Denoting heterozygous sites by 1 and homozygous sites by 2, this assumption results in the error probabilities:
$${} {\small{\begin{aligned} P(D_{ij}=0 | E_{ij} = 0) &= 1-\alpha -\frac{\alpha\beta}{2}, &\!\!\!\!\! P(D_{ij}=0 | E_{ij} = 1) &= \frac{\beta}{2}, \\ P(D_{ij}=1 | E_{ij} = 0) &=\alpha, &\!\!\!\!\! P(D_{ij}=1 | E_{ij} = 1) &= 1-\beta, \\ P(D_{ij}=2 | E_{ij} = 0) &= \frac{\alpha\beta}{2}, &\!\!\!\!\! P(D_{ij}=2 | E_{ij} = 1) &= \frac{\beta}{2}. \end{aligned}}} $$
((8))
Mutation tree reconstruction
We computed the ML tree for the 18 mutation sites with SCITE. When optimizing tree and sample attachment, we obtain a mostly linear mutation tree with a single branching in the lower part of the tree (Additional file 1: Figure S2a) with a ML log score of −378.4.
We observe that quite a few samples are placed at nodes high up in the tree (Additional file 1: Figure S3), though many of these placements are uncertain, as indicated by the multiple co-optimal attachments. Taking into account the uncertainties due to the high error rates and the large number of missing values (45 %), it is not unexpected that many cells fit equally well to several neighboring nodes. The linear nature of the tree matches a sequential monoclonal development. The subclone expansion starting towards the bottom of the tree indicates the co-existence of multiple subclones at the point of sampling. However, from the single time point data, it is not possible to decide whether the more recent subclones are on the verge of replacing the more ancestral clones, or will coexist for longer.
Along with finding the ML tree with attachments, we performed a fully Bayesian sampling of trees and attachments from the posterior. To summarize such a sample, we consider as an example the number of branches the trees possess. The distribution for the data from [30] (Fig. 2
a) shows that the trees mostly have a single branching point (with two branches) like the ML tree and often occur as a simple linear chain with a single branch.
Comparison to trees found with other approaches
The same data have previously been analyzed with two competing methods [33, 34].
Kim and Simon [34] employ the same underlying likelihood with errors as in Eq. 8 but they use the data to learn ancestral relations between each pair of mutation nodes instead of the whole tree at once. They also use the data to learn a parameter representing how quickly the mutation tree branches. This parameter is then used to calculate the prior probability of ancestral relations, which is fed into their pairwise test and subsequent tree reconstruction.
With the data from [30] (on the same 18 selected mutations), [34] estimate that 92 % of the evolutionary time of the phylogenetic tree should be before the first binary split. In their model, this translates into expecting over 80 % of the mutations to occur before any branching in the mutation tree. Despite this very linear tree estimate, their algorithm to turn the pairwise ancestral relations into a mutation tree leads to the very branched tree in Additional file 1: Figure S2c, which has a much lower log-likelihood of −1059.7 than the ML tree found with SCITE (with a log-likelihood of −378.4). This may be due to the use of the minimum spanning tree algorithm by Kim and Simon. The method effectively needs to turn ancestral relations into strict parent–child relations and thereby, it essentially discounts the deeper history embedded in their pairwise tests.
We cannot compare directly to the tree found by BitPhylogeny [33] since their algorithm aims to find the phylogenetic connection between the samples themselves rather than the mutation tree. Furthermore, the algorithm groups samples into clones according to the data and a stick-breaking prior. For example, using all the mutation data from [30], as well as a bulk normal and bulk cancer sequence, and with a particular stick-breaking tree prior they find one large clone accounting for over half the samples and eight further smaller clones arranged in a tree structure [33]. However, we can view their result as a mutation tree with attachments where the mutations themselves have been censored. This leaves just the sample attachment information as well as the global tree structure between their groupings.
To build a complete mutation tree we allow each mutation to be placed before any one of the clonal groupings of samples (or completely afterwards). For each mutation, we find its ML position and hence find the ML tree (with attachments), which respects the result of [33]. The resulting tree (Additional file 1: Figure S2b) is a mostly linear chain like the ML tree SCITE finds and involves some of the same genes at the branches although one of our branches is lost. The log-likelihood of −642.3 for this tree is substantially better than the tree of [34] but worse than the tree SCITE finds (with a log-likelihood of −378.4). With single-cell sequencing we can, as we do here, simply treat each cell as its own clone and discover the phylogeny directly. BitPhylogeny [33] instead focuses on clustering samples into subclones during tree inference thereby reducing the resolution of the reconstruction.
Error rate learning
Within our Bayesian MCMC approach, we can also sample error rates from the posterior. Focusing on the false negative error rate β while keeping the false positive α fixed, for the beta prior on β with mean 0.4309, we chose a large standard deviation of 0.1. In the MCMC chain, with probability 10 % a new β
′ is proposed following a Gaussian random walk with standard deviation equal to one third of the prior’s. Running the chain for 10 million steps, throwing away the first quarter, and plotting the resulting posterior of β we arrive at Fig. 2
b. The posterior mean is 0.455 with standard deviation 0.027 so that the data indicates that the measured value of 0.4309 is a little underestimated but well within tolerances.
More interesting for our purposes is how these error rates affect the tree inference. The MAP β is 0.455 while the MAP tree (with attachments marginalized out) is a simple chain (Additional file 1: Figure S4). The mutation order is similar to the ML tree (Additional file 1: Figure S2a) up to the branching point suggesting a monoclonal tumor development. Keeping the error rate fixed at 0.4309 instead, we find an identical MAP tree giving us confidence that the inference is robust against minor differences in the error rates.
Mutation tree inference for a larger set of mutations
We also considered a larger set of mutations comprising all 78 non-synonymous mutations from the full data set. For this number of mutations, with only 58 sampled cells and high levels of missing data (48 %), the posterior is rather flat making discovering a global optimum rather than a local optimum more difficult. Increasing the parameter γ to 2–3 to amplify the likelihood landscape helped in discovering high-scoring trees. We also tested that the alternative tree representation (see “Methods”) designed for instances with more mutations than samples aided in finding the ML tree (Additional file 1: Figure S5). The ML tree is again highly linear but the order especially of some of the 18 mutations varies compared to the ML tree inferred for that subset of the data (Additional file 1: Figure S3). With missing data, the mutations may fit equally well along several edges and they were placed in their earliest position, which may explain some of the variation. More generally though, the high levels of missing data allow mutations and samples to move without affecting the likelihood while high error rates allow further rearrangements with only a small effect. For example, the mutation in the gene PDE4DIP that changes most between the two data sets has 59 % missing data. Also the order is essentially determined by the smaller number of samples that attach higher up the trees. This smaller number is effectively reduced further by the missing data, limiting the accuracy of any tree reconstruction, as explored later with the simulations.
Clear-cell renal-cell carcinoma
The second data set is from single-cell exome sequencing data of a clear-cell renal-cell carcinoma [35]. The mutation statuses of 50 sites in 17 tumor cells are detailed in the supplementary material of [35]. We marked the presence of an SNV when the call was different from the consensus of five normal tissue cells (in line with the totals provided in their supplementary material). As for the data from [30, 35] distinguish between heterozygous and homozygous mutations so we again use Eq. 8. Of the 50 sites, only 35 were not mutated in at least one cell. Only those were selected since the remaining 15 would simply be placed at the top of the mutation tree. The error rates were estimated by [35] as α=2.67×10−5 (false positives) and β=0.1643 (false negatives) and the data also has 22 % missing entries (Additional file 1: Figure S1b).
Mutation tree reconstruction
The ML and MAP trees both possess a completely linear accumulation of mutations (Additional file 1: Figures S6 and S7a), which is consistent with a series of monoclonal expansions and the conclusions of [35]. The linearity is confirmed in the full posterior distribution of trees with a linear chain being dominant (Fig. 2
c). In addition, we observe that almost all of the samples are placed towards the end of the tree. Again a larger value of the parameter γ and the alternative tree representation sped up discovery of ML trees.
Error rate learning
Fixing a beta prior for β with mean 0.1643 and standard deviation of 0.06 the posterior distribution of β was obtained by averaging over ten runs of 10 million steps (with a quarter as burn-in) (Fig. 2
d). The posterior mean is a little larger at 0.207 with a standard deviation of 0.019 so the stated value is just within the uncertainties. The MAP value of β instead is a little closer at 0.198 while the MAP tree (Additional file 1: Figure S7b) is essentially identical to that with a fixed value of β=0.1643 (Additional file 1: Figure S7a). The order of some of the higher mutations varies, however, since their exact placement hardly affects the posterior probability.
Estrogen-receptor positive (ER +) breast cancer
The third data set is from single-nucleus exome sequencing of 47 tumor cells from an estrogen-receptor positive (ER +) breast cancer [36]. Only two states are called for each site: the presence or absence of a SNV. Estimated error rates from [36] are 9.72 % for allelic dropout, and 1.24×10−6 for false discovery. In our analysis, we use the 40 mutations present in at least two tumor cells (Additional file 1: Figure S1c).
Mutation tree reconstruction
The MAP tree computed for this data set is shown in Fig. 3. In the Supplement, we additionally show the ML tree (Additional file 1: Figure S8) and a version of the MAP tree with attached samples (Additional file 1: Figure S9a). In both the MAP and the ML trees, we observe a linear accumulation of mutations in the early stages of the tumor, suggesting that the development was through a sequential replacement of subclones with no surviving side branches and only a few cells with ancestral states surviving until present. In the later stages of the tumor, we observe a complex branching into co-existing subclones. This branching is exhibited more generally in the full posterior distribution of trees as summarized in Fig. 2
e.
From the single time point data available for this tumor, it cannot be inferred whether there will be a long-term coexistence of subclones, or if we observe a transient state that will eventually lead to a single surviving subclone. For initial cancer treatment, however, the status quo, whatever mutations co-occur in cells, is already informative for jointly targeting the present subclones and therefore, minimizing the risk of further differentiation into therapy-resistant subclones.
Error rate learning
Using a beta prior for β with mean 0.0972 and standard deviation of 0.04, we averaged over 20 runs of 10 million steps (with a quarter as burn-in) to obtain the posterior distribution of β (Fig. 2
f). The posterior mean is more than double at 0.228 (with a standard deviation of 0.015), which disagrees with the stated value. This result is in contrast to our later simulations on learning the error rate (Fig. 4) that show that the MAP value is close to the true one. A possible explanation for the discrepancy is that allelic dropout only comprises one part of the false negative rate. Other contributing factors could include inaccuracies in calling heterozygous mutations at low coverage.
The MAP value of β is 0.226 with a MAP tree (Additional file 1: Figure S9b), which shares many feature with the MAP tree at fixed β=0.0972 (Additional file 1: Figure S9a) but has some rearrangements of the branches lower down and some reordering of the mutations higher up. Learning the error rate also leads to slightly fewer branches in the posterior distribution, as indicated by the black crosses in Fig. 2
e.
Systematic evaluation of SCITE on simulated data
With the limited availability of single-cell sequencing data at this point and the lack of the ground truth in real data, we performed a more systematic evaluation of SCITE on simulated data sets. Our analysis focuses on the accuracy of tree inference and error rate learning, the effect of data quality, and the practical run times of SCITE.
Accuracy of tree inference
To check the consistency of our approach, we simulated random mutation trees with attachments uniformly, which allows for poly-clonal tree topologies. First, for n=20 and α=10−5, we generated 100 such trees with up to 100 attachments. For error rates 100β∈{5,15,25}, for each tree we sampled from a lognormal with standard deviation 0.1 and multiplied it by β to obtain β
∗. Then we added noise to the perfect data with rates (α,β
∗) and removed 1 % of the data. Taking subsets of the data of size m, we learned the ML and MAP trees for the error rates β. This gives us a random misspecification of around 10 % compared to β
∗.
We quantified the difference between the inferred trees and the true tree by counting how often a node has the wrong parent (Fig. 5 and the top row of Additional file 1: Figure S10). In the ML setting, if no samples are attached to a chain of mutations, then any ordering of those mutations has the same likelihood. Here, in the score we do not penalize this non-identifiability and take the ordering that minimizes the distance to the generating tree. The non-identifiability will, however, tend to decrease as the number of samples m increases. The MAP tree does select an ordering (roughly following the frequencies) and hence has higher distances than the ML tree. In general, MAP inference should be more robust and less prone to overfitting, but can have a higher bias. To compare the ML and MAP inference fairly, we chose a random ordering of the mutations in non-identifiable regions in the ML trees and recomputed the distances to the generating tree. We do observe a marginal improvement in the tree reconstruction with the MAP tree (Additional file 1: Figure S11).
The errors, however, are not a result of the inference method, since SCITE indeed finds the ML tree (Additional file 1: Figure S12). Instead these errors are inherent in noisy data where another tree might happen to fit the data better than the generating tree. The discrepancy can only be resolved by reducing errors or increasing the sample size and Additional file 1: Figure S10 gives an indication of how this occurs. To put the errors in scale, a value of two would refer to adjacent mutations in a chain being swapped. Since samples contain the mutations along their entire history in the mutation tree, we have a greater consensus about the mutation structure higher up the tree than lower down. The exact placement of mutations near the bottom of the tree may be determined by only a couple of samples so that the errors we typically see with larger m are mutations near the bottom of the tree being shifted, or two adjacent mutations being swapped. With this in mind, we obtain very good trees with about 60 samples, depending on the error rate.
We repeated the simulations for n=40 and up to 200 attachments as depicted along the bottom row of Additional file 1: Figure S10 and again find good reconstruction when we have several samples per mutation.
Learning the error rates
Since SCITE can also perform fully Bayesian tree inference, we examined its ability to infer the false negative rate from data. For 2000 random trees with 60 attachments, we generated data with a range of β from 5 to 25 %, α=10−5, and 1 % missing data. We further fixed a uniform prior for learning β so that no information is passed to SCITE apart from the noisy and incomplete mutation matrix.
There is very high correlation between the generating β and the MAP value learned (Fig. 4). To put this in context, we consider the theoretical distribution if the tree was known. From the random trees and attachments, around 22 % of the entries in the perfect mutation matrix are ones. They are randomly changed with the rate β, leading to a binomial distribution and a standard deviation of
$$\sqrt{\frac{100\beta(1-\beta)}{22mn}} $$
when inferring β from the result. One and two standard deviation intervals are included in Fig. 4, showing again that SCITE performs very well as it must also infer the tree structure and handle the missing data.
Similar plots for m=40 and m=80 (Additional file 1: Figure S13) show also a tightening of the β inference as m increases.
The effect of missing data
High rates of missing data points due to unobserved mutation states are typical for present-day single-cell sequencing data. We performed simulation experiments to test how this feature affects the accuracy of mutation tree reconstruction. With an error rate of β=10 % and the same misspecification as before, we generated up to 400 random trees with up to 80 attachments. Keeping α=10−5, we varied the amount of missing data from 1 to 20 % to see the effect on the tree reconstruction for m={40,60,80}. We see a very weak increase in reconstruction errors as the missing data rate increases (top row of Additional file 1: Figure S14). Since SCITE treats the inference probabilistically, missing data is akin to effectively reducing the number of samples m, so the behavior in Additional file 1: Figure S14 is in line with changing m slightly in Additional file 1: Figure S10. The behavior also shows that SCITE is robust even against high missing data rates.
Looking back to the even higher missing data rates in the earliest data sets, we simulated up to 60 % missing data with 400 trees and the same settings as before. The reconstruction progressively gets worse with increasing missing data (bottom row of Additional file 1: Figure S14). At around 30–40 % missing data with 80 attachments, we have similar performance as for 40 cells attached with no missing data, and so have effectively halved our sample size. With 60 % missing data, the reconstruction is notably poorer again, although SCITE does find about half the parents correctly for the MAP solution and a large majority with the ML approach. This difference is because the optimal order is chosen for the ML solutions in case of non-identifiability.
Doublet samples
Rarely, instead of isolating a single cell for sequencing, a pair of cells is captured instead. We checked how robust SCITE is to these sorts of perturbations by again simulating data from 400 random trees with 20 nodes and up to 100 attachments. To represent the sequences of doublet samples, we took up to 20 pairs of attached samples and combined them by recording a mutation whenever it was present in either of the original single cells. Errors were added with a rate of β=10 % (misspecified as previously), α=10−5, and 1 % missing data. We ran SCITE with m={40,60,80} total samples, including up to 20 doublets, to see their effect on the tree reconstruction.
We observe a linear increase in reconstruction errors as the number of doublets increases (Additional file 1: Figure S15) with decreasing gradient as m increases since then the doublets represent a smaller proportion of the total sample. Unlike missing data, which reduces the effective sample size, doublets add confounding mutations, which could disagree with the tree topology. However, since SCITE employs probabilistic inference, and at the level of the mutation tree rather than the sample tree, the consensus of the single-cell samples moderates the negative effects of the doublets. Even at high rates of doublet sampling, like 10 or 20 %, the tree reconstruction, therefore, performs well.
Run times
To uncover the complexity of the stochastic search and MCMC scheme, we simulated data from 400 uniformly sampled trees with up to 100 nodes and 400 attached samples. We set α=10−5 and β=0.1 (with the same misspecification as before), included 1 % missing data and set the parameter γ=1 as for the MCMC case. For each tree, we ran SCITE 100 times and recorded how many steps the algorithm took to first hit the highest likelihood tree uncovered by that run, as well as the time of the run. The lengths of the chains were chosen so that nearly all of the runs would share the same highest likelihood. The average number of steps needed to first find the consensus ML tree can then be calculated (for those runs with a lower likelihood, we add the length of the chain and then assumed they would find the ML tree in an additional average number of steps). This can then be multiplied by the average time per step to give a measure of how long SCITE takes to find a ML tree on average, and repeated for all 400 trees to provide Fig. 6.
On the theoretical side, arguments analogous to those in [37] indicate that the MCMC chain requires O(n
2 ln(n)) steps to converge or find ML trees. The likelihood landscape may also depend on n and m in non-trivial ways, which can further affect the convergence. With each MCMC step taking O(mn) to score the tree, we get an overall estimate of O(mn
3 ln(n)) for convergence.
Compared to the numerical results in Fig. 6, the gradients in the log–log plots are 4.5, 4.5, and 4.2 for m={n,2n,4n} respectively. Since m∼n in the simulations, these are a little higher than the power of 4 suggested by the estimate, but roughly in line with it. To check the linear scaling with m, we take the fit lines at n=60 in the middle of the simulation and we find that doubling m from n to 2n and then 4n increases the time by a factor of 1.9 and then 1.95, slightly less than double and in line with linear scaling. With linear scaling in m, and for a reasonable number of mutations, SCITE will, therefore, be able to handle large numbers of sampled cells efficiently.
Further parameters with influence on the practical performance of SCITE are the move probabilities and for ML tree discovery, additionally the parameter γ. We performed a systematic search for the optimal parameters, which is described in Additional file 1. Our observation is that an optimal choice of move probabilities gives a constant factor speed-up compared to default values. Similar results were observed for γ, for which the optimum for finding a ML tree quickly is just below 1, the value required for the MCMC sampling.
Comparison with competing approaches
To assess further the performance of SCITE, we compared it to a simple perfect phylogeny approach, two methods designed for single-cell data, and two recent methods for tree inference from bulk-sequencing data.
Perfect phylogeny
We first compared SCITE against a simple algorithm for solving the perfect phylogeny problem (i.e. testing whether the data defines a phylogeny, and if it does to construct one [12]). A mutation matrix has a perfect phylogeny if a tree can be constructed such that the leaves are the samples and the mutations are each placed at exactly one edge, such that for every leaf the mutations on the path leading to it from the root reflect its mutation status. Such a tree exists only if there are no contradictions in the data due to noise or recurring mutations. But if it exists, it can be represented as a mutation tree by labeling nodes instead of edges. To test for perfect phylogeny, we use a version of the data with no missing values. From our simulated trees and data, only very few are free of contradictions, which limits the tree comparison to a few instances. The perfect phylogenies on average deviate more from the true tree than both ML and MAP trees and none is found for instances with more than 45 samples. The differences between the perfect phylogeny and the true tree are due to both the errors introduced and insufficient information to fully reconstruct the tree. Details of the comparison are given in Additional file 1: Table S1.
The approach of Kim and Simon [34]
The method in [34] reconstructs the same type of mutation trees as our approach. However, in their approach, a parameter representing how quickly the mutation tree branches is first learned from the data. This parameter is then used to calculate the prior probability of ancestral relations, which informs a pairwise ordering test and subsequent tree reconstruction. Instead of learning the parameter from the data, we give their method the exact value from the tree that was actually used to generate the data since this simplifies running the simulation test. Of course, in practice, this piece of information would not be available so the results from their algorithm are over optimistic. Nevertheless, the pairwise approximation performs comparatively poorly (Fig. 5). In particular, there is little improvement as the number of samples increases. Although the pairwise ancestral tests will become more accurate, this additional information appears to have little impact on the conversion to a mutation tree.
Comparison with BitPhylogeny
More advanced probabilistic inference is provided by BitPhylogeny [33]. This method, however, reconstructs a hierarchical subclone structure rather than a mutation tree thereby precluding a direct comparison to SCITE and the approach of [34]. Therefore, we convert the outcome of each method into a complete mutation tree with samples attached. For SCITE, this means finding the ML tree with attachments. For the approach of [34], we place the samples at their best fitting position on the tree found. For BitPhylogeny instead, we place the mutations along the branches of their clonal tree in the position that maximizes the likelihood. Since the mutations and samples may be grouped together, as a measure of fit we use the consensus node-based shortest path distance (as defined in [33]) between the (completed) inferred tree and the generating tree. In particular, for each tree, the pairwise shortest distance between any two samples is their number of differing mutations. We then normalize by averaging over the absolute differences between the pairwise distances in the inferred and generating trees, rather than taking the sum.
For n=20, α=10−5, and β=0.1 (with the same misspecification as before), we generated 400 such trees with 1 % missing data. For simplicity and giving BitPhylogeny a slight advantage, we passed it the complete data. The results for m∈{40,60,80} are presented in Fig. 7. The methods compared perform significantly more poorly than SCITE, with BitPhylogeny [33] performing better than the algorithm of [34], but with neither approaching the performance of SCITE.
We can also compare the performance of the different methods in terms of the difference in log-likelihoods between the inferred and generating trees, normalized by dividing by the number of data matrix elements (Additional file 1: Figure S12). This shows similar behavior to Fig. 7 and we observe that SCITE always provides a non-negative difference. SCITE, therefore, always found either the generating tree or one with a slightly higher likelihood than the generating tree.
Comparison with bulk-sequencing methods
Finally, we compared SCITE to methods designed for deconvolution and tree reconstruction from mixed bulk sequences. We chose PhyloWGS [24] and AncesTree [22] as two recent high-performing methods that allow samples to be treated separately as well as combined. PhyloWGS employs a stick-breaking tree prior (like BitPhylogeny) while AncesTree solves the deconvolution and ancestry as a matrix factorization. When passing the simulated single-cell mutations as individual samples to both methods, neither returned anything other than a single grouping of mutations. A possible explanation for this result is that the two methods interpret the binary mutation states as cellular prevalence in mixed samples, which likely causes trouble in the deconvolution step. Better performance was obtained when combining the single cells into a bulk mixture, with both methods returning mutation trees with the mutations possibly grouped together at the nodes. To compare with the other methods, we again placed the samples at their best positions in the inferred trees to obtain the results in Fig. 7. AncesTree performs slightly worse than PhyloWGS and both are notably worse than BitPhylogeny and SCITE. This is not unexpected, as only the latter two are designed to handle single-cell data. The main conclusion here is that specialized methods are necessary for single-cell data as approaches for mixed samples are not readily applicable.