Likelihood of a clonal lineage tree
Data
We assume that the variants of the single cells have already been called and filtered so that the data set only contains the somatic variant sites. Let D=(d
kl
) be the matrix of observed genotypes where k∈{1,…,n} is the label of a single cell and l∈{1,…,m} is the index of a mutation site. Let d
kl
∈{0,1,NA} denote the mutation status of cell k at site l, where 0, 1 and NA encode an unmutated, mutated or unknown site, respectively.
Clonal lineage trees
We assume that a clonal lineage tree is a directed not necessarily binary tree \(\mathcal {T}\) whose root is the unmutated normal. Every node of this tree represents a clone c∈{1,…,N} that contains 0, 1 or multiple cells of the data set. Let c(k) denote the clone that contains cell k. In the following, we assume without loss of generality that the root has index 1.
OncoNEM
An OncoNEM has two parts: the clonal lineage tree \(\mathcal {T}\) and the occurrence parameter \(\Theta =\left \{\theta _{l}\right \}_{l=1}^{m}\), where θ
l
takes the value c of the clone where mutation l originated.
The core of our method is a function that defines the probability of the OncoNEM given a data set D and is derived in the following. Using a Bayesian approach, the posterior probability of \(\mathcal {T}\) and Θ given D can be written as
$$ P(\mathcal{T},\Theta|D) = \frac{P(D|\mathcal{T},\Theta) \, P(\Theta|\mathcal{T}) \, P(\mathcal{T})}{P(D)}. $$
((1))
The model prior \(P(\mathcal {T})\) can be used to incorporate prior biological knowledge. We assume it to be uniform over the search space. The normalizing factor P(D) is the same for all models and it is not necessary to compute it when comparing them. Therefore,
$$ P(\mathcal{T},\Theta|D) \propto P(D|\mathcal{T},\Theta) \, P(\Theta|\mathcal{T}). $$
((2))
Likelihood for known Θ
Let us assume that we know for each locus l in which clone the mutation occurred and that no mutations occur in the normal. This is equivalent to restricting the parameter space of θ
l
to {2,…,N} and is justified by stringent variant filtering of the input data.
Given \(\mathcal {T}\) and Θ, we can predict the genotype of every cell: if c is the clone in which a mutation occurred, the mutation is present in c and all descendants of c and absent in all other clones, i.e., given θ
l
=c, the tree determines the predicted genotype δ
kl
.
Finally, to calculate the likelihood of \((\mathcal {T},\Theta)\), we compare the expected genotypes with the observed ones. We model the genotyping procedure as draws of binary random variables ω
kl
from the sample space Ω={0,1} and assume that, given \(\mathcal {T}\) and Θ, the random variables are independent and identically distributed according to the probability distribution
$$ P\left(\omega_{kl} | \delta_{kl}\right) =\left(\begin{array}{ll} P\left(0 | 0\right) & P\left(1 | 0\right)\\ P\left(0 | 1\right) & P\left(1 | 1\right) \end{array} \right) = \left(\begin{array}{cc} 1-\alpha & \alpha \\ \beta & 1-\beta \end{array} \right), $$
((3))
where α and β are global probabilities of false positive and false negative draws, respectively.
We interpret the observed genotypes d
kl
as events from the event space \(\mathcal {P}(\Omega) = \{\emptyset,\{0\},\{1\},\{0,1\}\}\), where a missing value corresponds to the event {0,1}. Then, the probability of the observed genotypes D given \(\mathcal {T}\) and Θ is
$$ P(D|\mathcal{T},\Theta) = \prod\limits_{l=1}^{m} \prod\limits_{k=1}^{n} P(\omega_{kl} \in d_{kl} | \delta_{kl}), $$
((4))
where
$$ P\left(\omega_{kl} \in d_{kl} | \delta_{kl}\right) =\left\{ \begin{array}{ll} 1-\alpha & \text{if}~ d_{kl}=\{0\}~ \text{and}~ \delta_{kl}=0 \\ \alpha & \text{if}~ d_{kl}=\{1\}~ \text{and}~\delta_{kl}=0 \\ \beta & \text{if}~ d_{kl}=\{0\}~ \text{and}~ \delta_{kl}=1 \\ 1-\beta & \text{if}~ d_{kl}=\{1\}~\text{and}~ \delta_{kl}=1 \\ 1 & \text{if}~ d_{kl}=\{0,1\} \end{array} \right. $$
((5))
is the probability of a single observation given the predicted genotype.
Likelihood for unknown Θ
So far we assumed Θ to be known, but this is generally not the case. To derive the likelihood of the entire data matrix, we treat Θ as a nuisance parameter and marginalize over it. Furthermore, we make two assumptions: First, the occurrence of one mutation is independent of the occurrence of all other mutations, i.e.,
$$ P(\Theta|\mathcal{T}) = \prod\limits_{l=1}^{m} P(\theta_{l}|\mathcal{T}), $$
((6))
and second, the prior probability of a mutation occurring in a clone is
$$ P(\theta_{l}=c|\mathcal{T}) =\left\{ \begin{array}{ll} 0 & \text{if}~\textit{c}~\text{is the normal}~ (c=1), \\ \frac{1}{N-1} & \text{otherwise}. \end{array} \right. $$
((7))
Then the marginal likelihood is
$$ \begin{aligned} P(D|\mathcal{T}) =& \int P(D|\mathcal{T},\Theta) P(\Theta|\mathcal{T}) \mathrm{d}\Theta \\ =& \frac{1}{(N-1)^{m}} \prod\limits_{l=1}^{m} \sum\limits_{c=2}^{N} \prod\limits_{k=1}^{n} P\left(\omega_{kl} \in d_{kl} |\mathcal{T},\theta_{l}=c\right) \\ =& \frac{1}{(N-1)^{m}} \prod\limits_{l=1}^{m} \sum\limits_{c=2}^{N} \prod\limits_{k=1}^{n} P\left(\omega_{kl} \in d_{kl} | \delta_{kl}\right). \end{aligned} $$
((8))
Algorithms to infer OncoNEMs
OncoNEM inference is a three-step process of initial search, testing for unobserved clones and clustering.
Step 1. Initial search: building a cell tree
The search space of cell lineage trees with n nodes contains n
n−2 models, making exhaustive enumeration infeasible for trees with more than nine nodes. Therefore, we implemented a heuristic local search (see Algorithm 1), which avoids getting trapped in local optima by returning to neighbors of high-scoring previous solutions.
Step 2. Refinement: testing for unobserved clones
The number of sequenced single cells is usually small compared to the tumor size. Consequently, some clones of the tumor may not be represented in the single-cell sample. This problem is similar to the ‘unknown unknowns’ problem in reconstructing biological pathways [30], where latent variables that cause additional patterns in the observed data set can be inferred. In the OncoNEM setting, unobserved clones with at least two child clones create additional mutation patterns and can, therefore, potentially be inferred. OncoNEM accounts for this possibility by testing if there is a lineage tree with additional, unobserved branch nodes that can better explain the observed data (see Algorithm 2). Unobserved clones that linearly connect observed clones cannot be inferred, but they also do not change the shape of the tree.
Briefly, the algorithm generates trees with n+1 nodes from the previous solution by inserting an unobserved node into its branch points. These trees are used as start trees in a new search that optimizes the position of the unobserved node in the tree. A larger model is accepted if the Bayes factor of the larger versus the smaller model is larger than a threshold ε (see below). If the larger model passes the threshold, these expansion steps are repeated, otherwise the algorithm terminates with the smaller solution.
Step 3. Refinement: clustering cells into clones
The clustering procedure tests if the data can be explained better or equally well by a clonal lineage tree in which multiple cells correspond to the same node (see Algorithm 3). Nodes are clustered iteratively along branches until merging cells into clones decreases the likelihood by more than a factor of 1/ε compared to the best clustering solution found so far. Cells may be clustered into clones because they are genetically very similar or because of the limited information content of the data, which can be due to genotyping errors, missing values or a restricted number of SSNVs in the sequenced regions of the genome.
Choosing the Bayes factor threshold ε
Choosing the parameter ε is a trade-off between declaring clones with little support from the data and overly strict clustering. In this setting, choosing ε>1 means that we prefer the smaller model unless the strength of evidence for the larger model compared to the smaller one exceeds a certain threshold. Jeffreys’s [31] or Kass and Raftery’s [32] scale for the interpretation of the Bayes factor can be used as guidance. We used a value of ε=10, which denotes strong evidence according to Jeffreys’s scale.
Estimating Θ, the occurrence of mutations
Given a lineage tree, we can estimate which clones acquired which mutations during tumor development. To do this, we calculate the posterior probability of a mutation having occurred in clone c. Using a uniform prior for the occurrence parameter θ
l
∈{2,…,N}, we obtain
$$ P(\theta_{l}=c | \mathcal{T},D)=\frac{1}{Z} \prod\limits_{k=1}^{n} P\left(\omega_{kl} \in d_{kl} | \mathcal{T},\theta_{l}=c\right), $$
((9))
with normalizing constant
$$ Z = \sum\limits_{c=2}^{N} \prod\limits_{k=1}^{n} P\left(\omega_{kl} \in d_{kl}|\mathcal{T},\theta_{l}=c\right). $$
((10))
The branch lengths L of the tree can be estimated as the expected number of mutations that separate a clone c from its parent pa(c),
$$ L_{\text{pa}(c),c} = \sum\limits_{l=1}^{m} P(\theta_{l}=c|\mathcal{T},D). $$
((11))
Estimating model parameters α and β
Previous studies have estimated FDRs and ADO rates from the sequencing data [9, 10]. These error rates are, however, not equivalent to the error parameters FPR α and FNR β used by OncoNEM. This is due to three pre-processing steps that are applied to the sequencing data to generate the final genotype matrix.
In the first step, only sites that appear to be mutated are selected. Selecting only sites that report mutations from all sequenced sites enriches for false positives. It also means that the FPR used by OncoNEM is conceptually very different from the FDR reported in these studies. The FPR describes what fraction of truly non-mutant sites is reported as mutant in the observed genotype matrix, whereas the FDR corresponds to the number of false positive variants per sequenced base pair.
Even with a very small FDR, the total number of false positive variants is expected to be large, because the sequenced exome is very large. Therefore, the second pre-processing step is consensus-based variant filtering, which only selects mutations that occur multiple times for the final data set. Li et al. [11] selected the census-filtering threshold so that, under a binomial model, no site is expected to be non-mutant in all cells. However, this step cannot remove recurrent false positives caused by systematic sequencing errors. In addition to changing the FPR, this step also reduces the FNR, as it preferentially removes sites that have an above-average ADO rate.
Thirdly, a binarization step is performed that interprets all homozygous mutant sites as heterozygous normal/mutant. This step reduces the FNR by approximately 50 % and further explains why the FDR is expected to differ from previously estimated ADO rates.
While all of these steps are expected to change the error rates of the final data set, the exact impact on the parameters is difficult to estimate. Therefore, we chose to estimate error rates for our model directly from the data.
We treat the selection of model parameters as part of the learning problem and estimate them using a maximum likelihood approach, similar to Zeller et al. [33]. We create a grid of parameter combinations α and β and optimize \(\mathcal {T}\) given these parameters using the heuristic search algorithm. Then, we choose the parameter combination that yields the highest scoring tree and infer a clonal lineage tree as described above.
This parameter estimation process is computationally expensive compared to the tree inference. However, it can easily be parallelized and the grid of parameter combinations can be coarse as OncoNEM is robust to changes in the model parameters around the optimum (see simulation results). Furthermore, the range of tested parameter combinations can be reduced in the presence of prior knowledge.
Data simulation
For the simulation study, data sets were created in a two-step procedure that consists of (1) generating a tree structure and (2) simulating the corresponding genotypes.
Simulating clonal lineage trees
To simulate a tree with c clones, we select clone one to be the root and the parent of the second clone. Then, the remaining clones are added iteratively by choosing a non-root node that is already part of the tree with uniform probability as parent.
When simulating trees with unobserved clones, we count how many nodes in the simulated tree have at least two children. If this number is greater than or equal to the desired number of unobserved clones c
u
, we randomly choose c
u
of these nodes as unobserved clones, otherwise a new tree is simulated. Next, we assign one cell to every observed clone. For the remaining cells, clones are chosen iteratively with a probability proportional to the current clone size, to generate clones of different sizes.
Simulating genotype observations
For every mutation site, we choose the occurrence parameter θ
l
with uniform probability from all non-root nodes. Given Θ and the tree structure, the full matrix of true genotypes is obtained by setting an entry to 1, if the mutation occurred in a clone that is ancestral to the cell’s clone or if the mutation occurred in the clone containing the cell itself, and 0 otherwise.
Observed genotypes are derived from true genotypes by (1) setting a fraction p
missing of randomly chosen values to NA, (2) setting a fraction α of unmutated, non-missing entries to 1 and (3) setting a fraction β of mutated, non-missing entries to 0. If this yields sites without any observed mutations, we add, for each of these sites, a false positive to a randomly chosen cell. Finally, to avoid a bias in the method testing, we randomize the order of cells in the matrix of observed genotypes.
Comparison measures for method benchmarking
Clustering performance was assessed using the V-measure [34], an entropy-based cluster evaluation measure that assesses both completeness and homogeneity of the clustering solution. The V-measure takes values from 0 to 1, with higher values indicating a better performance.
To assess the similarity between trees, we developed a distance measure called pairwise cell shortest-path distance (see Fig. 7). Given are two trees, \(\mathcal {T}_{1}\) and \(\mathcal {T}_{2}\), built on the same set of cells {1,…,n}, but potentially differing in the number of nodes (clones). Note that the root of a tree can be an empty node. To ensure that every node of the tree is taken into account in the distance measure, we add an extra cell to the root before calculating the distance. Without loss of generality, we denote this additional cell in the root node with index 0. For every pair of cells i and j, we compute the shortest-path d
ij
(·) between the two cells in each tree. If the two cells belong to the same clone, their shortest-path distance is 0, otherwise the shortest-path distance equals the number of edges (regardless of direction) that separate the clones of the two cells. Finally, we sum up the absolute differences between the shortest-path distances of all unordered pairs of cells in the two trees to obtain the overall pairwise cell shortest-path distance:
$$ d(\mathcal{T}_{1},\mathcal{T}_{2}) = \sum\limits_{i=0}^{n-1} \sum\limits_{j=i+1}^{n} | d_{ij}(\mathcal{T}_{1}) - d_{ij}(\mathcal{T}_{2})|. $$
((12))
A proof that this distance is a metric can be found in Additional file 1.
We define the mutation order accuracy of a tree \(\mathcal {T}_{1}\) given the ground truth tree \(\mathcal {T}_{2}\) as the average of
-
the fraction of correctly inferred pairwise mutation orders, i.e., the probability that mutation a is upstream of mutation b in \(\mathcal {T}_{1}\) given that a is upstream of b in \(\mathcal {T}_{2}\), and
-
the fraction of correctly inferred mutually exclusive mutations, i.e., the probability that two mutations a and b lie on separate branches in \(\mathcal {T}_{1}\) given that a and b lie on separate branches in \(\mathcal {T}_{2}\)
for all mutations that belong to different clusters in \(\mathcal {T}_{2}\).
Software and data availability
OncoNEM has been implemented in R [35] and is freely available under a GPL3 license on bitbucket [36]. Additional file 2 is a Knitr file reproducing all figures of the simulation studies. Additional file 3 is a Knitr file reproducing all figures of the case studies. Additional files 4 and 5 are the corresponding PDF files.
The processed single-cell data sets are provided in the OncoNEM R package. The sequencing data from both single-cell studies are deposited in the NCBI Sequence Read Archive [37]. The accession numbers are [SRA:SRA051489] for the bladder cancer study [11] and [SRA:SRA050202] for the essential thrombocythemia study [10].