BitPhylogeny employs a non-parametric Bayesian clustering approach for reconstructing intra-tumor phylogenies from observed sequence data. To integrate the assignment of sequences to clones with the organization of clones into a tree, it uses the TSSB as a prior [16]. In this model, the observed data (i.e., sequences) are associated with all nodes of the tree, rather than only to the leaves as in classical phylogenetic models. The TSSB is a probabilistic mixture model with an infinite number of hierarchically organized components, but only a finite number of components have a non-zero weight. In addition to this prior, the full generative probabilistic model underlying BitPhylogeny is defined by node-wise data distributions, a transition kernel and the root prior.
Let i∈{1,…,M} denote the considered marker sites and n∈{1,…,N} index the observed marker profiles or sequences (methyltypes or genotypes). We denote the marker state at site i of sequence n by x
i,n
∈{0,1}, where x
i,n
=1 indicates methylation or mutation. To assign sequences to clones, each clone has a label. Following the notation in [16], the clone label is defined as ε=(ε
1,…,ε
K
), where \(\epsilon _{k} \in \mathbb {N}\) for all k∈{1,…,K}. Each label is a sequence of natural numbers indicating the location of the clone in the phylogenetic tree. For example, the clone ε=(1,2) is located in the second layer of the tree, and its lineage trace is ∅→1→2, where the numbers are node labels in each level and the root node is labeled by the empty sequence, ∅. The length K=|ε| is the depth of clone ε; for example, |(1,2)|=2. The root clone has depth |∅|:=0. The probability of an observed marker pattern originating from clone ε is denoted by π
ε
. This parameter specifies the proportion of observations explained by clone ε (Figure 1C).
Tree-structured stick-breaking process
The TSSB generates an infinite series of clone proportions π
ε
, which sum to one by interleaving two stick-breaking processes [16],
$$ \begin{aligned} \nu_{\boldsymbol{\epsilon}} &\sim \text{Beta}(1,\lambda^{|\boldsymbol{\epsilon}|} \alpha_{0}), \qquad \psi_{\boldsymbol{\epsilon}} \sim \text{Beta}(1,\gamma), \\ \varphi_{\boldsymbol{\epsilon}\epsilon_{i}} &= \psi_{\boldsymbol{\epsilon}\epsilon_{i}} \prod\limits_{j=1}^{\epsilon_{i}-1}(1-\psi_{\boldsymbol{\epsilon}j}), \qquad \pi_{\emptyset} = \nu_{\emptyset}, \\ \pi_{\boldsymbol{\epsilon}} &= \nu_{\boldsymbol{\epsilon}} \varphi_{\boldsymbol{\epsilon}} \prod\limits_{\{\boldsymbol{\epsilon}^{'}\}}\varphi_{\boldsymbol{\epsilon}^{\prime}} (1 - \nu_{\boldsymbol{\epsilon}^{\prime}}). \end{aligned} $$
((1))
The beta-distributed random variables ν
ε
and ψ
ε
define the clone size π
ε
by distributing mass between each node and its descendants and its siblings, respectively. The index ε
ε
i
denotes the ε
i
-th child of clone ε, and in the second product, the index ε
′ runs over all ancestors of clone ε in the tree. The clone size depends on the depth and a decay constant λ. We refer to the distribution of clone sizes π={π
ε
} and their hierarchical structures generated by (1) as TSSB (λ, α
0, γ). Using this tree partition prior of unbounded depth and width, the infinite mixture model for the observed methylation or mutation data \(\mathbf {x}_{n} = \{x_{i,n}\}_{i=1}^{M} \) is defined as
$$ \begin{aligned} &\boldsymbol{\pi} \sim \text{TSSB} (\lambda, \alpha_{0}, \gamma), \qquad \epsilon_{n} \mid \boldsymbol{\pi} \sim \text{Discrete}(\boldsymbol{\pi}),\\ &\boldsymbol{\theta}_{\epsilon} \sim p(\cdot \mid \boldsymbol{\theta}_{\text{pa}(\epsilon)}), \qquad \mathbf{x}_{n} \sim p(\cdot \mid \epsilon_{n},\, \boldsymbol{\theta}_{\epsilon}), \end{aligned} $$
((2))
where \(\boldsymbol {\theta }_{\epsilon }=\{\theta _{i,\epsilon }\}_{i=1}^{M}\) are the parameters of the local distribution p(x
n
∣θ
ε
) of the data emitted from clone ε. Each clonal parameter is sampled from a transition distribution that depends on the parent clonal parameter θ
pa(ε). We denote the distribution of the root parameter by p(θ
∅
).
Customized methylation model
To model methylation data, we specify the local probability distributions p(x
n
∣θ
ε
) and the transition probabilities p(θ
ε
∣θ
pa(ε)). For each clone ε, the local data distribution is a Bernoulli distribution with the parameter transformed by a sigmoid function. The parameter \(\theta _{i, \epsilon } \in \mathbb {R}\) is used to control the probability of observing a methylation event at locus i. Assuming independence among loci, we set
$$ \begin{aligned} p(\mathbf{x}_{n} \mid \boldsymbol{\theta}_{\epsilon}) = \prod\limits_{i=1}^{M} \sigma(\theta_{i,\epsilon})^{x_{i,n}} (1-\sigma(\theta_{i,\epsilon}))^{1-x_{i,n}}, \end{aligned} $$
((3))
where σ(θ
i,ε
)=1/(1+ exp(−θ
i,ε
)). Here, we assume the CpG sites are independent, which is appropriate for the IRX2 molecular clock (verified in Additional file 1: Figure S1).
For the transition probabilities, we use a mixture of two Laplace distributions to model the parent–child relation,
$$ \begin{aligned} p\left(\boldsymbol{\theta}_{\epsilon} \mid \boldsymbol{\theta}_{\text{pa}(\epsilon)}, \mu, \Lambda, \mathbf{w}\right) &=\prod\limits^{N}_{i = 1} w_{i} \, \text{Laplace}(\mu, \Lambda) \\ &\quad+ (1-w_{i})\,\text{Laplace}(-\mu, \Lambda), \end{aligned} $$
((4))
where μ defines the location of a positive and a negative mode and \(\mathbf {w} = \{w_{i}\}_{i=1}^{M}\). Intuitively, the positive mode generates parameters that give a high probability of observing methylation events, whereas the negative mode has the opposite effect. The hyperparameter Λ models variation within the modes. The weights w
i
and 1−w
i
of the two Laplace densities specify the probabilities of either mode being selected for sampling the child parameter. The Laplace densities have the effect of pushing the sampled parameters close to the modes μ or −μ.
The dependency on the parent parameter is introduced through the weights w
i
as
$$ w_{i} =\left\{ \begin{array}{ll} P(\theta_{i,\epsilon} \geq \eta \mid \theta_{i,\text{pa}(\epsilon)} \geq \eta) & \text{if} \,\,\,\theta_{i,\text{pa}(\epsilon)} \geq \eta \\ P(\theta_{i,\epsilon} \geq \eta \mid \theta_{i,\text{pa}(\epsilon)} < \eta) &\text{if} \,\,\,\theta_{i,\text{pa}(\epsilon)} < \eta \end{array} \right. $$
((5))
where η is a fixed threshold. We set η=1, which results in very conservative methylation calls.
The probabilities in Equation (5) provide a link to evolutionary models used in classical phylogeny. We define a transition probability matrix P
ε
to describe the state transition from parent to child. Let m denote the methylated state defined by θ
i,ε
≥η and u the unmethylated state. Then the matrix P
ε
can be written as
$$ \mathbf{P}_{\epsilon} = \left(\begin{array}{ll} P_{u\rightarrow u} & P_{u\rightarrow m}\\ P_{m\rightarrow u} & P_{m\rightarrow m} \end{array} \right), $$
((6))
and according to Equation (5), we have
$$ w_{i} =\left\{ \begin{array}{ll} P_{m\rightarrow m} &\text{if} \,\,\,\theta_{i,\text{pa}(\epsilon)} \ge \eta\\ P_{u \rightarrow m} &\text{if} \,\,\,\theta_{i,\text{pa}(\epsilon)} < \eta. \end{array} \right. $$
((7))
The transition matrix is obtained from a rate matrix A as the matrix exponential P(t)= exp(A
t). The rate matrix A is parameterized as
$$ \mathbf{A} = \left(\begin{array}{rr} -\beta_{m} &\beta_{m}\\ \beta_{u} & -\beta_{u} \end{array} \right) \rho, $$
((8))
where β
m
and β
u
are the equilibrium frequencies of the methylated and unmethylated states, respectively, and β
m
+β
u
=1. The scaling factor ρ is set to ensure that the average rate of methylation is one, i.e., 2β
u
β
m
ρ=1. For each clone ε≠∅, we denote its branch length by t
ε
. Finally, the root prior is defined as
$$ p(\theta_{0}) = \prod\limits^{N}_{i = 1} \text{Laplace}(-\mu, \Lambda). $$
((9))
This prior favors clones with unmethylated states at all loci as the root.
The complete generative probabilistic model underlying BitPhylogeny, including all parameters and all conditional independencies, is depicted in Figure 2 as a graphical model.
Customized single-nucleotide variant model
Single-cell data sets often have high rates of missing data. This can be the result of allele dropout and low depth at some regions of the genome. In [14], the authors reported that the allele dropout rate of their sequencing technique is independent of the location or base type of the locus. This matches the statistical description of data as missing completely at random. In our likelihood-based approach, the missing data can be handled by simply ignoring the locus in each cell where it is missing.
To specify the transition model for SNV data, we employ the following rate matrix,
$$ \mathbf{A} = \left(\begin{array}{rr} -\beta_{m} &\beta_{m}\\ 0 & 0 \end{array} \right) \rho, $$
((10))
where β
m
is the frequency of a locus being mutated. The scaling factor ρ is set to ensure β
m
(1−β
m
)ρ=1. After matrix exponentiation, the probability of transition from mutation to normal is 0, which reflects the common assumption of mutations being irreversible during tumor evolution.
Inference
For statistical inference, we pursue a Bayesian approach and estimate the full posterior probability distribution of all model parameters, including clone assignment and tree structure. For approximating the joint posterior
$$ p\! \left(\{ \epsilon_{n}\}, \{\boldsymbol{\theta}_{\epsilon}\}, \{t_{\epsilon} \},\, \{\nu_{\epsilon}\},\, \{\varphi_{\epsilon}\},\, \mu,\, \Lambda \left| \{\mathbf{x}_{n}\},\, \beta_{u},\, \beta_{m},\, \lambda,\, \alpha_{0}, \gamma \right.\right), $$
((11))
we fix λ=2, α
0=0.3 and γ=0.1. The equilibrium state frequencies β
u
and β
m
are estimated directly from the population as the average frequencies of unmethylated and methylated states, respectively. To sample from the target distribution (11), we use a Gibbs sampler, which iteratively generates samples from the full conditional distribution of each variable of interest. The sampling procedure follows the one described in [16] with one exception. We integrated a new ‘swap clone’ move as an additional step. In this move, the parameters, assigned data and masses of two random nodes in the tree are swapped, then the structure related parameters {ν
ε
} and {φ
ε
} are resampled. The swap is accepted with a probability defined by a metropolis ratio. If, after the swap node move, the root node has no assigned data points, then more swap node moves involving the root are conducted until the root node has at least one data point assigned. In other words, we consider empty root nodes as invalid.
We used the maximum posterior expected adjusted rand (MPEAR) method from [53] to compute summary labels from MCMC samples. The method first computes the posterior similarity matrix for the labels in each MCMC sample. The posterior similarity matrix is an N×N matrix, in which each entry is the posterior probability of two data points being clustered together. Given the posterior similarity matrix, the posterior expected adjusted rand (PEAR) index can be used to assess the performance of a proposed label configuration. The labels, corresponding to the highest PEAR, are chosen as the summary cluster configuration. We used the MPEAR implementation in the R package mcclust.
At the end of each MCMC run, the reconstructed tree structure is obtained as the following. We check, for each sample, the number of clones that have weights π
ε
>0.01. We call this number the big node number. Then, all the samples can be grouped into different unique big node number categories. For each unique big node number group, we record the tree structure with the highest complete data likelihood (integrating out {ε
n
}):
$$ p \left(\{\mathbf{x}_{n}\},\, \{{\boldsymbol{\theta}\epsilon}\},\{\textit{t}_{\epsilon}\}|\{\textit{v}_{\epsilon}\},\{\varphi_{\epsilon}\},\, \mu,\, \Lambda,\, \beta_{u},\, \beta_{m},\, \lambda,\, \alpha_{0},\, \gamma \right). $$
((12))
Finally, we report the recorded tree from the most frequent unique big node number group.
Baseline methods
We used hierarchical clustering and k-centroids as baseline clustering methods. For both methods, the R function dist with option ’binary’ was called to compute the Jaccard distance matrix of the observed sequences. The Jaccard distance matrix was then used to perform hierarchical clustering (hclust) and k-centroids (pam). To select the number of clusters, both methods were executed with a range of possible cluster numbers from 2 to 20, and the cluster number with the highest silhouette coefficient was selected. We computed silhouette coefficients with the silhouette function from the R package cluster. The coefficient is computed from the mean distances within clusters and the mean distances between clusters. It does not require the true cluster labels. The silhouette coefficient takes values between −1 and 1, with higher values indicating better clustering performance. We then estimated the methyltype sequences (methylation patterns) of each clone. For hierarchical clustering, each methyltype was computed by thresholding the mean of all sequences assigned to the clone. For k-centroids, the methyltypes were defined as the medoids. Given the estimated methyltypes, we computed minimum spanning trees from the Hamming distance matrices. We used the minimum.spanning.tree function from R package igraph. Finally, we defined the clone with the least number of methylated states to be the root clone and directed the tree accordingly. Hierarchical clustering and minimum spanning tree have been used as parts of the SPADE pipeline, which extracts a cellular hierarchy from high-dimensional cytometry data [54]. The baseline methods we used here differ from SPADE by adding a model selection step.
Clustering performance
The clustering performance was assessed by the v-measure [51], which computes the harmonic mean of two conditional entropies, namely homogeneity and completeness. Homogeneity measures how much each cluster contains only members of a single class, while completeness measures whether all members of a given class are assigned to the same cluster. The v-measure takes values in [0, 1], with 0 and 1 indicating the worst and best clustering performance, respectively.
Tree distance
We considered all markers that were present in the ground truth tree and in all three inferred trees. For these shared markers, we computed their pairwise shortest-path distance matrix in each tree. The lower triangular part of the distance matrices of all inferred trees were then compared to the ground truth using the sum of the absolute values of all differences between matrix entries. We named this distance measure the consensus node-based shortest path distance.
Software and data availability
Based on the TSSB implementation [16], BitPhylogeny has been implemented in Python and R and is freely available under a GPL3 license [55]. For 2,000 data points (observed marker patterns), a single MCMC iteration takes on average about 1 s on a standard single-core computer. Additional file 2 contains an R Markdown file for reproducing all the figures in this manuscript. All data, including synthetic, methylation and single-cell data sets, are provided in the BitPhylogeny software package. The sequencing data from the single-cell study are stored in NCBI Sequence Read Archive [56] under the accession number [SRA:SRA050202].