ddClone: joint statistical inference of clonal populations from single cell and bulk tumour sequencing data

Next-generation sequencing (NGS) of bulk tumour tissue can identify constituent cell populations in cancers and measure their abundance. This requires computational deconvolution of allelic counts from somatic mutations, which may be incapable of fully resolving the underlying population structure. Single cell sequencing (SCS) is a more direct method, although its replacement of NGS is impeded by technical noise and sampling limitations. We propose ddClone, which analytically integrates NGS and SCS data, leveraging their complementary attributes through joint statistical inference. We show on real and simulated datasets that ddClone produces more accurate results than can be achieved by either method alone. Electronic supplementary material The online version of this article (doi:10.1186/s13059-017-1169-3) contains supplementary material, which is available to authorized users.

: High-level data simulation workflow. First, the GD model is used to generate cell genotype data in copy number space (∆ CN , i.e., number of reference and variant alleles for each cell genotype at each locus). Second, the cell genotype data is converted into bulk data. This is given as input to all the methods tested in this work to be used to infer the clustering assignment and cellular prevalences of genomic loci. Third, the cell genotype data in CN space is converted into a cell binary genotype matrix and supplied only to our method, ddClone, whereby it is used to construct an informed prior over the partitions of genomic loci (the bold arrow). .., Φ M } are the true prevalences for genotypes 1 to M. We then simulate m cells from a multinomial distribution with parameters Φ observed , i.e., (n 1 , n 2 , ..., n M ) ∼ Mult(Φ observed ) where n i is the number of cells that have genotype i. This process is equivalent to sampling the cells from a Dirichlet-multinomial distribution, that is, (n 1 , n 2 , ..., n M ) ∼ Dirichlet-multinomial(λΦ). The larger the λ is, the closer are the two vectors Φ observed and Φ. In fact as the value of λ grows, the Dirichlet-multinomial distribution progressively better approximates the Multinomial distribution. Mut 9 x x x x Mut 8 Figure S3: A hypothetical phylogenetic tree with genotypes at leaves (top). The green and blue bars on the tree denote mutations that have happened together. A subset of the corresponding mutation co-occurrence patterns (bottom). Note that the bottom matrix shows a transposed version of the genotype matrix. While it always holds that if mutations are gained at the same site on the phylogenetic tree, then they will coappear in the genotype matrix (the top-to-bottom arrow), the opposite is not always true (the bottom-to-top arrow). For instance, if all the mutations were ancestral, and G1 were to lose mutations 5 to 7, and similarly, G4 were to lose mutations 1 to 4, we would still observe the same genotype matrix as in this figure, but the underlying phylogenetic tree would be completely different, comprising a trunk. We are making the simplifying assumption that if mutations co-occur in a genotype matrix, then they have co-occurred in the underlying phylogenetic tree as well. Figure S4: In the graphical model, the shaded nodes are observed and the rest of the nodes are not observed. In the inference step, the unobserved nodes will be inferred via Gibbs sampling. In particular, we are interested in inferring φ i -s, the cellular prevalences for genomic loci and the induced clustering by the ddCRP.
↵ ⇠ Gamma(a ↵ , b ↵ ) Figure S5: The distributional assumptions on the ddClone model. These random variables are described in more detail in table S1 Figure S6: Possible moves taken by the sampler. Left column shows customer connections and right column shows induced table configuration at each step. Consider the following example showing resampling of the second customer. We first remove the outgoing connection of the customer, i.e., c 2 = 6 (top row, the red arrow). When this connection is removed, the second table is split into two tables, with customers one and two sitting at one table and customers five and six sitting at a new table (middle row). Customer three is picked as the new connection for customer two, i.e., c new i = 3, and this causes their respective tables to merge (bottom row, the green arrow).
Figure S10: Rate matrices for CTMC used on τ −p SNV (left) and τ p SNV (right). States represent are pairs representing reference and variant allele copy numbers. In this example, maximum allowed copy number for both reference and variant alleles is 2. States to which transition is possible are annotated green. Note that in both rate matrices, the first row and column (representing transitioning from and to the complete deletion state) are all zero. This means that it is not possible to reach complete deletion in our model.  Figure S11: An instance of Generalized Dollo, Multi state stochastic Dollo and Stochastic Dollo models over a rooted phylogenetic tree for a single genomic locus side by side. We assume that a SNV has happened at the red dot on the tree. Dashed lines represent the GD model's run over the subtree before the SNV has happened. The thick solid lines represent the process after the SNV has happened. The thin solid lines represent a fixed state, i.e., the process can only handle a fixed state before the SNV gain event. The numbers and colours represent the state of the process (CTMC) at that point. GD can model multiple states on branches where SNV does not appear, while MSSD is forced to be in a fixed state in those positions. Hence the space of problems that GD models is a superset of that of SD. Figure S12: A rooted tree topology τ with root node Ω and SNV event at point x (left). G1, G2, G3, and G4 represent genotypes. τ −x is the subtree pruned at x (middle). Subtree rooted at x is denoted by τ x (right). To simulate from the Generalized Dollo model, for a specific genomic locus i, we first pick the SNV position on the tree, then simulate a CTMC on the pruned tree, and simulate another CTMC on the subtree.  1  2  3  4  5  6  7  8  9  10   28  24  29  21  32  23  14  20  12  43  48  3  38  46  45  39  35  34  26  18  16  9  6  4  5  31  42  44  17  30  11  33  22  40  7  47  10  27  41  36  13  2  8  1  37  15  25 Generalized Dollo − 10 Genotypes X 48 Loci − Seed = 0 Figure S13: Transposed binarized simulated cell genotypes ∆ from Generalized Dollo process over a fixed phylogeny. The original cell genotype matrix ∆ CN is in copy number space. We binarize it by setting entries with non zero variant allele copy number to one (coloured red) and setting entries with variant allele copy number of zero to zero (coloured blue). The clonal prevalence of each genotype is in parenthesis. Generalized Dollo − 10 Genotypes X 48 Loci − Seed = 1 Figure S14: Transposed binarized simulated cell genotypes ∆ from Generalized Dollo process over a fixed phylogeny. The original cell genotype matrix ∆ CN is in copy number space. We binarize it by setting entries with non zero variant allele copy number to one (coloured red) and setting entries with variant allele copy number of zero to zero (coloured blue). The clonal prevalence of each genotype is in parenthesis.  Generalized Dollo − 10 Genotypes X 48 Loci − Seed = 2 Figure S15: Transposed binarized simulated cell genotypes ∆ from Generalized Dollo process over a fixed phylogeny. The original cell genotype matrix ∆ CN is in copy number space. We binarize it by setting entries with non zero variant allele copy number to one (coloured red) and setting entries with variant allele copy number of zero to zero (coloured blue). The clonal prevalence of each genotype is in parenthesis.  Generalized Dollo − 10 Genotypes X 48 Loci − Seed = 3 Figure S16: Transposed binarized simulated cell genotypes ∆ from Generalized Dollo process over a fixed phylogeny. The original cell genotype matrix ∆ CN is in copy number space. We binarize it by setting entries with non zero variant allele copy number to one (coloured red) and setting entries with variant allele copy number of zero to zero (coloured blue). The clonal prevalence of each genotype is in parenthesis.  Generalized Dollo − 10 Genotypes X 48 Loci − Seed = 4 Figure S17: Transposed binarized simulated cell genotypes ∆ from Generalized Dollo process over a fixed phylogeny. The original cell genotype matrix ∆ CN is in copy number space. We binarize it by setting entries with non zero variant allele copy number to one (coloured red) and setting entries with variant allele copy number of zero to zero (coloured blue). The clonal prevalence of each genotype is in parenthesis.  Generalized Dollo − 10 Genotypes X 48 Loci − Seed = 5 Figure S18: Transposed binarized simulated cell genotypes ∆ from Generalized Dollo process over a fixed phylogeny. The original cell genotype matrix ∆ CN is in copy number space. We binarize it by setting entries with non zero variant allele copy number to one (coloured red) and setting entries with variant allele copy number of zero to zero (coloured blue). The clonal prevalence of each genotype is in parenthesis.  Generalized Dollo − 10 Genotypes X 48 Loci − Seed = 6 Figure S19: Transposed binarized simulated cell genotypes ∆ from Generalized Dollo process over a fixed phylogeny. The original cell genotype matrix ∆ CN is in copy number space. We binarize it by setting entries with non zero variant allele copy number to one (coloured red) and setting entries with variant allele copy number of zero to zero (coloured blue Generalized Dollo − 10 Genotypes X 48 Loci − Seed = 7 Figure S20: Transposed binarized simulated cell genotypes ∆ from Generalized Dollo process over a fixed phylogeny. The original cell genotype matrix ∆ CN is in copy number space. We binarize it by setting entries with non zero variant allele copy number to one (coloured red) and setting entries with variant allele copy number of zero to zero (coloured blue). The clonal prevalence of each genotype is in parenthesis.  2  15  45  30  12  13  33  9  27  3  25  36  43  31  35  29  42  24  23  20  40  18  34  1  44  32  48  37  26  19  5  14  47  7  11  21  46  22  41 Generalized Dollo − 10 Genotypes X 48 Loci − Seed = 8 Figure S21: Transposed binarized simulated cell genotypes ∆ from Generalized Dollo process over a fixed phylogeny. The original cell genotype matrix ∆ CN is in copy number space. We binarize it by setting entries with non zero variant allele copy number to one (coloured red) and setting entries with variant allele copy number of zero to zero (coloured blue). The clonal prevalence of each genotype is in parenthesis.  Generalized Dollo − 10 Genotypes X 48 Loci − Seed = 9 Figure S22: Transposed binarized simulated cell genotypes ∆ from Generalized Dollo process over a fixed phylogeny. The original cell genotype matrix ∆ CN is in copy number space. We binarize it by setting entries with non zero variant allele copy number to one (coloured red) and setting entries with variant allele copy number of zero to zero (coloured blue). The clonal prevalence of each genotype is in parenthesis.

Cell Genotypes
Genomic Curated Genotypes for sample SA501 Figure S23: Binary cell genotype matrices for sample SA494 over 29 genomic loci (left) and sample SA501 over 38 genomic loci (right). These are manually curated from a single cell genotype sequencing experiment [1]. Briefly, MrBayes was used to infer a consensus phylogenetic tree over the single nuclei. Then they were grouped into clades according to high probability branching splits. Finally, each clade was assigned a consensus cell genotype by taking the mode cell genotype of the clade at each genomic locus.

ITH Ovary
Phi error Methods ddClone PyClone PhyloWGS Clomial OncoNEM SCITE Figure S25: Benchmarking over simulated data with NO doublets. The rest of the parameters are identical to that of Figure 3 in the main manuscript. This result corroborates our previous results, that is, the ddClone model performs better or comparably well in presence of reasonable levels of noise. Panel A shows Vmeasure clustering performance. Panel B shows the average over loci of the absolute differences between the inferred and true cellular prevalences.  Figure S26: To establish the benchmark in the Xenograft dataset, PyClone was run in multi-sample mode once over 4 timepoints in sample SA494 and once over 11 timepoints in sample SA591 (left) Then each method was run over individual timepoints (right). ddClone was provided with the matching single-cell genomic data. For evaluation, each individual run was compared against the corresponding sample from the benchmark.   Figure S31: Effect of adding random point noise and varying decay parameter a on V measure index (a) and mean absolute error of cellular prevalence estimates (b) for the five simulated datasets. Beta-Binomial precision parameter s and hyperparameter α are fixed at 1000 and 1 respectively. We note that V measure index is more sensitive to changes in value of a than the level of point noise. Heat map colours represent values in the vertical axis and are included to aid the eyes.  Figure S32: Effect of removing single genotypes and varying hyperparameter a on V measure index (a) and mean absolute error of cellular prevalence estimates (b) for five simulated datasets. Genotypes are sorted in decreasing order of prevalence from right to left. Genotype 1 is the least prevalent and genotype 9 is the most prevalent. Beta-Binomial precision parameter s and hyperparameter α are fixed at 1000 and 1 respectively. We note that V measure index is more sensitive to changes in value of a than removal of single genotypes. Heat map colours represent values in the vertical axis and are included to aid the eyes.   2 The ddClone model Figure S3 illustrates our assumption about the relation between the genomic loci co-occurrence patterns and the underlying phylogenetic tree.

Simplified generative model
For exposition, we start with a simplified generative model in which we describe the relationship between inputs and the outputs of our method. Assume we have an heterogeneous tumour that contains subpopulations from two distinct haploid genotypes, g 1 and g 2 with clonal prevalences of 30% and 70%, respectively.
For simplicity we set the tumour cellularity in our sample to one (t = 1). Since this implies that the expected fraction of variant allele reads is equal to cellular prevalence at each genomic locus (ξ = φ), we will ignore ξ and directly use φ in this subsection.
The possible cellular prevalences for any genomic locus i in this tumour are 0}. This is because locus i is either not mutated in any of the genotypes (hence φ 1 ), only mutated in g 1 (corresponding to φ 2 ), only mutated in g 2 (therefore φ 3 ), or mutated in both g 1 and g 2 (meaning φ 4 ). These four cases represent our possible clusters.
To simulate the sequencing process for genomic locus i, we first pick its cluster. We use an auxiliary variable z i as follows: z i ∼ Categorical(w) where w = w 1:4 denotes the mixing weights, the proportion of The cellular prevalence for genomic locus i is now φ z i . In the inference procedure, z i s and φ z i s constitute our desired outputs. Next we simulate the number of variant alleles. Since according to our assumptions φ z i also denotes the expected proportion of variant reads in the sequencing experiment, we can relate it to the variant read counts b i via a Binomial likelihood function as follows: where for now, we fix d i , the total number of reads, to some appropriate constant value 1 . In the inference procedure, we observe b i and d i and they are the inputs to our model. Put together, we have: The two step process we described in Equation (S1) defines a mixture distribution as follows: 1 It could vary from about 10× in a whole genomic sequencing to about 10,000× in an ultra deep sequencing experiment [2].

53
Here we have four possible mixture components or clusters. Cluster assignments for each datapoint Before describing our model, ddClone, in section 2, we present an updated generative process for the simplified example that we considered here in subsection 2.2.

Generative process for ddCRP mixture modelling
We now present the high level forward simulation algorithm for mixture modelling in ddCRP for the simplified example we considered in subsection 2.1: From this, derive T (C), the corresponding table assignment.

For
where α is a model parameter, f is a decay function, G 0 is the base distribution for the φ i -s, F i is the likelihood function relating expected number of reads to b i to cellular prevalence φ i as in Equation (S1), and T C (i) is the index of the table at which customer i is sitting.
Formally, in our simplified model, for each genomic locus i ∈ [1, N ], we want to infer φ i and T C (i), given the model observations b i and d i . In section 2 we report a complete set of expected model inputs and outputs.
Here we introduce our model, ddClone. Figure S5 summarizes dependency and distributional assumptions in ddClone's model. Table S1 explains random variables used in this model.

The Parental Copy Number (PCN) prior
Following [3], when copy number variation data in form of major and minor copy numbers is available, we have implemented a number of methods to elicit priors over locus genotypes. We assume that copy number 55 state at each genomic locus is reported as a pair of integers (ζ 1 ,ζ 2 ), where the major copy numberζ 1 , refers to the maximum of the two said integers and the minor copy numberζ 2 refers to the minimum of the pair andζ =ζ 1 +ζ 2 is the total copy number. Here we describe the Parental Copy Number (PCN) strategy that is used in our experiments.
Let P denote the set of locus genotype states that PCN scheme describes. We assign equal weight to the locus genotype states in P and zero weight to any other locus genotype state that is not a member of P, that is a locus genotype state ψ i ∈ P has a weight equal to 1 |P| . The locus genotype states with non-zero weights are P = {ψ 1 , ψ 2 , ψ 3 , ψ 4 } where These locus genotype states adhere to the following conditions: g N = (2, 0) so that the normal locus genotype is diploid with respect to the reference alleles, and ζ(g V ) =ζ and b(g V ) ∈ {1,ζ 1 ,ζ 2 }. The number of variant alleles b(g V ) is at least one, in other words we do not consider genomic loci that are not mutated. When b(g V ) ∈ {ζ 1 ,ζ 2 }, we set g R = g N (as in ψ 1 , ψ 2 ). For b(g V ) = 1, we consider two scenarios: (i) either the point mutation event has happened before the copy number event, in which case we set g R = g N (see ψ 3 ), or (ii) the copy number event preceded the point mutation, where we choose g R such that ζ(g R ) =ζ and b(g R ) = 0 (as in ψ 4 ).

Resampling hyperparameters
α and a are resampled using methods described in [4]. Briefly, we used the following Gibbs move to update the value of hyperparameter α given the customer connection configuration C: , the number of self-connections, and p(α) ∼ Gamma(α 0 , β 0 ) is the Gamma prior over α with shape and rate parameters α 0 and β 0 .
The decay function parameter a is updated using the following Gibbs move: where we assume a uniform prior on a independent of α.
Since the decay function is exponential in our model, we use the Griddy-Gibbs [5] approach to sample approximately from Equation (S4).
We use the method proposed in [6] for resampling s.
Gamma distributed priors are characterized using shape α and rate β parameters. Equation S5 shows the corresponding distribution function: where Γ(α) is the Gamma function.
By default, hyperparameter resampling is enabled in our experiments in this work, unless otherwise specified. We note that to explore the model's sensitivity to the value of hyperparameters, in some of our experiments in section 6.2, we disable hyperparameter resampling. We specify this in the description of those experiments.
This Gibbs sampler potentially displaces more customers at each step, and as such might have better mixing properties compared to the traditional CRP Gibbs sampler [4]. Figure S6 shows such a step in 57 ddCRP.

The modified Jaccard distance
The Jaccard index is symmetric with regard to the FP and FN noises, thus we may use a modified Jaccard distance that takes into account the higher FN rate (with respect to FP rate) and assigns the expected value of the distance given an estimated FN rate. Intuitively, it constitutes a softer prior over co-occurrence patterns in the single cell data. Let's assume that compared to the FN rate, the FP rate is negligible. The issue with the Jaccard distance (JD) is that it ignores the possibility that an observed value of 0 in the genotype matrix could actually be 1 with probability p FN and for each locus observed at genotype x and y it assigns the following values (to be normalized): x y Jaccard dist 0 0 0 0 1 1 1 0 1 1 1 0 We incorporate the uncertainty imposed by p FN by computing the expected value of the Jaccard distance conditioned on the observations. It is straight forward to show that the modified Jaccard distance (MJD) at each locus, for two genotypes, would be: where p is shorthand for p FP .
For two binary vectors X and Y of size m, we can define the Jaccard distance as follows: where |.| is the set size. In our case where X and Y are binary vectors, |.| equals the number of ones in the vector and |X ∩ Y | is the number of indexes for which both X and Y are one. Now using the value-table   of the MJD, we define a probabilistic counterpart for the quantities in Equation S6. Let |.| p be the expected number of ones in its input binary vector, then: |X| p = number of observed ones in X + p × (number of observed zeros in X) (S7) Thus MJD for binary vectors X and Y and FN rate p is defined as: When p = 0, we recover the traditional Jaccard distance. This formulation accounts for loci that are counobserved and less harshly penalizes mis-observed pairs (occurrences where for two loci, one is observed in genotype and the other is not) according to the estimated FN rate.
We assume in the following that FN rate equals the ADO rate. We ran ddClone with this modified Jaccard distance (MJD) over 10 simulated datasets with ADO rate of 0.3. As demonstrated in Figure S7, using the MJD provides a small advantage over JD in terms of clustering but not much difference in terms of cellular prevalence error over the simulated data. Figure S8 shows the application of the MJD on real datasets.
There is a marked improvement for the TNBC Xenograft dataset and modest gains in the HGSOvCa dataset.
Finally to check the effects of misspecifying the ADO rate in the MJD, we simulated 10 datasets with a true ADO rate of 0.3 and ran ddClone multiple times on this dataset, each time using the MJD with a different ADO rate ( Figure S9). This results suggest that MJD is robust to misspecification of the ADO rate.

Simulating genotypes
Here we introduce our simulation scheme. We first use the Generalized Dollo (GD) model to simulate cell genotypes. We then use these cell genotypes to simulate the bulk data. A binarized version of the cell genotypes is used to inform our prior in our method, while the bulk data constitutes the main input to our model and the competing methods. Figure S1 shows the high-level data simulation workflow.
We used a variation of the Stochastic Dollo (SD) model, called Generalized Dollo Model (GD) to sim-ulate synthetic data accounting for both SNVs and CNVs. SD is a stochastic process that models evolution of binary features (in our case, point mutations) along a phylogenetic tree. A feature could only be gained on one point on the tree, and could be lost multiple times on different branches, but when lost, it cannot be regained [7].
A limitation of SD is that it is restricted to binary features. For instance, it can only model presence and absence of a mutation at a certain genomic locus.
Multi State Stochastic Dollo (MSSD) model [7] relaxes this restriction by expanding the present feature MSSD can only model evolution after a SNV has happened. This is not a correct assumption when modelling copy number variation events where we would like to be able to account for copy number changes before a point mutation has happened.
To resolve this problem, in addition to expanding the present feature, we also expand the absence feature and allow transition within these new states. This is the GD model. Once the system gains the feature, that is, it transits into the present features state subspace, it can make transitions within this subspace, but cannot go back to the previous state. Figure S11 illustrates SD, MSSD, and GD side by side on a specific phylogenetic tree for a particular genomic locus.
Since ν has a point mass µ on Ω, there is a non-zero probability that a SNV happens at the root and hence τ −p SN V = ∅. In this case sampleTreeCTMC(∅, Q) returns ∅. If a SNV does not happen at the root, then with probability one there are genotypes in the sample that have no variant allele copy at that genomic locus.
This will give us the copy number of each genotype at each genomic locus. We summarize this into copy-number aware genotype matrix ∆ We design Q 1 and Q 2 such that a complete deletion, i.e., transitioning to state (0, 0) is not possible. Q 1 only allows transition between states that have zero variant copy number. This simulates the behaviour of the system before a SNV happens. We assume once a mutation is lost, it cannot be recovered, and enforce this assumption in Q 2 by not allowing transition from states with zero variant copy number zero to states with non-zero variant copy numbers.
Simulating from the GD model To simulate data for each genomic locus from the GD model on a phylogenetic tree, we sample a point on the tree by selecting a topological location according to Equation (S9), followed by continuous uniform sampling on the branch if the location is not the most recent common ancestor. The sampled point is interpreted as the location at which the SNV occurred for the current locus. We call the subtree rooted at the sampled point, the below-subtree and the remaining part of the tree, the above-subtree. We simulate GD on the above-subtree to determine the copy number of the SNV point along with other point on this subtree.
This accounts for evolution of the genotypes before a SNV has happened. We continue by simulating the GD on the below-subtree to determine the copy number of decedents of the SNV point. This accounts for evolution of the genotypes after the SNV occurs.
To continue we first establish some notation following [8]. We define a phylogeny denoted by τ to be a continuous set of points and G(L, E) to be its topology where L are the leaves and E the edges. Let H i (ν) denote the state of the CTMC for genomic locus i at point ν. For an edge e ∈ E, let |e| denote its length.
Let τ −x be the tree pruned at point x, that is, the tree with subtree rooted at x, removed.
We write τ x for the subtree of τ rooted at node x, and τ −x the subtree pruned at node x, that is, the subtree with points in τ − τ x . Let ρ be a normalized measure that assigns zero weights to absorbing states and equal weights to non-absorbing states. Then the state of the CTMC at the root H i (Ω) is distributed according to a categorical distribution with parameter ρ.
Algorithm S1 shows the pseudo code to simulate from Generalized Dollo Model. As input we provide the SimulateGeneralizedDollo procedure with tree topology τ with M leaves and parameter µ, as well as rate matrices Q above , Q below .

Simulating single cell data
To simulate cells, we first sample observed prevalences Φ = {Φ observed

Simulating bulk data
We use the generated cell genotypes ∆ CN from the GD model to simulate the bulk data. This bulk data serves as the input to the competing methods in this work. Our method additionally takes as input a binarized version of the cell genotype data to inform its prior over partitions of genomic loci.
For each genomic locus i, the simulated bulk data consists in (i) variant and total allele counts (b i , d i ), To generate bulk data at genomic locus i, we first simulate d i , the total number of reads mapping to i from a Poisson distribution with parameter 10,000. We then use the CN of the most prevalent genotype from ∆ CN (here, it would be the M -th genotype) at locus i to set the major and minor CNs for the bulk. That is we set To simulate the variant allele counts b i we have to take into account the aggregate effect of all genotypes at locus i in ∆ CN . This means that the ψ i -s will be slightly different from that introduced in the Methods section of the main text, that is, instead of containing normal, reference, and variant subpopulations ψ i = (g i N , g i R , g i V ), it should contain normal, and all the genotypes from ∆ CN . With a slight abuse of notation, we denote this by ψ * i (∆ CN ) = ψ * i = (g i N , g i 1 , g i 2 , ..., g i M ). We 63 also have to modify the definition of ξ to work with the new ψ * as follows: where Z * is the appropriate updated normalizing constant. Finally, we have where we set s = 1000. Algorithm S2 summarizes the bulk data simulation procedure: Algorithm S2 Simulate Bulk Data 1: procedure SIMULATEBULKDATA(Φ, ∆ CN , s, t) 2: return {d, b,ζ 1 ,ζ 2 , t}

Parameters for synthetic data generation
We used the following setup to generate synthetic data:

Triple-negative breast cancer xenograft data
To test our method over a real dataset, we used a subset of samples from a triple-negative breast cancer xenograft study [1] where breast cancer tissues from 55 patients were transplanted into highly immunodefi-cient mice to generate 30 xenograft lines. Over 3 years, these lines were passaged up to 16 generations.
Whole genome sequencing was performed over a subsample of this cohort to identify candidate genomic positions. It was followed by deep targeted amplicon sequencing of between 100 to 300 SNV positions per sample. 210 cells from five timepoints that span two samples were chosen for single cell genotyping, and about 48 SNV positions were targeted for each timepoint.
The results were post-processed to remove all positions labeled as non-somatic. This was further summarized into constituent cell genotypes. A consensus phylogenetic tree was inferred using MrBayes [9].
Cells were grouped into clades consisting of high probability branching splits. For each clade a consensus cell genotype was derived by taking the most prevalent cell genotype at each genomic locus. Figure   S23 shows the inferred cell genotype matrix ∆ for each sample. In each timepoint, we only kept genomic loci that were shared between the bulk and single cell genotype data. Inferred cell genotypes from the triple-negative breast cancer xenograft single cell genotyping study is shown in Figure S23.

Establishing the benchmark
Since exact clustering configuration and cellular prevalences of the genomic loci in the real dataset is unknown, we used multi-sample PyClone's result over 11 timepoints from sample SA501 and 4 timepoints from sample SA494 as our benchmark ( figure S26). PyClone in multi-sample mode borrows statistical strength across all timepoints to give better estimates of subclonal structure in individual timepoints.
The following timepoints were used for sample SA501: The following timepoints were used for sample SA494: PyClone was run for 100,000 iterations with a burn-in period of 50,000 iterations. The rest of the settings were identical to synthetic simulation experiments as in listing 9. Cellular prevalence estimates are summarized in Figure S24.
We ran our method along with competing methods over timepoints SA501 X1, X2, X4, and SA494 T, X4 for which we had matching single genotype sequencing data. Figure 6 in the main text shows the performance of each method against the benchmark. 6 Sensitivity analysis 6.1 Sensitivity to presence of doublets and sampling distortion noise

Discarded data and potential effects on inference
The the main text, Fig. 3, we compare ddClone with a number of bulk only and single cell only methods under three noise regimes, namely: (i) ideal data with no ADO or doublets; (ii) data with moderate levels of sampling distortion, in presence of 30% doublet cells and an ADO rate of 30%, and finally (iii) data with higher levels of sampling distortion reflective of real data, with the same doublet and ADO rates to ii. In Figure S25 we investigate what happens in absence of doublets. As expected we observed an improvement in the performance of single cell only methods, but the ranking still does not change and ddClone outcompetes all the other methods.

Sensitivity to a and noise
Here we examine the effects of simultaneously varying a and introducing noise. In our first experiment, we added point noise. We simulated five datasets from the GD model with 10 genotypes over 48 genomic loci.
For each dataset, we ran ddClone for 200 iterations 60 times. Each time we fixed the hyperparameters a, α and s to a different starting value and disabled hyperparameter resampling. For each dataset, we introduced point noise with specified probability p to the original genotype matrix, and input the filtered genotype matrix to our model. Results for this experiment are shown in Figure S31. It implies that in presence of noise, the model is more sensitive to higher values of decay function parameter a and as a increases, model performance declines.
We examined two genotype loss scenarios: one where only a single genotype is lost, and one where progressively more genotypes are missed. Results for the first scenario are in Figure S32. Five datasets identical to the point noise experiment were generated. For each dataset, we held out the specified genotype and input the remaining as the genotype matrix to our model (i.e., a matrix with 9 genotypes in our experiments).
In the second scenario, we progressively removed more genotypes. Figure S33 depicts these results.
Except for genotype loss, the rest of experiment setup was identical to the first scenario. This result implies that the model is more sensitive to the value of the decay function parameter a than it is to genotype removal.
6.3 Sensitivity to the initial value of hyperparameter a Figure S34 shows the result of running our model with different starting values for the hyperparameter a. In these experiments we disabled resampling of hyperparameters a, α, and s, and fixed them at their starting value. We simulated 10 datasets from the GD model with 5 genotypes over 48 genomic loci. We ran our model 170 times for each dataset, with different initial values for hyperparameters, each time for 200 iterations. Each box plot shows the respective performance index for runs with an identical initial value of a and different values for s and α, each for 5 datasets.
7 Data requirements for the method to function well

Bulk data
The bulk likelihood portion of the model is designed for analysis of deeply sequenced (coverage > 100Ł) mutations to identify and quantify clonal structure [3].

Single cell data -number of cells
ddClone's performance in terms of both clustering and cellular prevalence accuracy highly depends on the number of captured genotypes (see Figure 7 in the main text). Even in presence of half of the genotypes, ddClone compares favourably against the second best performing bulk method.
The number of captured genotypes depends on the distribution of the genotypes across the tumour. We have modelled this using the concentration parameter λ in the Dirichlet-multinomial distribution that scales the original prevalence parameters (see Figure 4 in the main text). For small values of λ, sampled cells are not a good representative of the underlying tumour structure. For instance, at λ = 1.12, 50 sampled cells on average capture only slightly over 4 out of 10 existing genotypes. In this situation, even sampling 1000 cells will on average capture about 5 out of 10 genotypes. On the other hand, for larger values of λ, the sampling procedure results in cells that more accurately represent the tumour. For example, at λ = 10, a sample of 50 cells, on average capture 8 genotype out of 10 existing genotypes. Sampling 1000 cells in this situation will on average reveal an extra genotype. In summary, our results suggests that ddClone will perform comparably well when at least half of the genotypes are recognized.

Single cell data -depth
The sequencing depth at each loci has to be large enough to enable SNV detection. This is dependent on a number of factors including the variant allele frequency at the target locus, the library preparation and amplification, and mutation calling pipeline. In this study we used the binomial exact test described in [12] that recommends a coverage of at least > 43 at each locus. Reference [13] has a number of recommendations on what coverage to use. Moreover ddClone is designed to work with genotypes, rather than raw single cell data and therefore we anticipate that the coverage requirements of the genotype inference procedure may also be important (see [14] for an up to date review of such algorithms). Moreover, we have shown that ddClone is fairly robust to presence of point noise (e.g., false positive or false negative error in SNVs as in Fig. 6 and Fig.7 in the main text).

Convergence diagnostics
Following [3] to assess convergence of the MCMC chain for TNBC Xenograft samples SA501 and SA494, we ran 3 chains for 10,000 iterations with random seeds and visually inspected Posterior Similarity Matrices (PSM) to ensure similarity. Figure S39 shows the PSM for time point X4 in sample SA494. These experiments imply that the chains have converged. We note that for the simulation studies, when provided with the correct number of clusters, Clomial converged in no data points. therefore we set its c parameter to half of the gold standard one.
We downloaded PhyloWGS with git tag smchet1-31-g57294e3 and used it with the default settings for the following parameters: Since this version of PhyloWGS did not output clonal frequencies, we edited the source code to extract these values. Furthermore, to simplify comparison with other methods, we provided PhyloWGS with an empty copy number file.
Single cell data only methods

SCITE
We downloaded latest version of SCITE from its GitHub repository (hash = 512c8b84ddd2ff5632d1c9f310c38fbc6b109bc6). It outputs a mutation tree. To compute a clustering of mutations, we first collapse the nodes in this mutation tree. Briefly, we traverse the tree in a level order (breadth-first search), and merge every node with degree less than 2 with its parent node. Nodes with degree 3 or more are left unmerged (we assigned them either to a single cluster or to separate clusters, depending on which results in a higher V-measure). This results in a new tree with potentially fewer nodes. Every node in this new tree comprises a new cluster. In this study we considered 5 maximum likelihood trees, calculated clustering and cellular prevalence estimates on all of them, evaluated them, and reported the one with the highest score.
We report the proportion of cells that are descendants of a particular mutation as its cellular prevalence.
SCITE has the option to attach the cells at a node in the mutation tree it estimates. To count the number of cells that harbour a mutation, we recursively consider that mutation and all its descendant mutation nodes.
To run SCITE, we set the option -a to attach single cells to mutations. The option -ad accepts two assigning G OncoN EM (i, j) = 1 if p t mus (i, j) > t i treshold and G OncoN EM (i, j) = 0 otherwise. We define OncoNEM's mutation cellular prevalence as the sum of columns of G OncoN EM .
To run OncoNEM we followed the recommended steps in the OncoNEM's R-package Vignette. We