PhyloWGS: Reconstructing subclonal composition and evolution from wholegenome sequencing of tumors
 Amit G Deshwar^{1},
 Shankar Vembu^{2},
 Christina K Yung^{3},
 Gun Ho Jang^{3},
 Lincoln Stein^{3, 5} and
 Quaid Morris^{1, 2, 4, 5}Email author
DOI: 10.1186/s1305901506028
© Deshwar et al.; licensee BioMed Central. 2015
Received: 27 June 2014
Accepted: 29 January 2015
Published: 13 February 2015
Abstract
Tumors often contain multiple subpopulations of cancerous cells defined by distinct somatic mutations. We describe a new method, PhyloWGS, which can be applied to wholegenome sequencing data from one or more tumor samples to reconstruct complete genotypes of these subpopulations based on variant allele frequencies (VAFs) of point mutations and population frequencies of structural variations. We introduce a principled phylogenic correction for VAFs in loci affected by copy number alterations and we show that this correction greatly improves subclonal reconstruction compared to existing methods. PhyloWGS is free, opensource software, available at https://github.com/morrislab/phylowgs.
Background
Tumors contain multiple, genetically diverse subclonal populations of cells that have evolved from a single progenitor population through successive waves of expansion and selection [13]. Reconstructing their evolutionary histories can help identify characteristic driver mutations associated with cancer development and progression [4,5], and can provide insight into how tumors might respond to treatment [6,7]. In some cases, it is possible to genotype the subpopulations present in a tumor, while reconstructing its history, using the population frequencies of mutations that distinguish these subclonal populations [2,821]. Increasingly, tumors are being characterized using wholegenome sequencing (WGS) of bulk tumor samples [22] and few automated methods exist to perform this reconstruction on the basis of these data reliably.
Subclonal reconstruction algorithms attempt to infer the population structure of heterogeneous tumors based on the measured variant allelic frequency (VAF) of their somatic mutations. Some methods perform this reconstruction based solely on single nucleotide variants or small indels (collectively known as simple somatic mutations or SSMs) [1619,21,23]. Others use changes in read coverage to identify genomic regions with an average ploidy that differs from normal, which they explain using inferred copy number variations (CNVs) that affect some of the cells in the sample [15,20,24,25].
The low read depth of current WGS complicates subclonal reconstruction. Until recently, subclonal populations (i.e., subpopulations) were defined based on accurate estimates of the proportion of cells with each mutation (i.e., their population frequency), which, for individual SSMs, are only available through targeted resequencing where the read depths are orders of magnitude higher than typical WGS depths [17,18,23]. However, preliminary evidence suggests that the much larger number of mutations detected by WGS can compensate for their decreased read depth [26]. In contrast, CNVs affect large, multikilobasesized or megabasesized regions of the genome, which allow the average copy number of these regions to be accurately estimated with WGS. Unfortunately, CNVbased subclonal reconstruction is more difficult than SSMbased reconstruction because of the need to estimate simultaneously population frequency and new copy number for each CNV. Most CNVbased methods only attempt to infer the copy number status of the clonal cancerous population [24,25] that contains the mutations shared by all of the cancerous cells. The few CNVbased methods [15,20] that attempt to resolve more than one cancerous subpopulation are practically limited to a small number (often two) of subpopulations. In contrast, SSMbased methods applied to targeted resequencing data can reliably resolve many more cancerous subpopulations [1618,23]. However, it remains unclear what the limits of WGSbased automated subclonal reconstruction are.
Another open question is how to combine CNVs and SSMs when doing reconstruction. CNVs overlapping SSMs can interfere with SSMbased reconstruction because they complicate the relationship between VAF and population frequency. Although some methods attempt to model the impact of CNVs on the allele frequency of overlapping SSMs [1719,27], these methods have significant restrictions. For example, several of these methods [17,18] make the unrealistic assumption that every cell either contains the structural variation and the mutation or neither. Also, no method places structural variations in a phylogenetic tree, which is important for studying the evolution of cancerous genomes.
We describe PhyloWGS, the first method designed for complete subclonal phylogenic reconstruction of both CNVs and SSMs from WGS of bulk tumor samples. Unlike all previous methods, PhyloWGS appropriately corrects SSM population frequencies in regions overlapping CNVs and is fast enough to perform reconstruction of at least five cancerous subpopulations based on thousands of mutations. We present results on subclonal reconstruction problems that cannot be correctly reconstructed using previous methods. We also probe the relationship between WGS read depth and the number of subpopulations that PhyloWGS can recover. Finally, we demonstrate that even in the absence of reliable CNV estimates, it is still feasible to perform automated subclonal composition reconstruction based on SSM frequency data at typical WGS read depths (30 to 50 ×), even for highly rearranged genomes where less than 2% of the SSMs lie in regions of normal copy number. Opensource, free software implementing PhyloWGS is available under the GNU General Public License v3 [28].
Previous work
In general, the subpopulationdefining mutation sets include more than one mutation. Cancerous cells often have increased mutation rates, and even noncancerous cells accumulate somatic mutations at a rate of 1.1 per cell division [29]. As such, subpopulations are defined not only by the small number of oncogenic ‘driver’ mutations that support rapid expansion but also by a larger number of ‘passenger’ mutations acquired before the driver mutation(s). The selective sweeps that cause subpopulation expansion increase the population frequency of both driver and passenger SSMs, driving them to having indistinguishable population frequencies [30,31]. However, sampling and technical noise in sequencing means that the observed VAFs are distributed around the true value for a subpopulation. Panel (ii) shows an example histogram of measured VAFs for SSMs found in a heterogeneous tumor sample.
Simplesomaticmutationbased approaches
SSMbased subclonal reconstruction algorithms attempt to reconstruct the subpopulation genotypes based on VAF clusters (and their associated mutation sets) identified by fitting statistical mixture models to the VAF data either without phylogenic reconstruction [18,19,21,32], before phylogenic reconstruction [33] or concurrently with it [16,17]. Often, as in Figure 1, the clusters overlap, which introduces uncertainty in the exact number of mutation sets represented in the tumor (as well as in the assignment of SSMs to clusters). Adding more clusters to the model always provides a better data fit, so to prevent overfitting, the cluster number is selected by balancing data fit versus a complexity penalty (e.g. the Bayesian information criteria) or by Bayesian inference in a nonparametric model [17,18,32]. In panel (iii) in Figure 1, the correct number of clusters has been recovered along with appropriate central VAFs.
Assuming that the correct VAF clusters can be recovered, the subclonal lineages corresponding to each mutation set must still be defined. Defining the subclonal lineages is equivalent to defining the tumor phylogeny, and often multiple phylogenies are consistent with the recovered VAF clusters (e.g. panel (iiii) in Figure 1). Complete and correct reconstruction of subpopulation genotypes requires resolving this ambiguity. To do so, reconstruction methods make one of a handful of assumptions about the process of tumor evolution.
A common, and powerful, assumption is the infinite sites assumption (ISA) [17,34,35], which posits that each SSM occurs only once in the evolutionary history of the tumor. The ISA implies that the tumor evolution is consistent with a ‘perfect and persistent phylogeny’ [18]: each subpopulation has all of the SSMs that its ancestors had, each SSM appears in only one subclonal lineage and each subclonal lineage corresponds to a subtree in the phylogeny of tumor subpopulations. Because SSMs are relatively rare (compared to the genome size), the ISA is nearly always valid for all SSMs, so there is little danger of incorrect reconstructions due to violations of the ISA. In many cases, the ISA alone permits the recovery of multiple, complete subpopulation genotypes from a single or small number of tumor samples using either the sum rule [17] (also called the pigeonhole principle [26]) or the crossing rule [17], respectively. Methods that do not use the ISA require, in the case of no measurement noise, at least as many tumor samples as there are subpopulations [16,36]; in actual application when there is noise, even more samples are required.
Unfortunately, the ISA alone is often unable to resolve reconstruction ambiguity fully. As such, some methods [16,33] also make a sparsity assumption to select among ISArespecting phylogenies consistent with the VAF data. This assumption, which we call strong parsimony, posits that due to expansion dynamics, there are a small number of subpopulations still present in the tumor [16,33], and that many of the VAF clusters are vestigial. These methods therefore select the phylogeny (or phylogenies) that maximizes the number of vestigial VAF clusters [16], or equivalently, the number of branchpoints where the parental subpopulation has a zero frequency in the current tumor [16,33]. The strong parsimony assumption does resolve some ambiguity, and leads to the correct reconstruction in Figure 1, but it is risky as its empirical validity is not yet established. For example, under some conditions, a linear (i.e. nonbranching) phylogeny can be mistaken for a branching one; the risk of this occurring increases as the VAF measurement noise or the number of subpopulations in the tumor increases. This background distribution of false positive vestigiality is not yet considered by either of the methods that assume strong parsimony.
By assigning all SSMs within a VAF cluster to the same mutation set, reconstruction methods make another implicit assumption, which we call weak parsimony. This assumption does not hold if two mutation sets have the same population frequency. Note that if the ISA is valid, by the pigeonhole principle, weak parsimony is guaranteed to be valid whenever the population frequency of the mutation set is >50%.
Subclonal reconstruction methods, their properties and assumptions
Property/method  PhyloWGS  PhyloSub [ 17 ]  THetA [ 15 ]  PyClone [ 18 ]  TrAp [ 16 ]  Clomial [ 36 ]  RecBTP [ 33 ] 

Simple somatic mutation  Y  Y  N  Y  Y  Y  Y 
Copy number variation  Y  N  Y  N  N  N  N 
Weak parsimony  Y  Y  N/A  Y  Y  Y  Y 
Strong parsimony  N  N  N/A  N  Y  N  Y 
Infinite sites  Y  Y  N/A  Y  Y  N  Y 
Phylogenetic inference  Y  Y  N  N  Y  N  Y 
Parametric  N  N  Y  N  N  Y  N 
Multiple samples  Y  Y  N  Y  N  Y  N 
Genotype uncertainty  Y  Y  N  N/A  N  N  N 
PhyloWGS, like its predecessor PhyloSub [17], does not make the strong parsimony assumption nor does it report only a single tree. Instead it reports samples from the posterior distribution over phylogenies. Because the clustering of the VAF is performed concurrently with phylogenic reconstruction, PhyloWGS is able to perform accurate reconstruction even when the weak parsimony assumption is violated in a strict subset of the samples available, for example, if the VAF clusters overlap in one sample but not another. Our Markov chain Monte Carlo (MCMC) procedure samples phylogenies from the model posterior that are consistent with the mutation frequencies and does not rule out phylogenies that are equally consistent with the data. From this collection of samples, areas of certainty and uncertainty in the reconstruction can be determined.
Copynumbervariationbased approaches
In the absence of other information, like Ballele frequencies [26], parsimony assumptions are relied upon to resolve reconstruction ambiguities. One strategy only attempts to reconstruct the cancerous, subclonal lineage [24,25] with the highest population frequency (also known as the clonal population). From this reconstruction, the proportion of cells in the tumor sample that are cancerous (i.e. the cellularity), as well as the CNVs that are shared by all cancerous cells in the tumor, can be inferred. However, this approach can fail when there are multiple subclonal populations, especially if they share few CNVs [15,20]. Methods that attempt to detect >1 cancerous subpopulation do so by balancing data fit with a complexity term that penalizes additional subpopulations [15,20]. So far, these methods seem to be practically limited to a small number of cancerous subpopulations (i.e., two), and cannot be applied to tumors with substantial rearrangements.
Combining simple somatic mutations and copy number variations
In loci affected by CNVs, computing the population frequency of an SSM from its VAF requires knowing whether the SSM occurred before, after or independently of the CNV. If the SSM occurred before the CNV, and CNV affects the copy number of the SSM, then computing its VAF also requires knowing the new number of maternal and paternal copies of the locus. Figure 2 illustrates a situation where incorporating CNV information is critical for subclonal reconstruction. Without CNV information, the two VAF peaks would be interpreted as two separate subclonal lineages. With CNVs, it becomes clear that the second peak is caused by the amplification of part of the genome that increases the VAF of all SSMs found in the region.
Some subclonal reconstruction methods simply ignore the impact that CNVs have on the relationship between SSM population and allele frequency [16,21]. Other methods that do account for the effect of copy number changes on SSM frequencies [1719], do so by integrating over all the possible relationships between allele frequency and population frequency without using that the ISA, which was necessary to associate SSMs uniquely to subclonal lineages in the first place, constrains this relationship [26].
For the first time, we describe an automated method, PhyloWGS, which performs subclonal reconstruction using both CNVs and SSMs. By combining information from both CNVs and SSMs, and properly accounting for their interaction, we provide a more comprehensive and accurate description of a subclonal genotype.
Results
In the following, we first provide a brief explanation of how PhyloWGS incorporates both SSMs and CNVs in phylogenic reconstruction by converting CNVs into pseudoSSMs and performing subclonal reconstruction on the SSMs and pseudoSSMs; full details are provided in the Materials and methods section. Then, we show an illustrative example where accounting for the effect of CNVs on SSMs permits the correct subclonal reconstruction of a tumor population whereas using either CNV or SSM data in isolation does not. Then, we describe our efforts to quantify the relationship between read depth and the number of subpopulations that can be accurately recovered by applying PhyloWGS to simulated WGS data with different read depths, number of subpopulations and SSMs. Next, we describe the application of PhyloWGS to three TCGA benchmark datasets. Finally, we describe the application of PhyloWGS to two real datasets: a multiplesample WGS dataset from a patient with chronic lymphocytic leukemia and a single sample from a breast tumor.
Incorporating copy number variations with simple somatic mutations in phylogenic reconstruction
We assume that a CNV algorithm has already been applied to the sequencing data and that this algorithm provides estimates of copy number C and population frequency ϕ _{ i } for each CNV i. We use these estimates in two ways: first, for each CNV, we create an equivalent pseudoSSM with population frequency ϕ _{ i } by adding an SSM to the dataset with total reads d _{ i } and variant reads d _{ i }×ϕ _{ i }/2 rounded to the nearest whole number (i.e., the expected number of variant reads of a heterozygous mutation with population frequency ϕ _{ i }) where d _{ i } is set to a userdefined multiple of the average WGS read depth. If a confidence interval for ϕ _{ i } is available, we can set d _{ i } to have the same confidence interval. Note, we allow multiple CNVs to affect the same locus; each of these CNVs is assigned its own pseudoSSM. Second, we associate all SSMs within the region affected by the CNV to this pseudoSSM. Our model (described in the Materials and methods section) uses this association to compute the transformation from the inferred population frequency of an SSM to its expected VAF.
Here, we briefly describe this transformation when there is only one CNV affecting the SSM locus; the Materials and methods section describes the general version of the transformation used by PhyloWGS that allows multiple CNVs to affect the locus.
Given the population frequency of the CNV, ϕ _{ c }, the copy number of the CNV C broken down into maternal and paternal components C=C ^{ p }+C ^{ m }, and the population frequency of the SSM, ϕ _{ s }, the equations below compute the expected allele frequency of the SSM x _{ ssm }. Here we are using the terms ‘maternal’ and ‘paternal’ simply to distinguish the two copies and not to suggest that we have actually assigned each chromosome to each parent. Furthermore, the description here assumes that the SSM is on the maternal copy; if it is on the paternal copy, replace C ^{ m } with C ^{ p } below.
 1.
The SSM precedes the CNV event, i.e., the CNV occurred in a cell already containing the SSM.
 2.
The SSM occurs after the CNV event, i.e., the SSM occurred in a cell already containing the CNV.
 3.
The SSM and CNV occurred in separate branches of the phylogeny, i.e., the mutations occur in separate cells and no cell contains both the SSM and the CNV.
Case 1: simple somatic mutation → copy number variation
The numerator corresponds to the average number of copies per cell of the SSMmutated locus in the population and the denominator is the average number of copies per cell of the locus (mutated or not) in the population. We note that if there is no copy number change in C ^{ m } then the numerator is simply ϕ _{ s }, and if C ^{ m }=0 then the numerator is ϕ _{ s }−ϕ _{ c }.
Case 2: copy number variation → simple somatic mutation
Case 3: \(\genfrac {}{}{0pt}{}{\nearrow }{\searrow }^{SSM}_{\textit {CNV}}\)
We note that the breakdown of C into C ^{ m } and C ^{ p } and phasing the SSM only affects the expected VAF in Case 1. This is because it is the only case where a CNV event can affect a mutated locus. Although PhyloWGS requires the breakdown of C into C ^{ m } and C ^{ p } under these conditions, we do not require the SSM to be phased, as many cannot be [26], and instead consider both possibilities when computing the likelihood. Some subclonal copy number callers decompose C into C ^{ m } and C ^{ p } [38]; if the caller does not provide this decomposition, then PhyloWGS should be run on loci where C∈{0,1,2}.
An important consequence of these rules is that under some conditions, it is possible to identify unambiguously a branching phylogeny using single sample data. If an SSM can be phased to an amplified locus there are situations where given particular values of x _{ ssm }, ϕ _{ c }, C ^{ p } and C ^{ m } one can distinguish between Case 1 and Case 3. For example, given x _{ ssm }=0.1,ϕ _{ c }=0.4,C ^{ m }=10 and C ^{ p }=1, for Case 3 the inferred ϕ _{ s } is 0.56. However, if Case 1 were true, the resulting inferred ϕ _{ s } would be negative and so Case 1 is not possible. This condition holds whenever x _{ ssm }×(2(1−ϕ _{ c })+C ϕ _{ c })<(C ^{ m }−1)ϕ _{ c }. We were unable to find any other circumstances in which single sample VAFs were more consistent with a branching phylogeny than a chain phylogeny.
Combination of copy number variations with simple somatic mutations is required for accurate subclonal reconstruction
Reference and total read counts for example tumor sequencing data
Mutation id  Reference counts  Total counts 

s0  23  47 
s1  35  69 
s2  26  51 
s3  29  56 
s4  30  40 
s5  42  70 
s6  27  41 
s7  36  56 
s8  59  75 
s9  57  76 
s10  51  68 
s11  37  51 
We also ran PyClone [18] on this dataset. PyClone cannot take as input that a locus has been homozygously deleted, so we ran PyClone either by telling it there were no CNV changes or that there was a deletion of one copy. Without any input CNV alterations, PyClone produced a clustering identical to PhyloSub, while in the single copy deletion state, PyClone placed SSM4 in an additional cluster with no other mutations. As this simple example illustrates, integrating data from both SSMs and CNVs is required for full, and accurate, subclonal reconstruction.
Applying PhyloWGS to simulated data
Subclonal lineage proportions used
Number of populations  ϕ values used (linear) 

3  0.44, 0.11 
4  0.56, 0.25, 0.06 
5  0.64, 0.36, 0.16, 0.04 
6  0.71, 0.44, 0.25, 0.11, 0.03 
Another important measure of the performance of our algorithm is how accurate the mapping from population to SSM is. To evaluate this accuracy in a systematic way that accounts for class imbalance, varying number of SSMs and differing number of clusters, we examine the area under the precision–recall curve (AUPRC) between the known true coclustering matrix and the average coclustering matrix from our samples. A coclustering matrix M is a binary matrix where M _{ ij }=1 if SSM i and SSM j are in the same cluster. The average coclustering matrix is constructed by taking the average of the coclustering matrices of each sample in the Markov chain after burnin and is an estimate of the posterior mean coclustering matrix of our model. The average coclustering matrix better predicts the true coclustering matrix than the coclustering matrix computed from the maximum data likelihood tree. AUPRC was chosen over area under the receiver–operator curve as it is known to be more informative in the presence of class imbalance [40], which changes as the number of populations increases.
Simulations with copy number variation changes
TCGA benchmark
Chronic lymphocytic leukemia
Breast tumor
Discussion
Our work makes two important contributions to the burgeoning field of subclonal reconstruction. First, we provide the first automated method that integrates SSM and CNV data in the reconstruction of tumor phylogenies. This is an important innovation; previous methods either ignore the impact that CNVs have on SSM allele frequencies [16,21], or assume that the CNVs affect all (and only) the cells that contain the SSM [1719]. These assumptions can lead to incorrect inferences about the population frequency of SSMs because how a CNV affects the allele frequency of an SSM depends on its phylogenic relationship with the SSM. Many of the insights about how to integrate SSM and CNV data appear in [26]; our work here extends and formalizes these seminal observations while also providing an automated method for phylogenic reconstruction. A further advantage of combining SSMs and CNVs in the phylogenic reconstruction is that CNVs overlapping the SSM locus can provide further constraints on the tree structure than are provided by SSM frequency alone, and we described one case where it is possible unambiguously to infer branching when an amplification of a SSMcontaining locus does not lead to a large increase in the SSM allele frequency.
Second, we show that given typical WGS read depths, SSMbased methods are able to reconstruct tumor phylogenies accurately, and detect and assign SSMs for at least six subpopulations. Previously, it was not clear to what extent this reconstruction would be possible and no automated reconstructions with more than two cancerous subpopulations based on WGS data had been described. Furthermore, we demonstrate the importance of phylogenic correction of VAFs of SSMs that occur in loci affected by copy number changes when performing subclonal reconstruction. Specifically, we presented results for a breast cancer benchmark where methods that do not use PhyloWGS’s phylogenic correction perform much worse at recovering subpopulations. Finally, we report examples of accurate subclonal reconstruction for cancer populations with highly reordered chromosomes solely on the basis of SSM frequencies in the regions of normal copy number. For these same data, a stateoftheart CNVbased method failed to perform the reconstruction.
The current version of PhyloWGS relies on preprocessing the sequencing data with a CNVbased method for subclonal reconstruction. This is because it assumes that the initial population frequency ϕ _{ i } and copy number data C _{ i } are already available for the CNVs; furthermore, for amplifications, C _{ i }>2, it requires C _{ i } to be separated into the relative number of each of the two copies, i.e., \(\{{C_{i}^{m}}, {C_{i}^{p}}\}\). It does not, however, require the SSMs to be phased; in other words, it does not need to know whether the SSMs occurred on the maternal or paternal copy of the chromosome. New CNVbased methods provide \(\{{C_{i}^{m}}, {C_{i}^{p}}\}\) [38]; our work anticipates further developments in subclonal CNV callers. If used with a method that cannot decompose copy number changes into changes in the maternal and paternal loci, PhyloWGS can be restricted to regions of copy number loss (i.e. C _{ i }<2), where there is only one possible breakdown. Note that this decomposition is only necessary when an SSM precedes a CNV on the same branch of a phylogeny.
We also note that PhyloWGS does not require the CNVbased preprocessing to be able to detect all of the subclonal populations, and we have shown that PhyloWGS can detect additional populations either defined completely by SSMs or that were not detected by CNVbased methods. This is particularly important because recent CNVbased methods are limited to a maximum of two cancerous populations and those that can detect >1 cancerous subpopulation do so by relying on a strong parsimony assumption. If invalid, this assumption can lead to large errors in subclonal reconstruction because it selects branching phylogenies over chain phylogenies that are equally well supported by the data.
Indeed our results suggest an alternative strategy for combining SSMs and CNVs in subclonal reconstruction. Regions unaffected by CNVs can be relatively easily detected using methods such as BICseq [37]. Even in highly rearranged cancer genomes, there are often nonnegligible regions of normal copy number and we have shown that we can achieve reasonably accurate subclonal reconstructions using the limited number of SSMs in regions of normal copy number in the TCGA benchmark.
In the final stages of preparing this manuscript, a new method, cloneHD [27], was published. Like PhyloWGS, this method combines both SSMs and CNVs in subclonal reconstruction and does so using WGS data from single and multiple samples. However, unlike PhyloWGS, cloneHD does not explicitly perform phylogenic reconstruction, so it is unable to account fully for the phylogenic relationship among SSMs and CNVs when analyzing SSM allele frequency. As such, it is not clear to us that it can correctly solve the subclonal reconstruction problem posed in Figure 1. The cloneHD manuscript also does not extend the limits of WGSbased subclonal reconstruction as none of the examples reconstruct more than two cancerous subpopulations from a single sample. Finally, cloneHD appears to rely on the strong parsimony assumption to assess subclonal genotypes, and only reports a single reconstruction, obscuring the uncertainty involved. However, cloneHD does appear to be an interesting and powerful method and we hope that future work can compare the merits and drawbacks of these alternate approaches to subclonal reconstruction.
Conclusions
We have presented a new method, PhyloWGS, which reconstructs tumor phylogenies and characterizes the subclonal populations present in a tumor sample using both SSMs and CNVs. Our method takes as input SSM variant and germline read counts, as well as estimates of population frequencies and copy number for each CNV. PhyloWGS groups the SSMs and CNVs into subpopulations, and estimates the population frequencies and the phylogenic relationship of these subpopulations. PhyloWGS is based on a generative probabilistic model of allele frequencies that incorporates a nonparametric Bayesian prior over trees. The output of PhyloWGS consists of samples from this distribution generated through MCMC and we report the tree that maximizes the likelihood of the data found during the sampling run, if a single point estimate is required. However, unlike our previous PhyloSub method [17], PhyloWGS includes CNVs in its subclonal reconstruction and, in doing so, can correctly account for the effect of CNVs on the VAF of overlapping SSMs. PhyloWGS also runs more than 50 times faster than PhyloSub, making it feasible to apply it to the thousands of SSMs that are found through WGS.
We have applied PhyloWGS to real and simulated data from WGS of tumor samples to demonstrate that subclonal populations can be reliably reconstructed based solely on SSMs from mediumdepth sequencing (30 to 50 ×). We have also used PhyloWGS to solve correctly a simulated subclonal reconstruction problem that neither an SSMbased nor a CNVbased method could solve alone, and to reconstruct the phylogeny and subclonal composition of a highly rearranged sample for which a CNVbased method fails. We also demonstrated that when applied to WGS of timeseries samples from a chronic lymphocytic leukemia patient, PhyloWGS recovers the same tumor phylogeny previously reconstructed by applying PhyloSub and a semimanual method to data from deep targeted resequencing. Finally, we demonstrated stateoftheart performance in subclonal reconstruction on a breast tumor sample, highlighting the advantages of performing phylogenetic CNV correction. Our work thus greatly expands the range of tumor samples for which subclonal reconstruction is possible, enabling widespread use of automated subclonal reconstruction for mediumdepth WGS sequencing experiments.
Materials and methods
PhyloSub model
The posterior distribution of \(\tilde {\phi }_{i}\) is \(p(\tilde {\phi }_{i} \mid a_{i},d_{i},{\mu _{i}^{r}},{\mu _{i}^{v}}) \propto p(a_{i} \mid d_{i},\tilde {\phi }_{i},{\mu _{i}^{r}},{\mu _{i}^{v}}) p(\tilde {\phi }_{i})\).
The Dirichlet process prior DP(α,H) in the observation model of allelic counts (1) is useful for inferring groups of mutations that occur at the same SSM population frequency [12,18]. Furthermore, being a nonparametric prior, it allows us to avoid the problem of selecting the number of groups of mutations a priori. However, it cannot be used to model the clonal evolutionary structure, which takes the form of a rooted tree. To model this, we use the treestructured stickbreaking process prior [46] denoted by TSSB(α,γ,H). The parameters α and γ influence the height and width of the tree, respectively, and are similar to the concentration parameter in the Dirichlet process prior. Let \(\{\phi _{k}\}_{k=1}^{K}\) denote the set of unique frequencies in the set \(\{\tilde {\phi }_{i}\}_{i=1}^{N}\), where K is the number of subclones or nodes in the tree. In other words, multiple elements in the set \(\{\tilde {\phi }_{i}\}_{i=1}^{N}\) will take on the same value from the set \(\{\phi _{k}\}_{k=1}^{K}\) of unique frequencies. The prior/base distribution H of the SSM population frequencies is the uniform distribution Uniform(0,1) for the root node and \(\text {Uniform}(0,\phi _{\text {par}(v)}\sum _{w \in \mathcal {S}(v)} \phi _{w})\) for any other node v in the tree, where par(v) denotes the parent node of v and \(\mathcal {S}(v)\) is the set of siblings of v. This ensures that the clonal evolutionary constraints (discussed below) are satisfied when adding a new node to the tree. Given this model and a set of N observations \(\{(a_{i},d_{i},{\mu _{i}^{r}},{\mu _{i}^{v}})\}_{i=1}^{N}\), the tree structure and the SSM population frequencies \(\{\tilde {\phi }_{i}\}\) are inferred using MCMC sampling (see PhyloSub [17] for further details).
where we have used \(\{\tilde {\eta }_{i}\}\) to denote the auxiliary variables for each SSM. The prior/base distribution of the auxiliary variables is defined such that it is 1 for the root node and Uniform(0,η _{par(v)}) for any other node v in the tree. When a new node w is added to the tree, we sample η _{ w }∼Uniform(0,η _{par(w)}) and update η _{par(w)}←η _{par(w)}−η _{ w }. This ensures that \(\sum _{v} \eta _{v} =1\). This change is crucial as it allows us to design a Markov chain that converges to the stationary distribution of {η _{ v }}. The SSM population frequency for any node v can then be computed via \(\phi _{v}= \eta _{v} + \sum _{w \in \mathcal {D}(v)} \eta _{w} = \eta _{v} + \sum _{w \in \mathcal {C}(v)} \phi _{w}\), where \(\mathcal {D}(v)\) and \(\mathcal {C}(v)\) are the sets of all descendants and children of node v respectively. This construction ensures that the SSM population frequencies of mutations appearing at the parent node are greater than or equal to the sum of the frequencies of all its children. We use the Metropolis–Hastings algorithm [47] to sample from the posterior distribution of the auxiliary variables \(\{\tilde {\eta _{i}}\}\) (2) and derive the SSM population frequencies from these samples by selecting the sampled set of population frequencies with the highest likelihood. We use an asymmetric Dirichlet distribution as the proposal distribution.
Integrating copy number variation data into PhyloSub
The focus of our new method, PhyloWGS, is integrating SSM frequencies with existing CNVbased subclonal reconstructions. As mentioned above, our algorithm takes as input a set of SSMs along with their allele frequencies, expressed for each SSM i, as the number of reads at the locus supporting either the SSM (b _{ i }) or the reference allele (a _{ i }). We also allow our algorithm to take a set of inferred copy number changes, where for each change j, the input provides the new copy number C _{ j } as well as the proportion of the population with the change \(\tilde {\phi }_{j}\). In some cases, we also require the breakdown of C _{ j } into the new number of maternal (\({C_{j}^{m}}\)) and paternal (\({C_{j}^{p}}\)) copies of the locus (see below for details). If this breakdown is not available, we can restrict our attention to CNVs for which C _{ j }<2 because in these cases, there is only one possible breakdown. Also, in the absence of a paternal/maternal breakdown, we should still be able, in theory, to assign SSMs with overlapping CNVs with C _{ j }>2 to specific populations once the phylogeny and subclonal populations have been defined using SSMs and CNVs in regions of C _{ j }≤2.
Below, we describe the rules, based on the ISA, that we use to determine the relationship between the population frequency of an SSM \(\tilde {\phi }_{i}\) and its observed VAF (b _{ i }/d _{ i }). When the SSM does not overlap a region that has a predicted CNV in any cell in the tumor population, then the predicted allele frequency is simply half of the modeled population frequency. We also describe the method by which we transform each CNV j into a pseudoSSM to be included in the phylogeny.
If copy number variations do not overlap with any simple somatic mutation
If a CNV occurs in a region without any SSMs, we generate a pseudoSSM for the CNV j, which is represented in the model as a heterozygous, binary somatic mutation with a read depth that reflects the uncertainty in the provided population frequency \(\tilde {\phi }_{j}\) for the CNV. Specifically, we generate reference and variant read counts, a _{ j } and b _{ j }, respectively, such that the allelic frequency b _{ j }/(a _{ j }+b _{ j }) is approximately equal to \(\tilde {\phi _{j}} / 2\) and the total number of reads a _{ j }+b _{ j } is selected based on the evidence supporting the CNV. Generating this pseudoSSM allows the CNV to be treated like any other SSM in the phylogeny model.
If copy number variations overlap with simple somatic mutations
 1.
The population does not contain the SSM and is not affected by a CNV.
 2.
The population does not contain the SSM but is affected by a CNV.
 3.
The population contains the SSM but is not affected by a CNV.
 4.
The population contains the SSM and is affected by a CNV, and the SSM occurred after the CNV.
 5.
The population contains the SSM and is affected by a CNV, and the CNV occurred after the SSM.
Note that the breakdown of C _{ i } into \({C_{i}^{m}}\) and \({C_{i}^{p}}\) is only required if the CNV occurs after the SSM on the same branch.
Note that in some circumstances, a SSM can be placed on a particular copy of the chromosome by looking for reads that cover the SSM and nearby heterozygous germline mutations. If this is not possible, then the likelihood of a _{ i } is the average of two likelihoods: the likelihood of the SSM occuring on the maternal genome and the likelihood of the SSM occuring on the paternal genome.
Extension to multiple samples
Our model can be easily extended to multiple tumor samples. We make no assumptions regarding the time when the samples were collected, so this extension is equally applicable to multiple samples collected simultaneously (e.g. as in [2]) or over a period of time as in [11]. We allow the treestructured stickbreaking process prior to be shared across all the samples. The main technical difference between the single and the multiple sample models lies in the sampling procedure for SSM population frequencies. In the multiple sample model, we ensure that the clonal evolutionary constraints are satisfied separately for each tumor sample and then make a global Metropolis–Hastings move based on the product of posterior distributions across all the samples (cf. 2).
Markov chain Monte Carlo settings
In all the MCMC experiments, we fix the number of MCMC iterations to 2,500 and use a burnin of 100 samples. We also fix the number of iterations in the Metropolis–Hastings algorithm to 5,000 and set the scaling factor for the Dirichlet proposal distribution to 100 (see PhyloSub paper [17]). We use the CODA R package [48] for MCMC diagnostics to monitor the convergence of the samplers using the completedata log likelihood traces and the corresponding autocorrelation function.
Sequencing error
It is becoming increasingly clear that sequencing error is not uniform across the genome and different trinucleotide sequences result in different sequencing error rates [49]. While the precise nature of these differences is not yet fully known, PhyloWGS allows the user to input a different sequencing error rate for each mutation.
Abbreviations
 AUPRC:

area under the precision–recall curve
 CNV:

copy number variation
 DP:

Dirichlet process
 ISA:

infinite sites assumption
 MCMC:

Markov chain Monte Carlo
 SSM:

simple somatic mutation
 VAF:

variant allelic frequency
 WGS:

wholegenome sequencing
Declarations
Acknowledgements
This work was funded by a National Science and Engineering Research Council operating grant and an Early Researcher Award from the Ontario Research Fund to QM. AGD is supported by a Vanier Canadian Graduate Scholarship from the Natural Sciences and Engineering Research Council. CKY, GHJ and LS are the recipients of support from the Ontario Ministry of Research and Innovation.
Authors’ Affiliations
References
 Nowell PC. The clonal evolution of tumor cell populations. Science. 1976; 194:23–8.View ArticlePubMedGoogle Scholar
 Gerlinger M, Rowan AJ, Horswell S, Larkin J, Endesfelder D, Gronroos E, et al. Intratumor heterogeneity and branched evolution revealed by multiregion sequencing. New Engl J Med. 2012; 366:883–92.View ArticlePubMedGoogle Scholar
 Hughes AEO, Magrini V, Demeter R, Miller CA, Fulton R, Fulton LL, et al. Clonal architecture of secondary acute myeloid leukemia defined by singlecell sequencing. PLoS Genet. 2014; 10:e1004462.View ArticlePubMed CentralPubMedGoogle Scholar
 Hanahan D, Weinberg RA. The hallmarks of cancer. Cell. 2000; 100:57–70.View ArticlePubMedGoogle Scholar
 Hanahan D, Weinberg RA. Hallmarks of cancer: the next generation. Cell. 2011; 144:646–74.View ArticlePubMedGoogle Scholar
 Aparicio S, Caldas C. The implications of clonal genome evolution for cancer medicine. New Engl J Med. 2013; 368:842–51.View ArticlePubMedGoogle Scholar
 Bedard PL, Hansen AR, Ratain MJ, Siu LL. Tumour heterogeneity in the clinic. Nature. 2013; 501:355–64.View ArticlePubMedGoogle Scholar
 Mullighan CG, Phillips LA, Su X, Ma J, Miller CB, Shurtleff SA, et al. Genomic analysis of the clonal origins of relapsed acute lymphoblastic leukemia. Science. 2008; 322:1377–80.View ArticlePubMed CentralPubMedGoogle Scholar
 Navin NE, Hicks J. Tracing the tumor lineage. Mol Oncol. 2010; 4:267–83.View ArticlePubMed CentralPubMedGoogle Scholar
 Marusyk A, Polyak K. Tumor heterogeneity: causes and consequences. Biochimica et Biophysica Acta. 2010; 1805:105–17.PubMed CentralPubMedGoogle Scholar
 Schuh A, Becq J, Humphray S, Alexa A, Burns A, Clifford R, et al. Monitoring chronic lymphocytic leukemia progression by whole genome sequencing reveals heterogeneous clonal evolution patterns. Blood. 2012; 120:4191–6.View ArticlePubMedGoogle Scholar
 Shah SP, Roth A, Goya R, Oloumi A, Ha G, Zhao Y, et al. The clonal and mutational evolution spectrum of primary triplenegative breast cancers. Nature. 2012; 486:617–56.Google Scholar
 Carter SL, Cibulskis K, Helman E, McKenna A, Shen H, Zack T, et al. Absolute quantification of somatic DNA alterations in human cancer. Nat Biotechnol. 2012; 30:413–21.View ArticlePubMedGoogle Scholar
 Landau DA, Carter SL, Stojanov P, McKenna A, Stevenson K, Lawrence MS, et al. Evolution and impact of subclonal mutations in chronic lymphocytic leukemia. Cell. 2013; 152:714–26.View ArticlePubMed CentralPubMedGoogle Scholar
 Oesper L, Mahmoody A, Raphael B. THetA: inferring intratumor heterogeneity from highthroughput DNA sequencing data. Genome Biol. 2013; 14:R80.View ArticlePubMed CentralPubMedGoogle Scholar
 Strino F, Parisi F, Micsinai M, Kluger Y. TrAp: a tree approach for fingerprinting subclonal tumor composition. Nucleic Acids Res. 2013; 41:e165.View ArticlePubMed CentralPubMedGoogle Scholar
 Jiao W, Vembu S, Deshwar AG, Stein L, Morris Q. Inferring clonal evolution of tumors from single nucleotide somatic mutations. BMC Bioinform. 2014; 15:35.View ArticleGoogle Scholar
 Roth A, Khattra J, Yap D, Wan A, Laks E, Biele J, et al.PyClone: statistical inference of clonal population structure in cancer. Nat Methods. 2014; 11:396–98.View ArticlePubMedGoogle Scholar
 Andor N, Harness JV, Müller S, Mewes HW, Petritsch C. EXPANDS: expanding ploidy and allele frequency on nested subpopulations. Bioinformatics. 2014; 30:50–60.View ArticlePubMed CentralPubMedGoogle Scholar
 Chen M, Gunel M, Zhao H. SomatiCA: identifying, characterizing and quantifying somatic copy number aberrations from cancer genome sequencing data. PloS One. 2013; 8:e78143.View ArticlePubMed CentralPubMedGoogle Scholar
 Larson NB, Fridley BL. PurBayes: estimating tumor cellularity and subclonality in nextgeneration sequencing data. Bioinformatics. 2013; 29:1888–9.View ArticlePubMed CentralPubMedGoogle Scholar
 Weinstein JN, Collisson EA, Mills GB, Shaw KRM, Ozenberger BA, Ellrott K, et al. The cancer genome atlas pancancer analysis project. Nat Genet. 2013; 45:1113–20.View ArticlePubMed CentralPubMedGoogle Scholar
 Ding L, Ley TJ, Larson DE, Miller CA, Koboldt DC, Welch JS, et al. Clonal evolution in relapsed acute myeloid leukaemia revealed by wholegenome sequencing. Nature. 2012; 481:506–10.View ArticlePubMed CentralPubMedGoogle Scholar
 Carter SL, Cibulskis K, Helman E, McKenna A, Shen H, Zack T, et al. Absolute quantification of somatic DNA alterations in human cancer. Nat Biotechnol. 2012; 30:413–21.View ArticlePubMedGoogle Scholar
 Gusnanto A, Wood HM, Pawitan Y, Rabbitts P, Berri S. Correcting for cancer genome size and tumour cell content enables better estimation of copy number alterations from nextgeneration sequence data. Bioinformatics. 2012; 28:40–7.View ArticlePubMedGoogle Scholar
 NikZainal S, Loo PV, Wedge DC, Alexandrov LB, Greenman CD, Lau KW, et al. The life history of 21 breast cancers. Cell. 2012; 149:994–1007.View ArticlePubMed CentralPubMedGoogle Scholar
 Fischer A, VázquezGarcía I, Illingworth CJ, Mustonen V. Highdefinition reconstruction of clonal composition in cancer. Cell Reports. 2014; 7:1740–52.View ArticlePubMed CentralPubMedGoogle Scholar
 PhyloWGS. https://github.com/morrislab/phylowgs.
 Behjati S, Huch M, van Boxtel R, Karthaus W, Wedge DC, Tamuri AU, et al.Genome sequencing of normal cells reveals developmental lineages and mutational processes. Nature. 2014; 513:422–5.View ArticlePubMed CentralPubMedGoogle Scholar
 Burrell RA, McGranahan N, Bartek J, Swanton C. The causes and consequences of genetic heterogeneity in cancer evolution. Nature. 2013; 501:338–45.View ArticlePubMedGoogle Scholar
 Klein CA. Selection and adaptation during metastatic cancer progression. Nature. 2013; 501:365–72.View ArticlePubMedGoogle Scholar
 Miller CA, White BS, Dees ND, Griffith M, Welch JS, Griffith OL, et al.SciClone: inferring clonal architecture and tracking the spatial and temporal patterns of tumor evolution. PLoS Comput Biol. 2014; 10:e1003665.View ArticlePubMed CentralPubMedGoogle Scholar
 Hajirasouliha I, Mahmoody A, Raphael BJ. A combinatorial approach for analyzing intratumor heterogeneity from highthroughput sequencing data. Bioinformatics. 2014; 30:78–86.View ArticleGoogle Scholar
 Kimura M. The number of heterozygous nucleotide sites maintained in a finite population due to steady flux of mutations. Genetics. 1969; 61:893.PubMed CentralPubMedGoogle Scholar
 Hudson RR. Properties of a neutral allele model with intragenic recombination. Theor Popul Biol. 1983; 23:183–201.View ArticlePubMedGoogle Scholar
 Zare H, Wang J, Hu A, Weber K, Smith J, Nickerson D, et al.Inferring clonal composition from multiple sections of a breast cancer. PLoS Comput Biol. 2014; 10:e1003703.View ArticlePubMed CentralPubMedGoogle Scholar
 Xi R, Hadjipanayis AG, Luquette LJ, Kim TM, Lee E, Zhang J, et al.Copy number variation detection in wholegenome sequencing data using the Bayesian information criterion. Proc Nat Acad Sci. 2011; 108:E1128–36.View ArticlePubMed CentralPubMedGoogle Scholar
 Ha G, Roth A, Khattra J, Ho J, Yap D, Prentice LM, et al.TITAN: inference of copy number architectures in clonal cell populations from tumor wholegenome sequence data. Genome Res. 2014; 24:1881–93.View ArticlePubMed CentralPubMedGoogle Scholar
 Miller JW, Harrison MT. A simple example of Dirichlet process mixture inconsistency for the number of components. In: Advances in neural information processing systems. 2013:199–206.
 Davis J, Goadrich M. The relationship between precision–recall and ROC curves. In: Proceedings of the 23rd International Conference on Machine Learning: 2006. p. 233–40.
 Ewing A. TCGA mutation/variation calling benchmark 4. 2013. https://cghub.ucsc.edu/datasets/benchmark_download.html.
 Institute TB. Picard: Java tools for manipulating BAM files. http://picard.sourceforge.net/.
 Li H, Durbin R. Fast and accurate longread alignment with Burrows–Wheeler transform. Bioinformatics. 2010; 26:589–95.View ArticlePubMed CentralPubMedGoogle Scholar
 Cibulskis K, Lawrence MS, Carter SL, Sivachenko A, Jaffe D, Sougnez C, et al.Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat Biotechnol. 2013; 31:213–9.View ArticlePubMedGoogle Scholar
 Saunders CT, Wong WS, Swamy S, Becq J, Murray LJ, Cheetham RK, et al. Strelka: accurate somatic smallvariant calling from sequenced tumor–normal sample pairs. Bioinformatics. 2012; 28:1811–7.View ArticlePubMedGoogle Scholar
 Adams RP, Ghahramani Z, Jordan MI. Treestructured stick breaking for hierarchical data. In: Advances in neural information processing systems 23: 2010. p. 19–27.
 Hastings WK. Monte Carlo sampling methods using Markov chains and their applications. Biometrika. 1970; 57:97–109.View ArticleGoogle Scholar
 Plummer M, Best N, Cowles K, Vines K. CODA: convergence diagnosis and output analysis for MCMC. R News. 2006; 6:7–11.Google Scholar
 Boutros PC, Ewing AD, Ellrott K, Norman TC, Dang KK, Hu Y, et al.Global optimization of somatic variant identification in cancer genomes with a global community challenge. Nat Genet. 2014; 46:318–9.View ArticlePubMed CentralPubMedGoogle Scholar
Copyright
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.