THetA: inferring intratumor heterogeneity from highthroughput DNA sequencing data
 Layla Oesper^{1}Email author,
 Ahmad Mahmoody^{1} and
 Benjamin J Raphael^{1, 2}Email author
DOI: 10.1186/gb2013147r80
© Oesper et al.; licensee BioMed Central Ltd. 2013
Received: 26 April 2013
Accepted: 29 July 2013
Published: 29 July 2013
Abstract
Tumor samples are typically heterogeneous, containing admixture by normal, noncancerous cells and one or more subpopulations of cancerous cells. Wholegenome sequencing of a tumor sample yields reads from this mixture, but does not directly reveal the cell of origin for each read. We introduce THetA (Tumor Heterogeneity Analysis), an algorithm that infers the most likely collection of genomes and their proportions in a sample, for the case where copy number aberrations distinguish subpopulations. THetA successfully estimates normal admixture and recovers clonal and subclonal copy number aberrations in real and simulated sequencing data. THetA is available at http://compbio.cs.brown.edu/software/
Keywords
Cancer genomics intratumor heterogeneity DNA sequencing tumor evolution algorithmsBackground
Cancer is a disease driven in part by somatic mutations, which accumulate during the lifetime of an individual. The clonal theory of cancer progression [1] states that the cancerous cells in a tumor are descended from a single founder cell and that descendants of this founder cell acquired multiple mutations beneficial for tumor growth through multiple rounds of selection and clonal expansion. A tumor is thus a heterogeneous population of cells, each cell potentially containing a different complement of somatic mutations. These include both clonal mutations from the founder cell or early rounds of clonal expansion and subclonal mutations that occurred after the most recent clonal expansion. Alternatively, subclonal mutations may suggest that the tumor is polyclonal, consisting of subpopulations of cells that are not all descended from a single founder cell [2].
Highthroughput DNA sequencing technologies are now giving an unprecedented view of this intratumor mutational heterogeneity [3]. However, nearly all recent cancer sequencing projects generate DNA sequence from tumor samples consisting of many cells  including both normal (noncancerous) cells and one or more distinct populations of tumor cells. The tumor purity of a sample is the fraction of cells in the sample that are cancerous, and not normal cells. If a sample has a low tumor purity, then the power to detect all types of somatic aberrations in the cancer genomes is reduced. For example, lower tumor purity attenuates copy number ratios or allele frequencies away from the values expected with integral copy numbers. Methods to detect somatic copy number aberrations or loss of heterozygosity (LOH) from SNP array data or array comparative genomic hybridization (aCGH) data must account for this issue [4–9]. In addition, many algorithms for identifying somatic singlenucleotide mutations from DNA sequence reads implicitly or explicitly rely on an estimate of tumor purity. For example, the VarScan 2 program [10] uses an estimate of tumor purity as input to calibrate the expected number of reads that contain a somatic mutation at a locus.
Traditionally, tumor purity was assessed by visual analysis of tumor cells, either manually by a pathologist or via image analysis [11]. Recently, methods such as ASCAT [12] and ABSOLUTE [13] were introduced to estimate tumor purity directly from SNP array data. Both of these methods utilize the presence of copy number aberrations in cancer genomes to estimate both tumor purity and tumor ploidy, which is the number of copies of segments of chromosomes or entire chromosomes. Tumor purity and tumor ploidy are intertwined; for example, a heterozygous deletion of one copy of a chromosome in a 100% pure tumor sample (containing one cancer genome) could also be explained as a homozygous deletion in a 50% pure tumor sample (containing one cancer genome). Thus, it is necessary to estimate tumor purity and ploidy simultaneously, but this is a subtle and difficult problem. ASCAT and ABSOLUTE address this problem by estimating the average ploidy over the entire cancer genome. These estimates of tumor purity and average ploidy are then used in a second step to derive copy number aberrations.
Both ASCAT and ABSOLUTE have been shown to yield accurate estimates of tumor purity, achieving in some cases better estimates than via pathology or other techniques. However, these methods also have important limitations. First, the mathematical models used by ASCAT and ABSOLUTE are optimized for SNP array data, as we detail below. While these methods may be adapted to run on DNA sequencing data (for example, for ABSOLUTE see [14] and for ASCAT see below), the underlying mathematical model used by both methods does not adequately describe the characteristics of sequencing data. Second, both of these methods apply various heuristics in their estimation procedures, such as rounding copy numbers to the closest integer [12] and do not directly infer integer copy numbers for each segment of the genome during the estimation. Finally, both methods do not explicitly identify multiple tumor subpopulations, and instead infer only a single tumor subpopulation. For example, ABSOLUTE [13] classifies copy number aberrations as outliers if they are not clonal, but does not refine these outliers into subpopulations. If a tumor sample consists of multiple tumor subpopulations, then considering only a single tumor population may yield inaccurate estimates of tumor purity, as we show below.
Highthroughput DNA sequencing data is much higher resolution data than SNP arrays, and provides the opportunity to derive highly accurate estimates of both tumor purity and the composition of tumor subpopulations. For example, the number of reads containing a somatic singlenucleotide mutation at a locus provides  in principle  an estimate of the fraction of cells in a tumor sample containing this mutation. However, three interrelated factors complicate this analysis: (1) The number of reads supporting a somatic singlenucleotide mutation has high variance, implying that an estimate of the allele frequency will be highly unreliable at the modest coverages (30× to 40×) employed in nearly all current cancer sequencing projects. (2) Somatic mutations may be present in only a fraction of tumor cells. (3) Somatic copy number aberrations (nearly ubiquitous in solid tumors) alter the number of copies of the locus containing the mutation. While the first issue might be addressed in part by clustering allele frequency estimates across the genome [15–17], the second and third issues complicate such a clustering. Recent methods for analyzing tumor composition from DNA sequencing data either ignore copy number aberrations [17] or use iterative approaches [18] or other approximations [12, 13], and do not formally model the generation of DNA sequencing data from a mixture of integral copy numbers for each genomic segment.
Beyond the estimation of tumor purity and ploidy, it is desirable to identify subclonal aberrations, which can provide information on the age or history of the tumor [19], and can yield further insight into tumors that fail to respond to treatment or metastasize [19–21]. However, even with a pure tumor sample, characterizing subclonal mutations is a challenge. Tolliver et al. [22] infer subclonal copy number aberrations by comparing aberrations across different individuals, thus assuming that the progression of somatic copy number aberrations is conserved across individuals. Gerlinger et al. [23] recently demonstrated the extent of subclonal mutations by sequencing multiple (spatially separated) samples from a tumor, complementing earlier studies of heterogeneity using microarraybased techniques [24]. In another approach, Ding et al. [17] used a targeted ultradeep sequencing (1,000 × coverage) approach to estimate allele frequencies for relapse mutations in acute myeloid leukemia (AML). In another recent study, NikZainal et al. [25] used a SNP array based estimate of tumor purity [12] followed by extensive manual analysis of somatic mutations to identify a clonal (majority) population and a number of subclonal populations in each of several breast cancer genomes. Ultimately, singlecell sequencing techniques promise to provide a comprehensive view of cancer heterogeneity [26–29], but these techniques presently require specialized DNA amplification steps, which can introduce artifacts and also incur higher costs because they sequence many cells. Thus, the problem of the simultaneous estimation of and correction for tumor purity as well as the identification of clonal and subclonal mutations will remain a challenge for the majority of cancer sequencing projects.
In this paper, we introduce Tumor Heterogeneity Analysis (THetA), an algorithm that infers the most likely collection of genomes and their proportions from highthroughput DNA sequencing data, in the case where copy number aberrations distinguish subpopulations. In contrast to existing methods, we formulate and optimize an explicit probabilistic model for the generation of the observed tumor sequencing data from a mixture of a normal genome and one or more cancer genomes, each genome containing integral copy numbers of its segments. Specifically, we derive and solve the maximum likelihood mixture decomposition problem (MLMDP) of finding a collection of genomes  each differing from the normal genome by copy number aberrations  whose mixture best explains the observed sequencing data. Thus, we generalize the problem of estimating tumor purity to the problem of determining the proportions of normal cells and any number of tumor subpopulations in the sample.
Our formulation and solution of the MLMDP leverages the fact that copy number aberrations create a strong signal in DNA sequencing data: even relatively small copy number aberrations cause deviations in the alignments of thousands to millions of reads. Thus, in contrast to singlenucleotide mutations, where there is high variance in the number of reads at each position, many measurements (reads) are perturbed for each copy number aberration. Thus, each copy number aberration provides many data points for deconvolution of the tumor genome mixture. We show how to solve the MLMDP as a collection of convex optimization problems. THetA is the first algorithm  to our knowledge  that automatically identifies subclonal copy number aberrations in wholegenome sequencing data from mixtures of more than two genomes. Moreover, in the case of an admixture between a single (clonal) cancer population and normal cells, THetA runs in polynomial time; it is the first rigorous and efficient algorithm for simultaneously estimating tumor purity and inferring integral copy numbers.
We apply our THetA algorithm to simulated data and to real DNA sequencing data from breast tumors sequenced at approximately 188× and approximately 40× coverage from [25]. We quantify the normal cell admixture in each tumor, outperforming other algorithms for this task. We also demonstrate that allowing only one tumor subpopulation may lead to highly inaccurate tumor purity estimates, and subsequent failure to detect clonal and subclonal copy number aberrations. In the 188× sequenced tumor, we identify both clonal and subclonal tumor cell populations, each containing unique copy number aberrations. Our results recapitulate most of the findings reported in [25] for this sample, but also have some distinct differences, which are supported by the sequencing data. In one of the 40× sequenced tumors, we identified two previously unreported tumor subpopulations, demonstrating the ability to identify intratumor heterogeneity, in particular subclonal aberrations, at the modest sequence coverages that are the current standard in cancer sequencing studies.
Results
Maximum likelihood mixture decomposition problem
First, we will formulate the maximum likelihood mixture decomposition problem of finding the most likely mixture of tumor cell populations from a sequenced tumor sample. We assume that sequenced reads from a tumor sample are aligned to the reference human genome, the first step in cancer genome sequencing analysis [30, 31]. Typically, a matched normal genome is also sequenced to distinguish somatic mutations from germline variants. We focus on copy number aberrations in order to estimate tumor purity and subpopulations. Thus, we assume that a cancer genome differs from the reference genome by gains and losses of segments, or intervals, of the reference genome. These intervals are identified by examining the density, or depth, of reads aligning to each location in the reference [32–34], and/or by clustering discordant paired reads that identify the breakpoints of copy number aberrations or other rearrangements [35–40]. Following this analysis, the reference genome is partitioned into a sequence I = (I_{1}, ..., I_{ m }) of nonoverlapping intervals. We represent a cancer genome by an interval count vector c = (c_{1}, ..., c_{ m }), where c_{ j } ∈ ℕ is the integer number of copies of interval I_{ j } in the cancer genome. From the sequencing of a tumor sample, we observe a read depth vector r = (r_{1}, ..., r_{ m }), where r_{ j } ∈ ℕ is the number of reads with a (unique) alignment within I_{ j }.
Maximum likelihood mixture decomposition problem (MLMDP). Given an interval partition I of a reference genome and an associated read depth vector r derived from a tumor sample $\mathcal{T}$, find the underlying interval count matrix C and genome mixing vector μ that maximize the likelihood P(r  C , μ).
In the Materials and methods section below, we derive the probability P(rC, μ) in the MLMDP. In brief, under the usual assumptions for DNA sequencing, the probability p_{ j } that a read that aligns to an interval I_{ j } is equal to the fraction of the total DNA in the sample originating from interval I_{ j }. Hence, the probability P(rC, μ) of the observed read depth vector r follows a multinomial distribution determined by the interval count matrix C and genome mixing vector μ. We emphasize that the multinomial distribution models the fact that the number of reads aligning to each interval are not independent random variables, but rather are dependent on the number of copies (ploidy) of each interval in the cancer genome(s) (please see Additional file 1, Section A). In contrast to our probabilistic model for DNA sequencing data, other methods for estimating tumor purity and ploidy [12, 13] do not model the data as an observation from an experiment. Rather, they assume that the observed copy number ratio of an interval (or probe) is the ratio of the expected value of the tumor copy number and the expected value of the normal copy number (please see Additional file 1, Section B). Thus, they implicitly assume that the observed data is an average over many experiments.
Solving the maximum likelihood mixture decomposition problem
We now give an overview of our algorithm for solving the instance of the MLMDP where P(rC, μ) is the multinomial probability described above. Further details are in the Materials and methods section.
Restricting the space of interval count matrices
In practice, the interval count matrix C is not allowed to be any integervalued matrix. There are three natural constraints on the interval count matrix: (1) One component of the tumor sample is the normal genome. Thus, we set the first column c_{1} = (2, 2, ..., 2)^{T}, the vector whose entries are all two. (2) The number n of subpopulations is less than the number m of intervals. (3) The copy numbers of the intervals are integers between 0 and k, inclusive, where k ≥ 2. We let ${\mathcal{C}}_{m,n,k}$ denote the set of all matrices satisfying these properties.
A convex optimization algorithm
We wish to find the interval count matrix $\mathsf{\text{C}}\in {\mathcal{C}}_{m,n,k}$ and the genome mixing vector μ that maximize the multinomial likelihood P(rC, μ). However, this optimization problem is not straightforward to solve because it contains both integervalued variables (entries of C) and realvalued variables (entries of μ). We show that a special coordinate transformation allows the MLMDP to be solved as a disjunction of constrained convex optimization problems by enumerating the possible interval count matrices and solving a separate convex optimization problem for each such C (see Materials and methods). Since the number of possible matrices C grows exponentially with m and n, this bruteforce strategy approach will not scale well beyond small values of n subpopulations and m intervals. In a special, but important, case where a sample contains a single clonal tumor population along with a normal admixture (that is, n = 2), we show how to further restrict the space of possible interval count matrices C, and obtain an efficient algorithm (polynomial time in m) for the MLMDP. The runtime for our algorithm depends on the number of intervals m and maximum copy number k in the input. Simulations with m = 39 and k = 3 (described below) run in 1 to 2 minutes on a standard desktop, while increasing to k = 5 increases the runtime to approximately 25 to 40 minutes.
Selecting a solution
Two additional issues to be addressed in deriving a solution are: (1) how to select from multiple optimal solutions and (2) how to choose the number n of tumor subpopulations in the mixture. We note that tumor sequencing data alone does not distinguish between different optimal solutions with the same maximum likelihood. In mathematical terms, this is because only the parameter of the multinomial distribution is identifiable from the observed read depth vector r. Thus, we cannot distinguish between pairs (C, μ) and (C', μ') of interval count matrices and genome mixing vectors that give the same multinomial parameter. Our algorithm THetA has options to return the complete family of optimal solutions, or to limit to solutions with a baseline copy number of the clonal tumor population (see Materials and methods).
Regarding the second issue, note that the likelihood P(rC, μ) increases as the number n of tumor subpopulations in the mixture increases: indeed the observed read depth vector can be fitted 'perfectly' by placing each copy number aberration in its own tumor subpopulation. However, mixtures with larger n also have greater model complexity (that is, more parameters). We use a model selection criterion based on the Bayesian information criterion (BIC) to select a model with a balance between higher likelihood and lower model complexity in order to avoid overfitting.
Evaluation on simulated cancer genomes
Normal admixture: single cancer genome
Using two different sets of simulated data, we compared our THetA algorithm to three other methods for estimating tumor purity and ploidy: ASCAT [12], ABSOLUTE [13] and CNAnorm [18]. ASCAT and ABSOLUTE jointly estimate tumor purity and ploidy, and were originally designed for SNP array data. While both can be adapted to run on DNA sequencing data, they do not formally model this type of data, as noted above. CNAnorm is designed for DNA sequencing data, but rather than allowing tumor purity and tumor ploidy to inform each other, it uses an iterative approach that separately infers purity and copy numbers. In some instances, CNAnorm relies on the user manually entering the most abundant ploidy.
As noted above, there are multiple optimal solutions with the same maximum likelihood. CNAnorm [18] and ASCAT [12] use ad hoc criteria to return only a single purity estimate, and ABSOLUTE [13] uses external cancer karyotypes to select from multiple possible solutions. To compare THetA to these other methods, we must select a single pair (C, μ) from the set returned by THetA as a representative sample reconstruction. For all simulations, we chose the pair (C, μ) that maximizes the total length of all genomic intervals in the tumor genome with copy number 2, the expected copy number of the normal genome for humans. We note that this decision applies only to these simulations  for real sequencing data the set of all equally like solutions is returned by THetA from which a user may select one using additional information about the sample under consideration. For further details about the other algorithms please see Additional file 1, Sections K and L.
For our first set of simulations, we generated a cancer genome consisting of chromosome arm copy number aberrations. The copy number for each nonacrocentric chromosome arm was chosen uniformly at random from the range 0 (that is, homozygous deletion) through k > 2 (amplification), up to a specified maximum copy number k. While real cancer genomes may have copy numbers larger than the maximum value (k = 7) considered in these simulations, such high amplitude amplifications are generally focal events. We emphasize that it is not necessary to use all copy number aberrations to infer the tumor composition; for example, if there are a sufficient number of armlevel copy number aberrations, these may suffice. We then created a random mixture of this cancer genome and a 'matched normal' genome and simulated a read depth vector r for the mixture, adding noise according to the read depth estimation error φ. The parameter φ models errors in the sequencing and analysis process, and we estimated from real sequencing data that φ is in the range from 0.01 to 0.04 (please see Additional file 1, Section J and Figure S3). Since the ASCAT algorithm uses SNP array data, we also simulated SNP array data from our mixture. Further details of the simulations are in Additional file 1, Section I.
Performance of the algorithms on simulated data with one tumor population (n = 2)
% correct C  Copy number error (median)  Purity error (median)  

k  THetA  ASCAT  CNAnorm  ABSOLUTE  THetA  ASCAT  CNAnorm  ABSOLUTE  THetA  ASCAT  CNAnorm  ABSOLUTE 
3  100.0  85.0  40.0  70.0  0.0  0.0  0.103  0.000  0.004  0.040  0.068  0.010 
4  90.0  55.0  8.3  50.0  0.0  0.0  0.163  0.013  0.004  0.037  0.064  0.010 
5  85.0  50.0  6.7  15.0  0.0  0.013  0.185  0.160  0.004  0.062  0.038  0.075 
6  55.0  40.0  0.0  15.0  0.0  0.026  0.291  0.433  0.006  0.063  0.066  0.157 
7  30.0  15.0  0.0  10.0  0.031  0.036  0.445  0.471  0.005  0.069  0.108  0.149 
For this first set of simulations, we found that our THetA algorithm computes both C and μ very accurately over a range of copy numbers k. In particular, THetA outperforms CNAnorm, ABSOLUTE and ASCAT despite the fact that ASCAT uses additional information (allele frequencies) that THetA does not consider. Even with high amplitude copy number aberrations (k = 7) THetA on average estimates tumor purity within 0.5% of the true purity, compared to 6.9%, 10.8% and 14.9% by ASCAT, CNAnorm and ABSOLUTE respectively. Even when THetA does not estimate all copy numbers across the genome correctly, it estimates most copy numbers correctly and estimates the copy number correctly for more segments than the other algorithms (see Additional file 1, Figure S4). Further results comparing THetA to CNAnorm for different read depth estimation errors are in Additional file 1, Figure S5.
Mixture of tumor subpopulations
We next evaluated the performance of THetA on a simulated mixture containing two subpopulations of tumor cells with different copy number aberrations and an admixture with normal cells. Thus, there were three distinct subpopulations in the mixture (n = 3). Our method for constructing the simulated data was the same as for the first set of simulations, as described in the previous section, with a fixed read depth estimation error of φ = 0.02 along with a few minor changes (see Additional file 1, Section I).
Performance of the algorithms on simulated data with two tumor populations (n = 3)
% correct C  Copy number error (median)  Purity error (median)  

m  THetA  THetA  THetA  CNAnorm  ABSOLUTE 
6  35.0  0.118  0.081  0.202  0.458 
8  45.0  0.075  0.052  0.276  0.477 
10  35.0  0.071  0.055  0.177  0.434 
12  45.0  0.059  0.036  0.301  0.547 
While the performance of THetA was less precise in estimating all copy numbers (that is, the entries in C) exactly than for the tumor with a normal admixture (n = 2), THetA maintains a good level of accuracy as the estimates are near the true interval copy numbers. THetA correctly computes on average 94% of the copy numbers across all subpopulations in the mixture when there are m = 12 intervals with varying copy number in the subpopulations. THetA also estimates the tumor purity with good accuracy (within 3.6% of the true purity when m = 12), whereas both CNAnorm and ABSOLUTE gravely misestimate the tumor purity by 30.1% and 54.7%, respectively. One possible explanation for these errors is that both of these methods do not account for multiple subpopulations in the sample and therefore tend to report tumor purity as either the fraction of the sample representing the largest subpopulation, or as an average of the fractions of all tumor subpopulations. Thus, THetA successfully recovers a complex mixture of two tumor subpopulations and a normal cell admixture directly from the observed read depth.
Results from breast cancer sequencing data
We analyzed Illumina pairedend sequencing data from three breast cancer genomes and their matched normal samples from [25]. We downloaded the data from the European Genomephenome Archive (accession number EGAD00001000138). This includes two samples that were sequenced to a depth of approximately 40× coverage and one sample, PD4120a, that was sequenced with approximately 188× coverage. We used the BICSeq segmentation algorithm [32] to partition the 22 autosomes into intervals according to read depth. We formed an interval count vector from all intervals longer than 50 kb, focusing on these longer genomic intervals because their observed read depth will have lower variance (see Additional file 1, Figure S11). Most intervals removed in this step are relatively short for the samples analyzed. For the two 40× coverage genomes changing this cutoff to 10 kb resulted in the same partition as when 50 kb was used. For the 188× genome we only removed nine intervals from consideration when the threshold was 50 kb and seven when the threshold was 10 kb. The results from THetA are identical for the two different sets of intervals. We assume that most of the tumor genome does not undergo copy number aberrations, and thus the mode of the read depth vector is a normal baseline. We set lower and upper bounds on the copy number for each interval from this baseline. For further details, see Additional file 1, Section N.
Breast tumor: 188× sequence coverage
We analyzed the 188× sequenced tumor PD4120a using our THetA algorithm. We consider that the mixture contains a normal admixture with a single tumor subpopulation (n = 2) and a normal admixture with two distinct tumor subpopulations (n = 3). This sample was extensively annotated by [25] and thus acts a positive control for THetA.
Comparison of various algorithms on the 188× coverage breast cancer genome
Sample PD4120a  

Algorithm  % normal admixture  Clonal (% tumor purity)^{a}  Subclonal (%)^{a} 
THetA, n = 2  34.3%  Del: 1p, 4q, 13, 16q, 22q   
(segmentation)  +1: 1q  
(65.7%)  
THetA, n = 2 (chromosome arms)  38.3%  Del: 1p, 4q, 13, 16q, 22q   
+1: 1q  
(61.7%)  
CNAnorm^{b} (chromosome arms)  32.8%  Del: 1p, 4q, 13, 16q, 22q   
+1: 1q  
(67.2%)  
ASCAT^{c}  34%  Del: 1p, 4q, 13, 16q, 22q   
(virtual SNP array)  +1: 1q, 17q, 18, 19, 20  
(66.0%)  
ABSOLUTE^{d} (segmentation)  35%  (65.0%)   
THetA  28.0%  Del: 1p, 4q, 16q, 22q12.2  Del: 13, 22q11.212.1 
n = 3  13.3  +1: 1q  
(72.0%)  (61.9%)  
Del: 8, 11, 12, 14, 15  
(10.1%)  
Del: 2, 7, 4p , 6, 9, 18, 21  
NikZainal et al. (2012)  30%  Del: 4q  Del: 13, t(1;22) 
[25]  +1: 1q  (47.6%)  
(70.0%)  Tetraploid with:  
Del(2): 2, 7  
Del(1): 6, 8, 9, 11, 12,  
14, 15, 18, 21  
(9.8%) 
We investigated further the following three differences between our analysis and [25]: (1) clonal deletion of chromosome 16q, (2) clonal vs. subclonal amplification of chromosome 1q and (3) clonal vs. subclonal deletions in chromosome 22q. We analyzed these differences using two complementary approaches. First, we analyzed the distribution of tumor/normal read depth ratios in 50 kb bins across the genome. This distribution contains distinct peaks corresponding to copy number aberrations occurring in different subpopulations. After correcting the read depth ratios for a normal admixture using a linear scaling (see Additional file 1, Section O), peaks corresponding to clonal aberrations will occur at ratios divisible by 0.5, whereas peaks corresponding to subclonal aberrations will not (Figure 3B). Second, we analyzed a virtual SNP array that we constructed from the read counts and the variant allele frequencies derived from aligned reads at known germline SNPs (see Materials and methods). Copy number aberrations occurring in different subpopulations appear as distinct clusters in a scatter plot of read count vs. variant allele frequencies (Figure 3C).
The first difference we analyzed was our prediction of a clonal deletion of chromosome 16q, which was not reported by [25]. Visual inspection of the virtual SNP array data for chromosome 4 (predicted to be a clonal deletion by both methods) and chromosome 16q shows three distinct clusters  one for regions of normal copy (centered at a variant allele frequency of 0.5) and two clusters (positioned symmetrically around a variant allele frequency of 0.5) with a lower read count that indicate a deletion (Figure 3D). These deletion clusters have virtually identical locations in the scatter plot for chromosomes 4 and 16q  supporting the conclusion that these deletions occur in the same fraction of the tumor sample. Comparing the difference between the observed and expected read depth ratios in these deletions for different aberration fractions (the percentage of the sample containing the aberration) reveals that the optimal aberration fraction for both deletions is very similar  additional evidence that these deletions occur in the same fraction of the tumor sample (see Additional file 1, Section Q and Figure S12). Given the strong evidence for this chromosome 16 deletion, we suspect that its omission from [25] was an oversight rather than a deficiency of the analysis.
The second difference is that we predicted chromosome 1q to be amplified in a subclonal population consisting of 61.9% of the cells in the sample, whereas [25] indicated that this aberration is clonal (occurring in 70% of cells in the sample). Since this was the only large amplification present in the sample, we were not able to compare its variant allele frequencies to a different amplification (as we did with chromosomes 4 and 16 above). Therefore, we examined the read depth data more closely. Visual inspection of read depth ratios after adjusting for our predicted 28% normal admixture (Figure 3B) and the 30% normal admixture predicted by [25] (see Additional file 1, Figure S11) shows that the corrected read depth ratios for chromosome 1q do not match a ratio of 1.5 well (as would be expected if there was a clonal amplification with copy number 3)  an indication that 1q is a subclonal aberration. Comparison of read depth ratios for 1q to other clonal aberrations supports our prediction that 1q is a subclonal deletion (see Additional file 1, Figure S12).
The final difference involves chromosome 22q; we predicted that it contains both clonal and subclonal deletions, while [25] only reported subclonal events. In particular, [25] reported that a deletion of a derivative chromosome from a translocation between chromosomes 1 and 22 is the rearrangement responsible for the subclonal deletion observed on 22q. We found that 1p (see Additional file 1, Figure S12) and the distal portion of 22q (cytogenetic bands 12.213.3) appear to be clonal deletions, while the proximal portion of 22q (cytogenetic bands 11.212.1) is a subclonal deletion. In particular, the readdepth/variantallele plot from the virtual SNP array shows an oblong cluster for chromosome 22 that only partially overlaps with the cluster for chromosome 13, a chromosome predicted by both methods to have undergone a subclonal deletion (Figure 3E). This evidence supports another possible sequence of rearrangements: (1) A nonreciprocal translocation occurred between chromosomes 1 and 22 (supported by the output from the GASV algorithm [48] for clustering of discordant reads as discussed in Additional file 1, Section Q) resulting in the clonal loss of 1p and 22q12.213.3. Following this translocation, two copies of 22q11.212.1 remained. (2) One of these remaining two copies of 22q11.212.1 was deleted in a subclonal population (see Additional file 1, Figure S13).
Breast tumor: 40× sequence coverage
We also analyzed two tumor samples from [25] sequenced at approximately 40× coverage. For sample PD4088a, the model of this mixture preferred by our model selection procedure was a single clonal tumor population with normal admixture fraction 41%. [25] also reported this sample as clonal, although they did not provide an estimate of tumor purity or copy number aberrations. Further details of the analysis of this sample are in Additional file 1, Section S and Figure S16.
Discussion
We introduce Tumor Heterogeneity Analysis (THetA), an algorithm that infers the most likely collection of genomes and their proportions from highthroughput DNA sequencing data, in the case where copy number aberrations distinguish subpopulations. We show that THetA outperforms three other methods, CNAnorm [18], ASCAT [12] and ABSOLUTE [13], for inferring tumor purity and identifying copy number aberrations in the case of a single tumor cell population admixed with normal (noncancerous) cells. Moreover, we demonstrate that THetA successfully estimates tumor purity even at low purity (10%) and with modest sequence coverages (approximately 30×) on both real and simulated data. In contrast to other recent methods [12, 13] that first infer average ploidy across the genome, THetA simultaneously estimates tumor purity and computes the integral copy number of each genomic segment/interval. These advantages result from THetA exploiting the large number of data points (reads) that measure copy number aberrations in highthroughput sequencing data  information that is not available from SNP arrays.
We also demonstrate that THetA successfully deconvolves a tumor sample into a normal population and multiple tumor subpopulations, inferring the proportion of each subpopulation in the mixture, and partitioning copy number aberrations into clonal and subclonal populations. Other existing methods, such as ASCAT [12], ABSOLUTE [13] and CNAnorm [18], do not directly infer multiple subpopulations. Further, we show that these methods can produce highly inaccurate estimates of tumor purity on samples containing multiple subpopulations, and are sometimes unable to identify some copy number aberrations that occur in subpopulations of tumor cells. In addition, THetA reports all possible solutions of interval count matrices C and genome mixing vectors μ with the same maximum likelihood, allowing users to explore different maximum likelihood solutions. Thus, THetA is an attractive alternative to these methods.
We demonstrated the advantages of THetA using three breast cancer genomes sequenced in [25]: one sequenced at approximately 188× coverage and two at approximately 40× coverage. NikZainal et al. [25] showed how a large amount of information about a tumor's evolutional history can be derived by analyzing clonal and subclonal mutations in highcoverage sequencing data. Our THetA algorithm automates some of the manual analysis involved in such reconstructions. For the 188× genome, our results are largely concordant with the extensive analysis and annotation of this sample in [25]. THetA automatically recovered nearly all of the copy number aberrations reported in [25], but with some differences in the classification of aberrations as clonal or subclonal. Allele data not used by THetA provides external evidence that support the THetA results in several cases. On one of the 40× coverage genomes, we identified two previously unreported tumor subpopulations in nearly equal proportions, as well as a 24% normal admixture. These results are supported by statistical comparisons of read depth ratios, and also allowed us to identify copyneutral LOH on chromosome 8q. Thus, we demonstrated that it is possible to identify multiple tumor populations successfully in a single sample by considering a subset of genomic intervals. Further, we did so for an approximately 40× sequenced tumor, demonstrating the ability to identify intratumor heterogeneity at sequence coverages that are the current standard in cancer sequencing studies.
THetA uses only read depth for inferring intratumor heterogeneity, in contrast to other methods [12, 13, 17, 25] that use allele frequencies of heterozygous germline SNPs and somatic mutations. Since copy number aberrations  even those of a modest size  affect a large number of reads, THetA is able to infer multiple tumor subpopulations directly from sequencing data. However, THetA also has some limitations. First, the reliance on copy number aberrations means that THetA is unable to identify tumor subpopulations that do not contain copy number aberrations. As copy number aberrations are ubiquitous in many types of cancers, particularly solid tumors, we expect that THetA will prove useful for analyzing a wide range of different cancer samples. Second, while the mathematical model used by THetA allows for any number of subpopulations, in practice the number of subpopulations that can be correctly inferred depends on having at least one copy number aberration that distinguishes every subpopulation. Finally, THetA's computation time increases with an increasing number of subpopulations.
Our focus in the development of THetA was to address rigorously the difficult problem of analyzing tumor purity and subclonal copy number aberrations from DNA sequencing data. A logical next step is to use the output from THetA to help predict singlenucleotide mutations in tumor samples and/or assess the clonality of somatic mutations, both challenging problems in their own right. Carter et al. [13] and NikZainal et al. [25] show that once tumor purity is correctly estimated, then this value can be used to analyze the clonality/subclonality of somatic mutations. Incorporating the additional signal of variant allele frequencies into the probabilistic model, as well as extending the model to allelespecific copy number changes [52], are important directions for future work. Ultimately, a desirable goal is to integrate into a single probabilistic framework the detection of all types of somatic aberrations (single nucleotide, copy number and rearrangements) with the estimation of tumor purity and the derivation of tumor subpopulations. Finally, further algorithmic improvements in THetA would help in the analysis of more complicated tumor samples that have more intervals (for example, smaller copy number aberrations), higher amplitude copy number aberrations, more subpopulations or more complicated rearrangements; for example, due to breakage/fusion/bridge (B/F/B) cycles [53], chromothripsis [54] or extrachromosomal amplifications [55]. THetA runs in polynomial time for a mixture of two genomes with intervals of equal weight, but the question of the complexity of the MLMDP for n > 2 remains open.
A number of other techniques have recently been used to study intratumor heterogeneity. For example [56] uses expression profiles across different individuals to identify differentially expressed genes with respect to healthy cells at the cancer site of origin. Singlecell sequencing and multiregion sequencing from a primary tumor are alternative strategies that have been successfully employed [23–29]. As these technologies improve they will likely further contribute to our understanding of intratumor heterogeneity. However, sequencing of primary tumor samples as well as matched tumor/metastasis samples will remain a dominant protocol for some time. Thus, algorithms, such as THetA, ABSOLUTE, ASCAT and others, that can derive information about intratumor heterogeneity from DNA sequencing of tumor samples are a useful complement to other technologies and techniques for tumor heterogeneity studies.
Conclusions
Tumors are highly heterogeneous with individual cells in a tumor typically having different complements of somatic mutations. Highly accurate estimates of tumor purity and tumor subpopulation frequencies are necessary for investigating intratumor heterogeneity from single tumor samples. We introduce THetA, an algorithm that infers the most likely collection of genomes and their proportions from highthroughput DNA sequencing data, in the case where copy number aberrations distinguish subpopulations. We show the power of THetA with both simulated and real sequencing data  demonstrating the ability to identify intratumor heterogeneity (in particular subclonal copy number aberrations) at modest sequence coverages (approximately 40×) that are the current standard in cancer sequencing studies.
Materials and methods
Intervals and counts: probability model
In this section we derive the probability P(rC, μ) in the maximum likelihood mixture decomposition problem (MLMDP).
Single genome
to denote a normalized vector. Thus, the observed read depth vector r is the result of $r={\sum}_{j=1}^{m}{r}_{j}$ independent draws from a multinomial distribution with parameter $p=\widehat{c}$; that is, $r~\mathsf{\text{Mult}}\left(\text{r;}\widehat{c}\right)$.
We emphasize that the number of reads aligning to each interval are not independent random variables. As the total number of reads becomes extremely large, the multinomial distributions will converge to independent Poisson distributions in each interval. However, even with the millions to billions of reads produced by highthroughput DNA sequencing, the effects of using a finite number of reads are an issue in cancer genome sequencing. This is because large copy number changes  including the gain and loss of whole chromosomes  are common in cancer genomes. A large deletion or duplication will affect the number of reads observed in other intervals; for example, if we consider the 22 autosomes, a homozygous deletion of chromosome 1 will reduce the effective length of the cancer genome by 8.65%, altering the expected number of reads observed in other intervals (see Additional file 1, Figure S1).
Mixture of genomes
That is $r~\mathsf{\text{Mult}}\left(r;\hat{C\mu}\right)$.
Solving the maximum likelihood mixture decomposition problem
can be written in the form $p=\hat{C\mu}$ (see Additional file 1, Section C). Thus, a solution of the MLMDP is obtained by maximizing the multinomial likelihood over all $p\in {\text{\Delta}}_{m1}$.
Constraints on C
Since all entries of C are positive integers and all μ_{ j } are positive reals, (5) is a mixed integer problem. In general, mixed integer linear programming (MILP) problems are NPhard to solve [57]. In our case, the objective function is a nonlinear function of C and μ, meaning that even sophisticated MILP solvers are unlikely to be much benefit for this problem.
A coordinate transformation
Rather than attempting to solve the optimization problem (5) as a generic MILP, we derive a coordinate transformation that allows us to solve this problem as a constrained optimization problem in ℝ^{ m }. First, note that a pair (C, μ) ∈ Ω defines a probability distribution $\hat{C\mu}$. We define ${P}_{\text{\Omega}}=\left\{\hat{C\mu}\left(C,\mu \right)\in \text{\Omega}\right\}$ to be the space of all such probability distributions for all (C, μ) ∈ Ω. Note that only $\hat{C\mu}$, and not (C, μ), is identifiable from the observed data r. We prove the following theorem in Additional file 1, Section D.
Theorem 1. Suppose p ∈ P_{Ω}, so $\mathsf{\text{p=}}\hat{C\mu}$, for some (C , μ) ∈ Ω. Then there exists μ' ∈ Δ_{ n  1 } such that $p=\widehat{C}{\mu}^{\prime}$ where $\widehat{C}=\left(\hat{{c}_{1}},\dots ,\hat{{c}_{n}}\right)$.
Now suppose the interval count matrix C is fixed, and let $H\left(C\right)=\left\{\widehat{C}\mu \mu \in {\text{\Delta}}_{n1}\right\}$ denote the set of convex combinations of the normalized column vectors in C. Then (5) reduces to the problem of finding argmin _{p ∈ H(C)}${\mathcal{L}}_{r}\left(p\right)$. Since the objective function ${\mathcal{L}}_{r}\left(p\right)$ is separable convex (see Additional file 1, Section F) and the domain H(C) is convex, this problem is easy to solve using standard convex optimization routines.
A more efficient algorithm for the MLMDP
We derive an algorithm to solve the MLMDP (as formulated in (6)) that is polynomial time in m when n = 2. This algorithm relies on the observation that it is necessary to consider only a subset of interval count matrices C whose entries satisfy ordering constraints imposed by the read depth vector r. We say that two vectors a = (a_{1}, ..., a_{ m }) and b = (b_{1}, ..., b_{ m }) ∈ ℝ^{ m } have compatible order if for all 1 ≤ i, j ≤ m, a_{ i } ≤ a_{ j } if and only if b_{ i } ≤ b_{ j }. Note that the vector x = (s, ..., s) ∈ ℝ^{ m } for any s ∈ ℝ has compatible order with all vectors in ℝ^{ m }.
Theorem 2. Suppose ${p}^{*}=\hat{{C}^{*}{\mu}^{*}}=\mathsf{\text{argmi}}{\mathsf{\text{n}}}_{p\in {P}_{{\text{\Omega}}_{m,n,k}}}{\mathcal{L}}_{r}\left(p\right)$. Then we have the following:
1. p * and r have compatible order.
2. If n = 2 and μ_{ 2 }^{ * } > 0, then r and c_{ 2 }^{ * } have compatible order.
Theorem 2 (proof is in Additional file 1, Section E) leads to a more efficient algorithm where we evaluate only matrices ${\text{C=(c}}_{1},{\text{c}}_{2})\in {\mathcal{C}}_{m,2,k}$ where c_{2} has compatible order with r. The number of such matrices is O(m^{ k }) and enumeration of these matrices is O(m^{k+1}) (see Additional file 1, Section E). Note that if n ≥ 3, that is, there is more than one cancer genome in the mixture, then the ordering constraints do provide some restrictions on the entries of C (see Additional file 1, Section E). The reduction is not enough to make the space have polynomial size (in m), but the restrictions are useful in practice.
Intervals of unequal length and mappability
Thus far, we made the simplifying assumptions that all intervals in I are of equal length and that reads are aligned to each interval without any biases from the DNA sequence of the interval. Now we consider the general case where each interval I_{ j } has an associated positive weight w_{ j }. These weights can model both interval lengths as well as different mappability of intervals  that is, the probability of reads aligning uniquely to an interval in the reference genome can depend on the repeat content of the interval [58]. Let w = (w_{1}, ..., w_{ m }) be the interval weight vector. In practice, we use the read depth vector over I for the paired normal sample as w, which allows us to implicitly incorporate information on interval length, mappability and GC content into the model.
We define the linear transformation Φ: ℝ^{ m } → ℝ^{ m } to be $\text{\Phi}\left(v\right)=\hat{Wv}$. Thus, p = Φ(c) and r ~ Mult(r; Φ(c)). As in the unweighted case above, if the entries in c are allowed to be arbitrary positive integers, then for any integer read depth vector r and nonnegative weight vector w we can always find the maximum likelihood solution to the corresponding weighted MLMDP (see Additional file 1, Section G).
Given a read depth vector r and an interval weight vector w, we formulate the analogous maximum likelihood mixture decomposition problem of identifying the underlying interval count matrix C and genome mixing vector μ that maximize the multinomial likelihood Mult(rΦ(C μ)).
Theorem 3 (see Additional file 1, Section G for the proof) relates the optimal (C, μ) in the cases of equal and unequal weighted intervals.
Using this theorem, we find the optimal solution in the weighted interval case by solving the unweighted interval case; for example, using the techniques above. As stated, Theorem 3 applies to the case where (C, μ) ∈ Ω_{ m,n } (that is, the entries of C are unbounded). However, we can still leverage the logic behind this result when we add a restriction that $\mathsf{\text{C}}\in {\mathcal{C}}_{m,2,k}$. While we do not expect that $\mathsf{\text{argmi}}{\mathsf{\text{n}}}_{\left(C,\mu \right)\in {\text{\Omega}}_{m,2,k}}{\mathcal{L}}_{r}\left(\text{\Phi}\left(C\mu \right)\right)$ is equal to $\mathsf{\text{argmi}}{\mathsf{\text{n}}}_{\left(C,\mu \right)\in {\text{\Omega}}_{m,2,k}}{\mathcal{L}}_{{\text{\Phi}}^{1}\left(r\right)}\left(\hat{C\mu}\right)$, we may assume that a solution to $\mathsf{\text{argmi}}{\mathsf{\text{n}}}_{\left(C,\mu \right)\in {\text{\Omega}}_{m,2,k}}{\mathcal{L}}_{r}\left(\text{\Phi}\left(C\mu \right)\right)$ will satisfy the same order constraints as ${\mathcal{L}}_{{\text{\Phi}}^{1}\left(r\right)}\left(\hat{C\mu}\right)$. Namely, we expect that the optimal solution will have compatible order with Φ^{1}(r) (Theorem 2). This is because: (1) the unconstrained optima (when (C, μ) ∈ Ω_{ m,n }) for the two likelihood functions are equal, (2) the objective function ${\mathcal{L}}_{r}\left(p\right)$ is well behaved (separable convex) and (3) the transformation Φ is linear. Thus, the optima in the constrained weighted case cannot deviate too much from the optima in the constrained unweighted case, where the ordering conditions hold. Thus, we need only to consider $\mathsf{\text{C}}\in {\mathcal{C}}_{m,2,k}$ where c_{2} has compatible order with Φ^{1}(r) to find an optimum. We verified this statement empirically over a variety of simulations (see Additional file 1, Section H).
Model selection
We use the Bayesian information criterion (BIC) to make a selection from different sized models (that is, different values of n) and their corresponding sets of maximum likelihood solutions. The standard form of the BIC is 2 log(L) + a log(b) where L is the likelihood of a solution, a is the number of free parameters in the model and b is the number of data points. We add a slight modification to this, which is similar to a modification used by the segmentation algorithm BICSeq [32] that allows use of more stringently penalized solutions with more free parameters using a new parameter γ. The motivation is that the BIC tends to be too liberal when the model space is large [59]  as is the case here. Values of γ above 1 will penalize models that have more distinct tumor populations more strongly. That is, increasing this parameter will more strongly encourage solutions with fewer subpopulations. The default value of γ is 10, and was chosen because we expect to recover a small number of distinct subpopulations from sequencing data  thus making penalization of models with more subpopulations attractive. Additionally, changing γ in either direction (by up to 4) from this default value yields consistent results on the datasets analyzed. Our modified BIC is 2 log(L) + γa log(b), where a = (m + 1)(n  1) and b is the total number of reads in the intervals for both the tumor and normal samples. Since we often run the n = 3 version of the algorithm on a subset of the intervals used in the n = 2 algorithm, we use the following steps to determine which value of n to select. (1) Run the algorithm for n = 2 and n = 3 using the subset of intervals and the lower and upper bounds used for n = 3 and obtain respective likelihood values. (2) Compute the modified BIC for both values of n and choose the one with the lowest value.
Sets of maximum likelihood solutions
If (C, μ), (C', μ') ∈ Ω such that $\hat{C\mu}=\hat{{C}^{\prime}{\mu}^{\prime}}$, then for any observed read depth vector r, the likelihood of observing the r will be identical between these two solutions. That is, ${\mathcal{L}}_{r}\left(\hat{C\mu}\right)={\mathcal{L}}_{r}\left(\hat{{C}^{\prime}{\mu}^{\prime}}\right)$. By default THetA will always output the complete set of maximum likelihood solutions to the MLMDP given the input parameters (for example, the maximum copy number k to consider). However, THetA has several options that allow a user to input additional information, like sample ploidy, which may be known in advance. One option allows a user to supply an expected ploidy for a sample (for example, 4 in the case of a tetraploid genome), and the lower and upper bounds considered for all intervals are rescaled to reflect this expected ploidy. Another option allows a user to set lower and upper bounds on copy numbers directly for all intervals in the genome. In either case, THetA will still output the complete set of maximum likelihood solutions that reflect the options supplied by the user.
Code availability
The THetA software is available for download from our website [60]. For a copy of the software at the time of publication please see Additional file 2, although we recommend that the latest version of THetA be downloaded from [60].
Analysis of breast cancer genomes
Here we provide additional details of the analysis of the breast cancer samples.
Breast tumor: 188× sequence coverage
For the n = 2 analysis of sample PD4120a, we used all genomic intervals derived following BICSeq segmentation (λ = 100) after removal of all intervals less than 50 kb in length. For the n = 3 analysis, we selected a subset of these intervals by choosing: (1) all chromosomes that BICSeq partitioned into a single interval and (2) all intervals >22 Mb that were reported as having an abnormal copy number (≠ 2) in the n = 2 analysis. We used only the longest such interval per chromosome if the number of such intervals was large. We later added all intervals from chromosome 22 to this subset in order to resolve differences between our results and those presented in [25]. Since the results for both subsets were extremely similar, we present here the results for the larger subset (including chromosome 22). Results for the smaller subset are given in Additional file 1, Figure S9.
Breast tumor: 40× sequence coverage
For the n = 2 analysis of sample PD4115a, we used all genomic intervals derived from BICSeq segmentation (λ = 200) after removal of all intervals 50 kb in length. We found that PD4115a contains many apparent copy number aberrations with the segmentation containing 102 intervals (compared to only 69 intervals for sample PD4120a above). In addition, this sample also includes several highly amplified regions, and no chromosome was segmented into a single interval. Thus, we ran THetA for n = 3 on a subset of the longest intervals in the BICSeq partition, and set lower and upper bounds on the copy number for each interval (see Additional file 1, Section N).
Virtual SNP arrays
To compare those of our predictions that differed from those presented in [25], we looked at known germline SNP allele frequencies derived directly from the sequencing data  a virtual SNP array. We emphasize that this data was not used by our THetA algorithm for computing tumor heterogeneity, and therefore this provides independent data for validation. We looked at read coverage and variant allele frequency for the 907,693 SNP positions on the 22 autosomes tested by the Affymetrix 6.0 SNP array (SNP positions and major and minor alleles for hg19 determined using the UCSC genome browser [61]). The read coverage for a SNP position is the number of concordant reads with mapping quality >30 that have an alignment containing either the major or minor allele at the SNP position. The variant allele fraction, or BAF, is the fraction of such reads that contains the minor allele.
List of abbreviations used
 AML:

acute myeloid leukemia
 BIC:

Bayesian information criterion
 kb:

kilobase
 LOH:

loss of heterozygosity
 Mb:

megabase
 MILP:

mixed integer linear programming
 MLMDP:

maximum likelihood mixture decomposition problem
 SNP:

single nucleotide variant
 TCGA:

The Cancer Genome Atlas
 THetA:

Tumor Heterogeneity Analysis.
Declarations
Acknowledgements
We thank Peter Campbell and Adam Butler for their assistance in obtaining the breast cancer data from EGA. We also thank Li Ding and Chris Miller for their assistance is selecting the AML genome from TCGA used in the simulations. The results published here are in whole or part based upon data generated by The Cancer Genome Atlas pilot project established by the National Cancer Institute and the National Human Genome Research Institute. Information about TCGA and the investigators and institutions that constitute the TCGA research network can be found at [62].
This work is supported by a National Science Foundation (NSF) graduate research fellowship DGE0228243 (LO), NSF career award CCF1053753 (BJR) and grant RO1 HG5690 from the National Institutes of Health (BJR). BJR is also supported by a Career Award at the Scientific Interface from the Burroughs Wellcome Fund, an Alfred P Sloan Research Fellowship.
Authors’ Affiliations
References
 Nowell PC: The clonal evolution of tumor cell populations. Science. 1976, 194: 2328. 10.1126/science.959840.PubMedView ArticleGoogle Scholar
 Parsons BL: Many different tumor types have polyclonal tumor origin: evidence and implications. Mutat Res. 2008, 659: 232247. 10.1016/j.mrrev.2008.05.004.PubMedView ArticleGoogle Scholar
 Ding L, Raphael BJ, Chen F, Wendl MC: Advances for studying clonal evolution in cancer. Cancer Lett. 2013,Google Scholar
 Fridlyand J, Snijders AM, Pinkel D, Albertson DG, Jain AN: Hidden Markov models approach to the analysis of array CGH data. J Multivariate Analysis. 2004, 90: 132153. 10.1016/j.jmva.2004.02.008.View ArticleGoogle Scholar
 Assie G, LaFramboise T, Platzer P, Bertherat J, Stratakis CA, Eng C: SNP arrays in heterogeneous tissue: highly accurate collection of both germline and somatic genetic information from unpaired single tumor samples. Am J Hum Genet. 2008, 82: 903915. 10.1016/j.ajhg.2008.01.012.PubMedPubMed CentralView ArticleGoogle Scholar
 Yau C, Mouradov D, Jorissen RN, Colella S, Mirza G, Steers G, Harris A, Ragoussis J, Sieber O, Holmes CC: A statistical approach for detecting genome aberrations in heterogeneous tumor samples from single nucleotide polymorphism genotyping data. Genome Biol. 2010, 11: R92PubMedPubMed CentralGoogle Scholar
 Attiyeh EF, Diskin SJ, Attiyeh MA, Mosse YP, Hou C, Jackson EM, Kim C, Glessner J, Hakonarson H, Biegel JA, Maris JM: Genomic copy number determination in cancer cells from single nucleotide polymorphism microarrays based on quantitative genotyping corrected for aneuploidy. Genome Res. 2009, 276283. 19
 Greenman CD, Bignell G, Butler A, Edkins S, Hinton J, Beare D, Swamy S, Santarius T, Chen L, Widaa S, Futreal PA, Stratton MR: PICNIC: an algorithm to predict absolute allelic copy number variation with microarray cancer data. Biostatistics. 2010, 11: 164175. 10.1093/biostatistics/kxp045.PubMedPubMed CentralView ArticleGoogle Scholar
 Parisi F, Ariyan S, Narayan D, Bacchiocchi A, Hoyt K, Cheng E, Xu F, Li P, Halaban R, Kluger Y: Detecting copy number status and uncovering subclonal markers in heterogeneous tumor biopsies. BMC Genomics. 2011, 12: 23010.1186/1471216412230.PubMedPubMed CentralView ArticleGoogle Scholar
 Koboldt DC, Zhang Q, Larson DE, Shen D, McLellan MD, Lin L, Miller CA, Mardis ER, Ding L, Wilson RK: VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res. 2012, 22: 568576. 10.1101/gr.129684.111.PubMedPubMed CentralView ArticleGoogle Scholar
 Yuan Y, Failmezger H, Rueda OM, Ali HR, Graf S, Chin SF, Schwarz RF, Curtis C, Dunning MJ, Bardwell H, Johnson N, Doyle S, Turashvili G, Provenzano E, Aparicio S, Caldas C, Markowetz F: Quantitative image analysis of cellular heterogeneity in breast tumors complements genomic profiling. Sci Transl Med. 2012, 4: 157ra14310.1126/scitranslmed.3004330.PubMedGoogle Scholar
 Van Loo P, Nordgard SH, Lingjorde OC, Russnes HG, Rye IH, Sun W, Weigman VJ, Marynen P, Zetterberg A, Naume B, Perou CM, BorrensenDale AL, Kristensen VN: Allelespecific copy number analysis of tumors. Proc Natl Acad Sci USA. 2010, 107: 1691016915. 10.1073/pnas.1009843107.PubMedPubMed CentralView ArticleGoogle Scholar
 Carter SL, Cibulskis K, Helman E, McKenna A, Shen H, Zack T, Laird PW, Onofrio RC, Winckler W, Weir BA, Beroukhim R, Pellman D, Levine DA, Lander ES, Meyerson M, Getz G: Absolute quantification of somatic DNA alterations in human cancer. Nat Biotechnol. 2012, 30: 413421. 10.1038/nbt.2203.PubMedPubMed CentralView ArticleGoogle Scholar
 Landau DA, Carter SL, Stojanov P, McKenna A, Stevenson K, Lawrence MS, Sougnez C, Stewart C, Sivachenko A, Wang L, Wan Y, Zhang W, Shukla SA, Vartanov A, Fernandes SM, Saksena G, Cibulskis K, Tesar B, Gabriel S, Hacohen N, Meyerson M, Lander ES, Neuberg D, Brown JR, Getz G, Wu CJ: Evolution and impact of subclonal mutations in chronic lymphocytic leukemia. Cell. 2013, 152: 714726. 10.1016/j.cell.2013.01.019.PubMedPubMed CentralView ArticleGoogle Scholar
 Su X, Zhang L, Zhang J, MericBernstam F, Weinstein JN: PurityEst: estimating purity of human tumor samples using nextgeneration sequencing data. Bioinformatics. 2012, 28: 22652266. 10.1093/bioinformatics/bts365.PubMedPubMed CentralView ArticleGoogle Scholar
 Shah SP, Roth A, Goya R, Oloumi A, Ha G, Zhao Y, Turashvili G, Ding J, Tse K, Haffari G, Bashashati A, Prentice LM, Khattra J, Burleigh A, Yap D, Bernard V, McPherson A, Shumansky K, Crisan A, Giuliany R, HeraviMoussavi A, Rosner J, Lai D, Birol I, Varhol R, Tam A, Dhalla N, Zeng T, Ma K, Chan SK, et al: The clonal and mutational evolution spectrum of primary triplenegative breast cancers. Nature. 2012, 486: 395399.PubMedGoogle Scholar
 Ding L, Ley TJ, Larson DE, Miller CA, Koboldt DC, Welch JS, Ritchey JK, Young MA, Lamprecht T, McLellan MD, McMichael JF, Wallis JW, Lu C, Shen D, Harris CC, Dooling DJ, Fulton RS, Fulton LL, Chen K, Schmidt H, KalickiVeizer J, Magrini VJ, Cook L, McGrath SD, Vickery TL, Wendl MC, Heath S, Watson MA, Link DC, Tomasson MH, et al: Clonal evolution in relapsed acute myeloid leukaemia revealed by wholegenome sequencing. Nature. 2012, 481: 506510. 10.1038/nature10738.PubMedPubMed CentralView ArticleGoogle Scholar
 Gusnanto A, Wood HM, Pawitan Y, Rabbitts P, Berri S: Correcting for cancer genome size and tumour cell content enables better estimation of copy number alterations from nextgeneration sequence data. Bioinformatics. 2012, 28: 4047. 10.1093/bioinformatics/btr593.PubMedView ArticleGoogle Scholar
 Shibata D: Cancer. Heterogeneity and tumor history. Science. 2012, 336: 304305. 10.1126/science.1222361.PubMedView ArticleGoogle Scholar
 Greaves M, Maley CC: Clonal evolution in cancer. Nature. 2012, 481: 306313. 10.1038/nature10762.PubMedPubMed CentralView ArticleGoogle Scholar
 Mullighan CG, Phillips LA, Su X, Ma J, Miller CB, Shurtleff SA, Downing JR: Genomic analysis of the clonal origins of relapsed acute lymphoblastic leukemia. Science. 2008, 322: 13771380. 10.1126/science.1164266.PubMedPubMed CentralView ArticleGoogle Scholar
 Tolliver D, Tsourakakis C, Subramanian A, Shackney S, Schwartz R: Robust unmixing of tumor states in array comparative genomic hybridization data. Bioinformatics. 2010, 26: i106114. 10.1093/bioinformatics/btq213.PubMedPubMed CentralView ArticleGoogle Scholar
 Gerlinger M, Rowan AJ, Horswell S, Larkin J, Endesfelder D, Gronroos E, Martinez P, Matthews N, Stewart A, Tarpey P, Varela I, Phillimore B, Begum S, McDonald NQ, Butler A, Jones D, Raine K, Latimer C, Santos CR, Nohadani M, Eklund AC, SpencerDene B, Clark G, Pickering L, Stamp G, Gore M, Szallasi Z, Downward J, Futreal PA, Swanton C: Intratumor heterogeneity and branched evolution revealed by multiregion sequencing. N Engl J Med. 2012, 366: 883892. 10.1056/NEJMoa1113205.PubMedView ArticleGoogle Scholar
 Navin N, Krasnitz A, Rodgers L, Cook K, Meth J, Kendall J, Riggs M, Eberling Y, Troge J, Grubor V, Levy D, Lundin P, Maner S, Zetterberg A, Hicks J, Wigler M: Inferring tumor progression from genomic heterogeneity. Genome Res. 2010, 20: 6880. 10.1101/gr.099622.109.PubMedPubMed CentralView ArticleGoogle Scholar
 NikZainal S, Van Loo P, Wedge DC, Alexandrov LB, Greenman CD, Lau KW, Raine K, Jones D, Marshall J, Ramakrishna M, Shlien A, Cooke SL, Hinton J, Menzies A, Stebbings LA, Leroy C, Jia M, Rance R, Mudie LJ, Gamble SJ, Stephens PJ, McLaren S, Tarpey PS, Papaemmanuil E, Davies HR, Varela I, McBride DJ, Bignell GR, Leung K, Butler AP, et al: The life history of 21 breast cancers. Cell. 2012, 149: 9941007. 10.1016/j.cell.2012.04.023.PubMedPubMed CentralView ArticleGoogle Scholar
 Navin N, Kendall J, Troge J, Andrews P, Rodgers L, McIndoo J, Cook K, Stepansky A, Levy D, Esposito D, Muthuswamy L, Krasnitz A, McCombie WR, Hicks J, Wigler M: Tumour evolution inferred by singlecell sequencing. Nature. 2011, 472: 9094. 10.1038/nature09807.PubMedPubMed CentralView ArticleGoogle Scholar
 Xu X, Hou Y, Yin X, Bao L, Tang A, Song L, Li F, Tsang S, Wu K, Wu H, He W, Zeng L, Xing M, Wu R, Jiang H, Liu X, Cao D, Guo G, Hu X, Gui Y, Li Z, Xie W, Sun X, Shi M, Cai Z, Wang B, Zhong M, Li J, Lu Z, Gu N, et al: Singlecell exome sequencing reveals singlenucleotide mutation characteristics of a kidney tumor. Cell. 2012, 148: 886895. 10.1016/j.cell.2012.02.025.PubMedView ArticleGoogle Scholar
 Hou Y, Song L, Zhu P, Zhang B, Tao Y, Xu X, Li F, Wu K, Liang J, Shao D, Wu H, Ye X, Ye C, Wu R, Jian M, Chen Y, Xie W, Zhang R, Chen L, Liu X, Yao X, Zheng H, Yu C, Li Q, Gong Z, Mao M, Yang X, Yang L, Li J, Wang W, et al: Singlecell exome sequencing and monoclonal evolution of a JAK2negative myeloproliferative neoplasm. Cell. 2012, 148: 873885. 10.1016/j.cell.2012.02.028.PubMedView ArticleGoogle Scholar
 Baslan T, Kendall J, Rodgers L, Cox H, Riggs M, Stepansky A, Troge J, Ravi K, Esposito D, Lakshmi B, Wigler M, Navin N, Hicks J: Genomewide copy number analysis of single cells. Nat Protoc. 2012, 7: 10241041. 10.1038/nprot.2012.039.PubMedView ArticleGoogle Scholar
 Meyerson M, Gabriel S, Getz G: Advances in understanding cancer genomes through secondgeneration sequencing. Nat Rev Genet. 2010, 11: 685696. 10.1038/nrg2841.PubMedView ArticleGoogle Scholar
 Ding L, Wendl MC, Koboldt DC, Mardis ER: Analysis of nextgeneration genomic data in cancer: accomplishments and challenges. Hum Mol Genet. 2010, 19: R188196. 10.1093/hmg/ddq391.PubMedPubMed CentralView ArticleGoogle Scholar
 Xi R, Hadjipanayis AG, Luquette LJ, Kim TM, Lee E, Zhang J, Johnson MD, Muzny DM, Wheeler DA, Gibbs RA, Kucherlapati R, Park PJ: Copy number variation detection in wholegenome sequencing data using the Bayesian information criterion. Proc Natl Acad Sci USA. 2011, 108: E11281136. 10.1073/pnas.1110574108.PubMedPubMed CentralView ArticleGoogle Scholar
 Chiang DY, Getz G, Jaffe DB, O'Kelly MJ, Zhao X, Carter SL, Russ C, Nusbaum C, Meyerson M, Lander ES: Highresolution mapping of copynumber alterations with massively parallel sequencing. Nat Methods. 2009, 6: 99103. 10.1038/nmeth.1276.PubMedPubMed CentralView ArticleGoogle Scholar
 Miller CA, Hampton O, Coarfa C, Milosavljevic A: ReadDepth: a parallel R package for detecting copy number alterations from short sequencing reads. PLoS ONE. 2011, 6: e1632710.1371/journal.pone.0016327.PubMedPubMed CentralView ArticleGoogle Scholar
 Raphael BJ, Volik S, Collins C, Pevzner PA: Reconstructing tumor genome architectures. Bioinformatics. 2003, 19 (Suppl 2): i162171.View ArticleGoogle Scholar
 Greenman CD, Pleasance ED, Newman S, Yang F, Fu B, NikZainal S, Jones D, Lau KW, Carter N, Edwards PA, Futreal PA, Stratton MR, Campbell PJ: Estimation of rearrangement phylogeny for cancer genomes. Genome Res. 2012, 22: 346361. 10.1101/gr.118414.110.PubMedPubMed CentralView ArticleGoogle Scholar
 Oesper L, Ritz A, Aerni SJ, Drebin R, Raphael BJ: Reconstructing cancer genomes from pairedend sequencing data. BMC Bioinformatics. 2012, 13 (Suppl 6): S1010.1186/1471210513S6S10.PubMedPubMed CentralView ArticleGoogle Scholar
 McPherson A, Wu C, Wyatt AW, Shah S, Collins C, Sahinalp SC: nFuse: discovery of complex genomic rearrangements in cancer using highthroughput sequencing. Genome Res. 2012, 22: 22502261. 10.1101/gr.136572.111.PubMedPubMed CentralView ArticleGoogle Scholar
 Chen K, Wallis JW, McLellan MD, Larson DE, Kalicki JM, Pohl CS, McGrath SD, Wendl MC, Zhang Q, Locke DP, Shi X, Fulton RS, Ley TJ, Wilson RK, Ding L, Mardis ER: BreakDancer: an algorithm for highresolution mapping of genomic structural variation. Nat Methods. 2009, 6: 677681. 10.1038/nmeth.1363.PubMedPubMed CentralView ArticleGoogle Scholar
 Bashir A, Volik S, Collins C, Bafna V, Raphael BJ: Evaluation of pairedend sequencing strategies for detection of genome rearrangements in cancer. PLoS Comput Biol. 2008, 4: e100005110.1371/journal.pcbi.1000051.PubMedPubMed CentralView ArticleGoogle Scholar
 Ley TJ, Miller C, Ding L, Raphael BJ, Mungall AJ, Robertson A, Hoadley K, Triche TJ, Laird PW, Baty JD, Fulton LL, Fulton R, Heath SE, KalickiVeizer J, Kandoth C, Klco JM, Koboldt DC, Kanchi KL, Kulkarni S, Lamprecht TL, Larson DE, Lin L, Lu C, McLellan MD, McMichael JF, Payton J, Schmidt H, Spencer DH, Tomasson MH, Wallis JW, et al: Genomic and epigenomic landscapes of adult de novo acute myeloid leukemia. N Engl J Med. 2013, 368: 20592074.View ArticleGoogle Scholar
 Cavalli LR, Cavalieri LM, Ribeiro LA, Cavalli IJ, Silveira R, Rogatto SR: Cytogenetic evaluation of 20 primary breast carcinomas. Hereditas. 1997, 126: 261268.PubMedView ArticleGoogle Scholar
 Nordgard SH, Johansen FE, Alnaes GI, Bucher E, Syvanen AC, Naume B, BorresenDale AL, Kristensen VN: Genomewide analysis identifies 16q deletion associated with survival, molecular subtypes, mRNA expression, and germline haplotypes in breast cancer patients. Genes Chromosomes Cancer. 2008, 47: 680696. 10.1002/gcc.20569.PubMedView ArticleGoogle Scholar
 Vos CB, ter Haar NT, Rosenberg C, Peterse JL, CletonJansen AM, Cornelisse CJ, van de Vijver MJ: Genetic alterations on chromosome 16 and 17 are important features of ductal carcinoma in situ of the breast and are associated with histologic type. Br J Cancer. 1999, 81: 14101418. 10.1038/sj.bjc.6693372.PubMedPubMed CentralView ArticleGoogle Scholar
 Castells A, Gusella JF, Ramesh V, Rustgi AK: A region of deletion on chromosome 22q13 is common to human breast and colorectal cancers. Cancer Res. 2000, 60: 28362839.PubMedGoogle Scholar
 Driouch K, DorionBonnet F, Briffod M, Champeme MH, Longy M, Lidereau R: Loss of heterozygosity on chromosome arm 16q in breast cancer metastases. Genes Chromosomes Cancer. 1997, 19: 185191. 10.1002/(SICI)10982264(199707)19:3<185::AIDGCC8>3.0.CO;2U.PubMedView ArticleGoogle Scholar
 Chen T, Sahin A, Aldaz CM: Deletion map of chromosome 16q in ductal carcinoma in situ of the breast: refining a putative tumor suppressor gene region. Cancer Res. 1996, 56: 56055609.PubMedGoogle Scholar
 Sindi S, Helman E, Bashir A, Raphael BJ: A geometric approach for classification and comparison of structural variants. Bioinformatics. 2009, 25: i222230. 10.1093/bioinformatics/btp208.PubMedPubMed CentralView ArticleGoogle Scholar
 Anbazhagan R, Fujii H, Gabrielson E: Allelic loss of chromosomal arm 8p in breast cancer progression. Am J Pathol. 1998, 152: 815819.PubMedPubMed CentralGoogle Scholar
 Yaremko ML, Recant WM, Westbrook CA: Loss of heterozygosity from the short arm of chromosome 8 is an early event in breast cancers. Genes Chromosomes Cancer. 1995, 13: 186191. 10.1002/gcc.2870130308.PubMedView ArticleGoogle Scholar
 Adams J, Williams SV, Aveyard JS, Knowles MA: Loss of heterozygosity analysis and DNA copy number measurement on 8p in bladder cancer reveals two mechanisms of allelic loss. Cancer Res. 2005, 65: 6675.PubMedGoogle Scholar
 Dewal N, Hu Y, Freedman ML, Laframboise T, Pe'er I: Calling amplified haplotypes in next generation tumor sequence data. Genome Res. 2012, 22: 362374. 10.1101/gr.122564.111.PubMedPubMed CentralView ArticleGoogle Scholar
 Zakov S, Kinsella M, Bafna V: An algorithmic approach for breakagefusionbridge detection in tumor genomes. Proc Natl Acad Sci USA. 2013, 110: 55465551. 10.1073/pnas.1220977110.PubMedPubMed CentralView ArticleGoogle Scholar
 Korbel JO, Campbell PJ: Criteria for inference of chromothripsis in cancer genomes. Cell. 2013, 152: 12261236. 10.1016/j.cell.2013.02.023.PubMedView ArticleGoogle Scholar
 Raphael BJ, Pevzner PA: Reconstructing tumor amplisomes. Bioinformatics. 2004, 20 (Suppl 1): i265273. 10.1093/bioinformatics/bth931.PubMedView ArticleGoogle Scholar
 Quon G, Morris Q: ISOLATE: a computational strategy for identifying the primary origin of cancers using highthroughput sequencing. Bioinformatics. 2009, 25: 28822889. 10.1093/bioinformatics/btp378.PubMedPubMed CentralView ArticleGoogle Scholar
 Hemmeke R, Koppe M, Lee J, Weismantel R: Nonlinear integer programming. 50 Years of Integer Programming 19582008. Edited by: Junger M, Liebling TM, Naddef D, Nemhauser GL, Pulleyblank WR, Reinelt G, Rinaldi G, Wolsey LA, Berlin and Heidelberg: Springer. 2010, 561618.View ArticleGoogle Scholar
 Lee H, Schatz MC: Genomic dark matter: the reliability of short read mapping illustrated by the genome mappability score. Bioinformatics. 2012, 28: 20972105. 10.1093/bioinformatics/bts330.PubMedPubMed CentralView ArticleGoogle Scholar
 Chen J, Chen Z: Extended Bayesian information criteria for model selection with large model spaxes. Biometrika. 2008, 95: 759771. 10.1093/biomet/asn034.View ArticleGoogle Scholar
 Raphael research lab software. [http://compbio.cs.brown.edu/software/]
 Karolchik D, Hinrichs AS, Furey TS, Roskin KM, Sugnet CW, Haussler D, Kent WJ: The UCSC Table Browser data retrieval tool. Nucleic Acids Res. 2004, 32: D493496. 10.1093/nar/gkh103.PubMedPubMed CentralView ArticleGoogle Scholar
 The Cancer Genome Atlas. [http://cancergenome.nih.gov]
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.