 Method
 Open access
 Published:
ConDoR: tumor phylogeny inference with a copynumber constrained mutation loss model
Genome Biology volume 24, Article number: 272 (2023)
Abstract
A tumor contains a diverse collection of somatic mutations that reflect its past evolutionary history and that range in scale from single nucleotide variants (SNVs) to largescale copynumber aberrations (CNAs). However, no current singlecell DNA sequencing (scDNAseq) technology produces accurate measurements of both SNVs and CNAs, complicating the inference of tumor phylogenies. We introduce a new evolutionary model, the constrained kDollo model, that uses SNVs as phylogenetic markers but constrains losses of SNVs according to clusters of cells. We derive an algorithm, ConDoR, that infers phylogenies from targeted scDNAseq data using this model. We demonstrate the advantages of ConDoR on simulated and real scDNAseq data.
Background
Cancer is an evolutionary process in which somatic mutations across all genomic scales―ranging from single nucleotide variants (SNVs) to largescale copy number aberrations (CNAs)―accumulate in a population of cells. This process results in a heterogeneous tumor with subpopulations of cells, called clones, with distinct genomes. Reconstruction of the evolutionary history of cancer clones, known as a tumor phylogeny, from genomic sequencing data of the cells in a tumor is crucial for understanding cancer progression and developing effective therapies for treatment [1,2,3,4].
Early cancer sequencing projects performed bulk sequencing of tumor samples and thus measured somatic mutations from a mixture of thousands or millions of cells. Tumor phylogeny inference from this data is complicated since it requires deconvolution of the data, i.e., simultaneous inference of the tumor clones and their proportions in the mixture [5,6,7,8,9,10,11].
Recent developments in singlecell DNA sequencing (scDNAseq) allow parallel sequencing of thousands of individual cells from a tumor [2, 12,13,14,15], alleviating the need for such deconvolution. However, tumor phylogeny inference from this data remains challenging since current scDNAseq technologies are errorprone and produce data with missing information. As such, phylogeny inference using scDNAseq data involves correcting these errors and imputing the missing data under some evolutionary model [2, 16].
Multiple evolutionary models have been used to construct tumor phylogenies from scDNAseq data. Early works [17,18,19,20] used SNVs as evolutionary markers, and relied on the infinitesites model [21] which states that an SNV can be gained only once and never be subsequently lost in the phylogeny. While the same SNV occurring independently more than once is rare [22], loss of SNVs due to copynumber deletions is common in cancer [23]. To account for these losses, other works [24,25,26,27] use some variant of the kDollo model [28], in which a mutation can be gained at most once but may be lost at most k times during the course of the evolution, where k is a userdefined integer. Several methods [29, 30] employ an even more permissive model, the finitesites model [31], which allowed mutations to be gained and lost multiple times.
A major limitation of the aforementioned models and methods is that they do not utilize any information about CNAs, which can often also be derived from scDNAseq data. This limitation is addressed in methods such as SCARLET [32], BiTSC\(^2\) [33], and COMPASS [34] which incorporate copynumber information during phylogeny inference. SCARLET introduced a novel losssupported Dollo model that requires the copynumber profile of each cell and the copynumber phylogeny as input. BiTSC\(^2\) and COMPASS, on the other hand, construct a joint phylogeny with both SNV and CNA events. However, these methods rely heavily on accurate and simultaneous identification of SNVs and CNAs on the same set of cells, which is challenging with the current scDNAseq technologies [35].
Current scDNAseq technologies fall into one of two classes with different capabilities from measuring CNAs and SNVs. First, whole genome scDNAseq technologies yield data with roughly uniform coverage of the whole genome but with low depth at any particular locus, making it suitable for detection of larger CNAs in singlecells but not SNVs [12, 14, 36]. In contrast, targeted scDNAseq technologies sequence specific regions of the genome, typically comprising of cancerrelated genes, with high depth allowing accurate identification of SNVs but not of CNAs [13, 15, 24, 37]. For example, the Mission Bio Tapestri platform [38, 39] performs highcoverage sequencing (\(\sim 50\times\) coverage) of hundreds of amplicons from thousands of cells. While precise identification of CNAs in each cell using such targeted scDNAseq data is challenging, clustering of cells based on their copynumber profiles is a much simpler task. However, no existing evolutionary model utilizes such clustering information.
Here, we introduce a new evolutionary model, the constrained kDollo model, and an algorithm ConDoR (Constrained Dollo Reconstruction) that computes phylogenetic trees from read count data using this model (Fig. 1). In the constrained kDollo model, an SNV occurs only once on the phylogeny but may be lost up to k times, as long as these losses conform to a given copynumber clustering of cells. The key idea underpinning the constrained kDollo model is that, since loss of SNVs predominantly occurs due to CNAs, we allow loss of an SNV only between cells that have distinct copynumber profiles. Importantly, the constrained kDollo model generalizes both the infinitesites and the kDollo models. Additionally, we model the loss of single nucleotide polymorphisms (SNPs), i.e., germline variants present in normal cells, that can be informative during phylogeny inference, while most existing methods only focus on somatic variants (SNVs) [17, 20, 25, 32].
We show that ConDoR outperforms existing tumor phylogeny inference methods on simulated and real targeted scDNAseq data, including Mission Bio Tapestri data from multiple regions of a pancreatic tumor and wholeexome sequencing of a metastatic colorectal cancer [37]. In both cases, ConDoR yields a more plausible phylogeny compared to existing methods, and provides insights into the spatial evolution and metastatic spread of these tumors.
Results
Constrained kDollo model
We propose a new model, the constrained kDollo model, that integrates information about SNVs, SNPs and CNAs on the same set of cells during phylogeny inference. Our model incorporates CNAs via a clustering of cells, where all cells in the same cluster have the same copynumber profile (the copy number of all loci across the genome). In other words, each cluster corresponds to a copynumber clone. We will refer to this clustering as the copynumber clustering.
Suppose we measure m SNVs and SNPs in n cells from a tumor. In the following, we collectively refer to SNVs and SNPs as mutations. We encode the presence or absence of mutations in the cells by an \(n \times m\) binary mutation matrix A where \(a_{i,j} = 1\) if cell i contains mutation j and \(a_{i,j} = 0\) indicates the mutation j is absent in cell i. A phylogenetic tree T for the tumor is a rooted nodelabeled tree which describes the evolutionary history of the tumor. Each internal node v in the tree T represents an ancestral cell and is labeled by a vector \(a_v\in \{0,1\}^m\) indicating the presence/absence of each mutation \(j\in \{1,\ldots ,m\}\) in that cell. The root r(T) of the tree T represents the normal cell. As such \(a_{r(T), j} = 0\) if mutation j is an SNV, i.e., a somatic mutation, and \(a_{r(T), j} = 1\) if mutation j is an SNP, i.e., a germline mutation. Each leaf of T corresponds to one of the n cells in the tumor. Our goal is to reconstruct a phylogenetic tree T for a given mutation matrix A under a given evolutionary model.
An edge (v, w) of a phylogeny T induces the gain of a mutation j if \(a_{v,j} = 0\) and \(a_{w,j} = 1\). On the other hand, a mutation j is said to be lost on edge (v, w) if \(a_{v,j} = 1\) and \(a_{w,j} = 0\). The simplest evolutionary model used in cancer genomics is the infinitesites model which has two constraints [21]. Firstly, a mutation is allowed to be gained at most once in the phylogeny. This constraint stems from the infinitesites assumption which posits that it is very unlikely for the same position in the genome to get mutated multiple times independently. Secondly, once a mutation is gained it cannot be subsequently lost. A phylogeny that satisfies these constraints is known as a perfect phylogeny [40].
While parallel mutations (i.e., the same mutation gained multiple times in the phylogeny) and back mutations (i.e., a mutation reverted back to reference state) are rare in cancer [22], SNVs and SNPs are frequently lost due to copynumber aberrations. As such, more recent phylogeny inference methods [25, 32] apply some variant of the Dollo model [28] for phylogeny inference, which allows loss of SNVs/SNPs. Specifically, under the Dollo model, a mutation is allowed to be gained at most once but can be lost multiple times in the phylogeny. The parameterized version of this model is the kDollo model, in which a mutation can only be lost at most k times in the phylogeny. However, a major limitation of Dollo models is that, although they allow loss of SNVs and SNPs, possibly due to CNAs, they do not incorporate any information about the copynumber states of the cells.
We introduce the constrained kDollo model that supplements the kDollo model with two additional constraints using the copynumber clustering of the cells. First, since reversal of mutations (back mutations) in cancer are rare [22, 23], we assume that SNVs and SNPs can only be lost due to overlapping CNAs. As such, we only allow such losses between cells that belong to distinct copynumber clusters. Second, we assume that each copynumber profile describing the copy number states over the entire genome arises only once in the phylogeny. As such, cells belonging to the same cluster form a connected subtree in the phylogeny. Let p be the number of copynumber clones and \(\sigma\) be the copynumber clustering of the n cells. We formally define the constrained kDollo phylogeny for a mutation matrix A and copynumber clustering \(\sigma\) as follows.
Definition 1
(constrained kDollo phylogeny) A constrained kDollo phylogeny T has the following properties.

1
Each node v in T is labeled by \(a_v\in \{0,1\}^m\) and a copy number clone \(\sigma (v)\).

2
The root r(T) is labeled such that \(a_{r(T),j} = 0\) if mutation j is an SNV and \(a_{r(T), j} = 1\) if j is an SNP.

3
For each mutation j, there is at most one edge (v, w) in T such that \(a_{v,j} = 0\) and \(a_{w,j} = 1\).

4
For each mutation j, there are at most k edges (v, w) in T such that \(a_{v,j} = 1\) and \(a_{w,j} = 0\).

5
For edge (v, w) in T such that \(a_{v,j} = 1\) and \(a_{w,j} = 0\) for some mutation j, we have \(\sigma (v) \ne \sigma (w)\).

6
For any copy number clone \(\ell\), the set of nodes labeled \(\sigma (v) = \ell\) form a connected subtree of T.
We say that a \(n\times m\) binary matrix A is a constrained kDollo phylogeny matrix for copynumber clustering \(\sigma\) if and only if there exists a constrained kDollo phylogeny T for A and \(\sigma\), i.e., T has n leaves and for each leaf there is a unique index \(i\in \{1,\ldots , n\}\) such that the leaf is labeled by the row \(a_i\) of A and \(\sigma (i)\).
The constrained kDollo model generalizes the infinite sites model [21] and the kDollo model [28]. Specifically, when the number p of clusters is 1, the constrained kDollo model is equivalent to the infinite sites model. On the opposite extreme, when the number p of clusters is equal to the number n of cells, i.e., each cell is in a distinct cluster, the constrained kDollo model is equivalent to the kDollo model.
Constrained k–Dollo phylogeny problem for read count data
During a scDNAseq experiment, we do not observe the mutation matrix A directly. Instead, we observe read counts for each mutation in each cell. Specifically, we obtain the variant read count matrix \(Q \in \mathbb {Z}^{n\times m}\), where \(q_{i,j}\) is the number of reads with the variant allele for mutation j in cell i, and the total read count matrix \(R \in \mathbb {Z}^{n\times m}\), where \(r_{i,j}\) is the total number of reads for mutation j in cell i. Considering that the cells and mutations in each cell are sequenced independently, and given the tree, the cells evolve independently, the likelihood of observing the variant read count matrix Q for given total read count matrix R and mutation matrix A can be written as follows.
We model the observed variant read counts \(q_{i,j}\) using a betabinomial, similar to previous work [32, 41, 42]. The “Methods” section provides the details about the read count model.
For given read count matrices Q and R, copynumber clustering \(\sigma\) of the cells and integer k, our goal is to construct a constrained kDollo phylogeny that maximizes the likelihood described in Eq. 1. We refer to this as the Constrained kDollo phylogeny problem for read count data and pose it as follows.
Problem 1
(Constrained kDollo phylogeny problem for read count data (CkDPRC)) Given a variant read count matrix Q, total read count matrix R, copynumber clustering \(\sigma\) and integer k, find mutation matrix A and phylogeny T such that (i) likelihood \(\Pr (Q\mid R, A)\) is maximized and (ii) T is a constrained kDollo phylogeny for A and \(\sigma\).
In the “Methods” section, we describe a combinatorial characterization of constrained kDollo phylogenies that we incorporate in an efficient mixed linear integer program (MILP) to solve the CkDPRC problem. Our resulting method, ConDoR, is implemented in Python 3 using Gurobi [43] (version 9.0.3) to solve the MILP. ConDoR is available at https://github.com/raphaelgroup/ConDoR.
Evaluation on simulated data
We compare ConDoR to SCARLET [32], SPhyR [25], SiFit [29] and SCITE [17] on simulated data. We generated simulated data with \(n\in \{25, 50, 100\}\) cells, \(m\in \{25, 50, 100\}\) mutations, \(p\in \{3,5\}\) copynumber clusters, and maximum number of losses \(k\in \{1,2,3\}\). We used a growing random network [44] to generate a tree T with \(m + p\) edges and assign mutations, copy number states, and cluster assignments to each tree, as described in the “Methods” section. Next, we assign n cells uniformly at random to one of the nodes in the tree. We simulate the sequencing data for each mutation in each cell using a betabinomial read count model (details in the “Methods” section). We simulate 5 instances for each combination of the varying simulation parameters. The precise input parameters used for each method are described in Additional file 1: Section C.
We compare the mutation matrix \(\hat{A} = [\hat{a}_{i,j}]\) and tumor phylogeny \(\hat{T}\) inferred by each method to the ground truth as follows. Following previous studies [25, 32], we evaluate the inferred mutation matrix \(\hat{A}\) against the groundtruth mutation matrix A by computing the normalized mutation matrix error \(\epsilon (A, \hat{A})\) between A and \(\hat{A}\) given by,
We evaluate the accuracy of the inferred tumor phylogeny \(\hat{T}\) compared to the groundtruth tumor phylogeny T by computing the pairwise ancestral relation accuracy \(E(T, \hat{T})\) [25, 32]. Specifically, under the assumption that a mutation can be gained only once in the phylogeny, which is employed by all methods that we compare against in this study except SiFit, we compute the accuracy of inferring the correct relationship between all possible pairs of mutations from the inferred tumor phylogeny (details in the “Methods” section). When the pairwise ancestral relation accuracy \(E(T, \hat{T})\) is 1, the inferred tumor phylogeny \(\hat{T}\) and the groundtruth tumor phylogeny T are identical when restricted to the edges on which the mutations are gained. We exclude SiFit when computing this metric because it uses the finitesites model, which allows mutations to occur multiple times in the phylogeny as a consequence of which pairs of mutations may not have a unique relationship.
ConDoR outperforms all the other methods in terms of both the normalized mutation matrix error (Fig. 2a) and the ancestral relationship accuracy (Fig. 2b) across all simulation parameters. For instance, on the largest simulated instances with \(n = 100\) cells and \(m = 100\) mutations, ConDoR achieves the lowest normalized mutation matrix error (median \(\epsilon (A,\hat{A}) = 0.002\)) and the highest pairwise ancestral relation accuracy (median \(E(T,\hat{T}) = 0.986\)) compared to SCARLET (\(\epsilon (A,\hat{A}) = 0.008\), \(E(T,\hat{T}) = 0.969\)), SCITE (\(\epsilon (A,\hat{A}) = 0.01\), \(E(T,\hat{T}) = 0.975\)), SiFit (\(\epsilon (A,\hat{A}) = 0.05\)), and SPhyR ( \(\epsilon (A,\hat{A}) = 0.02\), \(E(T,\hat{T}) = 0.949\)). The superior performance of ConDoR comes with running times comparable to existing methods, although ConDoR does have a higher runtime on some of the large simulated instances with \(n=100\) cells and \(m=100\) mutations (Additional file 1: Fig. S5).
Interestingly, ConDoR outperforms SCARLET even though SCARLET is given substantially more information about copy number aberrations including both the precise copynumber profile of each cell and the true copynumber tree as input. We believe that this advantage is due to ConDoR solving the underlying optimization problem exactly while SCARLET employs various heuristics that are not guaranteed to yield an optimal solution. Using additional simulations, we also show that ConDoR is robust to noise in the copy number clustering and the read count model parameters (Additional file 1: Section D).
Multiregion pancreatic ductal adenocarcinoma data
We used ConDoR to analyze targeted singlecell DNA sequencing (scDNAseq) data from two regions of a pancreatic ductal adenocarcinoma (PDAC) tumor. Specifically, we sequenced two samples (S1 and S2) from distinct regions of the resected tumor using both conventional bulk whole exome sequencing and Mission Bio Tapestri singlecell sequencing (details in the “Methods” section). The scDNAseq workflow was conducted using a targeted panel consisting of 596 amplicons (median length is 209 bps, Additional file 1: Fig. S6a) interrogating frequently mutated genes in PDAC. We obtained sequencing data from 2153 cells (1167 cells from the first sample and 986 cells from the second sample) with a median coverage of \(67\times\) per amplicon per cell.
We identified 7 mutations of interest―including somatic SNVs in BRCA2, TGFBR2, FGFR1 and germline SNPs in SPTA1, MGMT. These mutations were identified using matched bulk tumor and normal sequencing data and were present in the singlecell data with high confidence (details in “Methods” section). Due to the short length of amplicons and uneven distribution in coverage (Additional file 1: Fig. S6b), accurate copynumber calling using this data is challenging. Instead we clustered cells according copy number profiles derived from normalized read counts using kmeans clustering [45] for number of clusters \(p\in \{2,\ldots ,8\}\). We select the best value for p using the Silhouette score [46] (see “Methods” section for details). This analysis reveals 3 copynumber clusters (Fig. 3a), which we label C0, C1, and C2, that contain 275, 1145, and 733 cells, respectively.
ConDoR produces a more plausible phylogeny of the PDAC tumor compared to existing methods and provides insights into the evolution of tumor. While most PDAC cases are driven by canonical gainoffunction KRAS mutations [47], ConDoR reveals that the tumor analyzed here is driven by a truncal BRCA2 stopgained mutation (p.Y600*), which likely inactivated the BRCA2 protein, a tumor suppressor essential for homologous recombination repair [48]. The ConDoR phylogeny shows branched evolution of the tumor, with the trunk leading to two branches (Fig. 3b), the first characterized by two missense TGFBR2 mutations, TGFBR2_1 (p.A426G) and TGFBR_2 (p.M425I), which likely inactivated cellintrinsic TGF\(\beta\) signaling, and the second characterized by a missense mutation to FGFR1 (p.T50K). Although FGFR1 is involved in MAPKERK signaling [47], the particular point mutation’s significance is yet to be characterized. ConDoR infers loss of two germline SNPs in MGMT, MGMT_1 and MGMT_2 (both contained in gene MGMT and amplicon AMPL257637), on the edge in the second branch distinguishing cells of cluster C2 from cells of cluster C1. This suggests a loss of heterozygosity (LOH) in cluster C2 cells, which is supported by lower normalized total read count of their amplicon (AMPL257637) in the cells from cluster C2 compared to the cells from cluster C1 (Fig. 3c, \(p < 5.8\times 10^{33}\) with a onesided KolmogorovSmirnov test). Lastly, the root of the ConDoR phylogeny is labeled by cluster C1, indicating that it contains the normal cells present in the data.
We compared the ConDoR phylogeny to the phylogenies produced by two other methods on this data: COMPASS [34], which infers a comprehensive phylogeny with both SNV and CNA events, and SPhyR [25], which uses the kDollo model. We could not run SCARLET on this data because it was difficult to obtain reliable integer copy numbers and copy number trees from this targeted sequencing data. While COMPASS takes the read count matrices as input, SPhyR takes an observed mutation matrix (Fig. 3d) obtained by discretizing read counts (details in the “Methods” section). COMPASS hypothesizes 8 loss of heterozygosity events (Additional file 1: Fig. S7a) covering all genes in the study except SPTA1 (BRCA2, TGFBR2, FGFR1, MGMT). SPhyR (with \(k = 1\)) produces a phylogeny that contains loss of all the mutations except FGFR1 (Additional file 1: Fig. S7b), which is highly unlikely. This demonstrates that using permissive models, like the ones used in COMPASS and SPhyR, may lead to overfitting of the data resulting in overestimation of mutations with loss. ConDoR’s constrained kDollo model avoids overfitting by incorporating the copynumber clustering to constrain where loss of mutations can occur in the phylogeny.
The phylogeny constructed by ConDoR also reveals a spatial clonal architecture of the PDAC tumor that agrees with previous histological analysis of the tumor. Specifically, the ConDoR phylogeny shows an enrichment of cells from sample S2 (743 cells from S2 vs. 28 cells from S1) in the second branch of the phylogeny, characterized by the edge with the mutation in FGFR1 (Fig. 3b). Such a spatial separation of the two clonal lineages conforms to histopathological results of this tumor (Fig. 5 in [15]) that showed two populations of tumor cells with distinct morphologies that were well demarcated. Spatial structure in the clonal heterogeneity of tumors has also been observed in previous cancer studies and has several clinical implications such as resistance to therapy and recurrences [49,50,51,52]. In summary, ConDoR leverages copynumber clustering obtained from targeted scDNAseq data to build a more plausible tumor phylogeny compared to existing methods and reveals the spatial structure of the intratumor heterogeneity.
Metastatic colorectal cancer data
We also analyzed a published targeted scDNAseq dataset from a metastatic colorectal cancer patient CRC2 [37]. This dataset consists of 36 SNVs that were identified from a 1000 gene panel in 186 cells: 145 from the primary tumor and 41 from a liver metastasis. The original study built a phylogeny of the 186 cells using SCITE [17] and reported two distinct branches of metastatic cells on this phylogeny. This phylogeny suggests a polyclonal origin of the metastasis, i.e., the metastatic tumor was seeded by two distinct clones that migrated from the primary tumor (Additional file 1: Fig. S8a). To evaluate the accuracy of the SCITE tree, the authors identified two bridge mutations, in the genes FHIT and ATP7B, that were present in the cells of the second metastatic branch (detected in 10/13 and 13/13 cells, respectively) but absent in the cells of the first metastatic branch (detected in 1/15 and 1/15 cells, respectively).
Two subsequent analyses of this data―using the PhISCS [20] and SCARLET [32] algorithms―yield a simpler explanation for the data; namely that the liver metastasis resulted from monoclonal seeding; i.e., the metastatic tumor resulted from a single migration from the primary tumor. However, neither of these studies adequately explain the absence of the bridge mutations in cells of the second metastatic branch in the SCITE tree. PhISCS removed the bridge mutations from analysis in order to obtain a perfect phylogeny that supports monoclonal seeding. SCARLET, using a losssupported Dollo model, found evidence for the loss of the FHIT mutation due to a deletion in some cells (Additional file 1: Fig. S8b) but concluded that the absence of the ATP7B mutation in all the cells from the second metastatic branch in the SCITE tree was due to simultaneous false negatives in all these cells, a highly unlikely scenario.
ConDoR produces a phylogeny that both supports monoclonal seeding of the metastasis and provides a more plausible explanation for the absence of the bridge mutations in some of the metastatic cells compared to previous analyses. The ConDoR phylogeny was produced using the copynumber clustering from [32], which included 4 clusters: 120 diploid cells (D), 33 aneuploid profile of primary tumor cells (P), and two distinct aneuploid profiles of metastatic tumor cells (M1 and M2 with 23 and 10 cells, respectively). The ConDoR tree contains a single branch containing all the metastatic cells, supporting the simpler hypothesis of monoclonal seeding of the liver metastasis, in agreement with the PhISCS and SCARLET analyses (Fig. 4a). Moreover, ConDoR infers the loss of both the bridge mutations, FHIT and ATP7B, leading to a phylogeny with a higher likelihood compared to SCARLET (loglikelihood − 8324.8 for ConDoR and − 8437.4 for SCARLET). This demonstrates that the low resolution of the copynumber aberrations derived from targeted scDNAseq data used by SCARLET may lead to misleading results, and ConDoR avoids these errors by only using the copynumber clusters.
We also find that the ConDoR tree is consistent with copynumber profiles obtained from wholegenome sequencing of 42 additional cells from the same patient. These cells were not used in the phylogenetic analyses. In addition to the bridge mutations, ConDoR infers the loss of SNVs in LRP1B, LINGO2_1, and NR4A3. These three SNVs lie within regions with lower copy numbers in the WGS copynumber profiles from the original study (Fig. 4b). The copynumber profiles from WGS data also reveal that all metastatic cells share copynumber deletions in chromosomes 2, 3p, 4, 7, 9, 16, and 22 relative to the cells in the primary tumor. These shared copy number profiles further corroborates the ConDoR tree (and the PhISCS and SCARLET trees) in which all metastatic cells are in a single clade. In contrast, SCITE tree from the original study suggests that these CNAs occurred independently in the two distinct branches of the phylogeny with metastatic cells which is a less likely explanation. In summary, ConDoR integrates SNVs and copynumber clustering to build a tumor phylogeny that contains loss of SNVs that are supported by orthogonal copynumber data and supports a simpler monoclonal origin of the metastasis compared to the original study.
Conclusions
We introduced a new evolutionary model, the constrained kDollo model, a model for twostate phylogenetic characters, in which a character can be gained at most once and lost at most k times but where the losses are constrained according to a given clustering of the taxa. This model was inspired by the challenge of inferring a phylogenetic tree from targeted singlecell DNA sequencing data, where SNVs and SNPs are measured with high fidelity, but CNAs are poorly described. Specifically, our model relies on a clustering of cells based on their copynumber profiles as input, without requiring identification of precise CNAs in each cell. The constrained kDollo model generalizes both the infinite sites model and the kDollo model.
We developed an algorithm, ConDoR (Constrained Dollo Reconstruction), that infers the most parsimonious constrained kDollo phylogeny using a probabilistic model for the read counts in scDNAseq data. On simulated data, ConDoR outperforms stateoftheart tumor phylogeny inference methods. On a multiregion targeted scDNAseq data of pancreatic ductal adenocarcinoma tumor, ConDoR produced a more plausible phylogeny compared to existing methods, providing insights into the evolution and spatial clonal architecture of the tumor. On targeted scDNAseq data of metastatic colorectal cancer patient, ConDoR found a phylogeny that supports a simpler monoclonal origin of liver metastasis compared to polyclonal seeding proposed by the original study [37].
There are several limitations and directions for future research. First, ConDoR currently takes the read count matrices with the mutations and the copynumber clustering as input to build a constrained kDollo phylogeny. A future direction is to extend ConDoR to simultaneously perform variant calling, and inference of the copynumber clustering and the phylogeny, potentially improving the accuracy of the results. Second, ConDoR and several existing methods [17, 20, 25, 29, 32] disregard the location of SNVs during phylogeny inference. However, since CNAs alter the copynumber of contiguous segments of the genome, the SNV locations can be used to model the likelihood of simultaneous loss of multiple adjacent SNVs. Lastly, while ConDoR only uses scDNAseq data as input, the underlying constrained kDollo model is a general model for evolution of SNVs. We propose that this model can be used for phylogeny inference while integrating information from multiple sequencing technologies, possibly measuring different modalities of the cancer cells [19, 53].
Methods
Characterization of constrained kDollo phylogenies
We derive a characterization of constrained kDollo phylogeny matrices by building on previous work on characterization of kDollo phylogeny matrices [25, 26]. Recall that in the kDollo model, a 0 entry in the mutation matrix A indicates that either the mutation did not occur in the cell or that the mutation occurred but then was subsequently lost in the cell. If we could distinguish these two cases, then we could replace the 0 entries resulting from losses by additional character states \(\{2,\ldots ,k+1\}\) representing the k possible losses of a mutation in the kDollo phylogeny. This idea forms the basis of the following (extended) definition of kcompletion of a mutation matrix A.
Definition 2
(ElKebir 2018 [25]) A matrix \(B\in \{0,\ldots , k+1\}^{n\times m}\) is a k completion of a mutation matrix \(A\in \{0,1\}^{n\times m}\) provided: (1) \(b_{i,j} = 1\) if and only if \(a_{i,j} = 1\); (2) \(b_{i,j}\in \{0,\ldots , k+1\}\setminus \{1\}\) if and only if \(a_{i,j} = 0\); (3) \(b_{i,j} \ge 1\) if j is an SNP.
The following definition defines a subset of all possible kcompletion matrices of a mutation matrix A.
Definition 3
(ElKebir 2018 [25]) A matrix \(B\in \{0,\ldots ,k+1\}^{n\times m}\) is a k Dollo completion of mutation matrix A provided it is a kcompletion of mutation matrix A such that there exists no two columns and three rows in B of the following forms:
where \(i_1,i'_1,j_1,j'_1\in I^{(1)}, i_2,j_2\in I^{(2)}, i^{\prime \prime }_1\in I^{(1)}\setminus \{i_2\}\) and \(j^{\prime \prime }_1\in I^{(1)}\setminus \{j_2\}\), and \(I^{(i)} = \{i,\ldots , k+1\}\).
According to this definition, the number of \(3\times 2\) submatrices that are forbidden to exist in kDollo completion matrices is \((k+1)^4 + 2k^2(k+1)^2 + k^4\). Ciccolella et al.[26] provided an alternate characterization of kDollo completion matrices, which we describe in Additional file 1: Section B.
kDollo completion matrices are useful in characterization of kDollo phylogeny matrices due to the following theorem.
Theorem 1
(ElKebir 2018 [25]) \(A\in \{0,1\}^{n\times m}\) is a kDollo phylogeny matrix if and only if there exists a kDollo completion \(B\in \{0,\ldots ,k+1\}^{n\times m}\) of A.
Constrained kDollo phylogenies are a subset of kDollo phylogenies that satisfy some additional constraints. In particular, a constrained kDollo completion must be consistent with copynumber clustering \(\sigma\), according to the following definition.
Definition 4
(Consistency) A kDollo completion \(B\in \{0,\ldots ,k+1\}^{n\times m}\) of a mutation matrix A is consistent with a copynumber clustering \(\sigma\) with p clusters provided the following conditions are true for every mutation j.

1
There is at most one cluster \(\ell\) such that for two distinct cells \(i, i' \in \sigma ^{1}(\ell )\), \(b_{i,j} = 0\) and \(b_{i',j} = 1\).

2
If there exists cell i such that \(\sigma (i) = \ell\) and \(b_{i,j} = s\) for \(s\in \{2,\ldots , k+1\}\), then \(b_{i',j} = s\) for all \(i'\in \sigma ^{1}(\ell )\).
Using this definition, we have the following characterization of constrained kDollo phylogeny matrices.
Theorem 2
A mutation matrix A is a constrained kDollo phylogeny matrix for copynumber clustering \(\sigma\) if and only if there exists a kDollo completion \(B\in \{0,\ldots , k+1\}^{n\times m}\) of A that is consistent with \(\sigma\).
We provide a proof of Theorem 2 in Additional file 1: Section A and show that given a kDollo completion B of mutation matrix A that is consistent with \(\sigma\), we can find a constrained kDollo phylogeny for A and \(\sigma\) in O(nmk) time. In addition, we also show the following result on the complexity of the CkDPRC problem (Problem 1).
Theorem 3
The CkDPRC problem is NPhard, even for \(k = 0\).
A proof of Theorem 3 is provided in Additional file 1: Section A.
ConDoR algorithm for constrained kDollo model
We formulate a mixed integer linear program (MILP) to solve Problem 1 exactly. Specifically, for given read count matrices Q and R, copynumber clustering \(\sigma\) and integer k, the MILP finds a kDollo completion B that is consistent with \(\sigma\) and that maximizes the likelihood \(\Pr (Q\mid R, A)\), where A is the mutation matrix corresponding to B.
The MILP is based on encoding the combinatorial characterization of constrained kDollo completion matrices described in the previous section. We introduce a binary variables \(a_{i,j}\) for cell i and mutation j to represent the mutation matrix A. Further, we introduce binary variables \(c_{\ell ,j,s}\) for cluster \(\ell\), mutation j and state \(s\in \{2,\ldots ,k+1\}\) to represent the presence of loss state s of mutation j in cluster \(\ell\). These binary variables are used to model the entries of the kcompletion matrix B as follows: \(b_{i,j} = 1\) if \(a_{i,j} = 1\); \(b_{i,j} = s\) if \(c_{\ell ,j,s} = 1\) and \(\sigma (i) = \ell\) for \(s\in \{2,\ldots ,k+1\}\); and \(b_{i,j} = 0\) otherwise.
Since \(b_{i,j}\) can only attain one value, we enforce the following constraints for all cells \(i\in \{1,\ldots , n\}\) and all mutations \(j\in \{1,\ldots , m\}\).
We also define variables \(x_{i,j}\) for cell i and mutation j which indicates if \(b_{i,j} \ge 1\). As such, for all cells \(i\in \{1,\ldots , n\}\) and all mutations \(j\in \{1,\ldots , m\}\) we enforce
Once we have modeled the kcompletion matrix B, we need to enforce constraints for (i) consistency with the copynumber clustering \(\sigma\), (ii) handling germline mutations and (iii) B to be a kDollo completion matrix. We describe the constraints for (i), (ii) and the objective function of the MILP in the following and refer to Additional file 1: Section B for (iii).
Handling germline mutations
Here, we describe the constraints to handle germline mutations. Note that, if mutation j is germline, it must either be present in cell i, i.e., \(a_{i,j} = 1\), or it must have been lost, i.e., \(c_{\sigma (i),j,s} = 1\) for some \(s\in \{2,\ldots ,k+1\}\). As such, if \(G\subseteq \{1,\ldots , m\}\) is the set of germline mutations, we enforce the following constraints for all cells \(i\in \{1,\ldots , n\}\) and germline mutations \(j\in G\),
Consistency constraints
We now describe the constraints to enforce consistency between the kcompletion matrix B and the copynumber clustering \(\sigma\). Note that Condition 2 of Definition 4 is satisfied by the way B is modeled, and we only need to introduce constraints to satisfy Condition 1 of Definition 4. To that end, we introduce two set of continuous auxiliary variables. First, we introduce \(g^{(0)}_{\ell ,j}\in [0,1]\) and enforce constraints so that \(g^{(0)}_{\ell ,j} = 1\) if there exists at least one cell \(i\in \sigma ^{1}(\ell )\) such that \(b_{i,j} = 0\) for cluster \(\ell\) and mutation j, and \(g^{(0)}_{\ell ,j} = 0\) otherwise. Similarly, we introduce \(g^{(1)}_{\ell ,j}\in [0,1]\) and enforce constraints so that \(g^{(1)}_{\ell ,j} = 1\) if there exists at least one cell \(i\in \sigma ^{1}(\ell )\) such that \(b_{i,j} = 1\) for cluster \(\ell\) and mutation j, and \(g^{(1)}_{\ell ,j} = 0\) otherwise. We model these variables using the following constraints for all mutations \(j\in \{1,\ldots , m\}\) and clusters \(\ell \in \{1,\ldots , p\}\),
Next, we introduce continuous variables \(g_{\ell , j}\in [0,1]\) such that \(g_{\ell ,j} = 1\) if and only if mutation j is gained in cluster \(\ell\) and \(g_{\ell ,j} = 0\) otherwise, for cluster \(\ell\) and mutation j. Specifically, \(g_{\ell ,j} = 1\) if there exists two distinct cells \(i, i'\in \sigma ^{1}(\ell )\) such that \(b_{i,j} = 0\) and \(b_{i',j} = 1\). We use \(g^{(0)}_{\ell ,j}\) and \(g^{(1)}_{\ell ,j}\) to model \(g_{\ell ,j}\) for all mutations \(j\in \{1,\ldots , m\}\) and clusters \(\ell \in \{1,\ldots , p\}\) with the constraints,
Finally to enforce that each mutation can be gained in at most one cluster, we have the following constraint for all mutations \(j\in \{1,\ldots , m\}\),
Objective function
Recall that we want to maximize the likelihood function \(P(Q\mid R, A)\) (Eq. 1), where A is the mutation matrix and B is its kDollo completion consistent with copynumber clustering \(\sigma\). After taking \(\log\) on both sides in Eq. 1, we can linearize the loglikelihood to get the following objective function.
This MILP has \(O(nm + pmk)\) binary variables, \(O(m^2k^2 + pm)\) continuous variables, and \(O(nm^2k^2)\) constraints.
Simulation details
In this section, we provide details about the simulations and the input files generated for each method.
Simulation of the phylogeny
We used a growing random network [44] to generate a tree T with \(m + p\) edges. Specifically, starting from the root vertex, T is built by iteratively adding child nodes while choosing the parent uniformly at random from the nodes in the tree in that iteration. The root node r(T) represents the normal cell and is assigned to cluster \(\ell = 0\). The edges are then labeled by either the gain of a mutation \(j\in \{1,\ldots , m\}\) or change to cluster \(\ell \in \{1,\ldots , p\}\). For each edge \((v,w)\in E(T)\) labeled with a change in cluster, we allow loss of the mutations gained along the path from the root r(T) to node v with probability \(\lambda = 0.8\). We generate a copynumber profile for each node in the tree, as described below.
Simulation of copynumber states
The SCARLET [32] algorithm requires the copynumber profile of each cell, as well as the copynumber tree as input. We simulate the copynumber tree as follows.
Each node of the tree is labeled by a copynumber between 0 and 8 for each mutation j. We first initialize the root of the tree with a copynumber profile in which the copynumber for each position is picked uniformly at random between 0 and 8. We then label the remaining nodes as we traverse the tree in a breadthfirst order. If the edge \((\pi (w), w)\) does not contain loss of mutation j, the copynumber for node w is the same as the copynumber of \(\pi (w)\). On the other hand, if the edge \((\pi (w), w)\) induces the loss of mutation j, the copynumber at j for node \(\pi (w)\) is picked uniformly at random between 1 and 8, while the copynumber of node w is picked uniformly at random between 0 and \(\pi (w)1\). This ensures that, (i) the copynumber profile only changes if there is a loss event on the edge and (ii) each loss of mutation is supported by decrement of copynumber. Let \(C\in \{0,\ldots ,8\}^{n\times m}\) be the copynumber matrix such that \(c_{i,j}\) is the copynumber at locus j in cell i. This copynumber matrix is used during simulation of the variant read counts which we describe in the next subsection.
Read count model
The total read count \(r_{i,j}\) for each cell i and mutation j is modeled as Poisson variable with mean coverage \(\text {cov} = 50\), i.e.
We use betabinomial model, similar to previous works [32, 41, 42] for the variant read count \(q_{i,j}\) for each cell i and mutation j. Our model accounts for sequencing errors and allelic imbalance during sequencing as follows. For sequencing error, we set error rate \(\epsilon = 0.001\) which is similar to the error rates of most recent Illumina sequencing platforms [54]. Specifically, we assume that the false positive rate \(\alpha _{fp}\) and the false negative rate \(\alpha _{fn}\) of observing a read with the variant allele is \(\epsilon\). When the mutation j is not present in cell i, i.e., \(a_{i,j} = 0\), the number of copies of the variant allele is 0. When the mutation j is present in cell i, i.e., \(a_{i,j} = 1\), we assume that the number of copies of the variant allele is 1. As such, the value of \(a_{i,j}\) indicates the number of variant allele. Given that the total copies of the locus for mutation j in cell i is \(c_{i,j}\), the true variant allele frequency, which we denote by \(y_{i,j}\), is given by \(y_{i,j} = a_{i,j} / c_{i,j}\). Due to sequencing errors \(\alpha _{fp}\) and \(\epsilon _{fn}\), the probability \(p_{i,j}\) of producing a read containing the variant allele for mutation j in cell i is
The number of variant reads \(q_{i,j}\) is given by
where, we set the dispersion parameter \(s = 15\) in our simulations to simulate allelic imbalance. Finally, we spikein missing entries in the variant read count matrix Q and total read count matrix R by setting \(q_{i,j} = 0\) and \(r_{i,j} = 0\) in \(\lfloor d n m \rfloor\) entries where d is the rate of missing entries in the data.
While ConDoR and SCARLET take the variant and total read counts as input, several methods (such as SPhyR, SCITE and SiFit) require an observed mutation matrix \(A'\) as input. In the following section, we show how we obtained the observed mutation matrix from the simulated read counts.
Obtaining the observed mutation matrix from the read counts
Methods such as SPhyR, SCITE, and SiFit take an observed mutation matrix \({A}'\in \{0,1,1\}^{n\times m}\) as input. This observed mutation matrix \({A}'\) may contain missing entries (represented by \(1\)) and errors (false positives and false negatives). The aforementioned methods estimate the true binary mutation matrix A and build a tumor phylogeny while correcting the errors and imputing the missing entries in the observed mutation matrix \(A'\). We denote the estimated mutation matrix by \(\hat{A}\).
We obtain \({A}'\) from the read count matrices Q and R as follows. We use three filtering parameters to discretize the read count matrices: (i) total read count threshold \(r_t = 10\), (ii) variant read count threshold \(q_t = 5\), and (iii) variant allele frequency threshold \(y_t = 0.1\). We say that mutation j is present in cell i if and only if the total read count \(r_{i,j}\) is greater than or equal to \(r_t\), the variant read count \(q_{i,j}\) is greater than or equal to \(q_t\), and the observed variant allele frequency \({y}'_{i,j} = q_{i,j} / r_{i,j}\) is greater than or equal to \(y_t\). Specifically, we set \({a}'_{i,j} = 1\) if \((r_{i,j} \ge r_t) \wedge (q_{i,j} \ge q_t) \wedge ({y}'_{i,j} \ge y_t)\). For the remaining entries, we set \({a}'_{i,j} = 0\) if \(r_{i,j} \ge 0\), indicating absence of mutation, and \({a}'_{i,j} = 1\), indicating missing entry, otherwise. The median false positive and false negative rates of the observed mutation matrices of the simulated instances are 0.0018 and 0.158, respectively (Additional file 1: Fig. S1). It is possible that using additional procedures to define a higher quality set of mutations could improve the performance of methods such as SPhyR, SCITE and SiFit.
Computation of pairwise ancestral relation accuracy
Here, we describe the computation of pairwise ancestral relation accuracy \(E(T, \hat{T})\) for two tumor phylogenies T and \(\hat{T}\). Under the assumption that a mutation can be gained only once in the phylogeny, any pair \((j, j')\) of mutations can be related in exactly one of the following four ways.

1
Mutation j occurs along the path from the root to source node of edge on which mutation \(j'\) occurs.

2
Mutation \(j'\) occurs along the path from the root to source node of edge on which mutation j occurs.

3
Mutation j and \(j'\) occur on the same edge of the phylogeny.

4
Mutation j and \(j'\) occur on distinct branches of the phylogeny.
We compute the accuracy of inferring the correct relationship between all possible pairs of mutations from the inferred tumor phylogeny.
Generation and preprocessing of the PDAC data
Here, we provide details regarding the generation and preprocessing of targeted sequencing data of pancreatic ductal adenocarcinoma tumor used in this study.
Bulk WES library preparation, sequencing, and variant calling
Genomic DNA was extracted from each tissue using the phenolchloroform extraction protocol [55] or the QIAamp DNA Mini Kits (Qiagen) [56]. WES library preparation and sequencing were performed by the Integrated Genomics Operation at Memorial Sloan Kettering Cancer (MSKCC, NY). Briefly, an Illumina HiSeq 2000, HiSeq 2500, HiSeq 4000, or NovaSeq 6000 platform was used to target sequencing coverages of \(>250\times\) for WES samples.
The raw FASTQ files were processed with the standard pipeline of the Bioinformatics Core at MSKCC. Sequencing reads were analyzed in silico to assess quality, coverage, and aligned to the human reference genome (hg19) using BWA [57]. After read deduplication, base quality recalibration and multiple sequence realignment were completed with the PICARD Suite [58] and GATK v.3.1 [59]; somatic singlenucleotide variants and insertiondeletion mutations were detected using Mutect v.1.1.6 [60] and HaplotypeCaller v.2.4 [61]. This pipeline generated a set of mutations for every single sample. Then, all mutations of all samples of the same sequencing cohort were pooled as a single set. Each sample’s BAM file was used to compute “fillout” values (total depth, reference allele read counts, alternative allele read counts) for each mutation in the pooled list. Mutation with alternate read count less than 2 across all samples were removed to trim down false positives. The purpose was to rescue mutations that were detected with high confidence in one sample but with low confidence in another sample of the same patient/tumor. This generated the final output in mutational annotation format (MAF).
Singlecell DNA sequencing (Tapestri) library preparation and variant calling
Single nuclei were extracted from snap frozen primary patient samples embedded in optimal cutting temperature (OCT) compound using the protocol developed by Zhang et al. [15].
Nuclei were suspended in Mission Bio cell buffer at a maximum concentration of 4000 nuclei/\(\mu\)l, encapsulated in Tapestri microfluidics cartridge, lysed and barcoded. Barcoded samples were then put through targeted PCR amplification with a custom 596amplicon panel covering important PDAC mutational hotspots in our sample cohort (table with all the amplicons is available at https://github.com/raphaelgroup/ConDoR).
The 596amplicon panel was designed based on curation of bulk whole exome/genome sequencing data of PDAC samples collected by the Iacobuzio lab. The goal was to cover as many PDACrelated SNVs within our patient cohort as possible within a 600amplicon limit, which was deemed economically optimal. The genes/SNVs of interest were determined by querying several resources, such as cBioportal [62, 63] and openCRAVAT [64]. Particular interest was paid to genes in the TGF\(\beta\) pathway as relevant mutations are currently being investigated as clinical biomarkers [65]. In addition to the SNVs, we added amplicons to cover as much exon region as possible for genes that are of particular interest for CNV analyses in PDAC: KRAS, TP53, SMAD4, CDKN2A, TGFBR1, TGFBR2, ACVR1B, ACVR2A, BMPR1A, BMPR1B, SMAD2, SMAD3, MYC, GATA6, BAP1, MUS81, and KAT5.
PCR products were removed from individual droplets, purified with Ampure XP beads and used as templates for PCR to incorporate Illumina i5/i7 indices. PCR products were purified again, quantified with an Agilent Bioanlyzer for quality control, and sequenced on an Illumina NovaSeq. The minimum total read depth was determined by same formula as used in [15].
As described in [15], FASTQ files for singlenuclei DNA libraries were processed through Mission Bio’s Tapestri pipeline with default parameters to arrive at the output H5 file, which mainly consists of two matrices: a cellbyperampliconreadcount matrix \(\mathbf {X_1}\), and a cellbySNV matrix \(\mathbf {X_2}\). Briefly, the pipeline has the following steps.

1
Adaptor sequences are trimmed and align the reads to the hg19 genome (UCSC).

2
The reads are assigned to individual cell barcodes, while filtering out the lowquality cell barcodes. For each of these barcodes, the number of forward reads aligned to each amplicon is used to form matrix \(\mathbf {X_1}\).

3
Variant calling is performed to generate a gVCF (genomic variant call format) file from the BAM file for each cell.

4
The cells are jointly genotyped to form a cellbySNV matrix \(\mathbf {X_2}\).
A more detailed documentation of the pipeline is available at: https://support.missionbio.com/hc/enus/categories/360002512933TapestriPipeline. In respect of Mission Bio’s request, the pipeline code is not to be publicized because it contains proprietary information per industry standard. However, the pipeline used in the paper that demonstrated this scDNAseq library preparation technology [66] is publicly available as a Github repository at https://github.com/AbateLab/DAbseq. Although we have not formally tested that it performs identically as the Mission Bio pipeline, we believe it is sufficient to replicate our results.
Variant calling
We detect 40 mutations in the bulk tumor sample with a variant allele frequency (VAF) of at least 0.05. Out of these 40 mutations, 34 mutations were also detected in the matched normal sample indicating that they were germline mutations. From the remaining 6 somatic mutations, we filter out mutations with low prevalence in the scDNAseq data. Specifically, we only include mutations with variant allele frequency more than 0.1, read depth of more than 20 and variant read depth of more than 10 in at least \(5\%\) of the cells. We end up with 4 somatic mutations: chr3:30715617:C/G (TGFBR2_1), chr3:30715619:G/T (TGFBR2_2), chr8:38314915:G/T (FGFR1), and chr13:32907415:T/A (BRCA2).
Most phylogeny inference methods only consider somatic SNVs as input, and filter out all germline SNPs. However, germline SNPs that have undergone loss in a subset of cells are informative during phylogeny inference. We identify germline SNPs with putative loss by including SNPs with variant allele frequency less than 0.1, variant read depth more than 10, and total read depth less than 20 in at least \(15\%\) of the cells. We find 3 such SNPs: chr10:131506283:C/T (MGMT_1), chr10:131506192:C/T (MGMT_2), and chr1:158612236:A/G (SPTA1). In summary, we consider 3 germline SNPs and 4 somatic SNVs in our analysis.
Copynumber clustering
In this section, we describe the method to cluster the PDAC cells based on the total reads in each cell. Let \(\mathcal {A}\) be the set of amplicons, \(\mathcal {G}\) be the set of genes and \(\mathcal {A}(g)\) denote the set of amplicons contained in gene \(g\in \mathcal {G}\). Let \(R^{A}_{i,a} = [r^{A}_{i,a}]\) be the \(n\times \mathcal {A}\) read depth matrix that contains the number of reads in cell i and amplicon a. We start by normalizing the ampliconlevel read depth matrix by the total reads in each cell. Specifically, we form matrix \(\bar{R}^{A}\in \mathbb {Z}^{n\times \mathcal {A}}\) such that,
Next, we assume that all loci in the same gene have the same copynumber. As such, we compute the average normalized read depth as follows
This step helps nullify some noise in the ampliconlevel total read count data. We focus on 30 genes with the highest number of amplicons and perform kmeans clustering on the resulting matrix \(\bar{R}^{G}\).
We use the Silhouette score to determine the number of copynumber clusters. Additional file 1: Fig. S9a shows the the Silhouette score for increasing number of clusters in the kmeans clustering. We choose the clustering with the highest Silhouette score resulting in \(p = 3\) clusters as the copynumber clustering \(\sigma\). Additional file 1: Fig. S9b shows a tSNE [67] where each point is a cell labeled by the cluster index from the copynumber clustering \(\sigma\).
Availability of data and materials
ConDoR software, simulations, and processed real data are publicly available at https://github.com/raphaelgroup/ConDoR [68] under the MIT license. The results reported in the paper are fully reproducible with the data available in the Github repository. The original CRC data is available on NCBI Sequence Read Archive (https://www.ncbi.nlm.nih.gov/sra) under accession number SRP074289 [69]. The raw FASTQ files of PDAC data analyzed in the present study are not publicly available because they are part of a larger cohort to be published in the near future but are available from the corresponding author on reasonable request. Specifically, this data will be deposited in EGA along with a bigger cohort shortly after submission of our next publication. They will be available under restricted access, as required by the MSKCC Medical Donation Program Data Access Agreement (MSKCC MDP DAA). Readers interested in gaining access through EGA need to contact the data access committee (DAC) of this dataset and start an application. The DAC will try to respond within 2 weeks but may take longer in special conditions. The MSKCC MDP DAA, to be provided by the DAC and include guidelines and restrictions on data usage, must be signed. Once the application is approved, an EGA account will be provided for data access. The time length of access to the data will be determined by the DAC on a casebycase basis. Further information about EGA can be found at https://egaarchive.org and “The European Genomephenome Archive of human data consented for biomedical research” (https://www.nature.com/ng/journal/v47/n7/full/ng.3312.html).
References
Tabassum DP, Polyak K. Tumorigenesis: it takes a village. Nat Rev Cancer. 2015;15(8):473–83.
Schwartz R, Schäffer AA. The evolution of tumour phylogenetics: principles and practice. Nat Rev Genet. 2017;18(4):213–29.
AmiroucheneAngelozzi N, Swanton C, Bardelli A. Tumor evolution as a therapeutic target the impact of tumor evolution in precision medicine. Cancer Discov. 2017;7(8):805–17.
Fittall MW, Van Loo P. Translating insights into tumor evolution to clinical practice: promises and challenges. Genome Med. 2019;11(1):1–14.
Jiao W, Vembu S, Deshwar AG, Stein L, Morris Q. Inferring clonal evolution of tumors from single nucleotide somatic mutations. BMC Bioinforma. 2014;15(1):1–16.
Popic V, Salari R, Hajirasouliha I, KashefHaghighi D, West RB, Batzoglou S. Fast and scalable inference of multisample cancer lineages. Genome Biol. 2015;16(1):1–17.
Malikic S, McPherson AW, Donmez N, Sahinalp CS. Clonality inference in multiple tumor samples using phylogeny. Bioinformatics. 2015;31(9):1349–56.
ElKebir M, Oesper L, AchesonField H, Raphael BJ. Reconstruction of clonal trees and tumor composition from multisample sequencing data. Bioinformatics. 2015;31(12):i62–70.
Deshwar AG, Vembu S, Yung CK, Jang GH, Stein L, Morris Q. PhyloWGS: reconstructing subclonal composition and evolution from wholegenome sequencing of tumors. Genome Biol. 2015;16(1):1–20.
ElKebir M, Satas G, Oesper L, Raphael BJ. Inferring the mutational history of a tumor using multistate perfect phylogeny mixtures. Cell Syst. 2016;3(1):43–53.
Eaton J, Wang J, Schwartz R. Deconvolution and phylogeny inference of structural variations in tumor genomic samples. Bioinformatics. 2018;34(13):i357–65.
Laks E, McPherson A, Zahn H, Lai D, Steif A, Brimhall J, et al. Clonal decomposition and DNA replication states defined by scaled singlecell genome sequencing. Cell. 2019;179(5):1207–21.
Morita K, Wang F, Jahn K, Hu T, Tanaka T, Sasaki Y, et al. Clonal evolution of acute myeloid leukemia revealed by highthroughput singlecell genomics. Nat Commun. 2020;11(1):1–17.
Minussi DC, Nicholson MD, Ye H, Davis A, Wang K, Baker T, et al. Breast tumours maintain a reservoir of subclonal diversity during expansion. Nature. 2021;592(7853):302–8.
Zhang H, Karnoub ER, Umeda S, Chaligné R, Masilionis I, McIntyre CA, et al. Application of highthroughput singlenucleus DNA sequencing in pancreatic cancer. Nat Commun. 2023;14(1):749.
Zafar H, Navin N, Nakhleh L, Chen K. Computational approaches for inferring tumor evolution from singlecell genomic data. Curr Opin Syst Biol. 2018;7:16–25.
Jahn K, Kuipers J, Beerenwinkel N. Tree inference for singlecell data. Genome Biol. 2016;17(1):1–17.
Ross EM, Markowetz F. OncoNEM: inferring tumor evolution from singlecell sequencing data. Genome Biol. 2016;17(1):1–14.
Malikic S, Jahn K, Kuipers J, Sahinalp SC, Beerenwinkel N. Integrative inference of subclonal tumour evolution from singlecell and bulk sequencing data. Nat Commun. 2019;10(1):1–12.
Malikic S, Mehrabadi FR, Ciccolella S, Rahman MK, Ricketts C, Haghshenas E, et al. PhISCS: a combinatorial approach for subperfect tumor phylogeny reconstruction via integrative use of singlecell and bulk sequencing data. Genome Res. 2019;29(11):1860–77.
Kimura M. The number of heterozygous nucleotide sites maintained in a finite population due to steady flux of mutations. Genetics. 1969;61(4):893.
Demeulemeester J, Dentro SC, Gerstung M, Van Loo P. Biallelic mutations in cancer genomes reveal local mutational determinants. Nat Genet. 2022;54(2):128–33.
Kuipers J, Jahn K, Raphael BJ, Beerenwinkel N. Singlecell sequencing data reveal widespread recurrence and loss of mutational hits in the life histories of tumors. Genome Res. 2017;27(11):1885–94.
McPherson A, Roth A, Laks E, Masud T, Bashashati A, Zhang AW, et al. Divergent modes of clonal spread and intraperitoneal mixing in highgrade serous ovarian cancer. Nat Genet. 2016;48(7):758–67.
ElKebir M. SPhyR: tumor phylogeny estimation from singlecell sequencing data under loss and error. Bioinformatics. 2018;34(17):i671–9.
Ciccolella S, Soto Gomez M, Patterson MD, Della Vedova G, Hajirasouliha I, Bonizzoni P. gpps: an ILPbased approach for inferring cancer progression with mutation losses from single cell data. BMC Bioinformatics. 2020;21(1):1–16.
Ciccolella S, Ricketts C, Soto Gomez M, Patterson M, Silverbush D, Bonizzoni P, et al. Inferring cancer progression from singlecell sequencing while allowing mutation losses. Bioinformatics. 2021;37(3):326–33.
Farris JS. Phylogenetic analysis under Dollo’s Law. Syst Biol. 1977;26(1):77–88.
Zafar H, Tzen A, Navin N, Chen K, Nakhleh L. SiFit: inferring tumor trees from singlecell sequencing data under finitesites models. Genome Biol. 2017;18(1):1–20.
Zafar H, Navin N, Chen K, Nakhleh L. SiCloneFit: Bayesian inference of population structure, genotype, and phylogeny of tumor clones from singlecell genome sequencing data. Genome Res. 2019;29(11):1847–59.
Wagner WH. Problems in the classification of ferns. Recent Adv Bot. 1961;(1):841–4.
Satas G, Zaccaria S, Mon G, Raphael BJ. Scarlet: singlecell tumor phylogeny inference with copynumber constrained mutation losses. Cell Syst. 2020;10(4):323–32.
Chen Z, Gong F, Wan L, Ma L. BiTSC 2: Bayesian inference of tumor clonal tree by joint analysis of singlecell SNV and CNA data. Brief Bioinforma. 2022;23(3):bbac092.
Sollier E, Kuipers J, Takahashi K, Beerenwinkel N, Jahn K. COMPASS: joint copy number and mutation phylogeny reconstruction from amplicon singlecell sequencing data. Nature Communications. 2023;14(1):4921.
Lähnemann D, Köster J, Szczurek E, McCarthy DJ, Hicks SC, Robinson MD, et al. Eleven grand challenges in singlecell data science. Genome Biol. 2020;21(1):1–35.
Zaccaria S, Raphael BJ. Characterizing alleleand haplotypespecific copy numbers in single cells with CHISEL. Nat Biotechnol. 2021;39(2):207–14.
Leung ML, Davis A, Gao R, Casasent A, Wang Y, Sei E, et al. Singlecell DNA sequencing reveals a latedissemination model in metastatic colorectal cancer. Genome Res. 2017;27(8):1287–99.
Lan F, Demaree B, Ahmed N, Abate AR. Singlecell genome sequencing at ultrahighthroughput with microfluidic droplet barcoding. Nat Biotechnol. 2017;35(7):640–6.
Pellegrino M, Sciambi A, Treusch S, DurruthyDurruthy R, Gokhale K, Jacob J, et al. Highthroughput singlecell DNA sequencing of acute myeloid leukemia tumors with droplet microfluidics. Genome Res. 2018;28(9):1345–52.
Gusfield D. Efficient algorithms for inferring evolutionary trees. Networks. 1991;21(1):19–28.
Singer J, Kuipers J, Jahn K, Beerenwinkel N. Singlecell mutation identification via phylogenetic inference. Nat Commun. 2018;9(1):1–8.
Weber LL, Sashittal P, ElKebir M. doubletD: detecting doublets in singlecell DNA sequencing data. Bioinformatics. 2021;37(Supplement1):i214–21.
Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual. 2020. http://www.gurobi.com. Accessed 5 Mar 2021.
Krapivsky PL, Redner S. Organization of growing random networks. Phys Rev E. 2001;63(6):066123.
MacQueen J. Classification and analysis of multivariate observations. In: 5th Berkeley Symp. Math. Statist. Probability. University of California Press; 1967. p. 281–297.
Rousseeuw PJ. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math. 1987;20:53–65.
Hayashi A, Hong J, IacobuzioDonahue CA. The pancreatic cancer genome revisited. Nat Rev Gastroenterol Hepatol. 2021;18(7):469–81.
Greer JB, Whitcomb DC. Role of BRCA1 and BRCA2 mutations in pancreatic cancer. Gut. 2007;56(5):601–5.
Gerlinger M, Rowan AJ, Horswell S, Larkin J, Endesfelder D, Gronroos E, et al. Intratumor heterogeneity and branched evolution revealed by multiregion sequencing. N Engl j Med. 2012;366:883–92.
Zhang J, Fujimoto J, Zhang J, Wedge DC, Song X, Zhang J, et al. Intratumor heterogeneity in localized lung adenocarcinomas delineated by multiregion sequencing. Science. 2014;346(6206):256–9.
Hiley C, de Bruin EC, McGranahan N, Swanton C. Deciphering intratumor heterogeneity and temporal acquisition of driver events to refine precision medicine. Genome Biol. 2014;15(8):1–10.
Stanta G, Bonin S. Overview on clinical relevance of intratumor heterogeneity. Front Med. 2018;5:85.
Sashittal P, Zaccaria S, ElKebir M. Parsimonious Clone Tree Integration in cancer. Algorithms Mol Biol. 2022;17(1):1–14.
Stoler N, Nekrutenko A. Sequencing error profiles of Illumina sequencing instruments. NAR Genomics Bioinforma. 2021;3(1):lqab019.
Köchl S, Niederstätter H, Parson W. DNA extraction and quantitation of forensic samples using the phenolchloroform method and realtime PCR. In: Carracedo A, editor. Forensic DNA Typing Protocols. Totowa: Humana Press; 2005. p. 13–29. https://doi.org/10.1385/1592598676:013.
Coyne SR, Craw PD, Norwood DA, Ulrich MP. Comparative analysis of the Schleicher and Schuell IsoCode Stix DNA isolation device and the Qiagen QIAamp DNA mini kit. J Clin Microbiol. 2004;42(10):4859–62.
Li H, Durbin R. Fast and accurate short read alignment with BurrowsWheeler transform. Bioinformatics. 2009;25(14):1754–60.
"Picard toolkit." Broad Institute. Broad Institute, GitHub repository. 2019. Available from: https://broadinstitute.github.io/picard.
McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing nextgeneration DNA sequencing data. Genome Res. 2010;20(9):1297–303.
Cibulskis K, Lawrence MS, Carter SL, Sivachenko A, Jaffe D, Sougnez C, et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat Biotechnol. 2013;31(3):213–9.
Poplin R, RuanoRubio V, DePristo MA, Fennell TJ, Carneiro MO, Van der Auwera GA, et al. Scaling accurate genetic variant discovery to tens of thousands of samples. BioRxiv. 2018;201178.
Cerami E, Gao J, Dogrusoz U, Gross BE, Sumer SO, Aksoy BA, et al. The cBio Cancer Genomics Portal: an open platform for exploring multidimensional cancer genomics data. Cancer Discov. 2012;2(5):401–4. https://doi.org/10.1158/21598290.CD120095.
Gao J, Aksoy BA, Dogrusoz U, Dresdner G, Gross B, Sumer SO, et al. Integrative analysis of complex cancer genomics and clinical profiles using the cBioPortal. Sci Signal. 2013;6(269):pl1–pl1. https://doi.org/10.1126/scisignal.2004088.
Pagel KA, Kim R, Moad K, Busby B, Zheng L, Tokheim C, et al. Integrated informatics analysis of cancerrelated variants. JCO Clin Cancer Informa. 2020;4:310–7. https://doi.org/10.1200/CCI.19.00132.
Hayashi A, Hong J, IacobuzioDonahue CA. The pancreatic cancer genome revisited. Nat Rev Gastroenterol Hepatol. 2021;18(7):469–81.
Demaree B, Delley CL, Vasudevan HN, Peretz CAC, Ruff D, Smith CC, et al. Joint profiling of DNA and proteins in single cells to dissect genotypephenotype associations in leukemia. Nat Commun. 2021;12(1):1583.
LJPvd M, Hinton G. Visualizing highdimensional data using tSNE. J Mach Learn Res. 2008;9(2579–2605):9.
Sashittal P, Zhang H, IacobuzioDonahue C, Raphael B. ConDoR: Tumor phylogeny inference with a copynumber constrained mutation loss model. Zenodo. 2023. https://doi.org/10.5281/zenodo.8350264.
Leung ML, Davis A, Gao R, Casasent A, Wang Y, Sei E, et al. Data from Singlecell DNA sequencing reveals a latedissemination model in metastatic colorectal cancer. NCBI SRA. 2017. https://www.ncbi.nlm.nih.gov/sra/?term=SRP074289. Processed data from this study was accessed from https://github.com/raphaelgroup/scarlet. Accessed 5 Mar 2023.
Acknowledgements
The authors would like the thank the reviewers for their helpful comments that led to several improvements in the paper. The authors are also grateful to the Single Cell Research Initiative (SCRI), Integrated Genomics Operations (IGO), and the bioinformatics core at MSKCC for their excellent technical support.
Review history
The review history is available as Additional file 2.
Peer review information
Tim Sands was the primary editor of this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.
Funding
This work was supported by NIH/NCI grants U24CA264027 and U24CA248453 awarded to B.J.R. and by NIH/NCI grants R35CA220508 and U2CCA233284 awarded to C.A.ID. C.A.ID. was also supported by Cycle for Survival for David Rubenstein Center for Pancreatic Cancer Research at Memorial Sloan Kettering Cancer Center.
Author information
Authors and Affiliations
Contributions
P.S. and B.J.R. conceived the project. P.S. characterized the combinatorial structure of the problem, formulated the MILP, implemented ConDoR, and performed the experimental evaluation. H.Z. performed the sequencing experiments and designed the bioinformatics pipeline for the bulk and the singlecell Tapestri data. All authors interpreted the results. C.A.ID. and B.J.R. provided supervision. All authors wrote the manuscript. All authors read and approved the manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Use of samples used in this study was approved by IRB review at Memorial Sloan Kettering Cancer Center under protocol #15149.
Consent for publication
All authors have consented to the publication of this work in Genome Biology.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Additional file 1.
Figures, tables and text describing additional information such as proofs of theorems or additional results.
Additional file 2.
Peer review history.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Sashittal, P., Zhang, H., IacobuzioDonahue, C. et al. ConDoR: tumor phylogeny inference with a copynumber constrained mutation loss model. Genome Biol 24, 272 (2023). https://doi.org/10.1186/s13059023031065
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13059023031065