SiFit: inferring tumor trees from single-cell sequencing data under finite-sites models

Single-cell sequencing enables the inference of tumor phylogenies that provide insights on intra-tumor heterogeneity and evolutionary trajectories. Recently introduced methods perform this task under the infinite-sites assumption, violations of which, due to chromosomal deletions and loss of heterozygosity, necessitate the development of inference methods that utilize finite-sites models. We propose a statistical inference method for tumor phylogenies from noisy single-cell sequencing data under a finite-sites model. The performance of our method on synthetic and experimental data sets from two colorectal cancer patients to trace evolutionary lineages in primary and metastatic tumors suggests that employing a finite-sites model leads to improved inference of tumor phylogenies. Electronic supplementary material The online version of this article (doi:10.1186/s13059-017-1311-2) contains supplementary material, which is available to authorized users.


Supplementary Note
Potential events violating Four-gamete test in single-cell sequencing data The four-gametes theorem states that an m × n binary matrix, M , has an undirected perfect phylogeny if and only if no pair of columns contain all four binary pairs (0, 0; 0, 1; 1, 0 and 1, 1), where m represents the number of taxa (leaves of the tree) and n represents genomic sites. A perfect phylogeny represents a rooted tree (T ) on a leafset of m taxa and each of the n genomic sites (characters) labels exactly one edge of T .
The genomic sites that are mutated in a particular taxa (c) are the genomic sites that label the branches along the unique path from the root to the leaf (labeled by c) of the perfect phylogeny tree T . The perfect phylogeny model conveys that each genomic site represents a perfect character, i.e. each site mutates exactly once in the evolutionary history of the character. This assumption in other words is known as "infinite sites assumption". A binary matrix that maintains the four-gamete condition can be thought of as following infinite-sites model of evolution and any violation of the four-gamete condition can suggest potential deviation from "infinite sites assumption".
In single-cell sequencing binary genotype data, several events can lead to violation of four-gamete test.
• Mutations affecting the same site Different mutation events in cancer such as deletion, LOH and convergent evolution can mutate a genomic site more than once. This will make that particular site 'imperfect' resulting in a violation of four-gamete condition (see Fig. S5b for an example).
• Cell doublets 'Cell doublets' are formed when two or more cells are accidentally isolated instead of single cells. This results in merging the genotype of two or more cells. The merging of genotype can also lead to violation of four-gamete condition (see Fig. S5c for an example).
• False positive and false negative errors In SCS data, false positive and false negative errors can lead to violation of four-gamete condition (see Fig. S5d for an example) as the false positive errors are random and the false negative errors create the same effect for heterozygous sites as back mutations.
All the above mentioned factors can occur together resulting in a huge number of pairs of genomic sites violating four-gamete test.  Figure S1: Performance comparison on datasets (containing doublets) with varying number of cells. SiFit's tree reconstruction accuracy is compared against that of SCITE, OncoNEM and MrBayes. The y-axis denotes the tree reconstruction error that measures the distance of the inferred tree from the ground truth. All cells including the doublets are considered for measuring the tree reconstruction error. The number of cells varies as m = 50, m = 100 and m = 200. The percentage of doublets (δ) varies as 5% and 10%. For each combination of δ and m, the number of sites n is varied as m = 200, m = 400 and m = 600. Each boxplot summarizes result over 10 datasets for each combination of δ, n and m. Figure S2: Performance comparison on datasets (containing doublets) with missing data. SiFit's tree reconstruction accuracy is compared against that of SCITE and OncoNEM . The y-axis denotes the tree reconstruction error that measures the distance of the inferred tree from the ground truth. All cells including the doublets are considered for measuring the tree reconstruction error. The amount of missing data varies from {0%, 10%, 25%}. The percentage of doublets (δ) varies as 5% and 10%. For each combination of δ and percentage of missing data, the number of sites n is varied as m = 200, m = 400 and m = 600. Each boxplot summarizes result over 10 datasets for each combination of δ, n and missing data percentage.

Supplementary Figures
Outline The remainder of this article is organized as follows. Section 2 gives account of previous work. Our new and exciting results are described in Section 3. Finally, Section 4 gives the conclusions.
1 Introduction Outline The remainder of this article is organized as follows. Section 2 gives account of previous work. Our new and exciting results are described in Section 3. Finally, Section 4 gives the conclusions. Outline The remainder of this article is organized as follows. Section 2 gives account of previous work. Our new and exciting results are described in Section 3. Finally, Section 4 gives the conclusions.
Outline The remainder of this article is organized as follows. Section 2 gives account of previous work. Our new and exciting results are described in Section 3. Finally, Section 4 gives the conclusions. 1 d Figure S5: Illustration of potential events violating four-gamete test. a) A perfect phylogeny with 4 cells as leaves. The green circles are the cells, the colored diamonds are the mutations. Corresponding binary mutation matrix is shown in the right. b) Violation of four-gamete principle due to mutations (deletion, LOH and recurrent point mutations) affecting the same site. The mutation M 3 occurs again in the cell c 4 (marked in blue). Columns M 1 and M 3 (highlighted in red) violate four-gamete principle. c) Violation due to cell doublets. The cell c 3 (marked in purple) is a doublet now and its genotype is merged now with a cell having same genotype as c 2 . The columns M 2 and M 5 (highlighted in red) violate four-gamete principle. d) Violation due to amplification error. The cell c 1 (marked in orange) has a FP error for mutation M 6 . The columns M 1 and M 6 (highlighted in red) violate four-gamete principle.