In vitro lineage tracing experiment
Plasmid design and cloning
The Cas9mCherry lentivector, pHRUCOESFFVCas9mCherry (to be added to Addgene), was designed for stable, constitutive expression of enzymatically active Cas9, driven by the viral SFFV promoter, insulated with a minimal universal chromatin opening element (minUCOE), and tagged with Cterminal, selfcleaving P2AmCherry. pHRUCOESFFVCas9mCherry is derived from pMH0001 (Addgene Cat#85969, active Cas9) with the BFP tag exchanged with mCherry. The P2AmCherry tag was PCR amplified from pHRSFFVKRABdCas9P2AmCherry (Addgene Cat #60954; forward: GAGCAACGGCAGCAGCGGATCCGGAGCTACTAACTTCAG; reverse: ATATCAAGCTTGCATGCCTGCAGGTCGACTTACTACTTGTACAGCTCGTCCATGC) and inserted using Gibson Assembly (NEB) into SbfI/BamHIdigested pMH0001 (active Cas9). Resulting plasmid was used for lentiviral production as described below.
The Target Site lentivector, PCT48 (to be added to Addgene), was derived from the reverse lentivector PCT5 (to be added to Addgene) containing GFP driven by the EF1a promoter. The sequence of the 10X amplicon with most common polyA location is the following: AATCCAGCTAGCTGTGCAGCNNNNNNNNNNNNNNATTCAACTGCAGTAATGCTACCTCGTACTCACGCTTTCCAAGTGCTTGGCGTCGCATCTCGGTCCTTTGTACGCCGAAAAATGGCCTGACAACTAAGCTACGGCACGCTGCCATGTTGGGTCATAACGATATCTCTGGTTCATCCGTGACCGAACATGTCATGGAGTAGCAGGAGCTATTAATTCGCGGAGGACAATGCGGTTCGTAGTCACTGTCTTCCGCAATCGTCCATCGCTCCTGCAGGTGGCCTAGAGGGCCCGTTTAAACCCGCTGATCAGCCTCGACTGTGCCTTCTAGTTGCCAGCCATCTGTTGTTTGCCCCTCCCCCGTGCCTTCCTTGACCCTGGAAGGTGCCACTCCCACTGTCCTTTCCTAATAAAAAAAAAAAAAAAAAAAAAAA
where N denotes our 14bp random integration barcode. PCT5 was digested with SfiI and EcoRI within the 3 ^{′}UTR of GFP. The Target Site sequence was ordered as a DNA fragment (gBlock, IDT DNA) containing three Cas9 cut sites and a high diversity, 14bp randomer (integration barcode, or intBC). The fragment was PCR amplified with primers containing Gibson assembly arms compatible with SfiI/EcoRIdigested PCT5 (forward: GATGAGCTCTACAAATAATTAATTAAGAATTCGTCACGAATCCAGCTAGCTGT;reverse:GGTTTAAACGGGCCCTCTAGGC CACCTGCAGGAGCGATGG). The amplified Target Site fragment was inserted into the digested PCT5 backbone using Gibson Assembly. The assembled lentivector library was transformed into MegaX competent bacterial cells (Thermo Fisher) and grown in 1L of LB with carbenicillin at 100 μg/mL. Lentivector plasmid was recovered and purified by GigaPrep (Qiagen) and used for highdiversity lentiviral production as described below.
The triplesgRNABFPPuroR lentivector, PCT61 (to be added to Addgene), is derived from pBA392 (to be added to Addgene) as previously described [57, 58] containing three sgRNA cassettes driven by distinct U6 promoters and constitutive BFP and puromycinresistance markers for selection. Importantly, the three PCT61 sgRNAs are complementary to the three cut sites in the PCT48 Target Site. To slow the cutting kinetics of the sgRNAs to best match the timescale involved in the in vitro lineage tracing experiments [10], the sgRNAs contain precise singlebasepair mismatches that decrease their avidity for the cognate cut sites [59]. The triplesgRNA lentivector was cloned using fourway Gibson assembly as described in [58]. Resulting plasmid was used for lentiviral production as described below.
Cell culture, DNA transfections, viral preparation, and cell line engineering
A549 cells (human lung adenocarcinoma line, ATCC CCL185) and HEK293T were maintained in Dulbecco’s modified Eagle medium (DMEM, Gibco) supplemented with 10% FBS (VWR Life Science Seradigm), 2 mM glutamine, 100 units/mL penicillin, and 100 μg/mL streptomycin. Lentivirus was produced by transfecting HEK293T cells with standard packaging vectors and TransITLTI transfection reagent (Mirus) as described in ([57]). Target Site (PCT48) lentiviral preparations were concentrated tenfold using LentiX Concentrator (Takara Bio). Viral preparations were frozen prior to infection. TriplesgRNA lentiviral preparations were titered and diluted to a concentration to yield approximately 50% infection rate. To construct the lineage tracingcompetent cell line, A549 cells were transduced by serial lentiviral infection with the three lineage tracing components: (1) Cas9, (2) Target Site, and (3) triplesgRNAs. First, A549 cells were transduced by Cas9 (mCherry) lentivirus and mCherry+ cells were selected to purity by fluorescenceactivated cell sorting on the BD FACS Aria II. Second, A549Cas9 cells were transduced by concentrated Target Site (GFP) lentivirus and GFP+ cells were selected by FACS; after sorting, Target Site infection and sorting were repeated two more times for a total of three serial lentiviral transfections, sorting for cells with progressively higher GFP signal after each infection. This strategy of serial transfection with concentrated lentivirus yielded cells with high copy numbers of the Target Site, which were confirmed by quantitative PCR. Third, A549 cells with Cas9 and Target Site were transduced by titered triplesgRNA (BFPPuroR) lentivirus and selected as described below.
In vitro lineage tracing experiment, singlecell RNAseq library preparation, and sequencing
One day following triplesgRNA infection, cells were trypsinized to a singlecell suspension and counted using an Accuri cytometer (BD Biosciences). Approximately 25 cells were plated in a single well of a 96well plate. Seven days postinfection, cells were trypsinized and split evenly into two wells of a 96well plate. Cells stably transduced by triplesgRNA lentivirus were selected by adding puromycin at 1.5 μg/mL on days 9 and 11 postinfection; puromycinkilled cells were removed by washing the plate with a fresh medium. After 14 days, cells were trypsinized and split evenly for a second time into four wells of a 6well plate. Finally, after 21 days in total, cells from the four wells were trypsinized to a singlecell suspension and collected.
Cells were washed with PBS with 0.04% w/v bovine serum albumin (BSA, New England Biolabs), filtered through 40 μm FlowMi filter tips filter tips (BelArt), and counted according to the 10x Genomics protocol. Approximately 14,000 cells per sample were loaded (expected yield: approximately 10,000 cells per sample) into the 10x Genomics Chromium Single Cell 3 ^{′} Library and Gel Bead Kit v2, and cDNA was reversetranscribed, amplified, and purified according to the manufacturer’s protocol. Resulting cDNA libraries were quantified by BioAnalyzer, yielding the expected size distribution described in the manufacturer’s protocol.
To prepare the Target Site amplicon sequencing library, resulting amplified cDNA libraries were further amplified with custom, Target Sitespecific primers containing P5/P7 Illumina adapters and sample indices (forward:CAAGCAGAAGACGGCATACGAGATXXXXXXXXGTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGAATCCAGCTAGCTGTGCAGC;reverse:CAAGCAGAAGACGGCATACGAGATXXXXXXXXGTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGGCATGGACGAGCTGTACAAGT; “X” denotes sample indices). PCR amplification was performed using Kapa HiFi HotStart ReadyMix, as in [57], according to the following program: melting at 95 ^{∘}C for 3 min, then 14 cycles at 98 ^{∘}C for 15 s and 70 ^{∘}C for 20 s. Approximately 12 fmol of template cDNA were used per reaction; amplification was performed in quadruplicate to avoid PCRinduced library biases, such as jackpotting. PCR products were repooled and purified by SPRI bead selection at 0.9x ratio and quantified by BioAnalyzer.
Target Site amplicon libraries were sequenced on the Illumina NovaSeq S2 platform. Due to the low sequence complexity for the Target Site library, a phiX genomic DNA library was spiked in at approximately 50% for increased sequence diversity. The 10x cell barcode and unique molecular identifier (UMI) sequences were read first (R1: 26 cycles) and the Target Site sequence was read second (R2: 300 cycles); sample identities were read as indices (I1 and I2: 8 cycles, each). Over 550M sequencing clusters passed filter and were processed as described below. All raw and processed data are available through GEO Series accession GSE146712 [60].
Processing pipeline
Read processing
Each target site was sequenced using the Illumina Novaseq platform, producing 300bp longread sequences. The Fastq’s obtained were quantitated using 10x’s cellranger suite, which simultaneously corrects cell barcodes by comparing against a whitelist of 10x’s approved cell barcodes. For each cell, a consensus sequence for each unique molecule identifier (UMI) was produced by collapsing similar sequences, defined by those sequences differing by at most 1 Levenshtein distance. A directed graph is constructed, where sequences with identical UMIs are connected to one another if the sequences themselves differ by at most one Levenshtein distance. Then, UMIs in this network are collapsed onto UMIs that have greater than or equal number of reads. This produces a collection of sequences indexed by the cell barcode and UMI information (i.e., there is a unique sequence associated with each UMI).
Before aligning all sequences to the reference, preliminary quality control is performed. Specifically, in cases where UMIs in a given cell still have not been assigned a consensus sequence, the sequence with the greatest number of reads is chosen. UMIs with fewer than 2 reads are filtered out, and cells with fewer than 10 UMIs are filtered out as well. Finally, a filtered file in Fastq format is returned.
Allele calling
Alignment is performed with Emboss’s Water local alignment algorithm. Optimal parameters were found by performing a grid search of gap open and gap extend parameters on a set of 1000 simulated sequences, comparing a global and local alignment strategy. We found a gap open penalty of 20.0 and a gap extension penalty of 1.0 produced optimal alignments. The “indels” (insertions and deletions resulting from the Cas9induced doublestrand break) at each cut site in the sequences are obtained by parsing the cigar string from the alignments. To resolve possible redundancies in indels resulting from Cas9 cutting, the 5 ^{′} and 3 ^{′} flanking 5nucleotide context is reported for each indel.
UMI error correction
To correct errors in the UMI sequence either introduced during sequencing, PCR preparation, or data processing, we leverage the allele information. UMIs are corrected within groups of identical cell barcodeintegration barcode pairs (i.e., we assume that only UMIs encoding for the same intBC in a given cell can be corrected). We reason that ideally, for a given integration barcodes, a cell will only report one sequence, or allele. Within these “equivalence classes,” UMIs that differ by at most 1 Levenshtein distance (although this number can be userdefined) are corrected towards the UMI with a greater number of reads.
Cellbased filtering
With the UMI corrected and indels calculated, the new “molecule table” is subjected to further quality control. Specifically, UMIs are filtered based on the number of reads (dynamically set to be the 99th percentile of the reads divided by 10), integration barcodes (denoting a particular integration site) can be error corrected based on a minimum hamming distance and identical indels (referred to as alleles), and in the case where multiple alleles are associated with a given integration barcode a single allele is chosen based on the number of UMIs associated with it.
Calling independent clones
Collections of cells that are part of the same clonal population are identified by the set of integration barcodes each cell contains. Because all cells in the same clone are clonal, we reasoned that cells in the same clone should all share the same set of integration barcodes that the progenitor cell contained. Because of both technical artifacts (e.g., sequencing errors, PCR amplification errors) and biological artifacts (e.g., bursty expression, silenced regions) however, rather than looking for sets of nonoverlapping sets, we perform an iterative clustering procedure. We begin by selecting the intBC that is shared amongst the most cells and assign any cell that contains this barcode to a cluster and remove these cells from the pool of unassigned cells. We perform this iteratively until at most k percent (in our case defined as.5% of cells are unassigned, which we assign to a “junk” clone.
Using the set of integration barcodes for each clone, we are able to identify doublets that consist of cells from different clones. Finally, after identifying doublets, to further filter out lowquality integration barcodes, for each clone integration barcodes that are not shared by at least 10% of cells in a given clone are filtered out, producing the final allele table.
Guidelines for final quality control
The thresholds discussed above are heuristic choices determined based on our handson experience with this type of targetsite library processing. However, these thresholds will undoubtedly change depending on the sequencer used, the sequencing depth of the library, and the biological use case. For these reasons, we suggest that it is more effective to ensure that the final quality control numbers indicate that the library was processed sufficiently.
We present distributions for the metrics we find to be the most useful in Additional file 1: Fig S17: the UMIs per cellBC as a measure for how well sampled a cell is in (a), the reads per UMI as a measure for how confident one is of the UMI sequence in (b), UMIs per intBC as a measure for how confident one is of the called allele and intBC in (c), and a comparison of the number of UMIs versus the number of reads in (d), as a way of quickly assessing if there are any outlier UMIs.
Because this library was sequenced quite deeply, we do not expect typical applications to afford this degree of certainty. Instead, we suggest that cells should have at least 10 targetsite UMIs, the reads/UMI distribution should have a mean at around 100–200 reads, and each intBC should have at least 5–10 UMIs associated with it. Cassiopeia’s processing pipeline creates figures for each of these statistics after filtering and close attention should be paid to these figures during the processing of the targetsite sequencing data.
Filtering of clones for reconstruction
We filtered out clones upon two criteria: firstly, we removed clone 1 as we deduced that it had two defective guides; secondly, we removed lineages that reported fewer than 10% unique cells (thus removing clone 7). The remainder of clones were reconstructed.
Estimation of percharacter mutation rates
To estimate mutation rates per clone, we assume that every target site was mutated at the same rate and independently of one another across 15 generations. Assuming some mutation rate, p, per character, we know that the probability of not observing a mutation in d generations is (1−p)^{d} in a given character and that the probability of observing at least 1 mutation in that character is 1−(1−p)^{d}. Then, giving this probability 1−(1−p)^{d}=m can be used as a probability of observing a mutated character in a cell and model the number of times a character appears mutated in a cell as a binomial distribution where the expectation is simply nm where n is the number of characters. Said simply, given this model, one would expect to see nm characters mutated in a cell. In this case, the empirical expectation is the mean number of times a given character appeared mutated in a cell (averaged across all cells), which we denote as K and propose that
$$K = nm = n * (1  (1  p)^{d}) $$
and thus p, the mutation rate, is
$$p = 1  (1  K/n)^{d} $$
Bulk cutting experiment to determine prior probabilities of indel formation
Two and 4 days following triplesgRNA (PCT61) infection, infected cells were selected by adding puromycin at 1.5 μg/mL; puromycinkilled cells were removed by washing the plate with fresh medium. Cells were split every other day, and 500k cells were collected on days 7, 14, and 28. Frozen cell pellets were lysed, and the genomic DNA was extracted and purified by ethanol precipitation. The PCT48 Target Site locus was PCR amplified from genomic DNA samples (forward: TCGTCGGCAGCGTCAGATGTGTATAAGAGACAGAATCCAGCTAGCTGTGCAGC; reverse:GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGTCGAGGCTGATCAGCG) and further amplified to incorporate Illumina adapters and sample indices(forward:AATGATACGGCGACCACCGAGATCTACACXXXXXXXXTCGTCGGCAGCGTCAG;reverse:CAAGCAGAAGACGGCATACGAGATXXXXXXXXGTCTCGTGGGCTCGGAG; “X” denotes sample indices). The subsequent amplicon libraries were sequenced on an Illumina MiSeq (paired end, 300 cycles each). Sequencing data was analyzed as described below.
Determining prior probabilities of indel formation
To determine the prior probabilities of edits, we leverage the fact that we have access to a large set of target sites (or intBCs) with a similar sequence (apart from the random barcode at the 5 ^{′} end); namely, a total of 117 intBC across the 11 clones. To compute the prior probability for a given indel, we compute the empirical frequency of observing this mutation out of all unique edits observed. Specifically, we compute the prior probability of a given indel s, q_{s} as the following:
$$q_{s} = \frac{f(s)}{I} $$
where f(s) is the number of intBCs that had s in at least one cell and I is the number of intBCs that are present in the dataset.
As further support for this method, we used the bulk experiment consisting of many separately engineered A549 cells, as described in the previous section. The advantage of the bulk experiment is that we have access to substantially more intBCs (> 10k), thus providing a more robust estimation of q_{s}. We therefore employed the same approach to estimate indel formation rates from the bulk data and find that the resulting rates correlate well with the indel rates estimated from the singlecell lineage tracing experiment (Additional file 1: Fig S20).
Doublet detection
Methods to detect doublets
We hypothesized that doublets could come in two forms and that we could use various components of the intBC data structure to identify them. Namely, doublets could be of cells from the identical clone, here dubbed “intradoublets,” or doublets could be of cells from separate clones, here dubbed “interdoublets.” In the case of “intradoublets,” we can utilize the fact that these cells will have a large overlap in their set of intBCs but will report “conflicting” alleles for each of these intBCs. Thus, to identify these doublets, we calculate the percentage of UMIs that are conflicting in each cell. Explicitly, for each cell, we iterate over all intBCs and sum up the number of UMIs that correspond to an allele that conflicts with the more abundant allele for a given intBC; we then use the percentage of these UMIs to identify doublets. We perform this after all UMI and intBC correction in hopes of calling legitimate conflicts.
To deal with “interdoublets,” we developed a classifier that leverages the fact that cells from different clones should have nonoverlapping intBC sets. While this is the ideal scenario, often times intBCs are shared between clones for one of two reasons: (1) the clustering assignments are noisy or (2) the transfections of intBCs resulted in two cells receiving the same intBC, even though cells are supposed to be progenitors of separate clones. Our strategy is thus as follows: for each cell c_{i}∈C calculate a “membership statistic,” m_{i,k} for each clone l_{k}∈L. The membership statistic is defined as so:
$$m_{i,k} = \frac{\sum_{j \in I_{k}}\delta(i, j)p(j, k)}{\sum_{j \in I_{k}} (p(j, k))} $$
where I_{k} is the set of intBCs for the clone l_{k} and p(j,k) is the prevalence rate of the intBC j in l_{k}. We use δ(i,j) as an indicator function for whether or not we observed the intBC j in the cell c_{i}. Intuitively, this membership statistic is a weighted similarity for how well the cell fits into each clone, where we are weighting by how much we are able to trust the intBC that is observed in the cell. To put all on the same scale, we normalize by total membership per cell, resulting in our final statistic, \(m^{\prime }_{i,k} = \frac {m_{i, k}}{\sum _{k'=0}^{k} m_{i, k'}}\). We then filter out doublets whose m^{′} for their classified clone falls below a certain threshold.
Simulation of doublets
We simulated two datasets to test our methods for identifying doublets and to find the optimal criterion on which to filter out doublets. To test this strategy, we took a single clone from our final Allele Table (the table relating all cells and their UMIs to clones) and formed 200 doublets by combining the UMIs from two cells. We generated 20 of these datasets and noted which cells were artificially introduced doublets.
Contrary to the strategy for simulating doublets from the same clone, we created artificial “inter” doublets from the final Allele Table by combining doublets from two different clones. Similarly, we generated 20 synthetic datasets each with 200 of these artificial doublets.
Identification of decision rule
To identify the optimal decision rule for calling both types of doublets, we tested decision rules ranging from 0 to 1.0 at 0.05 intervals and calculated the precision and recall at each of these rules. Taking these results altogether, we provide an optimal decision rule where the Fmeasure (or the weighted harmonic mean of the precision and recall) of these tests is maximal.
Algorithmic approaches for phylogenetic reconstruction
One way to approach the phylogenetic inference problem is to view each target site as a “character” that can take on many different possible “states” (each state corresponding to an indel pattern induced by a CRISPR/Cas9 edit at the target site). Formally, these observations can be summarized in a “character matrix,” M∈R^{n,m}, which relates the n cells by a set of characters χ={χ_{1},...,χ_{m}} where each character χ_{i} can take on some k_{i} possible states. Here, each sample, or cell, can be described as a concatenation of all of their states over characters in a “character string.” From this character matrix, the goal is to infer a tree (or phylogeny), where leaf nodes represent the observed cells, internal nodes represent ancestral cells, and edges represent a mutation event.
We first propose an adaption of a slow, but accurate, Steiner tree algorithm via integer lineage programming (ILP) to the lineage tracing phylogeny problem. Then, we propose a fast, heuristicbased greedy algorithm which simultaneously draws motivation from classical perfect phylogeny algorithms, and the fact that mutations can only occur unidirectionaly from the unmutated, or s_{0} state. Lastly, we combine these two methods and present a hybrid method, which presents better results than our greedy approach, yet remains feasible to run over tens of thousands of cells.
Adaptation to Steiner tree problem
Steiner trees are a general problem for solving for the minimum weight tree connecting a set of target nodes. For example, if given a graph G=(V,E) over some V vertices and E edges, finding the Steiner tree over all v∈V would amount to solving for the minimum spanning tree (MST) of G. While there exist polynomial time algorithms for the minimum spanning tree, the general Steiner tree problem, where the set of targets T⊆V is designated, is NPhard.
Previously, Steiner trees have been suggested to solve for the maximum parsimony solution to the phylogeny problem. Here, the graph would consist of all possible cells (both observed and unobserved) and each edge would consist of a possible evolutionary event connecting two states (e.g., a mutation). Generally, given a set of lengthl binary “characterstrings” (recall that these are the concatenation of all character states for a given sample), we can solve for the maximum parsimony solution by finding the optimal Steiner tree over the 2^{l} hypercube (i.e., graph). As a result, by converting our multistate characters to binary characters via one hot encoding, theoretically, we should be able to compute the most parsimonious tree which best explains the observed data. However, in practice, this method turns out to be infeasible, as we deal with hypercubes of size O(2^{mn}), where m is the number of characters and n is the number of states. In the following, we will propose a method for estimating the underlying search space, providing us with a feasible solvable instance and a formulation of an integer linear programming (ILP) problem to solve for the optimal Steiner tree.
Approximation of potential graph We first begin by constructing a directed acyclic graph (DAG) G, where nodes represent cells. We then take the source nodes, or nodes with indegree 0, of G, and for each pair of source nodes, consider the latest common ancestor (LCA) they could have had. This LCA has an unmutated state for character χ_{i} if they disagree across two source nodes, and the same state as the two source nodes if they agree in value. If the edit distance between these two cells is below a certain threshold d, we add the LCA to G, along with directed edges to the two source nodes, weighted by the edit distance between the parent and the source. We repeat this process until only one node remains as a source: the root.
One may think that this step explodes with O(n^{2}) complexity at each stage, where n is the number of source nodes in each prior stage, as we consider all pairs of source nodes. However, we note that the number of mutations per latest common ancestor is always less than both children, and therefore, we eventually converge to the root. Therefore, when dealing with several hundred cells, the potential graph is feasible to calculate.
Furthermore, to add scalability to the approximation of the Potential Graph, we allow the user to provide a “maximum neighborhood size” which will be used to dynamically solve for the optimal LCA distance threshold d to use. One may think of this as the maximum memory or time allowed for optimizing a particular problem. Since the size of the Potential Graph can grow quite large in regard to the number of nodes, we iteratively create potential graphs for various threshold d and at each step ensure that the number of nodes in the network does not exceed the maximum neighborhood size provided. If at any point the number of nodes does exceed this maximum size, we return the potential graph inferred for an LCA threshold of d−1.
Formulation of integer linear programming problem Given our initial cells, S, the underlying potential graph drawn from such cells, G, and the final source node, or root, r from G, we are interested in solving for \(\mathcal {T} = SteinerTree(r, S, G)\). We apply an integer linear programming (ILP) formulation of Steiner tree, formulated in terms of network flows, with each demand being met by a flow from source to target. Below, we present the integer linear programming formulation for Steiner tree. We use Gurobi [61], a standard ILP solver package
$${\begin{aligned} & \text{minimize} & \sum_{(u,v) \in E} d_{uv}^{b} \cdot w(u,v) \\ & \text{subject to} & \sum_{(u,v) \in E} d_{uv}  \sum_{(v,w) \in E} d_{vw} &= 0 && \forall v \!\notin\! S \cup \{r\} \\ & & \sum_{(r, w) \in E} d_{rw} &= S && \\ \end{aligned}} $$
$${\begin{aligned} & & \sum_{(u,s) \in E} d_{us} &= 1 && \forall s \in S \\ & & d_{uv}^{b} &\geq \frac{d_{uv}}{S} && \forall (u,v) \in E\\ & & d_{uv} &\in \{0,..,S\} && \forall (u,v) \in E\\ & & d_{uv}^{b} &\in \{0,..,1\} && \forall (u,v) \in E \end{aligned}} $$
Each variable d_{uv} denotes the flow through edge (u,v), if it exists; each variable \(d_{uv}^{b}\) denotes whether (u,v) is ultimately in the chosen solution subgraph. The first constraint enforces flow conservation, and hence that the demands are satisfied, at all nodes and all conditions. The second constraint requires S units of flow come out from the root. The third constraint requires that each target absorb exactly one unit of flow. The fourth constraint ensures that if an edge is used at any condition, it is chosen as part of the solution.
Below, we explicitly define the algorithm in pseudocode.
Stability analysis of the maximum neighborhood size parameter
To evaluate the stability of the userdefined maximum neighborhood size parameter, we assessed the accuracy of the reconstructions for parameters varying from 800 to 14,000. We used trees simulated under default conditions (400 samples, 40 characters, 40 states per character, 11 generations, 2.5% mutation rate per character, and a mean dropout rate of 17%). The accuracy of trees were compared to the tree generated with a parameter of 14,000 using the triplets correct statistic. We used 10 replicates to provide a sense for how stable a given accuracy is.
In addition to providing measures of accuracy, we also provide the optimal LCA threshold d found for a given maximum neighborhood size during the inference of these potential graphs. Using these analysis, we found that a maximum neighborhood size of 10,000 nodes seemed to be an ideal tradeoff between scalability and accuracy (as it is in the regime where accuracy saturates) for our default simulations. This corresponded to a mean LCA threshold, d, of approximately 5.
Heuristicbased greedy method
On perfect phylogeny and singlecell lineage tracing
In the simplest case of phylogenetics, each character is binary (i.e., k_{i}=2,∀i∈m) and can mutate at most once. This case is known as “perfect phylogeny” and there exist algorithms (e.g., a greedy algorithm by Dan Gusfield [30]) for identifying if a perfect phylogeny exists over such cells, and if so find one efficiently in time O(mn), where m is the number of characters and n are the number of cells. However, several limitations exist with methods such as Gusfield’s algorithm. One potential problem in using existing greedy perfect phylogeny algorithms for lineage tracing is that they require the characters to be binary. Indeed, if the characters are allowed to take any arbitrary number of states, the perfect phylogeny problem becomes NPhard. However, while the number of states (CRISPR/Cas9induced indels at a certain target site) in lineage tracing data can be large, these data benefit from an additional restriction that makes it more amenable for analysis with a greedy algorithm. Below, we show that because the founder cell (root of the phylogeny) is unedited (i.e., includes only uncut target sites) and that the mutational process is irreversible, we are able to theoretically reduce the multistate instance (as observed in lineage tracing) to a binary one so that it can be resolved using a greedy algorithm.
A second remaining problem in using these perfect phylogeny approaches is that we cannot necessarily expect every mutation to occur exactly once. In theory, it may happen that the same indel pattern is induced in exactly the same target site on two separate occasions throughout a lineage tracing experiment, especially if a large number of cell cycles takes place. A final complicating factor is that these existing greedy algorithms often assume that all character states are known, whereas lineage tracing data is generated by singlecell sequencing, which often suffers from limited sensitivity and an abundance of “dropout” (stochastic missing data) events.
The greedy algorithm We suggest a simple heuristic for a greedy method to solve the maximum parsimony phylogeny problem, motivated by the classical solution to the perfect phylogeny problem and irreversibility of mutation. Namely, we consider the following method for building the phylogeny: Given a set of cells, build a tree topdown by splitting the cells into two subsets over the most frequent mutation. Repeat this process recursively on both subsets until only one sample remains.
Formally, we choose to split the dataset into two subsets, O_{i,j} and \(\overline {O}_{i,j}\), such that O_{i,j} contains cells carrying mutation s_{j} in χ_{i}, and \(\overline {O}_{i,j}\) contains cells without s_{j} in χ_{i}. We choose i,j based on the following criteria:
$$i,j = \underset{i,j}{\text{argmax}}\, n_{i,j} $$
where n_{i,j} is the number of cells that carry mutation s_{j} in character χ_{i}. We continue this process recursively until only one sample exists in each subset. We note that this method operates over cells with nonbinary states, solving the first of problems addressed earlier.
A major caveat exists with methods such as the greedy method proposed by Gusfield, as well as the one proposed by us thus far: namely, they assume all character states are known (i.e., no dropout). However, in our practice, we often encounter dropout as a consequence of Cas9 cutting or stochastic, technical dropout due to the dropletbased scRNAseq platform. To address this problem in our greedy approach, during the split stage, these cells are not initially assigned to either of the two subsets, O_{i,j} or \(\overline {O}_{i,j}\). Instead, for each individual sample which contains a dropped out value for chosen split character χ_{i}, we calculate the average percentage of mutated states shared with all other cells in O_{i,j} and \(\overline {O}_{i,j}\) respectively, and assign the sample to the subset with greater average value.
Appending the dropout resolution stage with the initial split stage, we present our greedy algorithm below in its entirety.
Overall, this method is very efficient and scales well into tens of thousands of cells. Below, we show via proof that this algorithm can find perfect phylogeny if one exists.
CassiopeiaGreedy algorithm can solve multistate perfect phylogeny Here we show that while not required, Cassiopeia can solve the multistate perfect phylogeny problem optimally. Importantly, however, Cassiopeia’s effectiveness makes no assumption about perfect phylogeny existing in the dataset but rather leverages this concept to provide a heuristic for scaling into larger datasets.
To show how Cassiopeia’s greedy method can solve perfect phylogeny optimally, we begin by introducing a few clarifying definitions prior to the main theorem. We define M as the original n cells by n character kstate matrix (i.e., entries \(\in \{s_{0},\dots,s_{k1}\}\)). We say M has a zero root perfect phylogeny if there exists a tree \(\mathcal {T}\) over its elements and character extensions such that the state of the root is all zeros and every character state is mutated into at most once. In addition, we assume that all nonleaf nodes of \(\mathcal {T}\) have at least two children (i.e., if they only have one child, collapse two nodes into one node). Finally, we offer a definition for character compatibility:
Definition 1
(Character Compatibility). For a pair of binary characters, (χ_{1},χ_{2}), where the sets (O_{1},O_{2}) contain the sets of cells mutated for χ_{1} and χ_{2}, respectively, we say that they are compatible if one of the following is true:

O_{1}⊆O_{2}

O_{2}⊆O_{1}

O_{1}∩O_{2}=∅
This definition extends to multistate characters as well, assuming they can be binarized.
Before proving the main theorem, we first prove the following lemma:
Lemma 1
If M has a perfect phylogeny, then the most frequent character, mutation pair appears on an edge from the root to a direct child node.
Proof
WLOG let χ_{i}:s_{0}→s_{j} denote the maximally occurring character, mutation pair within M. Suppose by contradiction that this mutation does not appear on an edge directly from root to a child, but rather on some edge (u,v) that is part of a subtree whose root r^{∗}, is a direct child of the root. As r^{∗} has at least two children, this implies that the mutation captured from the root to r^{∗} must be shared by strictly more cells than χ_{i}:s_{0}→s_{j}, thereby reaching a contradiction on χ_{i}:s_{0}→s_{j} being the maximally occurring mutation. □
Theorem 1
The greedy algorithm accurately constructs a perfect phylogeny over M if one exists.
Proof
We approach via proof by induction. As a base case, a single is trivially a perfect phylogeny over itself.
Now suppose by induction that for up to n−1 cells, if there exists a perfect phylogeny \(\mathcal {T}\) over such cells, then the greedy algorithm correctly returns the perfect phylogeny. Consider the case of n cells. By the above lemma, we know we can separate these n cells into two subsets based on the most frequent character, mutation pair χ_{i}:s_{0}→s_{j},O_{i,j} and \(\overline {O}_{i,j}\), where O_{i,j} contains cells with mutation s_{j} over χ_{i}, and \(\overline {O}_{i,j}\) = M−O_{i,j}. By induction, the greedy algorithm correctly returns two perfect phylogenies over O_{i,j} and \(\overline {O}_{i,j}\), which we can merge at the root, giving us a perfect phylogeny over n cells. □
Accounting for prior probability of mutations
In most situations, the probability of mutation to each distinct state may not be uniform (i.e., character χ_{1} mutating from the unmutated state s_{0} to state s_{4} may be twice as likely as mutating to state s_{6}). Therefore, we incorporate this information into choosing which character and mutation to split over based on the following criteria:
$$i,j = \underset{i,j}{\text{argmin}}\, p_{i}(s_{0}, s_{j})^{f(n_{i,j})}$$
where p_{i}(s_{0},s_{j}) is the probability that character χ_{i} mutates from the unmutated state s_{0} to s_{j} and f(n_{i,j}) is some transformation of the number of cells that report mutation j in character i that is supposed to reflect the future penalty (number of independent mutations of character i to state j) we will have to include in the tree if we do not pick i,j as our next split. After a comparison of 5 different transformations (Additional file 1: Fig S15), we find that f(n_{i,j})=n_{i,j} gives the best performance, leaving us with the following criteria for splittings:
$$i,j = \underset{i,j}{\text{argmin}}\, p_{i}(s_{0}, s_{j})^{n_{i,j}}$$
A hybrid method for solving singlecell lineage tracing phylogenies
Due to the runtime constraints of the Steiner tree method, it is infeasible for such method to scale to tens of thousand of cells. Therefore, we build a simple hybrid method which takes advantage of the heuristic proposed in the greedy algorithm and the theoretical optimality of the Steiner tree method. Recall that in the greedy method, we continued to choose splits recursively until only one sample was left per subset. In this method, rather than follow the same process, we choose a cutoff for each subset (e.g., 200 cells). Once a subset has reached a size lower than the said cutoff, we feed each individual subset into the Potential Graph Builder and Steiner tree solver, which compute an optimal phylogeny for the subset of cells. After an optimal subtree is found, we merge it back into the greedy tree. Therefore, we build a graph whose initial mutations are chosen from the greedy method and whose latter mutations are chosen more precisely via the Steiner tree approach.
Below, we present a pseudocode algorithm for the hybrid method. We note the slight difference in greedy from before. Namely, greedy additionally accepts a cutoff parameter and, in addition to returning a network built up to that cutoff, returns all subsets that are still needed to be solved.
This approach scales well when each instance of Steiner tree is ran on an individual thread, and thus often takes only a few hours to run on several thousand cells.
Theoretical analysis of parallel evolution
Estimating first and second moments of double mutations
Expected number of double mutations Under the framework of our simulation, we assume that each at each generation, every cell divides, and then each character of each cell undergoes random mutation independently. Let p be the probability that a particular character mutates and q be the probability the character took on a particular mutated state given that it mutated. Let T be the true phylogenetic tree over the samples. According to our model, T must be a full binary tree, and the samples are leaves of T. Let X be the total number of times a particular mutation occurred in the T. Let X_{u,v} be an indicator variable for edge (u,v) such that:
$$X_{u,v} = \left\{ \begin{array}{ll} 1 & \text{if a mutation occurs on edge } (u,v)\\ 0 & \text{otherwise} \end{array}\right. $$
Let h be the height of the T, which is equalled to the number of generations. If v is at depth d in T, then the probability that a mutation occurs at (u,v) is pq(1−p)^{d−1}. Since there are 2^{d} nodes at depth d, we have:
$$ \begin{aligned} E(X) &= \sum_{(u,v)\in T}E(X_{u,v})\\ &= \sum_{d = 1}^{h}2^{d}pq(1p)^{d1} \\ &= \frac{2pq((22p)^{h}1)}{12p} \end{aligned} $$
(1)
Let n=2^{h} be the number of cells in our sample. If p>0.5,E(X)≤2pq/(2p−1), if p=0.5,E(X)=2pqh=O(logn), and if \(p < 0.5, E(X) = O(n^{\frac {1}{\log _{2}{22p}}})\). Moreover, for fixed h, E(X) has a single peak for p∈[0,1], meaning that it increases with p for sufficiently small values of p, and always increases with q. Intuitively, this is because E(X) is small if (1) p is small enough that the character never mutates much throughout the experiment or (2) p is large enough that most mutations occur near the top of the tree, resulting in the extinction of unmutated cells early in the experiment. While E(X) peaks for values of p in between, it is always directly proportional to q because X is simply equalled to q time the number of times the character mutated.
Variance of double mutations We can compute the variance as:
$$\begin{array}{*{20}l} Var(X) &= E(X^{2})  E(X)^{2}\\ &= 2\sum_{(u,v)\neq(u',v')}E(X_{u,v}X_{u',v'}) + E(X)  E(X)^{2} \end{array} $$
To compute \(\phantom {\dot {i}\!}E(X_{u,v}X_{u',v'})\), we note that for a given pair of edges (u,v) and (u^{′},v^{′}), such that LCA(u,u^{′}) is at depth d, u is at depth d+l, and u^{′} is at depth l+k, the probability that a mutation occurred on both edges is p^{2}q^{2}(1−p)^{d+l+k}. Thus, we have:
$${\begin{aligned} \sum_{(u,v)\neq(u',v')}E(X_{u,v}X_{u',v'}) &= \sum_{d=0}^{h1}2^{d}\sum_{k=0}^{hd1}\sum_{l=0}^{hd1}2^{l+k}p^{2}q^{2}(1p)^{d+l+k}\\ &= p^{2}q^{2}\sum_{d=0}^{h1}(22p)^{d}(\sum_{k=0}^{hd1}(22p)^{k})^{2}\\ &= \frac{p^{2}q^{2}}{(2p1)^{2}}\sum_{d=0}^{h1}(2p2)^{d}((2p2)^{(hd)}1)^{2}\\ &\leq \frac{p^{2}q^{2}}{(2p1)^{2}}\sum_{d=0}^{h1}(2p2)^{2hd}\\ &= (2p2)^{h+1}\frac{p^{2}q^{2}}{(2p1)^{2}}\sum_{d=0}^{h1}(2p2)^{d}\\ &\leq \frac{p^{2}q^{2}(2p2)^{2h+1}}{(2p1)^{3}} \end{aligned}} $$
Thus, we can bound the variance as follows:
$$ {\begin{aligned} Var(X) &\leq \frac{2p^{2}q^{2}(2p2)^{2h+1}}{(2p1)^{3}} + \frac{2pq\left(1(22p)^{h}\right)}{2p1} \\&\quad \frac{4p^{2}q^{2}(1(22p)^{h})^{2}}{(2p1)^{2}} \end{aligned}} $$
(2)
This means that in the case that p>0.5:
$$Var(X) \leq \frac{2p^{2}q^{2}}{(2p1)^{3}} + \frac{2pq}{2p1}  \frac{4p^{2}q^{2}}{(2p1)^{2}}$$
In the case that p=0.5:
$$Var(X) = O(h^{3}) = O(\log^{3}(n))$$
In the case that p<0.5:
$$Var(X) = O(n^{\frac{2}{\log_{2}{22p}}})$$
Least squares linear estimate and negative correlation between frequency and the number of double mutations
To justify the greedy, we must show that if a mutation occurs frequently, then it is likely to have occurred less times throughout the experiment. Let Y be the frequency of a particular mutation in the samples. We estimate X given Y using the least squares linear estimate (LLSE) as follows:
$$ L(XY) = E(X) + \frac{CoV(X,Y)}{Var(Y)}(YE(Y)) $$
(3)
Since CoV(X,Y)=E(XY)−E(X)E(Y), we need only to compute E(XY), which we do by expressing X and Y in terms of the same indicators:
$$Y = \frac{1}{2^{h}}\sum_{(u,v) \in T} 2^{\text{depth}(v)}X_{u,v}$$
As a sanity check, it can easily be verified that E(Y)=q(1−(1−p)^{h}) by computing E(Y) using these indicators:
$$\begin{array}{*{20}l} E(Y)&=2^{h}\sum_{d=1}^{h}2^{d}(1p)^{d1}pq*2^{hd}\\ &=pq\sum_{d=1}^{h}(1p)^{d1}\\ &=q(1(1p)^{h})\\ \end{array} $$
Thus, we can compute E(XY) similar to how we computed E(X^{2}) for variance.
$$ \begin{aligned} E(XY) &= 2^{h}E((\sum_{(u,v)\in T}X_{u,v})(\sum_{(u,v)\in T}2^{\text{depth}(v)}X_{u,v}))\\ &=2^{h}\Big(2\sum_{(u,v)\neq (u',v')}2^{\text{depth}(v)}E(X_{u,v}X_{u',v'})+\sum_{(u,v)\in T}2^{\text{depth}(v)}E(X_{u,v}^{2})\Big)\\ &= 2*2^{h}\sum_{d=0}^{h1}2^{d}\sum_{k=0}^{h1}\sum_{l=0}^{h1}2^{l+k}p^{2}q^{2}(1p)^{d+l+k}*2^{hdl1} + E(Y)\\ &= p^{2}q^{2}\sum_{d=0}^{h1}\sum_{k=1}^{hd1}\sum_{l=0}^{hd1}(1p)^{d}(22p)^{k}(1p)^{l} + E(Y)\\ &=\frac{pq^{2}}{12p}\sum_{d=0}^{h1}(22p)^{hd}1)(1(1p)^{hd})(1p)^{d} + E(Y)\\ &=\frac{pq^{2}}{12p}\Big(2(22p)^{h}(12^{h})\frac{(22p)(1p)^{h}((22p)^{h}1)}{12p} \\ & \quad \quad \quad \quad  \frac{1(1p)^{h}}{p}+h(1p)^{h}\Big) + E(Y)\\ \end{aligned} $$
(4)
Assuming that is \(p < 11/\sqrt {2} \approx 0.29\) (based on our estimation of Cas9cutting rates, this seems to be a biologically relevant probability), we have:
$$\begin{array}{*{20}l} {\lim}_{h\rightarrow \infty}CoV(X,Y) &= \Big(2\frac{22p}{12p}\Big)\frac{pq^{2}(2(1p)^{2})^{h}}{12p}\\ &= \infty \end{array} $$
since 2<(2−2p)/(1−2p) when p<0.5. Var(Y) can be computed using the same indicators:
$$ {\begin{aligned} Var(Y) &= 2\sum_{i,j}E(Y_{i}Y_{j}) + \sum_{i}E(Y_{i}^{2})  E(Y)^{2}\\ \sum_{i,j}E(Y_{i}Y_{j}) &= 2^{2h}\sum_{d=0}^{h1}2^{d}(1p)^{d}(\sum_{k=0}^{hd1}2^{k}(1p)^{k}pq*2^{hdk1})^{2}\\ &=\frac{q^{2}}{4}\sum_{d=0}^{h1}(\frac{1p}{2})^{d}(\frac{1(1p)^{hd}}{p})^{2}\\ &=\frac{q^{2}}{4}\sum_{d=0}^{h1}(\frac{1p}{2})^{d}  \frac{2(1p)^{h}}{2^{d}}+ \frac{(1p)^{2h}}{(22p)^{d}}\\ &= \frac{q^{2}}{4}\Big(\frac{2(1(\frac{1p}{2})^{h})}{1+p}4(1p)^{h}(12^{h}) + \\ & \quad \quad \quad \quad \quad \frac{(22p)(1p)^{2h}(1(\frac{1}{22p})^{h})}{12p}\Big) \end{aligned}} $$
(5)
$$\begin{array}{*{20}l} \sum_{i}E(Y_{i}^{2}) &= 2^{2h}\sum_{d=1}^{h}2^{d}(1p)^{d1}pq*2^{2(hd)}\\ &=\frac{pq}{2}\sum_{d=0}^{h1}(\frac{(1p)}{2})^{d}\\ &=\frac{pq(1(\frac{1p}{2})^{h})}{1+p} \end{array} $$
Note that if p<0.5, every term in Var(Y) converges to a constant as h→∞. Thus, if (1−p)^{2}>0.5, then as the depth increases, X and Y become exponentially more negatively correlated. This means that for biologically relevant values of p, the frequency of a mutation in the samples is negatively correlated with the number of times the mutation occurred, thus justifying the rationale of splitting the sample on more frequently occurring mutations.
Simulation for tracking the evolution of a particular mutation
To more efficiently simulate the number of occurrences of a particular mutation, we define {N_{1},N_{2},...N_{h}} as a Markov chain, where N_{t} is the number of unmutated cells at generation t, and N_{1}=1. Let A_{t}∼Bin(2N_{t},p) be the number of cells that mutates at generation t and B_{t}∼Bin(A_{t},a) be the number of mutated cell that took on the particular state in question. The Markov chain evolves as N_{t+1}=2N_{t}−A_{t}. Note that we assume, in this model, that mutation can only occur after cell division. Thus, we have \(X = \sum _{t=1}^{h}B_{t}\) and \(Y = \sum _{t=1}^{h}2^{th}B_{t}\).
Assessing the precision of greedy splits
To assess the precision of greedy splits, we first simulated 100 true phylogenies of 400 cells (without dropout) for all pairs of parameters in num_states={2,10,40} and p_{c}ut={0.025,0.1,0.4}. For each network, we assessed the precision of the greedy split as follows:

1
We used the criteria i,j= arg max_{i,j}n_{i,j} to select the character χ_{i} and state j to split on (as CassiopeiaGreedy would do). This group of cells that have a mutation j in character χ_{i} is called G.

2
For defining the a set of n subsets corresponding to cells that inherited the (character, state) pair (i,j) independently using the true phylogenies, and call this set S=(s_{1},s_{2},...,s_{n}) (this corresponds to there being n parallel evolution events for the (character, state) pair (i,j).

3
We presume that the largest group of cells in S is the “true positive” set (let this be defined as s^{′}= arg max_{s}s_{i}). We then define the precision P as the proportion of true positives in the set G—i.e., \(P = \frac {s'}{G}\).
Statistics for IVLT analysis
Meta purity statistic
To calculate the agreement between clades (i.e., the leaves below a certain internal node of the tree) and some meta value, such as the experimental plate from which a sample came from, we can employ a chisquared test. Specifically, we can compute the following statistic: considering some M clades at an arbitrary depth d, we find the count of meta values associated with each leaf in each clade, resulting in a vector of values m_{i} comprised of these metacounts for each clade i. We can form a contingency table summarizing these results, T, where each internal value is exactly m_{i,j}—the counts of the meta item j in clade i. A chisquared test statistic can be computed from this table.
To compare across different trees solved with different methods, we report the chisquared test statistic as a function of the number of clades, or degrees of freedom of the test.
Mean majority vote statistic
The mean majority vote statistic seeks to quantify how coherent each clade is with respect to its majority vote sample at a give depth. For a given clade with leaves L_{i} where L_{i}=n, where every leaf l_{i,j} corresponds to cell j in clade i has some meta label m_{j}, the majority vote of the clade is \(v = argmax_{m' \in M} \sum _{j \in n} \delta (j, m')\). Here, M is the full set of possible meta values and δ(m_{j},m^{′}) is an indicator function evaluating to 1 iff m_{j}=m^{′}. The membership of this clade is then simply \(\frac {\sum _{j \in n} \delta (m_{j}, v)}{n}\). Then, the mean membership is the mean of these membership statistics for all clades at a certain depth (i.e., if the tree was cut at a depth of d, the clades considered here are all the internal nodes at depth d from the root). By definition, this value ranges from \(\frac {1}{M}\) to 1.0.
As above, to compare across different trees solved with various methods, we report this mean membership statistic as a function of the number of clades.
Triplets correct statistic
To compare the similarity of simulated trees to reconstructed trees, we take an approach which compares the subtrees formed between triplets of the terminal states across the two trees. To do this, we sample ∼10,000 triplets from our simulated tree and compare the relative orderings of each triplet to the reconstructed tree. We say a triplet is “correct” if the orderings of the three terminal states are conserved across both trees. This approach is different from other tree comparison statistics, such as RobinsonFoulds [34], which measures the number of edges that are similar between two trees.
To mitigate the effect of disproportionately sampling triplets relatively close to the root of the tree, we calculate the percentage of triplets correct across each depth within the tree independently (depth measured by the distance from the root to the latest common ancestor (LCA) of the triplet). We then take the average of the percentage triplets correct across all depths. To further reduce the bias towards the few triplets that are sampled at levels of the tree with very few cells (i.e., few possible triplets), we modify this statistic to only take into account depths where there at least 20 cells. We report these statistics without this depth threshold in Additional file 1: Fig S8.
Allelic and phylogenetic distances
For the analysis in Fig. 4, we define two metrics to capture celltocell similarity: a normalized allelic distance and normalized phylogenetic distance. The normalized allelic distance is calculated as follows: for all target sites χ_{m}∈{χ_{1},...,χ_{M}} in a pair of cells c_{i} and c_{j}:

1
if state in χ_{m} is the same in c_{i} and c_{j}, continue

2
else if state in χ_{m} is 0 or missing in either c_{i} or c_{j} increment the allelic distance by 1

3
else increment the allelic distance by 2
Finally, the allelic distance for a pair of cells is normalized by 2∗M, where M is the number of target sites.
The phylogenetic distance is defined as simply the number of mutations separating the two cells c_{i} and c_{j} as implied by the tree (i.e., the number of mutations along the branches for the shortest path separating c_{i} and c_{j}). The normalized phylogenetic distances is this distance, divided by the diameter (defined as the maximum phylogenetic distance between all pairs of cells) of the tree.
Bootstrapping analysis
Bootstrapping was done using a custom function for sampling M target sites (i.e., characters) from an N×M character matrix with replacement and reconstructing trees from these bootstrapped samples. After performing tree inference, we collapsed “singles” using the collapse.singles function in the R package “ape.” For the purposes of our robustness analysis, we sampled B=100 trees from N=10 simulated trees and used the Transfer Bootstrap Expectation (TBE) [62] statistic for assessing branch support for each clade as implemented in Booster (available at https://github.com/evolbioinfo/booster/).
Application of CaminSokal
We applied CaminSokal using the “mix” program in PHYLIP [63] as done for reconstructions for McKenna et al. [5] and Raj et al. [6]. To use “mix,” we first factorized the characters into binary ones (thus ending up with \(\sum _{i} s_{i}\) binary characters total, where s_{i} is the number of states that character i presented). Then, we onehot encoded the states into this binary representation where every position in the binary string represented a unique state at that character. We thus encoded every cell as having a 1 in the position of each binary factorization corresponding to the state observed at that character. If the cell was missing a value for character i, the binary factorization of the character was a series of “?” values (which represent missing values in PHYLIP “mix”) of length s_{i}. Before performing tree inference, we weighted every character based on the frequency of nonzero (and nonmissing values) observed in the character matrix. After PHYLIP “mix” found a series of candidate trees, we applied PHYLIP “consense” to calculate a consensus tree to then use downstream.
Application of neighbor joining
We used Biopython’s neighborjoining procedure to perform all neighbor joining in this manuscript. We begun similarly to the CaminSokal workflow, first factorizing all of the characters into a binary representation. Then, we applied the neighborjoining procedure using the “identity” option as our similarity map.
Application of Cassiopeia
Reconstruction of simulated data
We used CassiopeiaILP with a maximum neighborhood size of 10,000 and time to converge of 12,600s. CassiopeiaHybrid used a greedy cutoff of 200, a maximum neighborhood size of 6000 and 5000 s to converge. CassiopeiaGreedy required no additional hyperparameters. Simulations with priors applied the exact prior probabilities used to generate the simulated trees.
Reconstruction of in vitro clones
For both CassiopeiaHybrid with and without priors, we used a cutoff of 200 cells and each instance of CassiopeiaILP was allowed 12,600 s to converge on a maximum neighborhood size of 10,000. CassiopeiaILP was applied with a maximum neighborhood size of 10,000 and a time to converge of 12,600 s.
Simulation of targetsite sequences for alignment benchmarking
To determine an optimal alignment strategy and parameters for our targetsite sequence processing pipeline, we simulated sequences and performed a grid search using Emboss’s Water algorithm (a local alignment strategy). We simulated 5000 sequences. For each sequence, we begun with the reference sequence and subjected it to multiple rounds of mutagenesis determined by a Poisson distribution with λ=3, and a maximum of 5 cuts. During each “cutting” event, we determined the outcomes as follows:

1
Determine the number of Cas9 proteins localizing to the target site in this iteration, where n_{cas9}∼min(3,Pois(λ=0.4)).

2
Determine the site(s) to be cut by choosing available sites randomly, where the probability of being chosen is \(p = \frac {1}{n_{\text {uncut}}}\) and n_{uncut} is the number of sites uncut on that sequence.

3
If n_{cas9}=1, we determined the type of the indel by drawing from a Bernoulli distribution with a probability of success of 0.75 (in our case, a “success” meant a deletion and a “failure” meant an insertion). We then determined by drawing from a negative binomial distribution as so: s∼min(30,max(1, NB(0.5,0.1))). In the case of an insertion, we added random nucleotides of size s to the cut site, else we removed s nucleotides.

4
In the case of n_{cas9}≥2, we performed a resection event where all nucleotides between the two cut sites selected were removed.

5
After a cut event, we appended the result of the Cas9 interaction to a corresponding CIGAR string
Our Water simulations were exactly 300 bp, possibly extending past the PolyA signal, as would be the case reading off a Novaseq sequencer.
Upon simulating our ground truth dataset, we performed our grid search by constructing alignments with Water with a combination of gap open and gap extension penalties. We varied the gap open penalties between 5 and 50 and gap extension penalties between 0.02 and 2.02.
To score resulting alignments, we compared the resulting CIGAR string to our ground truth CIGAR string for each simulated sequence. To do so, we first split each cigar string into “chunks,” corresponding to the individual deletions or insertions called. For example, for some CIGAR string 40M2I3D10M, the chunks would be 40M,2I,3D, and 10M. Then, beginning with a max score of 1, we first deducted the difference between the number of chunks in the ground truth and the alignment. Then, in the case where the number of chunks were equal between ground truth and alignment, we deducted the percent nucleotides that differed between CIGARs. For example, if the ground truth was 100M and the alignment gave 95M, the penalty would be 0.05.
To find the optimal set of parameters, we selected a parameter pair that not only scored very well, but also located in the parameter space where small perturbations in gap open and gap extension had little effect.
Simulation of lineages for algorithm benchmarking
We simulated lineages using the following parameters:

1
The number of characters to consider, C

2
The number of states per character, S

3
The dropout per characters, d_{c} ∀c∈C

4
The depth of the tree (i.e., the number of binary cell division), D

5
The probability that a site can be mutated, p. This is a general probability of cutting

6
The rate at which to subsample the data at the end of the experiment, M
To simulate the tree, we begin by first generating the probability of each character mutating to a state, here represented as p_{c}(0,s),∀s∈S. In order to do this, we fit a spline function to the inferred prior probabilities from the lineage tracing experiment (refer to the section entitled “Determining prior probabilities of indel formation” for information on how we infer prior probabilities). We then draw S values from this interpolated distribution. We then normalize these mutation rates to sum to p, therefore allowing in general a p probability of mutating a character and 1−p probability of remaining uncut. In the case of the “state distribution” simulations (Additional file 1: Fig S13), we say that p_{c} is distributed as:
$$p_{c} = \theta * Unif(0,1) + (1  \theta) * F'(x) $$
where F^{′}(x) is the interpolated empirical distribution and θ is the mixture component.
Then, we simulate D cell divisions, where each cell division consists of allowing a mutation to take place at each character with probability p. In the case a mutation takes place, we choose a state to mutate to according to their respective probabilities. Importantly, once a character has been mutated in a cell, that character cannot mutate again.
At the end of the experiment, we sample M percent of the cells resulting in 2^{D}∗M cells in the final lineage.
We find that this method for simulating lineages (in particular the method for generating a set of priors on how likely a given state is to form) is able to closely recapitulate observed lineages (Additional file 1: Fig S6).
Metrics for comparing simulations to empirical data We used three metrics of complexity to compare simulated clones to real clones:

Minimum compatibility distance: For every pair of character, we define the minimum compatibility distance as the minimum number of cells to be removed to obtain compatibility (Def. 1).

Number of observable states per cell: The number of nonzero or nonmissing values for each cell, across all characters (i.e., the amount of data that can be used for a reconstruction, per cell).

Number of observable states per character: The number of nonzero or nonmissing values across for each character, across all cells.
Parallel evolution simulations for greedy benchmarking
As shown above, our greedy approach should accurately reconstruct a lineage if a perfect phylogeny exists. In order to better quantify how much our greedy algorithm’s performance is affected by parallel mutations, we decided to simulate “near perfect phylogenies,” whereby we first began by simulating a perfect phylogeny, and afterwards introduced double mutated characters.
Specifically, we begin by simulating perfect phylogenies with 40−k characters. We then fix a depth, d, and sample a node from said depth. We choose two grandchildren randomly from this node (one from each child) and introduce the same mutation on each of the edges from each child to grandchild, thereby violating the perfect phylogeny. We repeat this process k times. This thus creates an analysis, as presented in Additional file 1: Fig S5, whereby accuracy can be evaluated as a function of both depth of parallel evolution, d, and the number of events that occurred, k.
Simulation of “base editor” technologies
We used the simulation framework described above to simulate base editor technologies. To explore the trade off between the number of states and the number of characters, we simulated trees with 40,50,80, and 100 characters while maintaining the product of characters and states equal at 400 (thus, we had trees of 10,8,5, and 4 states per character, respectively). The dropout per character was set to 10%, the mutation rate per character was set to 1.04% (a previously observed mutation rate [46]), and 400 cells were sampled from a tree of depth 10. For each character/state regime, we generated 10 trees for assessing the consistency of results. We use a negative binomial model (∼NB(5,0.5)) as the editing outcome distribution (i.e., state distribution).
Simulation of “phased recorder” technologies
To simulate the phased recorder, we used 5 different experiments varying mutation rates across 50 characters and 10 states per character. In each experiment, we chose a mutation rate for each character from one of 10 regimes, each differing in their relationship to the base mutation rate p_{0}. To systematically implement this, mutation rate for χ_{i} is described as such:
$$m_{i} = p_{0} * (1 + e_{j} * \lfloor\frac{i}{5}\rfloor) $$
where p_{0}=0.025 and e_{j} is an experiment scalar in e={0,0.05,0.1,0.25,0.5}. This means that for characters 1−5,m_{i}=p_{0}; for characters 6−10,m_{i}=e_{j}p_{0}; for characters 11−15,m_{i}=2e_{j}p_{0}; etc. To summarize each experiment, we provide the ratio between the maximum and minimum mutation rates, which is by definition 1+10r_{j} (because we had 50 characters). We compare two models of indel formation rates—the first being a negative binomial model (∼ NB(5,0.5)) and the second being the spline distribution fit from empirical data.
We simulated 10 trees per regime and reconstructed trees with Cassiopeia with and without priors.
Reconstructions of GESTALT datasets
We downloaded data corresponding to the original GESTALT study [5] and the more recent scGESTALT study from https://datadryad.org/resource/doi:10.5061/dryad.478t9 and GSE105010, respectively. We created character matrices for input into Cassiopeia by creating pivot tables relating each cell the observed indel observed at each one of the 10 tandem sites on the GESTALT recorder. We then reconstructed trees from these character matrices using one of five algorithms: CaminSokal (used in the original studies), neighbor joining, Cassiopeia’s greedy method, Cassiopeia’s Steiner tree method, and Cassiopeia’s hybrid method.
For each reconstruction, we record the parsimony of the tree, corresponding to the number of mutations that are inferred along the reconstructed tree. We display these findings in Fig. 6a, where we have Znormalized the parsimonies across the methods for each dataset to enable easier visualization of relative performances.
Visualization of trees
To visualize trees, we use the iTOL interface [64]. Colors in the heatmap denote a unique mutation, gray denotes an uncut site, and white denotes dropout.