Measuring the accuracy of genome-size multiple alignments

A novel computational approach can assess the accuracy of genomic alignments, and reveals suspicious regions in the 17-vertebrate MULTIZ alignment available on the UCSC Genome Browser.


Background
With the rapid sequencing of so many related genomes, comparative genomics has emerged as one of the most important areas of computational biology. The workhorse of comparative genomics is the multiple sequence alignment, particularly whole-genome multiple sequence alignments such as those provided by the UCSC Genome Browser [1]. These alignments are marvelous tools for anyone working in comparative genomics. More and more sophisticated analyses rely implicitly on the correctness of these alignments. For example, it is already standard practice to search for functional genomic elements (more precisely, those constrained by purifying selection) by scanning a whole-genome alignment, looking for regions that are better conserved across the species than expected [2][3][4][5][6][7][8][9][10][11][12].
When such methods find surprisingly well conserved sites across all aligned species, that portion of the alignment is likely to be correct. Conversely, in regions where the sequences are misaligned, these methods may fail to find conserved sites that exist. Even the designers of the alignment algorithms and genome browsers would not claim that the alignments are correct at all sites across entire genomes. How can users decide which portions of the alignment are trustworthy and which portions less so, particularly in noncoding regions?
We present a method to assess a whole-genome multiple sequence alignment, classifying it into well aligned and suspiciously aligned regions. Before carrying out any further analysis that relies on the alignment's correctness, such as those listed above, the user should be aware of possible misalignment in those regions classified as suspicious. In addition, efforts should be made to either realign the regions with suspicious alignments or increase the confidence in their current alignments by other evidence.
Without any well established methodology for estimating the quality of the whole-genome alignments, scientists have either trusted the alignments completely or developed their own filters. Such filters usually involve a conservation threshold (for example, a high phastCons conservation score [8,10]) and thus filter out most of the genomic sequence, limiting the scope of their analysis to only the regions that have high sequence similarity. Our work is the first that assigns alignment 'quality scores' to all portions of the whole-genome alignments. Using these, we have a measure of the level of trust in any region of the alignment, including those that have low sequence similarity. Thus, any comparative analysis can be carried out on much more of the genome than just those regions with high sequence similarity. Also, for the first time, we have a quantitative measure of how well a multiple alignment tool such as MULTIZ (the alignment tool used by the UCSC Genome Browser) performs on a whole-genome scale.
We note in this regard that the UCSC whole-genome multiple alignments are already annotated with phastCons conservation scores [9]. However, these scores serve a different purpose than our alignment quality scores: phastCons measures how well conserved each column of aligned residues is, under the assumption that the alignment is correct, that is, that aligned residues are indeed orthologous. The same statement applies to the measures 'binomial p-value' and 'parsimony pvalue' proposed by Margulies et al. [5] and GERP [3] and Gumby [13] scores. In contrast, the purpose of our alignment quality scores is to assess the likelihood that the alignment is correct. So, for instance, a low phastCons score in an alignment segment that also has a poor alignment quality score does not necessarily mean poor conservation across the species, since the alignment may well be incorrect in this region. The same is true for the other four conservation scores mentioned above.
The statistics of local pairwise alignments [14,15] are well understood and have been extended to local multiple alignments [16]. These statistics are based on a few well conserved parts of the alignment, for example, protein domains, and report the statistical significance of these parts only. This cannot be used for our task on whole-genome alignments, because a whole-genome (or whole-chromosome) alignment will nearly always contain some very strongly conserved regions [2]. Thus, only these regions would be reported as statistically significant, with no information given about the remainder of the alignment. Unfortunately, there has been no methodology that measures the accuracy of every region of a multiple sequence alignment separately. (Yu and Smith [17] proposed a method based on hidden Markov models for doing this in pairwise alignments.) Solving this problem is critical in whole-genome alignments, because the search space for good alignments is huge due to genome rearrangements. Thus, there is a high probability that some regions will be misaligned even if most of the alignment is correct.
In previous work [16] we presented a tool, StatSigMA, to assess whether a multiple sequence alignment is contaminated with one or more unrelated sequences. In this work, we extend those ideas to the tool StatSigMA-w, which performs this assessment for every region of a multiple sequence alignment, and show its application to whole-chromosome alignments.
There are different ways that alignments can be incorrect, in the sense of not pairing all and only orthologous residues. There can be orthologies that are not aligned (false negatives of the alignment algorithm) and nonorthologies that are aligned (false positives of the alignment algorithm). Stat-SigMA-w is designed to measure the latter. These false positives can be further classified as 'small' errors (for example, one residue placed at the wrong end of a gap) or 'large' errors (for example, a region of 100 residues of one sequence aligned to regions of other sequences to which it is unrelated). Because StatSigMA is designed to detect contamination of a multiple sequence alignment by unrelated sequences, the focus of this work is on the latter, 'large' errors.
The UCSC Genome Browser provides whole-genome alignments for vertebrates, insects, and yeast. Of these, we evaluate the quality of the 17-vertebrate MULTIZ [18]  In the Results section we focus on the alignment of human chromosome 1 as a proof of concept. We identify 9.7% (21 Mbp) of this alignment as suspiciously aligned. We present independent evidence to support our claim that the suspicious regions may be misaligned. Overall, our work confirms that MULTIZ is doing an impressive job of aligning sequences. However, there are instances when it fails and these can be identified using StatSigMA-w.

StatSigMA
The method used in this work extends our earlier tool Stat-SigMA [16], which assesses whether a multiple sequence alignment is contaminated with one or more unrelated sequences. There are several interesting problems in extending StatSigMA to assess the accuracy of all portions of a whole-genome alignment. We summarize the methodology of StatSigMA in this section and then describe its extension to StatSigMA-w in the subsequent sections.
Given a multiple alignment and a phylogeny relating its sequences, StatSigMA computes a p-value for each of several null hypothesis cases. There is one null hypothesis case for each branch k of the phylogeny. In the null hypothesis case k, branch k exhibits 'unrelated behavior', that is, the subalignments corresponding to the two subtrees separated by the removal of k are independent rather than homologous. The assumption made is that, after rejecting all these cases, we can infer that the multiple alignment shows all the sequences are related. The algorithm followed by StatSigMA is outlined below. These steps are described in detail elsewhere (A.P. and M.T., Assessing the discordance of multiple sequence alignments, unpublished work).

Algorithm followed by StatSigMA
Input: multiple sequence alignment and phylogeny.
1. For every branch k of the phylogeny do the following steps: (a) Create the log likelihood scoring function corresponding to unrelated behavior on branch k.
(b) Using the scoring function, estimate Karlin-Altschul parameters K k , λ k , and H k [14].
(c) Using the scoring function, identify the maximal scoring segments of the input multiple alignment [19].
(d) Using K k , λ k , and H k , identify the set of maximal scoring segments resulting in the least p-value p k of the score [20].
2. Output max k p k as the 'discordance' of the multiple sequence alignment.
Analogous to a p-value, the discordance is between 0 and 1, and the lower its value, the better the alignment (in the sense of not being contaminated with unrelated sequences).
The log likelihood scores in the first step take into account the phylogeny. These scores are computed as follows. Suppose we have S sequences in a multiple alignment related by a phylogenetic tree T (having branch lengths). Let γ 1 , γ 2 , ..., γ S be the residues observed at a particular column of the multiple alignment. Suppose we want to test the hypothesis that there is unrelated behavior on branch k. Suppose the removal of branch k separates T into subtrees t 1 (having residues β 1 , β 2 , ..., β i at the leaves) and t 2 (having residues β i+1 , β i+2 , ..., β S at the leaves). Let M be the evolutionary model. Then, analogous to the Karlin-Altschul log likelihood score [14], the score for observing this column of the multiple alignment is as follows: We precompute this score for all tuples at the leaves of the tree. The various probability terms in Equation 1 are computed using the dynamic programming algorithm of Felsenstein [21].

Appropriate phylogeny
Choosing the appropriate phylogeny is important for the accurate computation of p-values. In the case of pairwise alignments, without any information about the appropriate scoring matrix, BLAST [22] computes the p-value using multiple scoring matrices, and then corrects for multiple hypotheses. Using similar ideas, we use multiple trees to cover the variations in the rate of evolution across the genome. We start with the phylogeny estimated from four-way degenerate sites (that is, third codon positions) in the ENCODE [23] regions (Adam Siepel, personal communication), whose branch lengths correspond to neutral evolution. (This phylogeny is available on our worldwide web site [24].) We use three versions of this tree, one with the original branch lengths, one with branch lengths multiplied by 100 (to account for regions evolving faster than third codon positions), and one with branch lengths multiplied by 0.01 (to account for regions under purifying selection, for example, coding regions). Separate p-values are computed using each of the three trees, and the final score at a site is the minimum of the three, corrected for multiple hypotheses.

Appropriate segmenting
The MULTIZ whole-genome alignment is built with human as the reference species. This alignment consists of chains of aligned blocks. The aligned blocks themselves may contain only a subset of the species. For example, for some human sequence, the aligned block may contain sequence from chimp and not from fugu. Human, being the reference species, is present in every block. As in StatSigMA [16], the null hypothesis case k refers to branch k of the phylogeny demonstrating unrelated behavior, and thus splitting the tree into two subtrees, t 1 and t 2 , whose corresponding subalignments are independent. If a MULTIZ block does not contain any sequence from t 1 or does not contain any sequence from t 2 , then that block is not considered for branch k.
Using StatSigMA's score function and the algorithm to find maximal scoring segments by Ruzzo and Tompa [19], we can identify all maximal, positively scoring segments in the whole-genome alignment. Each such segment can be a high scoring slice of a MULTIZ block or can overlap multiple MUL-TIZ blocks. Thus, it is possible that neighboring columns of a maximal scoring segment may contain different sets of species. For the rest of the paper, the word 'segments' refers to these maximal positively scoring segments. The score sc k,S of segment S is simply the sum, over the columns of S, of the log likelihood score of each column given by Equation 1. Here we make the assumption that the scoring function precomputed for branch k is accurate even when only a subset of species is aligned, so that we can add scores of columns that contain different sets of species.

Appropriate context
The context for a local alignment is defined as the genomic sequences that were considered to create that alignment. For example, we may take genomic sequences of length 1 Mbp each from human and fugu, which may result in a local alignment of length 100 bp. Thus, the context for this 100 bp alignment is one million residues each for human and fugu.
Choosing the appropriate context is critical for the correct evaluation of p-values. A statistically significant alignment in a context of length L may be insignificant had the context length been 10L. (See Materials and methods for the exact dependence of the p-value on L.) In theory, as a multiple alignment tool proceeds to create an alignment, it can identify the context that it is using to output the alignment. However, it is unclear how to infer the original context from the completed alignment.
To solve this problem, we estimate the p-value of the score sc k,S of segment S using all possible contexts W and then maximize over W; see Materials and methods for details of the pvalue formulas. The reason for maximization over W is that, if even one context has a p-value greater than the threshold at which the null hypothesis is rejected, then the null hypothesis should not be rejected.
In order to evaluate the p-value in all possible contexts efficiently, we exploit strongly conserved genomic anchors. Throughout the genome, we expect to see very strongly conserved long segments (for example, certain coding exons). In fact, these segments are so highly conserved that their alignments to the orthologous regions in other species are statistically significant even if we compute the p-value of the score of this single segment in the context of the entire genome, that is, without any support from synteny. Once we have identified these anchors, for segments between anchors A and B, we need to consider only contexts that lie between A and B. This is because the p-values of the scores of the neighboring anchors are an upper bound on the p-value with respect to any context that contains them (see Materials and methods).
Once we have computed the p-value of a segment's score, we can also treat that segment as an anchor for the later stages (for the purposes of upper bounding the p-value of nearby segments whose p-values are still to be computed). This gives rise to a divide-and-conquer algorithm that simultaneously defines appropriate contexts and estimates p-values for all regions of any multiple alignment (see Materials and methods for details of the algorithm).

Identifying suspicious regions of the alignment
The p-value of a single site for a branch k is the p-value of the score of the segment that contains that site for the null hypothesis case k. After computing the p-value for every branch of the phylogeny and for every segment, we identify the branch (or branches, in case of a tie) having the greatest p-value for each site: this maximum p-value is the 'discordance' of that site. This method outputs the discordance of every aligned site, and the worst branch(es) for that site. Finally, at each site, the least discordance among the three trees (multiplied by 3 for Bonferroni correction) is the final discordance value. We take 0.1 as the threshold for identifying sites with high discordance, as discussed in the next section.
The scoring function of Equation 1 is based on an evolutionary model, including evolution of gaps as single residue insertions or deletions, as described in earlier work [16] (and A.P. and M.T., Assessing the discordance of multiple sequence alignments, unpublished work). Because such models reflect an incomplete understanding of evolution, they may cause the method described thus far to underestimate the statistical significance of certain regions, for example, those containing long insertions or deletions, or short low scoring alignments flanked by high scoring alignments. The latter problem is exacerbated by the fact that the extreme value distributon used to compute the p-values does not hold for very small contexts. Also, we are interested in studying longer regions that are misaligned rather than misalignments of a few residues. With these motivations, we consider only those regions that: are at least 50 bp in length; and have discordance at least 0.1 at all sites; with the same branch being the worst one, that is, this branch corresponds to the maximum p-value at each site. (This can result in overlapping regions corresponding to different branches, in the case of a tie for greatest p-value.) To eliminate regions that would be labeled insignificant due to long insertions and deletions, we filter out regions that contain more than 50% gap columns for either of the two subsets of species separated by the worst branch. The regions remaining are the ones that StatSigMA-w reports as having suspicious alignment. For the remainder of the paper, the term 'suspicious regions' refers to these remaining high discordance regions, and we denote the set of suspicious regions by ℜ. The suspicious regions that are attributable to any branch on the path from zebrafish (for example) to human are denoted by ℜ zebrafish ; these are regions where zebrafish (and possibly other species) appear misaligned to human.
The decisions incorporated in the definition of ℜ are purposely conservative. Any region labeled 'suspicious' has at least 50 consecutive alignment columns for which a single branch exhibits unrelated behavior. There are no long insertions or deletions with respect to this branch, so the label of 'suspicious' is not attributable solely to the manner in which the evolution of gaps is modeled.

Suspicious regions in chromosome 1
In this section, we present the results obtained by analyzing the MULTIZ 17-vertebrate alignment of chromosome 1 of the human genome. Chromosome 1 is 247 Mbp long, nearly 8% of the human genome. We took the MULTIZ alignment of this chromosome with the other 16 vertebrates, and computed pvalues for all branches of the three phylogenies and for all segments.
For four representative branches, Figure 1 plots the fraction of sites in maximal scoring segments of length at least 50 bp against the p-value of that segment's score. As can be seen, the graphs are bimodal, that is, most columns either have pvalue greater than 0.1 or less than 10 -4 . The graphs show that a threshold of 0.1 can be used to differentiate low p-value segment scores from high p-value segment scores, and this classification would hardly change if the threshold were instead 10 -2 or 10 -4 . Table 1 provides a breakdown of the residues of human chromosome 1 in ℜ with respect to the other species. For example, 28.5% of the human residues aligned to zebrafish have a suspicious zebrafish alignment (that is, belong to ℜ zebrafish ). Considering the human alignments containing mouse, approximately 3.3% have a suspicious alignment with mouse. In general, the nonprimates each have 1-5 Mbp in suspicious regions, and the percentage of residues in suspicious regions increases (approximately) with increasing phylogenetic distance from humans. This suggests increasing difficulty of correct alignment as this distance increases. In total, ℜ contains 185,452 regions, varying in size from 50 bp to 3,121 bp, with mean 115 bp. The total number of residues contained in ℜ is more than 21 Mbp and accounts for 9.7% of the aligned human chromosome 1.
The last column of Table 1 provides the number and percentage of residues in ℜ that have a phastCons score greater than 0.5. The UCSC phastCons [9] track measures the level of conservation of any site in an alignment, and a score greater than 0.5 means that the site is more likely to be in the conserved state than the nonconserved state. Considering, for example, the human residues aligned to zebrafish, it can be seen that 19% of the residues in ℜ zebrafish have high phastCons scores. In general, the high percentages for the nonmammals in the last column are likely to correspond to regions where the mammals are well conserved but the given nonmammal may be misaligned. We did a separate test of phastCons on synthetic data, using the same parameter values that were used to compute the conservation score for the 17-vertebrate MUL-TIZ alignment. If all but one sequence are identical, and the remaining sequence is less than 50% identical to the others, phastCons can still report a very high conservation score at every site, including those where the last sequence is not iden-tical. This is additional evidence that phastCons is not intended to identify misalignments. Figure 2 provides an example of a MULTIZ alignment block in ℜ zebrafish . Lower case letters in this multiple alignment indicate disagreement with the human sequence. StatSigMA-w's reported discordance is 1 at every site in this block, the branch incident on zebrafish being responsible for this high value. In contrast, the phastCons conservation score is 1 at every human site in this block, despite the lack of conservation in the zebrafish sequence. (Recall that a StatSigMA-w discordance of 1 suggests misalignment, whereas a phastCons score of 1 indicates strong conservation.) This exemplifies the discussion in the previous paragraph that a high phastCons score does not imply correct alignment. In this region, the high phastCons score is due to very strong conservation not only among the mammals, but including chicken as well. For zebrafish, a small percentage (16.9%) are aligned to human exons. This is in contrast to the fact that nearly 35% of the chromosome 1 alignments containing zebrafish lie in human exons. Only 2.4% of ℜ mouse regions are aligned to human exons, in contrast to the fact that nearly 6% of the chromosome 1 alignments containing mouse lie in human exons. This is consistent with the intuition that misalignments will be rarer in exons than in noncoding regions. The overlap of suspicious regions with coding sequences does point out, however, that misalignments can be misleading, especially if the alignment is being used to infer functional annotation in one species from another.

Independent evidence of misalignment
As described in the previous paragraph, 16.9% of suspicious regions in ℜ zebrafish overlap annotated human exons. Of these, 770 regions in ℜ zebrafish overlap the coding exons of 562 human genes. This suggests that the alignment is incorrectly aligning a human coding sequence and a sequence from zebrafish. To test this hypothesis, we took each human coding sequence aligned to a region in ℜ zebrafish , translated it using its annotated reading frame to yield its amino acid sequence, and used this as a BLASTX query against the aligned zebrafish nucleotide sequence. BLASTX translates the zebrafish nucleotide sequence in all six reading frames, aligns each of them to the human amino acid sequence, and reports the best E-value among all six pairwise alignments. Figure 4a plots the distribution of these E-values for ℜ zebrafish . To contrast this to the distribution for well aligned sequences, we took those parts of the alignment that include zebrafish, have discordance at most 10 -10 , and intersect some human coding exon. The distribution of BLASTX E-values for these parts is also plotted in Figure 4a. Finally, the distribution when the The data underlying Figure 4 can be used to obtain an estimate of the false positive rate of StatSigMA-w's prediction of suspicious regions. If we make the simplistic assumption Aligned portions of the 247 Mbp of human chromosome 1 with respect to each of the other 16 vertebrates. Column 2 shows the total number of residues of human chromosome 1 that are aligned to the other species, together with its percentage of 247 Mbp, the size of human chromosome 1. Column 3 shows the total number of residues that are in suspicious regions, together with its percentage of the corresponding number in column 2. The number of suspicious regions is shown in column 4. Column 5 shows the total number of residues that are in suspicious regions and have a phastCons score greater than 0.5, together with its percentage of the corresponding number in column 3.
Sample suspicious zebrafish alignment  (based on where the low discordance and suspicious curves cross in Figure 4a) that a BLASTX E-value at most 10 -4 indicates a correct alignment, then of the 770 predicted zebrafish suspicious regions aligned to a human exon, only 5.3% are in fact correctly aligned (compared to 93% of the low discordance regions).
In a more global sense, though, the MULTIZ coding alignments are better than Figure 4 suggests. When we repeat the BLASTX experiment on the entire annotated human exon rather than just the exon fragment aligned to the ℜ zebrafish region, the BLASTX E-value graph looks much more like the low discordance graph of Figure 4a. This suggests that MUL-TIZ usually aligns a homologous zebrafish exon, but misaligns the fragments of these exons that are in ℜ zebrafish .
To analyze these exonic regions using a different criterion, we searched the aligned zebrafish sequence in all three forward reading frames to compute the percentage of frameshifted codons (because of many insertions and deletions in the alignment). We used the reading frame that led to the least percentage of frameshifted codons. Figure 5 plots the distribution of this percentage for the regions in ℜ zebrafish . To contrast this to the distribution for well aligned sequences, we took those parts of the alignment that include zebrafish, have discordance at most 10 -10 , and intersect some human coding exon. This distribution is also plotted in Figure 5. Like Figure  4, Figure 5 clearly shows that the two distributions are very different, and the suspicious regions have much more dubious alignments than the low discordance regions.
Genomic locations of suspicious alignments for mouse and zebrafish

Results on simulated data
In order to measure further the sensitivity and specificity of StatSigMA-w's predictions of suspicious regions, we also performed experiments on simulated data. Such experiments have the advantage that we know exactly what is correctly aligned and what is misaligned. The simulated data were provided by Mathieu Blanchette (personal communication). They consist of a set of 25 correct multiple sequence alignments, each of length approximately 100 Kb. Each alignment results from the concatenation of two data sets obtained by simulation of evolution according to the procedure described by Blanchette et al. [18] and summarized as follows. Each simulation begins with a random ancestral sequence and lets it evolve along the branches of a given phylogeny until it produces sequences at the leaves. The phylogeny used is that for the nine mammals human, chimp, baboon, mouse, rat, dog, cat, cow, and pig. By keeping track of the mutations performed, it is straightforward to produce the true alignment of orthologous residues and gaps. These are the alignments from which our experiments began.
From each of these alignments we discarded all the gaps and aligned the resulting sequences using TBA. (TBA is a generalization of MULTIZ in which there is no special reference sequence. Both algorithms are described in the same paper [18].) By comparing this TBA alignment to the true alignment, we determined exactly where TBA misaligned residues, in the sense that such a residue in human is aligned to a nonorthologous residue in another species. Analogous to the definition of a suspicious region, define a 'misalignment region' as any contiguous set of 50 or more columns in the TBA alignment in which a single species' residues are misaligned to human in every column. Regions with more than 50% gap characters in the misaligned species were ignored.
We next submitted the TBA alignment to StatSigMA-w, which computed p-values and suspicious regions as described above. Our goal was to assess the specificity (fraction of suspicious regions actually misaligned) and sensitivity (fraction of misalignment regions reported as suspicious) for these StatSigMA-w predictions. The results are given in the four histograms of Figure 6. These histograms show averages over the 25 simulated data sets. Because TBA rarely misaligned chimp or baboon with human, histograms are given only for the other six species. Figure 6a,b presents specificity results, reporting information about the distribution of suspicious regions. For each suspicious region reported by StatSigMA-w, we measured the percentage of columns of that region that are actually misaligned. This percentage is shown on the horizontal axis.
In Figure 6a, the misalignment is required to be in the same species as a species responsible for the suspicious region; that is, if the suspicious region is attributable to branch k, then k must lie on the path from the misaligned species to human. In Figure 6b, the misalignment could be in any species. The reason for our interest in the latter is that it is sometimes the case that a misalignment in pig (say) may cause StatSigMA-w to Percentage of frameshifted codons Figure 5 Percentage of frameshifted codons. Distribution of ℜ zebrafish and low discordance (≤10 -10 ) zebrafish alignment regions intersecting some annotated human coding exons, plotted against the percentage of frameshifted zebrafish codons (in the region aligned to the human exon). report that cow appears suspiciously aligned, particularly if the cow sequence is sufficiently diverged from human. In any case, drawing attention to any misalignment by labeling it as suspicious should be helpful to the alignment user. For each misalignment region, we measured the percentage of columns of that region for which StatSigMA-w reports a pvalue at least 0.1. In Figure 6c, the high p-value must be for a branch separating the misaligned species from human. In Figure 6d, the high p-value can be for any branch.
The results shown in Figure 6 are quite encouraging. A very large fraction of reported suspicious regions have at least 90% of their columns misaligned and, conversely, for a very large fraction of misalignment regions StatSigMA-w reports at least 90% of their columns to have high p-values.

Conclusion
In this work, we analyzed the 17-vertebrate MULTIZ alignment of human chromosome 1. On the whole, MULTIZ is doing a good job of aligning orthologous sequences. However, 9.7% of the alignment is identified as suspiciously aligned by our method. We present independent evidence that many of these suspicious regions represent misalignments. This evidence is for coding sequences, which should be the simplest to align, and thus can be treated as a lower bound on the percentage of misalignments. Extrapolating to the entire genome, 3.2 Gbp × 9.7% = 310 Mbp of the alignment may be suspicious and should be reexamined.
It is possible that some of the regions that we report as suspiciously aligned may indeed be correctly aligned. The sequence similarity may be too low to establish confidence in the alignment, but it may still be correct. In such cases we need additional evidence to trust the alignment and use it as the basis for comparative sequence analyses.
Specificity and sensitivity for simulated data Figure 6 Specificity and sensitivity for simulated data. Results averaged over 25 simulated data sets, each of size approximately 100 Kb, that are aligned by TBA. (a) Specificity with respect to the same species: for each suspicious region reported by StatSigMA-w, the histogram shows the percentage of its columns for which the species reported as suspicious is actually misaligned by TBA. (b) Specificity with respect to any species: for each suspicious region reported by StatSigMA-w, the histogram shows the percentage of its columns for which any species is actually misaligned by TBA. (c) Sensitivity with respect to the same species: for each misalignment region, the histogram shows the percentage of its columns for which StatSigMA-w reports a p-value at least 0.1 for a branch attributable to the misaligned species. (d) Sensitivity with respect to any species: for each misalignment region, the histogram shows the percentage of its columns for which StatSigMA-w reports a p-value at least 0.1 for any branch. Step 8. Repeat Steps 4 to 7, until p-values have been computed for all positively scoring segments. Assign p-value 1 to any negatively scoring segments.
A point to note here is that, if a segment has a p-value p for a context W, then extending W to reach the nearest segment boundaries can only increase the p-value. Thus, for identifying the context with the greatest p-value in Step 5, we need only consider those contexts that lie on segment boundaries.