- Open Access
Split-alignment of genomes finds orthologies more accurately
Genome Biologyvolume 16, Article number: 106 (2015)
We present a new pair-wise genome alignment method, based on a simple concept of finding an optimal set of local alignments. It gains accuracy by not masking repeats, and by using a statistical model to quantify the (un)ambiguity of each alignment part. Compared to previous animal genome alignments, it aligns thousands of locations differently and with much higher similarity, strongly suggesting that the previous alignments are non-orthologous. The previous methods suffer from an overly-strong assumption of long un-rearranged blocks. The new alignments should help find interesting and unusual features, such as fast-evolving elements and micro-rearrangements, which are confounded by alignment errors.
Aim of genome alignment
If we compare two genome sequences, such as those of human and chimp, to see how they differ, then intuitively we wish to align the “equivalent” regions of the genomes. More precisely, we wish to align orthologs, which are descended from the same sequence in the last common ancestor of the genomes. The white boxes in Fig. 1a illustrate orthologs.
We can recognize orthologs by sequence similarity, but we need to distinguish them from two other types of similar sequence. The first is paralogs, which are descended from a common ancestral sequence by intra-genome duplication before the speciation event. The black and white boxes in Fig. 1a are paralogous to each other. The second is independently-evolved simple sequences such as atatatatatat. Simple sequences are typically suppressed by identifying and masking them, though not all identification  and masking  procedures work equally well.
Genome comparison would be simpler if the equivalencies were always one-to-one, but unfortunately orthology is not always one-to-one. If orthologs are duplicated after the speciation event, it can be many-to-many. In Fig. 1a, the black box in the left genome is orthologous to both black boxes in the right genome.
There is a large body of ongoing research on discriminating orthologous from paralogous proteins [3–6]. A simple approach, which ignores many-to-many orthology, is to find reciprocal best matches between two proteomes. A better approach in theory (not necessarily in practice ) is to infer phylogenetic trees of the proteins, and thence infer speciation and duplication events. These methods are not easily adapted to whole genomes, because we must consider rearrangements causing different genomic segments to have different evolutionary relationships, and the segment boundaries are not known in advance.
There is a widespread desire to refine the concept of orthology, perhaps in order to avoid many-to-many equivalencies, and so people speak of “main ortholog”, “positional ortholog”, “syntenic regions”, etc . These terms tend to be ill-defined. For example, “positional orthology” refers to orthologs that are in equivalent positions in two genomes: this is problematic, because the only way to define equivalent positions is by orthology. The intuition seems to be that more-extensive orthology defines equivalent positions, whereas smaller orthologous fragments do not. It is unclear how extensive the orthology has to be, or whether there is really a coherent concept here.
This has been made more precise under the term “toporthology”  (or “topoorthology” ), which is based on symmetry of duplications. For example, retrotransposition is an asymmetric duplication (Fig. 1b), because we can distinguish the original (white box) from the copy (black box). The original is the toportholog. It is important to realize that duplications can also be symmetric (Fig. 1c), so that neither duplicate is less “original” than the other: thus toporthology is not always one-to-one.
It was suggested that symmetric duplications are those where deletion of either duplicate would restore the genome to its original state . However, there are cases where deletion of neither duplicate would restore the original genome (Fig. 1d). In this example, we might be tempted to say that the duplicate with longer orthologous flanking sequence is the “main ortholog”, but that simply highlights the fuzziness of the concept.
Synteny, order and orientation
The original meaning of “syntenic” is “on the same chromosome” . Thus “conserved synteny” means conservation of being on the same chromosome. Comparison of Drosophila melanogaster and Drosophila pseudoobscura genomes shows striking synteny conservation: although these genomes are highly shuffled relative to each other, the shuffling is mostly within and not between chromosomes (Fig. 2).
Another pattern is conserved order and orientation. This happens when an ancestral genomic segment has been partially rearranged by inversions, deletions, insertions, etc, but parts of it remain in their ancestral order and orientation. This can be seen in human and orangutan chromosomes 17 (Fig. 2). Most genome alignment methods use conserved order and orientation to help construct their alignments .
The classic approach to alignment is to define a scoring scheme, with substitution and gap scores (e.g. Table 1), and then seek alignments with maximal total score. This is equivalent to using a statistical model of related sequences, with substitution and gap probabilities, and seeking alignments with maximal likelihood under the model [11, 12].
It is said that “all models are wrong, but some are useful”, and this is no exception. This model lacks many features of related sequences: substitutions are more frequent at CG dinucleotides, indels are more common in tandem repeats, some regions (e.g. protein-coding) are more conserved than others, structural RNA genes conserve complementarity rather than primary sequence, etc. There have been proposals to model some of these features (e.g. [13–15]), but they have a cost in run time and nuisance parameters. In this study we shall just use the classic alignment model, though our new methods could be combined with more complex models. Classic alignment has been very widely used, and often works well enough to give useful results. It can successfully align orthologs whose primary sequence is not constrained, provided their common ancestry is recent enough that they have not diverged too far.
Maximal-score alignment has an under-appreciated flaw: it can spuriously align dissimilar and unrelated sequences, if they are flanked by similar sequences . Although the dissimilar sequences will have negative alignment score, if both flanks have positive scores of greater magnitude then the score is maximized by aligning the whole thing. The underlying problem is that this approach seeks optimal individual alignments, but we really want an optimal set of alignments.
Maximal-score alignments can be found by the Smith-Waterman-Gotoh algorithm [17, 18], but this is slow for large genomes and so fast heuristics are used instead. A typical heuristic is seed-and-extend, which often has three steps: 1) find “seeds”, i.e. short matches that can be found quickly; 2) for each seed check whether there is a gapless alignment with score ≥ some threshold d; 3) if so check whether there is a gapped alignment with score ≥e.
Step 3 is often done with a “gapped x-drop algorithm” [19, 20]. This means that we try extending an alignment in all possible ways, with any pattern of insertions and deletions, but stop if the score drops more than x below the maximum seen so far. It can be argued that x should be just less than e: lower values of x can hide alignment flanks with positive score, but higher values cause trouble by merging alignments with score s≥e across drops with score ≤−s .
Repeats (interspersed repeats and simple sequences) are typically masked before alignment. Specifically, they are marked using lowercase letters, seeds are forbidden from overlapping them, but the final alignments are allowed to extend into them. A major reason for masking is to make the computation tolerable: without it, e.g. each of the 1 million human Alu repeats would hit each of the 1 million chimp Alus, producing 1012 alignments.
Summary of this study
This study presents a new genome alignment method, with several interesting features:
It is based on finding an optimal set of alignments, instead of optimal individual alignments.
It aligns without masking, which turns out to be important for orthology search.
It uses a statistical model to estimate the reliability (unambiguity) of each alignment part, enabling the user to disregard less-reliable parts.
In a major departure, it does not consider conserved order and orientation. Although considering this is sensible, the ways that other aligners do so are problematic.
Compared to previous aligners, this method aligns thousands of loci differently and with much higher similarity, strongly suggesting that the previous alignments are not orthologous.
Idea of the new method
The idea is to seek a set of one-to-one alignments between two genomes that maximizes:
Here, f is an “alignment existence cost”, which is necessary to avoid trivial solutions with lots of length-1 alignments. It is similar to Mauve’s breakpoint penalty .
The one-to-one requirement means that each basepair in either genome must match at most one basepair in the other genome. This is crude but tractable, and the hope is it will mostly find one-to-one orthologs. It is akin to the reciprocal best match approach to protein orthology.
This simple scoring system is a natural way to find a set of items. One property is that no alignment can contain any segment with score <−f, because in that case the score could be increased by splitting the alignment into two parts either side of the segment. So it solves the aforementioned problem of arbitrarily bad segments in individual alignments. The constant f reflects uniform probabilities, in a statistical model, of starting and ending a new item (see the Appendix).
Note this is not equivalent to finding non-overlapping alignments with score >f, with a classic aligner like BLAST or WU-BLAST [23, 24]. Our approach optimizes the set rather than individual alignments: for instance, if two alignments overlap, our approach optimizes the breakpoint for jumping between them.
Unfortunately, there does not seem to be an efficient algorithm to find such an optimal set of alignments. The nearest thing is the “repeated matches” algorithm, which finds an optimal set of many-to-one alignments . This is asymmetric: it aligns each basepair in the “query” genome to at most one basepair in the “reference” genome, but not necessarily vice-versa. It is about as fast as Smith-Waterman-Gotoh. In practice, the new method uses these steps:
Find local alignments between the two genomes, by seed-and-extend (many-to-many).
Apply the repeated matches algorithm, constrained to the candidate alignments found in step 1. We refer to this constrained version of the repeated matches algorithm as “split-alignment”.
Split-alignment guarantees to find a set of many-to-one alignments that maximizes the sum of (alignment score−f), where each alignment in the set is part (or all) of a candidate alignment. In other words, given a set of alignments that overlap in the query, it finds an optimal set of nonoverlapping alignment parts. One aspect of this is finding optimal breakpoints for jumping between overlapping alignments. The output may include multiple parts of one candidate alignment.
Perform split-alignment a second time, after swapping the roles of query and reference. This produces one-to-one alignments.
Step 1 uses LAST (though other aligners could be used), and for brevity let us refer to the whole new method as LAST . We shall refer to the output of step 2 as “1-split” alignments, and the output of step 3 as “2-split” alignments.
Many of the following results use the 1-split alignments, because they are easier to evaluate: if we find many-to-one alignments between genomes Q (query) and R (reference), we can assess whether each alignment could be improved by aligning the same segment of Q to a different region of R. They are also more comparable to the UCSC genome alignments, which are many-to-one [26, 27].
By using a probabilistic version of split-alignment (a kind of Forward-Backward algorithm , see the Appendix), we can estimate the probability that each pair of bases is wrongly aligned. This is high if that region of genome Q aligns almost equally well to other regions of genome R. The following results omit alignments from each set that lack at least one position with error probability ≤ 0.00001.
Results with pre-masking
The new method was used to align the human and chimp genomes, with standard repeat-masking at first. To facilitate comparison with the UCSC alignments, the same scoring scheme was used (human-chimp.v2, Table 1). This produced 371977 1-split alignments (with human as query), of which 15084 are “different” from UCSC, meaning no pair of aligned bases in common. For 6845 of these different alignments, the alignment’s human segment is 100 % covered by (i.e. contained in) one UCSC alignment: so we can compare the alignment scores for this (exact same) human segment. LAST’s score is higher in 95 % of cases (Fig. 3). For human versus dog, LAST’s score is higher in 90 % of cases.
It is encouraging that LAST usually gets higher scores, but the 5–10 % of lower scores are clear failures in its aim of finding an optimal set of alignments. Inspection of several cases revealed that these failures are caused by masking. If the true ortholog of a sequence is masked, but a paralog is not, then LAST may incorrectly align the paralog. Fundamentally, masking is dangerous for orthology search in a way that it is not for homology search. In homology search it can only cause false-negatives, but in orthology search it can also cause false-positives.
Alignment without masking
We would thus like to align without masking, but we still wish to avoid aligning independently-evolved simple sequences (Fig. 1a). This was achieved by post-masking: alignments that mostly overlap simple sequence were discarded at the end.
The problem is that alignment without masking takes much longer and produces overwhelming output (Table 2, row “mask” versus row “unmask”). It is feasible because we use LAST, whose seeds adapt (in length and rareness) to repeats . So the number of seed hits merely doubles (because about half the query was previously masked). The main problem is that the number of gapless alignments increases 100-fold. This is because a greater proportion of the seed hits lie in high-scoring alignments (repeats).
To mitigate this problem, a “gapless alignment culling” step was added. This step discards any gapless alignment whose query segment lies in those of two or more other alignments with greater score-per-length.a This aims to get the strongest matches to each region of the query (like adaptive seeds), not all matches. Ultimately we just want one strongest match, but the second-strongest helps us to calculate model probabilities. A similar culling procedure is present in BLAST .
Results with post-masking
Post-masking (of simple sequences only, not interspersed repeats) was tested on five pairs of genomes (Fig. 4). In each case, the majority of aligned bases are identical to the UCSC alignments (Fig. 4, top row), but a nontrivial proportion are different (Table 3). For example, in the human-mouse comparison, >12 % of aligned bases lie in alignments that are completely different from UCSC (no pair of aligned bases in common).
As above, we can compare scores for “different” LAST alignments whose human segment is covered by one UCSC alignment (Fig. 4, row 2). For the ape comparisons, LAST’s score is almost always higher, so post-masking does indeed improve the results. Moreover, the LAST scores are higher by a margin of at least 795: this comes from the error probability threshold of 0.00001, because a score difference of 795 is equivalent to a 105-fold difference in model probability.
The human-dog and human-mouse results are not quite as good: the LAST scores are lower in about 2 % of cases. This is at least partly because these genomes are more diverged, so LAST’s seeds miss some orthologs.
It may be more intuitive to compare the LAST and UCSC alignments by %-identity (Fig. 4, row 3). The LAST alignments almost always have higher %-identity, often by a considerable margin, e.g. 10 % or 20 %. %-identity can be misleading, because it treats e.g. one length-10 gap the same as 10 substitutions. It is better to weight different types of change by their evolutionary likelihoods, which is done in the alignment scores (Fig. 4, row 2).
In summary, there are thousands of loci that LAST aligns completely differently from UCSC, with significantly higher score and %-identity. Some overlap protein-coding exons (Table 4). It is plausible that in many of these cases the LAST alignments are orthologous and the UCSC alignments are not. In some cases, the UCSC alignments lack similarity and homology. An example is shown in Fig. 5, where UCSC aligns an inversion in un-inverted orientation. In other cases, the UCSC alignments are homologous, but less similar than the LAST alignments. However, UCSC favours chains of colinear alignments, and we may wonder whether we would rather have (say) a 91 %-identity colinear human-chimp alignment or a 98 %-identity non-colinear alignment (Table 4). When the difference in similarity is this large, it is more plausible that the UCSC alignment is paralogous. Since paralogs often come from tandem duplication, they can lie in chains. Several factors may cause lower-similarity chained alignments.
Large and complex duplications: these create ambiguity about how to construct chains.
Rearrangement (e.g. inversion) of the ortholog but not the paralog.
Genome misassembly: most of these assemblies are unfinished drafts. Misassembly is especially likely in regions with complex duplications, repeats, and rearrangements.
Gene conversion: this can convert an ortholog to a paralog.
Contaminating human sequence in e.g. the chimp assembly.
Accelerated evolution: this can decrease the similarity of an ortholog.
Wrong x-drop alignments
LAST’s ape comparisons still have a tiny fraction of alignments with lower score than UCSC. These are mostly caused by a pathology of the gapped x-drop heuristic (Fig. 6). If an alignment has a region with score <−x (e.g. a large gap), the left and right flanks of that region will usually be found as separate alignments. Unfortunately, it is sometimes possible to find an alternative, wrong alignment of the whole region without a score drop >x, but with lower score overall. If orthologs are wrongly aligned in this way, the alignment score may be lower than that of paralogs, causing LAST to prefer a paralogous alignment.
This problem can be fixed by either increasing x so that the correct alignment is found, or decreasing x so that the incorrect alignment is not found and the correct alignment is found in two parts. Unfortunately, different values of x fix different cases, and there is no reasonable value that fixes all cases.
New alignments of repeats
The LAST alignments include many cases where the human segment is completely unaligned by UCSC (Table 3). These alignments tend to be covered by repeat elements, such as LINEs and SINEs (Fig. 7, left panel). Many repeats can be aligned unambiguously because they are older than the common ancestor of the genomes, so they have unique orthologs with higher similarity than the other copies. In addition, there are many cases where repeat elements have been inserted within other repeats, creating unique mosaics. Alignment without masking reveals many such potentially interesting orthologies.
Nevertheless, orthology search is likely harder for repeats than non-repeats. The human-chimp 1-split alignments mostly have around 98 % identity, but many of the repeat alignments have lower %-identity (Fig. 7, panel 2). The likely explanation is that many of these repeat alignments are paralogous.
This problem mostly vanishes in the final 2-split alignments (Fig. 7, panel 3). Now the repeat alignments also have around 98 % identity, although they have slightly higher variance: they more often have both higher and lower %-identity. This higher variance is not surprising as repeat alignments tend to be shorter (Fig. 7, panel 4), simply because longer alignments are less likely to be 100 % repetitive.
Surprisingly, the human-dog 1-split repetitive alignments do not have reduced %-identity (Fig. 7, panel 5). A possible explanation is that the human-chimp paralogous alignments are mostly due to poor genome assembly: orthologous human-chimp repeats are often very young, with low divergence, and thus hard to assemble.
Badness of HoxD55
Alignment of the D. melanogaster and D. pseudoobscura genomes worked less well: the LAST scores were lower than the UCSC scores in 13 % of cases (Fig. 4). This was the only comparison to use the HoxD55 scheme (Table 1). Inspection of several cases revealed that the LAST failures are due to the x-drop problem described above, which evidently occurs much more often with HoxD55. This scoring scheme has a high tolerance for aligning unrelated sequences , which presumably exacerbates the x-drop error.
Accordingly, the alignment worked much better with HoxD70 (Fig. 8, left column). Now, the %-identity is almost always higher for LAST than UCSC, apart from just two clearly-wrong LAST alignments, caused by x-drop error.
Comparison to other aligners
Many genome alignment methods have been proposed, though most have in common an approach of looking for chains of colinear alignments. In addition to UCSC, let us consider VISTA  and Mauve  as representative examples.
In the VISTA human-chimp alignments, the vast majority of aligned bases are identical to LAST (Fig. 8). Nevertheless, there are many LAST alignments that have no aligned bases in common with VISTA: for some of these, the human segment is covered by one VISTA alignment, in which case we can compare the %-identities for that human segment. The VISTA %-identity is almost always lower, often much lower (Fig. 8). In fact, VISTA has many more very low %-identity alignments (e.g. <60 %) than UCSC (Fig. 4, lower-left panel). Inspection of several cases revealed errors similar to that in Fig. 5. The likely reason is that VISTA uses colinearity more aggressively than UCSC, by globally aligning genome regions defined by chains.
As another comparison, the two Drosophila genomes were aligned using progressiveMauve version 2.3.1 with default parameters. The result is conservative, with fewer aligned bases than LAST, and few cases where Mauve aligns the same region of melanogaster to a different region of pseudoobscura (Fig. 8). In these few cases, Mauve’s %-identity is usually much lower, apart from the same two LAST errors mentioned above. Although Mauve also uses aggressive global alignment, it subsequently detects and removes alignments of unrelated sequences, to avoid errors like that in Fig. 5 [22, 31].
Good alignment depends on using reasonable score/model parameters (Table 1), and we can check whether they match the substitution and gap frequencies in the alignments. This is only a rough check, because the alignments are not perfect: in particular, the gap existence counts may be underestimates due to “gap attraction” and “gap annihilation” [13, 32].
The main observation is that the gap costs for human-chimp.v2 are unduly large: a better fit would be obtained with a gap existence cost of 500 and a gap extension cost of 30. So we re-did the ape alignments with these costs, then re-counted substitutions and gaps.
The next observation is that gap lengths do not fit any model with a simple gap extension probability, because the frequencies of longer gaps decrease more slowly (Fig. 9). A pragmatic solution is to fit the gap extension probability to short gaps.
The substitution and gap frequencies, and corresponding scores, are shown in Appendix C. They do not differ greatly from the alignment models. However, these frequencies are not uniform across the genome, and averaged parameters may not be ideal. For example, a (match score):(mismatch cost) ratio of 1:1 is appropriate for ∼75 % identity, 1:2 for ∼95 % identity, and 1:3 for ∼99 % identity . To investigate, we deleted gap columns then measured substitution rates in 200bp windows of the alignments (Fig. 9, row 2). The substitution rates are not uniform, but they do not vary arbitrarily: for instance, human-mouse alignments hardly ever have ≥90 % identity. In summary, the used parameters are not wildly unreasonable.
Alignathon simulation test
Finally, we tested our method on the “Alignathon” simulated genomes . Simulation has the advantage that the true alignments are known, but the disadvantage that the simulation’s realism is unknown. For instance, the simulation presumably lacks rearrangements like that in Fig. 1d.
There are two simulations: one of four ape-like genomes, and one of five mammal-like genomes. The “truths” are multiple (not pair-wise) alignments, and, in our understanding, they align all homologs (including paralogs, but excluding mobile element insertions) that have duplicated since the most recent common ancestor of the genomes. This unfortunately does not match our approach of finding one-to-one orthologs. In any case, for each simulation we made pair-wise alignments with LAST, then joined them into multiple alignments with mafTransitiveClosure , which joins pair-wise into multiple alignments in a naïve way.
For the ape simulation, LAST achieved a precision of 0.998 and a recall of 0.978. All other aligners had lower precision (Table S13 in ). For the mammal simulation, LAST achieved precision=0.827 and recall=0.612. Several aligners have higher precision, however all but one of those have much lower recall (Tables S15–16 in ). The exception is Cactus, with precision=0.885 and recall=0.734.
To understand why Cactus has higher precision, let us focus on pair-wise alignments between simHuman and simMouse. LAST aligns 124 million pairs of bases, of which 25 million are wrong, and 464 thousand lie in completely-wrong alignments (no pair of aligned bases in common with the truth). Cactus aligns 131 million pairs of bases, of which 18 million are wrong, and 6 million lie in completely-wrong alignments. So LAST is much better at avoiding completely-wrong alignments, whereas Cactus excels at accuracy of partly-right alignments. The latter is not surprising, because Cactus is a true multiple aligner: it takes pair-wise alignments from an external source (potentially LAST), and combines and refines them by integrating the information from all the sequences .
The new genome alignment method is conceptually extremely simple, it just seeks an optimal set of one-to-one alignments. Despite decades of extensive research on alignment, alignment sets have been surprisingly neglected, although they are often what is really wanted, e.g. for multi-domain proteins.
The new method is obviously crude, because it ignores phylogeny and many-to-many orthology. It will fail in cases of reciprocal gene loss, where one copy of a paralog is absent in one genome and the other copy is missing from the other genome. Such hidden paralogy is a major problem in understanding evolution .
Nevertheless, the new method seems to fix thousands of non-orthologous parts in previous genome alignments. The previous errors were caused by an over-aggressive assumption of conserved order and orientation. For example, in many cases in Table 4, UCSC finds the same alignment as LAST in its initial (many-to-many) “chains” but omits it from its final (many-to-one) “net” alignments, because it prefers weaker alignments in stronger chains. There is a widespread paradigm of trying to align long colinear blocks (often using “chains” or “anchors”), which risks producing non-orthologous or even non-homologous alignments. The ideal approach is probably to use a weaker preference for conserved order and orientation, e.g. via prior probabilities in a statistical model.
The use of a probabilistic model is a key advantage, since it quantifies the ambiguity of each aligned base. Similar probabilistic methods have been applied before to individual alignments [11, 13, 14, 21], but apparently not to alignment sets.
We found that pre-masking is dangerous for orthology search, which is probably not widely recognized since it is not dangerous for typical BLAST homology searches. Unfortunately, genome alignment without masking is much more compute-intensive, even with adaptive seeds and gapless alignment culling. Probably, better heuristics could be developed to tackle this.
We also found that the gapped x-drop heuristic can sometimes produce bad alignments (Fig. 6). This is important because x-drop is widely used (e.g. BLAST), the bad alignments are not immediately obvious (probably they are usually overlooked), and this problem does not seem to have been described before. Unfortunately, it is unclear how to fix it, save by applying the repeated matches algorithm directly to the genomes (which seems feasible on a large supercomputer).
Split-alignment has applications beyond whole genome comparison. It can be used to map DNA or RNA reads to a genome. “Mapping” is orthology search (since paralogs are not wanted), and reads are genome fragments (possibly rearranged), so it is all the same thing. Since different reads may redundantly cover the same query bases, we would seek many-to-one alignments, i.e. stop at the 1-split stage. Our method incorporates fastq quality data into the model and scoring . The statistical model, which quantifies the (un)ambiguity of each alignment part, is a major benefit for finding reliable rearrangement breakpoints.
The new method aligns the majority of genomic bases identically to previous methods, as expected. Nevertheless, around 100 million human bases, which overlap a number of protein-coding regions, are in completely different alignments. The new alignments should be especially beneficial when searching for interesting and unusual features in genome evolution, because these are particularly confounded by alignment errors. One example is accelerated evolution, which is mimicked by paralogy. Another is micro-rearrangements, which are systematically missed in standard genome alignments based on colinearity [38, 39]. Indeed the new alignments suggest many interesting rearrangements (e.g. Fig. 5), but unfortunately it is not straightforward to tell true rearrangements from assembly errors. The new alignments are available at: . The software is available at , and also in the last-align package for Debian and Ubuntu .
Materials and methods
The input is a set of local alignments between one query sequence and one genome. (If there is more than one query, the algorithm is applied to each independently.) An example is shown in Fig. 10. First, the alignments are oriented to use the forward strand of the query. A i j is defined to be the score at query letter j in alignment i, for match, mismatch, or insertion of this letter. D i j is defined to be the score between query letters j−1 and j in alignment i, for deletions. The optimal split-alignment score is calculated by dynamic programming, using these recurrence relations:
The recurrence is initialized like this:
where beg(i) is the coordinate of the first query letter in alignment i, and beg= min(beg(i)). The optimal split-alignment score is W end, where end= max(end(i)), and end(i) is one-past the last query letter in alignment i. This only calculates the score, but an optimal split-alignment can then be found by a standard traceback procedure .
These assemblies were used: panTro4, ponAbe2, canFam3, mm10, dp4, dm3 (without chrUextra), and hg19 (without alternate haplotypes and with the chrY pseudo-autosomal regions replaced by ‘n’s).
The UCSC genome alignments were taken from the axtNet subdirectories of these directories: hg19/vsPanTro4, hg19/vsPonAbe2, hg19/vsCanFam3, hg19/vsMm10, dm3/vsDp4.
The VISTA alignment was taken from: .
Lowercase-masked genomes were obtained from the UCSC database. Tandem repeats found by tantan version 13 were additionally masked, in order to prevent non-homologous alignments more reliably .
The genomes were lowercase-masked by tantan only, then aligned case-insensitively, and at the very end each alignment was rescored with gentle masking of lowercase letters : if it lacked any segment with score ≥e it was discarded.
The sensitive transition seed set MAM8 was used by default . For the closely-related apes, the spaced seed 1111110 was used instead. For human-dog and human-mouse with post-masking, since the number of indexed bases roughly doubles without masking, MAM4 was used to avoid a too-large index.
LAST’s seed rareness limit m was empirically set to 50 for the ape alignments and 100 for the others. The score threshold e was set to values with borderline statistical significance, using ALP : 5000 for the flies with HoxD55, 4000 for the flies with HoxD70, 4500 for mammals with HoxD70, and 3000 for the apes. The alignment existence cost f and the maximum gapped score drop x were both set to e−1.
To illustrate, the Drosophila HoxD70 alignments can be constructed with LAST v535 as follows. First, run tantan on both genomes, with default settings. Then, make the 1-split alignments like this: lastdb¡-uMAM8¡x¡dp4.fa lastal¡-pHOXD70¡-e4000¡-C2¡-m100¡x¡dm3.fa¡| last-split¡-m1¡>¡1.maf Next, make the 2-split alignments like this: maf-swap¡1.maf¡|¡last-split¡-m1¡>¡2.maf Finally, run last-postmask on 1.maf and 2.maf.
Alignathon ape test
These query-reference pairs were aligned: simChimp-simHuman, simGorilla-simHuman, simOrang-simHuman. The alignment procedure was the same as for the real ape genomes (lastdb option -m1111110, and lastal options -phuman-chimp.v2 -a500 -b30 -e3000 -C2 -m50). Alignments with error probability ≤ 0.00001 were retained, and joined by mafTransitiveClosure.
Alignathon mammal test
These query-reference pairs were aligned: simDog-simHuman, simMouse-simHuman, simRat-simMouse, simCow-simDog. Since the simulated genomes are smaller than the real ones, we used MAM8 instead of MAM4 (lastdb option -uMAM8, and lastal options -pHOXD70 -e4500 -C2 -m100). Alignments with error probability ≤ 0.00001 were retained, and joined by mafTransitiveClosure.
The data set supporting the results of this article is available in the Zenodo repository .
a Score-per-length is computed for whole alignments, not overlapping parts.
Appendix A: Statistical models
The aim here is to explain and motivate the statistical model of alignments, and the f parameter (alignment existence cost). It is instructive to first consider models of segments, such as hydrophobic segments in protein sequences. Segments are a simpler (1-dimensional) analog of alignments.
A simple model is for segments to have independent letters with frequencies π x , while background (non-segment) regions have letter frequencies θ x . Given a sequence, we can then seek maximal-likelihood segments.
Figure 11a shows a precise model of this kind, with transition probabilities ω and γ, in a standard circle-and-arrow notation . Suppose we have a sequence Q of length n. Let us calculate the likelihood of the path through the model whereby Q c+1…Q d is a foreground segment:
This can be simplified by factoring out a constant μ, defined as:
Because μ does not depend on the path, we can find a most-probable path by maximizing:
Next, because maximizing a value is equivalent to maximizing its logarithm, we can maximize:
We can now define a scoring scheme, where each letter-type x receives a score:
Here, t is an arbitrary scale factor. Maximal-likelihood segments are runs of letters with maximal total score. Scores are related to model probabilities like this:
A.2 Segment sets
Figure 11a clearly models one segment, and we can instead model multiple segments using Fig. 11b, with transition probabilities ω, γ, and α. It can be shown that a maximal-likelihood segment set is one that maximizes:
Here, the segment score is the sum of the letter scores S(x), and f is:
Thus, a segment existence cost f arises naturally from model probabilities of starting and ending a segment.
Figure 11c shows one possible model of local alignments. It can be shown that a maximal-likelihood alignment is one with maximal score according to this scheme:
A.4 Alignment sets
Unfortunately, it is unclear how to make a simple model like those in Fig. 11 for a set of local alignments. So let us proceed by brute force. In all three previous models, it was the case that prob∝ exp(score/t). We can define the probability of any alignment set A as follows:
The score parameters and t are the same as in the single-alignment model, so the only new parameter is f.
Appendix B: Probability calculation
The probabilistic version of the split-alignment algorithm is described here. These exponentiated scores are used:
The Forward algorithm is:
The Backward algorithm is:
These algorithms enable us to calculate the model probability of each column in each alignment. The probability for a column in alignment i with query letter j is:
where z=G end=C beg. The probability for a column in alignment i between query letters j−1 and j is:
Each column’s error probability is one minus its model probability.
The practical implementation of this Forward-Backward algorithm uses scaling to avoid numerical instability .
Frith MC. A new repeat-masking method enables specific detection of homologous sequences. Nucleic Acids Res. 2011; 39:23.
Frith MC. Gentle masking of low-complexity sequences improves homology search. PLoS ONE. 2011; 6:28819.
Kuzniar A, van Ham RC, Pongor S, Leunissen JA. The quest for orthologs: finding the corresponding gene across genomes. Trends Genet. 2008; 24:539–51.
Altenhoff AM, Dessimoz C. Phylogenetic and functional assessment of orthologs inference projects and methods. PLoS Comput Biol. 2009; 5:1000262.
Altenhoff AM, Dessimoz C. Inferring orthology and paralogy. Methods Mol Biol. 2012; 855:259–79.
Sonnhammer E, Gabaldon T, Wilter Sousa da Silva A, Martin M, Robinson-Rechavi M, Boeckmann B, Thomas P, Dessimoz C. Big Data and Other Challenges in the Quest for Orthologs. Bioinformatics. 2014; 30(21):2993–8.
Dewey CN. Positional orthology: putting genomic evolutionary relationships into context. Brief Bioinformatics. 2011; 12:401–12.
Dewey CN, Pachter L. Evolution at the nucleotide level: the problem of multiple whole-genome alignment. Hum Mol Genet. 2006; 15 Spec No 1:51–6.
Passarge E, Horsthemke B, Farber RA. Incorrect use of the term synteny. Nat Genet. 1999; 23:387.
Dewey CN. Whole-genome alignment. Methods Mol Biol. 2012; 855:237–57.
Durbin R, Eddy S, Krogh A, Mitchison G. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge, UK: Cambridge University Press; 1998.
Yu YK, Altschul SF. The construction of amino acid substitution matrices for the comparison of proteins with non-standard compositions. Bioinformatics. 2005; 21:902–11.
Lunter G, Rocco A, Mimouni N, Heger A, Caldeira A, Hein J. Uncertainty in homology inferences: assessing and improving genomic sequence alignment. Genome Res. 2008; 18:298–309.
Hudek AK, Brown DG. FEAST: sensitive local alignment with multiple rates of evolution. IEEE/ACM Trans Comput Biol Bioinform. 2011; 8:698–709.
Nánási M, Vinar T, Brejová B. Probabilistic approaches to alignment with tandem repeats. Algorithms Mol Biol. 2014; 9:3.
Zhang Z, Berman P, Wiehe T, Miller W. Post-processing long pairwise alignments. Bioinformatics. 1999; 15:1012–1019.
Smith TF, Waterman MS. Identification of common molecular subsequences. J Mol Biol. 1981; 147:195–7.
Gotoh O. An improved algorithm for matching biological sequences. J Mol Biol. 1982; 162:705–8.
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997; 25:3389–402.
Zhang Z, Berman P, Miller W. Alignments without low-scoring regions. J Comput Biol. 1998; 5:197–210.
Frith MC, Hamada M, Horton P. Parameters for accurate genome alignment. BMC Bioinformatics. 2010; 11:80.
Darling AE, Mau B, Perna NT. progressive Mauve: multiple genome alignment with gene gain, loss and rearrangement. PLoS ONE. 2010; 5:11147.
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990; 215:403–10.
Lopez R, Silventoinen V, Robinson S, Kibria A, Gish W. WU-Blast2 server at the European Bioinformatics Institute. Nucleic Acids Res. 2003; 31:3795–798.
Kiełbasa SM, Wan R, Sato K, Horton P, Frith MC. Adaptive seeds tame genomic sequence comparison. Genome Res. 2011; 21:487–93.
Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, et al. Human-mouse alignments with BLASTZ. Genome Res. 2003; 13:103–7.
Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution’s cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A. 2003; 100:11484–11489.
Berman P, Zhang Z, Wolf YI, Koonin EV, Miller W. Winnowing sequences from a database search. J Comput Biol. 2000; 7:293–302.
Frith MC, Park Y, Sheetlin SL, Spouge JL. The whole alignment and nothing but the alignment: the problem of spurious alignment flanks. Nucleic Acids Res. 2008; 36:5863–871.
Dubchak I, Poliakov A, Kislyuk A, Brudno M. Multiple whole-genome alignments without a reference organism. Genome Res. 2009; 19:682–9.
Treangen TJ, Darling AE, Achaz G, Ragan MA, Messeguer X, Rocha EP. A novel heuristic for local multiple alignment of interspersed DNA repeats. IEEE/ACM Trans Comput Biol Bioinform. 2009; 6:180–9.
Lunter G. Probabilistic whole-genome alignments reveal high indel rates in the human and mouse genomes. Bioinformatics. 2007; 23:289–96.
States DJ, Gish W, Altschul SF. Improved sensitivity of nucleic acid database searches using application-specific scoring matrices. Methods. 1991; 3:66–70.
Earl D, Nguyen N, Hickey G, Harris RS, Fitzgerald S, Beal K, et al. Alignathon: a competitive assessment of whole-genome alignment methods. Genome Res. 2014; 24:2077–089.
Paten B, Earl D, Nguyen N, Diekhans M, Zerbino D, Haussler D. Cactus: Algorithms for genome multiple sequence alignment. Genome Res. 2011; 21:1512–1528.
Kuraku S. Palaeophylogenomics of the vertebrate ancestor–impact of hidden paralogy on hagfish and lamprey gene phylogeny. Integr Comp Biol. 2010; 50:124–9.
Frith MC, Wan R, Horton P. Incorporating sequence quality data into alignment improves DNA read mapping. Nucleic Acids Res. 2010; 38:100.
Chaisson MJ, Raphael BJ, Pevzner PA. Microinversions in mammalian evolution. Proc Natl Acad Sci U S A. 2006; 103:19824–19829.
Hou M, Yao P, Antonou A, Johns MA. Pico-inplace-inversions between human and chimpanzee. Bioinformatics. 2011; 27:3266–275.
Genome alignments from “Split-alignment of genomes finds orthologies more accurately”. http://last.cbrc.jp/genome/.
LAST: genome-scale sequence comparison. http://last.cbrc.jp/.
Möller S, Krabbenhöft HN, Tille A, Paleino D, Williams A, Wolstencroft K, et al. Community-driven computational biology with Debian Linux. BMC Bioinformatics. 2010; 11:5.
Human Feb 2009- Chimp Feb 2011 pairwise alignments. http://pipeline.lbl.gov/data/hg19_panTro4.
Frith MC, Noé L. Improved search heuristics find 20,000 new alignments between human and mouse genomes. Nucleic Acids Res. 2014; 42:59.
Sheetlin S, Park Y, Spouge JL. The Gumbel pre-factor k for gapped local alignment can be estimated from simulations of global alignment. Nucleic Acids Res. 2005; 33:4987–994.
Genome alignments from “Split-alignment of genomes finds orthologies more accurately”. https://zenodo.org/record/17436.
Hastings PJ, Lupski JR, Rosenberg SM, Ira G. Mechanisms of change in gene copy number. Nat Rev Genet. 2009; 10:551–64.
Chiaromonte F, Yap VB, Miller W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput. 2002; 7:115–26.
We thank Hugues Richard and Anish Shrestha for discussions while developing the method. This work was supported by KAKENHI Grant Number 26700030 to MCF, and International Research Fellowship of the Japan Society for the Promotion of Science PE11014 to HR.
The authors declare that they have no competing interests.
RK participated in developing the software. MCF designed the method, performed the tests, and wrote the manuscript. Both authors read and approved the final manuscript.