Split-alignment of genomes finds orthologies more accurately
© Frith and Kawaguchi. 2015
Received: 6 April 2015
Accepted: 8 May 2015
Published: 21 May 2015
We present a new pair-wise genome alignment method, based on a simple concept of finding an optimal set of local alignments. It gains accuracy by not masking repeats, and by using a statistical model to quantify the (un)ambiguity of each alignment part. Compared to previous animal genome alignments, it aligns thousands of locations differently and with much higher similarity, strongly suggesting that the previous alignments are non-orthologous. The previous methods suffer from an overly-strong assumption of long un-rearranged blocks. The new alignments should help find interesting and unusual features, such as fast-evolving elements and micro-rearrangements, which are confounded by alignment errors.
Aim of genome alignment
We can recognize orthologs by sequence similarity, but we need to distinguish them from two other types of similar sequence. The first is paralogs, which are descended from a common ancestral sequence by intra-genome duplication before the speciation event. The black and white boxes in Fig. 1a are paralogous to each other. The second is independently-evolved simple sequences such as atatatatatat. Simple sequences are typically suppressed by identifying and masking them, though not all identification  and masking  procedures work equally well.
Genome comparison would be simpler if the equivalencies were always one-to-one, but unfortunately orthology is not always one-to-one. If orthologs are duplicated after the speciation event, it can be many-to-many. In Fig. 1a, the black box in the left genome is orthologous to both black boxes in the right genome.
There is a large body of ongoing research on discriminating orthologous from paralogous proteins [3–6]. A simple approach, which ignores many-to-many orthology, is to find reciprocal best matches between two proteomes. A better approach in theory (not necessarily in practice ) is to infer phylogenetic trees of the proteins, and thence infer speciation and duplication events. These methods are not easily adapted to whole genomes, because we must consider rearrangements causing different genomic segments to have different evolutionary relationships, and the segment boundaries are not known in advance.
There is a widespread desire to refine the concept of orthology, perhaps in order to avoid many-to-many equivalencies, and so people speak of “main ortholog”, “positional ortholog”, “syntenic regions”, etc . These terms tend to be ill-defined. For example, “positional orthology” refers to orthologs that are in equivalent positions in two genomes: this is problematic, because the only way to define equivalent positions is by orthology. The intuition seems to be that more-extensive orthology defines equivalent positions, whereas smaller orthologous fragments do not. It is unclear how extensive the orthology has to be, or whether there is really a coherent concept here.
This has been made more precise under the term “toporthology”  (or “topoorthology” ), which is based on symmetry of duplications. For example, retrotransposition is an asymmetric duplication (Fig. 1b), because we can distinguish the original (white box) from the copy (black box). The original is the toportholog. It is important to realize that duplications can also be symmetric (Fig. 1c), so that neither duplicate is less “original” than the other: thus toporthology is not always one-to-one.
It was suggested that symmetric duplications are those where deletion of either duplicate would restore the genome to its original state . However, there are cases where deletion of neither duplicate would restore the original genome (Fig. 1d). In this example, we might be tempted to say that the duplicate with longer orthologous flanking sequence is the “main ortholog”, but that simply highlights the fuzziness of the concept.
Synteny, order and orientation
Another pattern is conserved order and orientation. This happens when an ancestral genomic segment has been partially rearranged by inversions, deletions, insertions, etc, but parts of it remain in their ancestral order and orientation. This can be seen in human and orangutan chromosomes 17 (Fig. 2). Most genome alignment methods use conserved order and orientation to help construct their alignments .
Alignment scoring schemes used in this study, and their underlying probabilities
a c g t
a c g t
a c g t
a 90 -330 -236 -356
a 91 -114 -31 -123
a 91 -90 -25 -100
c -330 100 -318 -236
c -114 100 -125 -31
c -90 100 -100 -25
g -236 -318 100 -330
g -31 -125 100 -114
g -25 -100 100 -90
t -356 -236 -330 90
t -123 -31 -114 91
t -100 -25 -90 91
gap existence cost: 600
gap existence cost: 400
gap existence cost: 400
gap extension cost: 150
gap extension cost: 30
gap extension cost: 30
a c g t
a c g t
a c g t
gap existence probability:.000021
gap existence probability:.043
gap existence probability:.091
gap extension probability:.11
gap extension probability:.73
gap extension probability:.76
It is said that “all models are wrong, but some are useful”, and this is no exception. This model lacks many features of related sequences: substitutions are more frequent at CG dinucleotides, indels are more common in tandem repeats, some regions (e.g. protein-coding) are more conserved than others, structural RNA genes conserve complementarity rather than primary sequence, etc. There have been proposals to model some of these features (e.g. [13–15]), but they have a cost in run time and nuisance parameters. In this study we shall just use the classic alignment model, though our new methods could be combined with more complex models. Classic alignment has been very widely used, and often works well enough to give useful results. It can successfully align orthologs whose primary sequence is not constrained, provided their common ancestry is recent enough that they have not diverged too far.
Maximal-score alignment has an under-appreciated flaw: it can spuriously align dissimilar and unrelated sequences, if they are flanked by similar sequences . Although the dissimilar sequences will have negative alignment score, if both flanks have positive scores of greater magnitude then the score is maximized by aligning the whole thing. The underlying problem is that this approach seeks optimal individual alignments, but we really want an optimal set of alignments.
Maximal-score alignments can be found by the Smith-Waterman-Gotoh algorithm [17, 18], but this is slow for large genomes and so fast heuristics are used instead. A typical heuristic is seed-and-extend, which often has three steps: 1) find “seeds”, i.e. short matches that can be found quickly; 2) for each seed check whether there is a gapless alignment with score ≥ some threshold d; 3) if so check whether there is a gapped alignment with score ≥e.
Step 3 is often done with a “gapped x-drop algorithm” [19, 20]. This means that we try extending an alignment in all possible ways, with any pattern of insertions and deletions, but stop if the score drops more than x below the maximum seen so far. It can be argued that x should be just less than e: lower values of x can hide alignment flanks with positive score, but higher values cause trouble by merging alignments with score s≥e across drops with score ≤−s .
Repeats (interspersed repeats and simple sequences) are typically masked before alignment. Specifically, they are marked using lowercase letters, seeds are forbidden from overlapping them, but the final alignments are allowed to extend into them. A major reason for masking is to make the computation tolerable: without it, e.g. each of the 1 million human Alu repeats would hit each of the 1 million chimp Alus, producing 1012 alignments.
Summary of this study
It is based on finding an optimal set of alignments, instead of optimal individual alignments.
It aligns without masking, which turns out to be important for orthology search.
It uses a statistical model to estimate the reliability (unambiguity) of each alignment part, enabling the user to disregard less-reliable parts.
In a major departure, it does not consider conserved order and orientation. Although considering this is sensible, the ways that other aligners do so are problematic.
Compared to previous aligners, this method aligns thousands of loci differently and with much higher similarity, strongly suggesting that the previous alignments are not orthologous.
Idea of the new method
Here, f is an “alignment existence cost”, which is necessary to avoid trivial solutions with lots of length-1 alignments. It is similar to Mauve’s breakpoint penalty .
The one-to-one requirement means that each basepair in either genome must match at most one basepair in the other genome. This is crude but tractable, and the hope is it will mostly find one-to-one orthologs. It is akin to the reciprocal best match approach to protein orthology.
This simple scoring system is a natural way to find a set of items. One property is that no alignment can contain any segment with score <−f, because in that case the score could be increased by splitting the alignment into two parts either side of the segment. So it solves the aforementioned problem of arbitrarily bad segments in individual alignments. The constant f reflects uniform probabilities, in a statistical model, of starting and ending a new item (see the Appendix).
Note this is not equivalent to finding non-overlapping alignments with score >f, with a classic aligner like BLAST or WU-BLAST [23, 24]. Our approach optimizes the set rather than individual alignments: for instance, if two alignments overlap, our approach optimizes the breakpoint for jumping between them.
Find local alignments between the two genomes, by seed-and-extend (many-to-many).
Apply the repeated matches algorithm, constrained to the candidate alignments found in step 1. We refer to this constrained version of the repeated matches algorithm as “split-alignment”.
Split-alignment guarantees to find a set of many-to-one alignments that maximizes the sum of (alignment score−f), where each alignment in the set is part (or all) of a candidate alignment. In other words, given a set of alignments that overlap in the query, it finds an optimal set of nonoverlapping alignment parts. One aspect of this is finding optimal breakpoints for jumping between overlapping alignments. The output may include multiple parts of one candidate alignment.
Perform split-alignment a second time, after swapping the roles of query and reference. This produces one-to-one alignments.
Step 1 uses LAST (though other aligners could be used), and for brevity let us refer to the whole new method as LAST . We shall refer to the output of step 2 as “1-split” alignments, and the output of step 3 as “2-split” alignments.
Many of the following results use the 1-split alignments, because they are easier to evaluate: if we find many-to-one alignments between genomes Q (query) and R (reference), we can assess whether each alignment could be improved by aligning the same segment of Q to a different region of R. They are also more comparable to the UCSC genome alignments, which are many-to-one [26, 27].
By using a probabilistic version of split-alignment (a kind of Forward-Backward algorithm , see the Appendix), we can estimate the probability that each pair of bases is wrongly aligned. This is high if that region of genome Q aligns almost equally well to other regions of genome R. The following results omit alignments from each set that lack at least one position with error probability ≤ 0.00001.
Results with pre-masking
It is encouraging that LAST usually gets higher scores, but the 5–10 % of lower scores are clear failures in its aim of finding an optimal set of alignments. Inspection of several cases revealed that these failures are caused by masking. If the true ortholog of a sequence is masked, but a paralog is not, then LAST may incorrectly align the paralog. Fundamentally, masking is dangerous for orthology search in a way that it is not for homology search. In homology search it can only cause false-negatives, but in orthology search it can also cause false-positives.
Alignment without masking
We would thus like to align without masking, but we still wish to avoid aligning independently-evolved simple sequences (Fig. 1a). This was achieved by post-masking: alignments that mostly overlap simple sequence were discarded at the end.
Statistics for aligning human chromosome 22 to the chimp genome (step 1 only: no split-alignment)
To mitigate this problem, a “gapless alignment culling” step was added. This step discards any gapless alignment whose query segment lies in those of two or more other alignments with greater score-per-length.a This aims to get the strongest matches to each region of the query (like adaptive seeds), not all matches. Ultimately we just want one strongest match, but the second-strongest helps us to calculate model probabilities. A similar culling procedure is present in BLAST .
Results with post-masking
Quantities of LAST (1-split, post-masked) alignments, and differences from UCSC alignments
(bases a )
(bases a )
(bases a )
(bases a )
As above, we can compare scores for “different” LAST alignments whose human segment is covered by one UCSC alignment (Fig. 4, row 2). For the ape comparisons, LAST’s score is almost always higher, so post-masking does indeed improve the results. Moreover, the LAST scores are higher by a margin of at least 795: this comes from the error probability threshold of 0.00001, because a score difference of 795 is equivalent to a 105-fold difference in model probability.
The human-dog and human-mouse results are not quite as good: the LAST scores are lower in about 2 % of cases. This is at least partly because these genomes are more diverged, so LAST’s seeds miss some orthologs.
It may be more intuitive to compare the LAST and UCSC alignments by %-identity (Fig. 4, row 3). The LAST alignments almost always have higher %-identity, often by a considerable margin, e.g. 10 % or 20 %. %-identity can be misleading, because it treats e.g. one length-10 gap the same as 10 substitutions. It is better to weight different types of change by their evolutionary likelihoods, which is done in the alignment scores (Fig. 4, row 2).
Large and complex duplications: these create ambiguity about how to construct chains.Table 4
Examples of better human-chimp alignments found by LAST than UCSC (mm=mismatches)
Rearrangement (e.g. inversion) of the ortholog but not the paralog.
Genome misassembly: most of these assemblies are unfinished drafts. Misassembly is especially likely in regions with complex duplications, repeats, and rearrangements.
Gene conversion: this can convert an ortholog to a paralog.
Contaminating human sequence in e.g. the chimp assembly.
Accelerated evolution: this can decrease the similarity of an ortholog.
Wrong x-drop alignments
This problem can be fixed by either increasing x so that the correct alignment is found, or decreasing x so that the incorrect alignment is not found and the correct alignment is found in two parts. Unfortunately, different values of x fix different cases, and there is no reasonable value that fixes all cases.
New alignments of repeats
Nevertheless, orthology search is likely harder for repeats than non-repeats. The human-chimp 1-split alignments mostly have around 98 % identity, but many of the repeat alignments have lower %-identity (Fig. 7, panel 2). The likely explanation is that many of these repeat alignments are paralogous.
This problem mostly vanishes in the final 2-split alignments (Fig. 7, panel 3). Now the repeat alignments also have around 98 % identity, although they have slightly higher variance: they more often have both higher and lower %-identity. This higher variance is not surprising as repeat alignments tend to be shorter (Fig. 7, panel 4), simply because longer alignments are less likely to be 100 % repetitive.
Surprisingly, the human-dog 1-split repetitive alignments do not have reduced %-identity (Fig. 7, panel 5). A possible explanation is that the human-chimp paralogous alignments are mostly due to poor genome assembly: orthologous human-chimp repeats are often very young, with low divergence, and thus hard to assemble.
Badness of HoxD55
Alignment of the D. melanogaster and D. pseudoobscura genomes worked less well: the LAST scores were lower than the UCSC scores in 13 % of cases (Fig. 4). This was the only comparison to use the HoxD55 scheme (Table 1). Inspection of several cases revealed that the LAST failures are due to the x-drop problem described above, which evidently occurs much more often with HoxD55. This scoring scheme has a high tolerance for aligning unrelated sequences , which presumably exacerbates the x-drop error.
Comparison to other aligners
Many genome alignment methods have been proposed, though most have in common an approach of looking for chains of colinear alignments. In addition to UCSC, let us consider VISTA  and Mauve  as representative examples.
In the VISTA human-chimp alignments, the vast majority of aligned bases are identical to LAST (Fig. 8). Nevertheless, there are many LAST alignments that have no aligned bases in common with VISTA: for some of these, the human segment is covered by one VISTA alignment, in which case we can compare the %-identities for that human segment. The VISTA %-identity is almost always lower, often much lower (Fig. 8). In fact, VISTA has many more very low %-identity alignments (e.g. <60 %) than UCSC (Fig. 4, lower-left panel). Inspection of several cases revealed errors similar to that in Fig. 5. The likely reason is that VISTA uses colinearity more aggressively than UCSC, by globally aligning genome regions defined by chains.
As another comparison, the two Drosophila genomes were aligned using progressiveMauve version 2.3.1 with default parameters. The result is conservative, with fewer aligned bases than LAST, and few cases where Mauve aligns the same region of melanogaster to a different region of pseudoobscura (Fig. 8). In these few cases, Mauve’s %-identity is usually much lower, apart from the same two LAST errors mentioned above. Although Mauve also uses aggressive global alignment, it subsequently detects and removes alignments of unrelated sequences, to avoid errors like that in Fig. 5 [22, 31].
Good alignment depends on using reasonable score/model parameters (Table 1), and we can check whether they match the substitution and gap frequencies in the alignments. This is only a rough check, because the alignments are not perfect: in particular, the gap existence counts may be underestimates due to “gap attraction” and “gap annihilation” [13, 32].
The main observation is that the gap costs for human-chimp.v2 are unduly large: a better fit would be obtained with a gap existence cost of 500 and a gap extension cost of 30. So we re-did the ape alignments with these costs, then re-counted substitutions and gaps.
The substitution and gap frequencies, and corresponding scores, are shown in Appendix C. They do not differ greatly from the alignment models. However, these frequencies are not uniform across the genome, and averaged parameters may not be ideal. For example, a (match score):(mismatch cost) ratio of 1:1 is appropriate for ∼75 % identity, 1:2 for ∼95 % identity, and 1:3 for ∼99 % identity . To investigate, we deleted gap columns then measured substitution rates in 200bp windows of the alignments (Fig. 9, row 2). The substitution rates are not uniform, but they do not vary arbitrarily: for instance, human-mouse alignments hardly ever have ≥90 % identity. In summary, the used parameters are not wildly unreasonable.
Alignathon simulation test
Finally, we tested our method on the “Alignathon” simulated genomes . Simulation has the advantage that the true alignments are known, but the disadvantage that the simulation’s realism is unknown. For instance, the simulation presumably lacks rearrangements like that in Fig. 1d.
There are two simulations: one of four ape-like genomes, and one of five mammal-like genomes. The “truths” are multiple (not pair-wise) alignments, and, in our understanding, they align all homologs (including paralogs, but excluding mobile element insertions) that have duplicated since the most recent common ancestor of the genomes. This unfortunately does not match our approach of finding one-to-one orthologs. In any case, for each simulation we made pair-wise alignments with LAST, then joined them into multiple alignments with mafTransitiveClosure , which joins pair-wise into multiple alignments in a naïve way.
For the ape simulation, LAST achieved a precision of 0.998 and a recall of 0.978. All other aligners had lower precision (Table S13 in ). For the mammal simulation, LAST achieved precision=0.827 and recall=0.612. Several aligners have higher precision, however all but one of those have much lower recall (Tables S15–16 in ). The exception is Cactus, with precision=0.885 and recall=0.734.
To understand why Cactus has higher precision, let us focus on pair-wise alignments between simHuman and simMouse. LAST aligns 124 million pairs of bases, of which 25 million are wrong, and 464 thousand lie in completely-wrong alignments (no pair of aligned bases in common with the truth). Cactus aligns 131 million pairs of bases, of which 18 million are wrong, and 6 million lie in completely-wrong alignments. So LAST is much better at avoiding completely-wrong alignments, whereas Cactus excels at accuracy of partly-right alignments. The latter is not surprising, because Cactus is a true multiple aligner: it takes pair-wise alignments from an external source (potentially LAST), and combines and refines them by integrating the information from all the sequences .
The new genome alignment method is conceptually extremely simple, it just seeks an optimal set of one-to-one alignments. Despite decades of extensive research on alignment, alignment sets have been surprisingly neglected, although they are often what is really wanted, e.g. for multi-domain proteins.
The new method is obviously crude, because it ignores phylogeny and many-to-many orthology. It will fail in cases of reciprocal gene loss, where one copy of a paralog is absent in one genome and the other copy is missing from the other genome. Such hidden paralogy is a major problem in understanding evolution .
Nevertheless, the new method seems to fix thousands of non-orthologous parts in previous genome alignments. The previous errors were caused by an over-aggressive assumption of conserved order and orientation. For example, in many cases in Table 4, UCSC finds the same alignment as LAST in its initial (many-to-many) “chains” but omits it from its final (many-to-one) “net” alignments, because it prefers weaker alignments in stronger chains. There is a widespread paradigm of trying to align long colinear blocks (often using “chains” or “anchors”), which risks producing non-orthologous or even non-homologous alignments. The ideal approach is probably to use a weaker preference for conserved order and orientation, e.g. via prior probabilities in a statistical model.
The use of a probabilistic model is a key advantage, since it quantifies the ambiguity of each aligned base. Similar probabilistic methods have been applied before to individual alignments [11, 13, 14, 21], but apparently not to alignment sets.
We found that pre-masking is dangerous for orthology search, which is probably not widely recognized since it is not dangerous for typical BLAST homology searches. Unfortunately, genome alignment without masking is much more compute-intensive, even with adaptive seeds and gapless alignment culling. Probably, better heuristics could be developed to tackle this.
We also found that the gapped x-drop heuristic can sometimes produce bad alignments (Fig. 6). This is important because x-drop is widely used (e.g. BLAST), the bad alignments are not immediately obvious (probably they are usually overlooked), and this problem does not seem to have been described before. Unfortunately, it is unclear how to fix it, save by applying the repeated matches algorithm directly to the genomes (which seems feasible on a large supercomputer).
Split-alignment has applications beyond whole genome comparison. It can be used to map DNA or RNA reads to a genome. “Mapping” is orthology search (since paralogs are not wanted), and reads are genome fragments (possibly rearranged), so it is all the same thing. Since different reads may redundantly cover the same query bases, we would seek many-to-one alignments, i.e. stop at the 1-split stage. Our method incorporates fastq quality data into the model and scoring . The statistical model, which quantifies the (un)ambiguity of each alignment part, is a major benefit for finding reliable rearrangement breakpoints.
The new method aligns the majority of genomic bases identically to previous methods, as expected. Nevertheless, around 100 million human bases, which overlap a number of protein-coding regions, are in completely different alignments. The new alignments should be especially beneficial when searching for interesting and unusual features in genome evolution, because these are particularly confounded by alignment errors. One example is accelerated evolution, which is mimicked by paralogy. Another is micro-rearrangements, which are systematically missed in standard genome alignments based on colinearity [38, 39]. Indeed the new alignments suggest many interesting rearrangements (e.g. Fig. 5), but unfortunately it is not straightforward to tell true rearrangements from assembly errors. The new alignments are available at: . The software is available at , and also in the last-align package for Debian and Ubuntu .
Materials and methods
where beg(i) is the coordinate of the first query letter in alignment i, and beg= min(beg(i)). The optimal split-alignment score is W end, where end= max(end(i)), and end(i) is one-past the last query letter in alignment i. This only calculates the score, but an optimal split-alignment can then be found by a standard traceback procedure .
These assemblies were used: panTro4, ponAbe2, canFam3, mm10, dp4, dm3 (without chrUextra), and hg19 (without alternate haplotypes and with the chrY pseudo-autosomal regions replaced by ‘n’s).
The UCSC genome alignments were taken from the axtNet subdirectories of these directories: hg19/vsPanTro4, hg19/vsPonAbe2, hg19/vsCanFam3, hg19/vsMm10, dm3/vsDp4.
The VISTA alignment was taken from: .
Lowercase-masked genomes were obtained from the UCSC database. Tandem repeats found by tantan version 13 were additionally masked, in order to prevent non-homologous alignments more reliably .
The genomes were lowercase-masked by tantan only, then aligned case-insensitively, and at the very end each alignment was rescored with gentle masking of lowercase letters : if it lacked any segment with score ≥e it was discarded.
The sensitive transition seed set MAM8 was used by default . For the closely-related apes, the spaced seed 1111110 was used instead. For human-dog and human-mouse with post-masking, since the number of indexed bases roughly doubles without masking, MAM4 was used to avoid a too-large index.
LAST’s seed rareness limit m was empirically set to 50 for the ape alignments and 100 for the others. The score threshold e was set to values with borderline statistical significance, using ALP : 5000 for the flies with HoxD55, 4000 for the flies with HoxD70, 4500 for mammals with HoxD70, and 3000 for the apes. The alignment existence cost f and the maximum gapped score drop x were both set to e−1.
To illustrate, the Drosophila HoxD70 alignments can be constructed with LAST v535 as follows. First, run tantan on both genomes, with default settings. Then, make the 1-split alignments like this: lastdb¡-uMAM8¡x¡dp4.fa lastal¡-pHOXD70¡-e4000¡-C2¡-m100¡x¡dm3.fa¡| last-split¡-m1¡>¡1.maf Next, make the 2-split alignments like this: maf-swap¡1.maf¡|¡last-split¡-m1¡>¡2.maf Finally, run last-postmask on 1.maf and 2.maf.
Alignathon ape test
These query-reference pairs were aligned: simChimp-simHuman, simGorilla-simHuman, simOrang-simHuman. The alignment procedure was the same as for the real ape genomes (lastdb option -m1111110, and lastal options -phuman-chimp.v2 -a500 -b30 -e3000 -C2 -m50). Alignments with error probability ≤ 0.00001 were retained, and joined by mafTransitiveClosure.
Alignathon mammal test
These query-reference pairs were aligned: simDog-simHuman, simMouse-simHuman, simRat-simMouse, simCow-simDog. Since the simulated genomes are smaller than the real ones, we used MAM8 instead of MAM4 (lastdb option -uMAM8, and lastal options -pHOXD70 -e4500 -C2 -m100). Alignments with error probability ≤ 0.00001 were retained, and joined by mafTransitiveClosure.
The data set supporting the results of this article is available in the Zenodo repository .
a Score-per-length is computed for whole alignments, not overlapping parts.
Appendix A: Statistical models
The aim here is to explain and motivate the statistical model of alignments, and the f parameter (alignment existence cost). It is instructive to first consider models of segments, such as hydrophobic segments in protein sequences. Segments are a simpler (1-dimensional) analog of alignments.
A simple model is for segments to have independent letters with frequencies π x , while background (non-segment) regions have letter frequencies θ x . Given a sequence, we can then seek maximal-likelihood segments.
A.2 Segment sets
Thus, a segment existence cost f arises naturally from model probabilities of starting and ending a segment.
A.4 Alignment sets
The score parameters and t are the same as in the single-alignment model, so the only new parameter is f.
Appendix B: Probability calculation
Each column’s error probability is one minus its model probability.
The practical implementation of this Forward-Backward algorithm uses scaling to avoid numerical instability .
Appendix C: Substitution/gap counts
Substitution and gap probabilities and scores inferred from genome alignments
a c g t
a c g t
a 77 -300 -212 -337
c -300 100 -275 -212
g -212 -275 100 -300
t -337 -212 -300 77
gap existence probability: 0.00077
gap existence cost: 495
gap extension probability: 0.65
gap extension cost: 27
a c g t
a c g t
a 77 -248 -154 -287
c -248 100 -222 -154
g -154 -222 100 -248
t -287 -154 -248 77
gap existence probability: 0.0018
gap existence cost: 448
gap extension probability: 0.65
gap extension cost: 28
a c g t
a c g t
a 77 -126 -38 -154
c -126 100 -121 -38
g -38 -121 100 -126
t -154 -38 -126 77
gap existence probability: 0.012
gap existence cost: 429
gap extension probability: 0.73
gap extension cost: 25
a c g t
a c g t
a 79 -114 -27 -136
c -114 100 -115 -27
g -27 -115 100 -114
t -136 -27 -114 79
gap existence probability: 0.015
gap existence cost: 451
gap extension probability: 0.73
gap extension cost: 27
a c g t
a c g t
a 92 -123 -59 -139
c -123 100 -117 -59
g -59 -117 100 -123
t -139 -59 -123 92
gap existence probability: 0.016
gap existence cost: 445
gap extension probability: 0.73
gap extension cost: 27
We thank Hugues Richard and Anish Shrestha for discussions while developing the method. This work was supported by KAKENHI Grant Number 26700030 to MCF, and International Research Fellowship of the Japan Society for the Promotion of Science PE11014 to HR.
- Frith MC. A new repeat-masking method enables specific detection of homologous sequences. Nucleic Acids Res. 2011; 39:23.View ArticleGoogle Scholar
- Frith MC. Gentle masking of low-complexity sequences improves homology search. PLoS ONE. 2011; 6:28819.View ArticleGoogle Scholar
- Kuzniar A, van Ham RC, Pongor S, Leunissen JA. The quest for orthologs: finding the corresponding gene across genomes. Trends Genet. 2008; 24:539–51.PubMedView ArticleGoogle Scholar
- Altenhoff AM, Dessimoz C. Phylogenetic and functional assessment of orthologs inference projects and methods. PLoS Comput Biol. 2009; 5:1000262.View ArticleGoogle Scholar
- Altenhoff AM, Dessimoz C. Inferring orthology and paralogy. Methods Mol Biol. 2012; 855:259–79.PubMedGoogle Scholar
- Sonnhammer E, Gabaldon T, Wilter Sousa da Silva A, Martin M, Robinson-Rechavi M, Boeckmann B, Thomas P, Dessimoz C. Big Data and Other Challenges in the Quest for Orthologs. Bioinformatics. 2014; 30(21):2993–8.PubMed CentralPubMedView ArticleGoogle Scholar
- Dewey CN. Positional orthology: putting genomic evolutionary relationships into context. Brief Bioinformatics. 2011; 12:401–12.PubMed CentralPubMedView ArticleGoogle Scholar
- Dewey CN, Pachter L. Evolution at the nucleotide level: the problem of multiple whole-genome alignment. Hum Mol Genet. 2006; 15 Spec No 1:51–6.View ArticleGoogle Scholar
- Passarge E, Horsthemke B, Farber RA. Incorrect use of the term synteny. Nat Genet. 1999; 23:387.PubMedView ArticleGoogle Scholar
- Dewey CN. Whole-genome alignment. Methods Mol Biol. 2012; 855:237–57.PubMedGoogle Scholar
- Durbin R, Eddy S, Krogh A, Mitchison G. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge, UK: Cambridge University Press; 1998.View ArticleGoogle Scholar
- Yu YK, Altschul SF. The construction of amino acid substitution matrices for the comparison of proteins with non-standard compositions. Bioinformatics. 2005; 21:902–11.PubMedView ArticleGoogle Scholar
- Lunter G, Rocco A, Mimouni N, Heger A, Caldeira A, Hein J. Uncertainty in homology inferences: assessing and improving genomic sequence alignment. Genome Res. 2008; 18:298–309.PubMed CentralPubMedView ArticleGoogle Scholar
- Hudek AK, Brown DG. FEAST: sensitive local alignment with multiple rates of evolution. IEEE/ACM Trans Comput Biol Bioinform. 2011; 8:698–709.PubMedView ArticleGoogle Scholar
- Nánási M, Vinar T, Brejová B. Probabilistic approaches to alignment with tandem repeats. Algorithms Mol Biol. 2014; 9:3.PubMed CentralPubMedView ArticleGoogle Scholar
- Zhang Z, Berman P, Wiehe T, Miller W. Post-processing long pairwise alignments. Bioinformatics. 1999; 15:1012–1019.PubMedView ArticleGoogle Scholar
- Smith TF, Waterman MS. Identification of common molecular subsequences. J Mol Biol. 1981; 147:195–7.PubMedView ArticleGoogle Scholar
- Gotoh O. An improved algorithm for matching biological sequences. J Mol Biol. 1982; 162:705–8.PubMedView ArticleGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997; 25:3389–402.PubMed CentralPubMedView ArticleGoogle Scholar
- Zhang Z, Berman P, Miller W. Alignments without low-scoring regions. J Comput Biol. 1998; 5:197–210.PubMedView ArticleGoogle Scholar
- Frith MC, Hamada M, Horton P. Parameters for accurate genome alignment. BMC Bioinformatics. 2010; 11:80.PubMed CentralPubMedView ArticleGoogle Scholar
- Darling AE, Mau B, Perna NT. progressive Mauve: multiple genome alignment with gene gain, loss and rearrangement. PLoS ONE. 2010; 5:11147.View ArticleGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990; 215:403–10.PubMedView ArticleGoogle Scholar
- Lopez R, Silventoinen V, Robinson S, Kibria A, Gish W. WU-Blast2 server at the European Bioinformatics Institute. Nucleic Acids Res. 2003; 31:3795–798.PubMed CentralPubMedView ArticleGoogle Scholar
- Kiełbasa SM, Wan R, Sato K, Horton P, Frith MC. Adaptive seeds tame genomic sequence comparison. Genome Res. 2011; 21:487–93.PubMed CentralPubMedView ArticleGoogle Scholar
- Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, et al. Human-mouse alignments with BLASTZ. Genome Res. 2003; 13:103–7.PubMed CentralPubMedView ArticleGoogle Scholar
- Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution’s cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A. 2003; 100:11484–11489.PubMed CentralPubMedView ArticleGoogle Scholar
- Berman P, Zhang Z, Wolf YI, Koonin EV, Miller W. Winnowing sequences from a database search. J Comput Biol. 2000; 7:293–302.PubMedView ArticleGoogle Scholar
- Frith MC, Park Y, Sheetlin SL, Spouge JL. The whole alignment and nothing but the alignment: the problem of spurious alignment flanks. Nucleic Acids Res. 2008; 36:5863–871.PubMed CentralPubMedView ArticleGoogle Scholar
- Dubchak I, Poliakov A, Kislyuk A, Brudno M. Multiple whole-genome alignments without a reference organism. Genome Res. 2009; 19:682–9.PubMed CentralPubMedView ArticleGoogle Scholar
- Treangen TJ, Darling AE, Achaz G, Ragan MA, Messeguer X, Rocha EP. A novel heuristic for local multiple alignment of interspersed DNA repeats. IEEE/ACM Trans Comput Biol Bioinform. 2009; 6:180–9.PubMedView ArticleGoogle Scholar
- Lunter G. Probabilistic whole-genome alignments reveal high indel rates in the human and mouse genomes. Bioinformatics. 2007; 23:289–96.View ArticleGoogle Scholar
- States DJ, Gish W, Altschul SF. Improved sensitivity of nucleic acid database searches using application-specific scoring matrices. Methods. 1991; 3:66–70.View ArticleGoogle Scholar
- Earl D, Nguyen N, Hickey G, Harris RS, Fitzgerald S, Beal K, et al. Alignathon: a competitive assessment of whole-genome alignment methods. Genome Res. 2014; 24:2077–089.PubMed CentralPubMedView ArticleGoogle Scholar
- Paten B, Earl D, Nguyen N, Diekhans M, Zerbino D, Haussler D. Cactus: Algorithms for genome multiple sequence alignment. Genome Res. 2011; 21:1512–1528.PubMed CentralPubMedView ArticleGoogle Scholar
- Kuraku S. Palaeophylogenomics of the vertebrate ancestor–impact of hidden paralogy on hagfish and lamprey gene phylogeny. Integr Comp Biol. 2010; 50:124–9.PubMedView ArticleGoogle Scholar
- Frith MC, Wan R, Horton P. Incorporating sequence quality data into alignment improves DNA read mapping. Nucleic Acids Res. 2010; 38:100.View ArticleGoogle Scholar
- Chaisson MJ, Raphael BJ, Pevzner PA. Microinversions in mammalian evolution. Proc Natl Acad Sci U S A. 2006; 103:19824–19829.PubMed CentralPubMedView ArticleGoogle Scholar
- Hou M, Yao P, Antonou A, Johns MA. Pico-inplace-inversions between human and chimpanzee. Bioinformatics. 2011; 27:3266–275.PubMed CentralPubMedView ArticleGoogle Scholar
- Genome alignments from “Split-alignment of genomes finds orthologies more accurately”. http://last.cbrc.jp/genome/.
- LAST: genome-scale sequence comparison. http://last.cbrc.jp/.
- Möller S, Krabbenhöft HN, Tille A, Paleino D, Williams A, Wolstencroft K, et al. Community-driven computational biology with Debian Linux. BMC Bioinformatics. 2010; 11:5.View ArticleGoogle Scholar
- Human Feb 2009- Chimp Feb 2011 pairwise alignments. http://pipeline.lbl.gov/data/hg19_panTro4.
- Frith MC, Noé L. Improved search heuristics find 20,000 new alignments between human and mouse genomes. Nucleic Acids Res. 2014; 42:59.View ArticleGoogle Scholar
- Sheetlin S, Park Y, Spouge JL. The Gumbel pre-factor k for gapped local alignment can be estimated from simulations of global alignment. Nucleic Acids Res. 2005; 33:4987–994.PubMed CentralPubMedView ArticleGoogle Scholar
- Genome alignments from “Split-alignment of genomes finds orthologies more accurately”. https://zenodo.org/record/17436.
- Hastings PJ, Lupski JR, Rosenberg SM, Ira G. Mechanisms of change in gene copy number. Nat Rev Genet. 2009; 10:551–64.PubMed CentralPubMedView ArticleGoogle Scholar
- Chiaromonte F, Yap VB, Miller W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput. 2002; 7:115–26.Google Scholar
This is an Open Access article distributed under the terms of the Creative Commons Attribution License(http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.