Phylogenetic assessment of alignments reveals neglected tree signal in gaps
© Dessimoz and Gil; licensee BioMed Central Ltd. 2010
Received: 21 August 2009
Accepted: 6 April 2010
Published: 6 April 2010
The alignment of biological sequences is of chief importance to most evolutionary and comparative genomics studies, yet the two main approaches used to assess alignment accuracy have flaws: reference alignments are derived from the biased sample of proteins with known structure, and simulated data lack realism.
Here, we introduce tree-based tests of alignment accuracy, which not only use large and representative samples of real biological data, but also enable the evaluation of the effect of gap placement on phylogenetic inference. We show that (i) the current belief that consistency-based alignments outperform scoring matrix-based alignments is misguided; (ii) gaps carry substantial phylogenetic signal, but are poorly exploited by most alignment and tree building programs; (iii) even so, excluding gaps and variable regions is detrimental; (iv) disagreement among alignment programs says little about the accuracy of resulting trees.
This study provides the broad community relying on sequence alignment with important practical recommendations, sets superior standards for assessing alignment accuracy, and paves the way for the development of phylogenetic inference methods of significantly higher resolution.
The study of biological sequences almost inevitably begins with the process of alignment. The goal of this process is usually to match homologous characters, that is, characters that have a common ancestry . In turn, these sets of homologs, the columns of the alignment, can be used for a variety of applications, such as identifying residues with analogous structural or functional role, or inferring the phylogenetic tree of the underlying sequences. The accuracy of multiple sequence alignment programs has been the object of numerous comparative studies [2–4], which evaluate alignments either by using trusted reference alignments obtained from structural data, or by using simulation. Unfortunately, both approaches have flaws. Trusted benchmark alignments such as Balibase, Prefab, Homstrad, or Sabmark [5–8] are all derived from protein structure information, exploiting the tendency of structure to evolve more slowly than sequence .
However, proteins with resolved structure remain a small and highly biased sample of all proteins [10, 11]. In addition, homology inferred from structural information is inherently restricted to conserved regions, thereby providing little guidance for correct gap placement. The other approach to validating alignments is simulation [12–18]. Yet, results obtained from simulated data strongly depend on the choice of model used to generate the data, and most biological processes are difficult to model realistically. For instance, current insertion-deletion models are known to be insufficient . Even if a good model can be formulated, it will never fully capture the complexity of real biological data. Consequently, the results observed on simulated data differ significantly from those measured on empirical data .
Results and discussion
There is, therefore, a need for alternative evaluation procedures that do not rely on structural information while applicable to a large and representative sample of real biological data. In this work, we propose two such tests. We then show how they offer answers to three of the most important open questions regarding sequence alignment for phylogenetic inference: (i) Which alignment approach leads to the most accurate trees? (ii) Are gap regions informative for phylogenetic inference or should they be ignored? (iii) What is the impact of alignment uncertainty on tree inference?
Phylogeny-based tests of alignment accuracy
Assessment of alignment methods
To address the question of alignment accuracy, we used the tests to evaluate 13 MSA software packages, which can be classified into roughly three alignment scoring strategies: scoring matrix-based Mafft FFT-NS-2, Muscle, Clustal W2, DiAlign/-T/-TX, Kalign [6, 27–33]; consistency-based Mafft L-INS-i, T-Coffee, Mummals, ProbCons, ProbAlign [27, 28, 34–37]; and tree-aware-gap-placing Prank . We tested the alignment software both on amino-acid and on nucleotide data, with the exception of Mummals and ProbCons, which only run on amino-acid data. For the species-tree discordance test, we sampled sets of 6 orthologs as inferred by OMA  among 57 eukaryotic, 11 fungal, and 418 bacterial genomes, under the constraint that the branching order of the species represented in each set be well-accepted (Additional file 1, Figure S1). For the minimum duplication test, we retrieved groups of up to 60 homologs from 18 metazoan and 18 fungal genomes. Trees were reconstructed by maximum likelihood (ML) from both amino-acid and nucleotide alignments. In addition, to compare the two types of alignments under the same evolutionary model, ML trees were also reconstructed from back-translated amino-acid alignments, using the actual codons from the corresponding nucleotide sequences. In total, the tests required computing over 100,000 alignments of up to 60 sequences, at a cost of over 20,000 CPU hours.
To limit the risk of systematic biases or unrecognized factors, these observations were confirmed by two kinds of controls. First, we considered the effect of the tree building method used in the test procedure. We ran the tests under a different model of evolution and using least squares distance trees instead of ML. The results were highly consistent (Additional file 1, Figures S8 and S9, relative accuracy of the two methods correlates with 0.90, P < 10-10, t-test). Second, we tested the dependence of the results on characteristics of the input data. We re-evaluated the tests on partitioned data and estimated the correlations between the relative accuracy of each partition with its full datasets. The data was segmented according to sequence length (Additional file 1, Figure S10, r = 0.62, P < 10-10), sequence divergence (Additional file 1, Figure S11, r = 0.67, P < 10-10) and number of sequences (Additional file 1, Figure S12, r = 0.89, P < 10-10). Furthermore, we contrasted the results of different pairs of lineages (Additional file 1, Figure S6, 0.68 <r ≤ 0.94, all P < 10-3). In all cases, our conclusions above stand.
Guide trees make or break progressive alignments
Since sequence insertion and deletion events are generally assumed to take place along a tree, most aligners rely on guide trees to construct and score alignments. Some of them - in our case Mafft, Muscle, Clustal W2, T-Coffee and Prank - allow specification of the guide tree by the user. To investigate their sensitivity to tree specification, we ran the species-tree discordance test on two extreme cases: we provided either a random guide tree, or the reference species tree as guide (Additional file 1, Figure S13). Unsurprisingly, the input trees hardly affected methods refining their guide trees iteratively (Muscle) or relying strongly on consistency (T-Coffee), a mostly tree-independent objective function. In contrast, strictly progressive methods (Mafft-FFT, Clustal W2, Prank) were highly sensitive to the provided guide tree. With such methods, guide tree specification is a double-edged sword: prior knowledge of the underlying sequence phylogeny, depending on its accuracy, can either improve the resulting alignments, or worsen them. Consequently, if the tree is known with high confidence, we recommend using it in conjunction with Prank or Mafft. If not, one might wonder which program infers the best guide trees, and whether feeding them to the other aligners could improve results overall. Our results suggest that on average, the best guide trees are inferred by Prank on amino-acid data, and Mafft on nucleotide data (Additional file 1, Figure S14). The difference is however not sufficiently large that the other alignment methods consistently profit from these improved guide trees (Additional file 1, Figure S15).
Gaps carry substantial unexploited tree signal
Excluding gaps and variable regions harms
Alignment variability poorly predicts tree accuracy
We have seen that different alignment programs can give rise to trees of varying accuracy. But in the broader context of tree inference, sequence alignment is not the only source of tree uncertainty. By 'uncertainty', we mean the expected addition of systematic and random error, that is, the expected inaccuracy. For instance, the amount of input data (that is, sequence lengths), the divergence between sequences, the model of evolution, or the tree searching algorithm all affect the accuracy of reconstructed trees, and one's confidence therein. This raises the question of the relative contribution of alignment uncertainty to tree uncertainty. Wong et al. recently quantified the observation that different alignment programs often lead to different tree topologies . They found a correlation (Spearman-rank correlation r s = 0.53) between alignment variability (average distance between alignments from different methods) and tree variability (average topological distance among trees estimated from different alignment methods). But constrained by a lack of measure of total tree error, their analysis only focused on the random component of tree uncertainty. We exploited the tree accuracy measure from the species-tree discordance test to estimate the correlation between alignment variability and tree accuracy. Interestingly, accounting for both random and systematic errors suggests a weaker connection between alignment and tree quality: the negative correlation between alignment variability and tree accuracy was low for amino-acid and back-translated data (Additional file 1, Figure S20, -r s < 0.16, P < 0.01, t-test). Thus, alignment variability says little about overall tree uncertainty for amino-acid alignments. To put the results into perspective, we also estimated the correlation between bootstrap tree support and tree accuracy. Surprisingly, even though bootstrap assumes correct alignments, it was a consistently better predictor of tree accuracy than alignment variability (Additional file 1, Figure S20, rs, Bootstrap> -rs, AlignmentVar, P < 0.006, see methods). For nucleotide alignments, shown above to be often worse than amino-acid alignments, we found a higher correlation between alignment variability and tree accuracy than for the amino-acid counterparts. Still, alignment variability was never a better predictor of tree accuracy than tree support (Additional file 1, Figure S20). Since tree support is usually computed anyway, this casts doubt on the usefulness of trying more than one alignment method for the purpose of phylogenetic inference . Rather, we recommend that practitioners stick with an accurate alignment method, as identified by tests such as the ones presented here.
In summary, the use of trees rather than protein structure to assess alignments is advantageous in that it more closely fits a common application of alignments, it is not restricted to the relatively small and biased sample of proteins with known structure, and it also allows the evaluation of gap regions. Indeed, our results show that consistency-based alignment methods, which score best in structural benchmarks, do not yield significantly better trees than their scoring matrix-based counterparts. Our tests also demonstrate that gaps often carry a strong phylogenetic signal, which at present is not well exploited, either by most alignment methods, or by standard tree building methods; but even with such methods, excluding gaps and variable regions worsen the resulting trees. Finally, the low correlation we observed between alignment variability and tree accuracy suggests that there is little to gain from the common practice of trying more than one alignment program on a given dataset. This latter result, as well as the analysis on the impact of guide tree specification, rely exclusively on the species-tree discordance test, because they require knowledge of a reference topology. As such, the conclusions are based on six-taxa trees only. How well they generalize to larger trees is yet to be investigated. Besides, further interesting questions remain: how do alignment methods perform on data not represented in this study, such as promoter regions or other non-coding sequences? How can we best extend our current models of sequence evolution to take into account the phylogenetic signal of gap patterns? How do the methods investigated here compare with the statistical approach of joint alignment and tree inference? The methodology introduced here gives us the means to investigate these issues. Beyond alignments, the ability to measure tree accuracy under realistic conditions allows assessment of further important aspects of phylogeny inference, such as evolutionary models, tree building algorithms, or tree confidence measures.
Materials and methods
Sets of orthologous protein sequences
The Species Tree Discordance Test was performed on three sets of species: eukaryotes, fungi, and bacteria (detailed list in Supplementary Information Sect. 1.1). For all three sources of data, we retrieved sets of orthologs as inferred by OMA (Release of September 2008) . Although cases of misclassification cannot be excluded, it has been shown in a previous study that the false-positive rate of OMA's predictions is low compared with other similar projects . More importantly, though the presence of non-orthologs reduces the power of our test, it does not bias the results toward a particular alignment program. Sequences were sampled according to reference trees with a comb topology (Additional file 1, Figure S1). This topology ensures that all sequences in a sample are orthologous to each other . In each trial, a starting sequence from a random species in the innermost leaf was randomly chosen. Then, for each remaining leaf, a random orthologous sequence was sampled.
Sets of homologous protein sequences
We performed the Minimum Duplication test on two sets of organisms: metazoa and fungi (detailed list in Supplementary Information Sect. 1.2). Sets of homologs were constructed by taking the transitive closure of pairs of sequences with high alignment scores (E-value below 10-10). The sets were restricted to a maximal size of 60 sequences by removing sequences randomly from sets of excessive cardinality.
Definition: absolute minimum number of duplications
For any set of homologous genes, consider partitions of the sequences according to their genome of origin: each resulting partition consists of same-species paralogs. Let m be the maximum cardinality of these partitions. For m paralogs to be observed in the same genome, at least m-1 duplications had to take place. We denote m-1 as absolute minimum number of duplications for the set of homologs.
Species-tree discordance test
The species-tree discordance test evaluates a sequence alignment program in terms of the average accuracy of the trees reconstructed from its alignments. The test requires a large number of sequence sets whose phylogeny is known. Given that orthologous genes (by definition) follow the species tree, we sampled orthologs provided by OMA  from species with known and undisputed branching order (Additional file 1, Figure S1). Agreement between obtained and reference topologies was quantified by the proportion of wrong splits .
Minimum duplication test
In a gene tree, the split of two same-species paralogs is necessarily a duplication event. By a parsimonious argument, the tree with the least duplication splits represents the most likely evolutionary history. The minimum duplication test evaluates a sequence alignment program in terms of the average minimum number of gene duplication events implied in the trees reconstructed from its alignments of homologous sequences. Given a rooted tree, a lower bound on the number of duplications can be obtained by counting nodes that have subtrees with overlapping sets of species . Since the placement of the root of the tree is usually unknown, we considered all possible rootings and retained the minimum number of duplications. This measure was normalized by subtracting the absolute minimum number of duplications from it (see above). An example computation can be found in Additional file 1, Figure S2.
Gene trees were reconstructed by maximum likelihood using PhyML v. 2.4.4  from the sequences aligned with the different programs under JTT+I+Γ for amino-acids and HKY+I+Γ for nucleotides. To investigate the accuracy of gap placement, the two tests were also performed using Wagner parsimony on the presence/absence patterns of gaps (for a given alignment, each column containing at least one gap was considered a character and the presence/absence of a gap its state). To avoid over-counting, neighboring columns with identical gap-patterns were combined into single characters.
Alternative tree building methods
As control, we recomputed the trees using a least-square distance approach instead of maximum likelihood: we reconstructed variance weighted least-squares distance trees using the MinSquareTree function in Darwin . The pairwise input-distances were computed by maximum likelihood using the GCB matrices  for amino-acid data. For nucleotide data we used an unpublished, empirical nucleotide substitution matrix estimated from mammalian orthologs in OMA . Likewise, as an alternative (and control), we recomputed the Gap Parsimony Trees without combining repeated columns. Furthermore, for a subset of the tests we repeated the computation of the ML trees using the software RAxML v. 7.0.4 .
Filtering of gaps and variable regions
We define a gap column as a column of the multiple sequence alignment in which at least one sequence has a gap character. To filter both gaps and variable regions, we used Gblocks version 0.91b  with default settings. In addition and as control, we also relaxed the settings according to Talavera et al. . At times, any of the three filtering variants (no gap, Gblocks default, Gblocks relaxed) could yield alignments with no column left, that is, of null length. Such samples were excluded.
Measures to relate alignment uncertainty to tree inference
The measures used in the section Alignment Variability Poorly Predicts Tree Accuracy and Additional file 1, Figure S18 are defined as follows: Tree accuracy was measured by one minus the normalized Robinson-Foulds distance  between the inferred and the accepted topology. Tree support was measured by the proportion of bootstrap replicates agreeing with the inferred topology. Tree variability was measured by the average Robinson Foulds distance among trees estimated from different alignment methods. Alignment variability was measured by the average distance between alignments  from different alignment methods. This measure has been shown  to strongly correlate (Spearman's rank correlation r s = 0.92, P < 0.0001) with Bayesian-inferred alignment variability.
Comparing two correlation coefficients
is approximately standard normal distributed, where z(·) denotes the Fisher Z-transform.
We thank Olivier Gascuel for early ideas leading to the design of the minimum duplication test, and Adrian Altenhoff, Maria Anisimova, Gina Cannarozzi, Gaston Gonnet, Heather Murray, Adrian Schneider, Jörg Stelling, Hervé Vanderschuren, as well as two anonymous reviewers for helpful remarks on the manuscript.
- Kemena C, Notredame C: Upcoming challenges for multiple sequence alignment methods in the high-throughput era. Bioinformatics. 2009, 25: 2455-2465. 10.1093/bioinformatics/btp452.PubMedPubMed CentralView ArticleGoogle Scholar
- Blackshields G, Wallace IM, Larkin M, Higgins DG: Analysis and comparison of benchmarks for multiple sequence alignment. In Silico Biol. 2006, 6: 321-339.PubMedGoogle Scholar
- Edgar RC, Batzoglou S: Multiple sequence alignment. Curr Opin Struct Biol. 2006, 16: 368-373. 10.1016/j.sbi.2006.04.004.PubMedView ArticleGoogle Scholar
- Notredame C: Recent evolutions of multiple sequence alignment algorithms. PLoS Comput Biol. 2007, 3: e123-10.1371/journal.pcbi.0030123.PubMedPubMed CentralView ArticleGoogle Scholar
- Thompson J, Koehl P, Ripp R, Poch O: BAliBASE 3.0: latest developments of the multiple sequence alignment benchmark. Proteins. 2005, 61: 127-136. 10.1002/prot.20527.PubMedView ArticleGoogle Scholar
- Edgar RC: MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics. 2004, 5: 113-10.1186/1471-2105-5-113.PubMedPubMed CentralView ArticleGoogle Scholar
- Stebbings LA, Mizuguchi K: HOMSTRAD: recent developments of the homologous protein structure alignment database. Nucleic Acids Res. 2004, 32: D203-7. 10.1093/nar/gkh027.PubMedPubMed CentralView ArticleGoogle Scholar
- Van Walle I, Lasters I, Wyns L: SABmark - a benchmark for sequence alignment that covers the entire known fold space. Bioinformatics. 2005, 21: 1267-1268. 10.1093/bioinformatics/bth493.PubMedView ArticleGoogle Scholar
- Chotia C, Lesk A: The relation between the divergence of sequence and structure in proteins. EMBO J. 1986, 5: 823-826.Google Scholar
- Peng K, Obradovic Z, Vucetic S: Exploring bias in the Protein Data Bank using contrast classifiers. Pac Symp Biocomput. 2004, 435-446.Google Scholar
- Xie L, Bourne P: Functional coverage of the human genome by existing structures, structural genomics targets, and homology models. PLoS Comput Biol. 2005, 1: e31-10.1371/journal.pcbi.0010031.PubMedPubMed CentralView ArticleGoogle Scholar
- Rosenberg MS: Evolutionary distance estimation and fidelity of pair wise sequence alignment. BMC Bioinformatics. 2005, 6: 102-10.1186/1471-2105-6-102.PubMedPubMed CentralView ArticleGoogle Scholar
- Hall BG: Comparison of the accuracies of several phylogenetic methods using protein and DNA sequences. Mol Biol Evol. 2005, 22: 792-802. 10.1093/molbev/msi066.PubMedView ArticleGoogle Scholar
- Ogden TH, Rosenberg MS: Multiple sequence alignment accuracy and phylogenetic inference. Syst Biol. 2006, 55: 314-328. 10.1080/10635150500541730.PubMedView ArticleGoogle Scholar
- Nuin PAS, Wang Z, Tillier ERM: The accuracy of several multiple sequence alignment programs for proteins. BMC Bioinformatics. 2006, 7: 471-10.1186/1471-2105-7-471.PubMedPubMed CentralView ArticleGoogle Scholar
- Kumar S, Filipski A: Multiple sequence alignment: in pursuit of homologous DNA positions. Genome Res. 2007, 17: 127-135. 10.1101/gr.5232407.PubMedView ArticleGoogle Scholar
- Landan G, Graur D: Characterization of pairwise and multiple sequence alignment errors. Gene. 2009, 441: 141-147. 10.1016/j.gene.2008.05.016.PubMedView ArticleGoogle Scholar
- Wang LS, Leebens-Mack J, Wall PK, Beckmann K, dePamphilis CW, Warnow T: The impact of multiple protein sequence alignment on phylogenetic estimation. IEEE/ACM Trans Comput Biol Bioinform. 2009Google Scholar
- Strope CL, Abel K, Scott SD, Moriyama EN: Biological sequence simulation for testing complex evolutionary hypotheses: indel-Seq-Gen version 2.0. Mol Biol Evol. 2009, 26: 2581-93. 10.1093/molbev/msp174.PubMedPubMed CentralView ArticleGoogle Scholar
- Fitch WM: Distinguishing homologous from analogous proteins. Syst Zool. 1970, 19: 99-113. 10.2307/2412448.PubMedView ArticleGoogle Scholar
- Schneider A, Gonnet G, Cannarozzi G: SynPAM-a distance measure based on synonymous codon substitutions. IEEE/ACM Trans Comput Biol Bioinform. 2007, 4: 553-60. 10.1109/TCBB.2007.1071.PubMedView ArticleGoogle Scholar
- Altenhoff AM, Dessimoz C: Phylogenetic and functional assessment of orthologs inference projects and methods. PLoS Comput Biol. 2009, 5: e1000262-10.1371/journal.pcbi.1000262.PubMedPubMed CentralView ArticleGoogle Scholar
- Goodman M, Czelusniak J, Moore GW, Romero-Herrara AE: Fitting the gene lineage into its species lineage: a parsimony strategy illustrated by cladograms constructed from globin sequences. Syst Zool. 1979, 28: 132-168. 10.2307/2412519.View ArticleGoogle Scholar
- Slowinski JB, Page RD: How should species phylogenies be inferred from sequence data?. Syst Biol. 1999, 48: 814-25. 10.1080/106351599260030.PubMedView ArticleGoogle Scholar
- Scannell DR, Byrne KP, Gordon JL, Wong S, Wolfe KH: Multiple rounds of speciation associated with reciprocal gene loss in polyploid yeasts. Nature. 2006, 440: 341-5. 10.1038/nature04562.PubMedView ArticleGoogle Scholar
- Heijden van der RTJM, Snel B, van Noort V, Huynen MA: Orthology prediction at scalable resolution by phylogenetic tree analysis. BMC Bioinformatics. 2007, 8: 83-10.1186/1471-2105-8-83.PubMedPubMed CentralView ArticleGoogle Scholar
- Katoh K, Kuma K, Toh H, Miyata T: MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res. 2005, 33: 511-518. 10.1093/nar/gki198.PubMedPubMed CentralView ArticleGoogle Scholar
- Katoh K, Toh H: Recent developments in the MAFFT multiple sequence alignment program. Brief Bioinform. 2008, 9: 286-298. 10.1093/bib/bbn013.PubMedView ArticleGoogle Scholar
- Larkin MA, Blackshields G, Brown NP, Chenna R, Mcgettigan PA, Mcwilliam H, Valentin F, Wallace IM, Wilm A, Lopez R, Thompson JD, Gibson TJ, Higgins DG: Clustal W and Clustal X version 2.0. Bioinformatics. 2007, 23: 2947-2948. 10.1093/bioinformatics/btm404.PubMedView ArticleGoogle Scholar
- Morgenstern B: DIALIGN 2: improvement of the segment-to-segment approach to multiple sequence alignment. Bioinformatics. 1999, 15: 211-218. 10.1093/bioinformatics/15.3.211.PubMedView ArticleGoogle Scholar
- Subramanian A, Menkhoff JW, Kaufmann M, Morgenstern B: DIALIGN-T: An improved algorithm for segment-based multiple sequence alignment. BMC Bioinformatics. 2005, 6: 66-10.1186/1471-2105-6-66.PubMedPubMed CentralView ArticleGoogle Scholar
- Subramanian A, Kaufmann M, Morgenstern B: DIALIGN-TX: greedy and progressive approaches for segment-based multiple sequence alignment. Algorithms Mol Biol. 2008, 3: 6-10.1186/1748-7188-3-6.PubMedPubMed CentralView ArticleGoogle Scholar
- Lassmann T, Sonnhammer ELL: Kalign-an accurate and fast multiple sequence alignment algorithm. BMC Bioinform. 2005, 6: 298-10.1186/1471-2105-6-298.View ArticleGoogle Scholar
- Notredame C, Higgins D, Heringa J: T-Coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol. 2000, 302: 205-217. 10.1006/jmbi.2000.4042.PubMedView ArticleGoogle Scholar
- Pei J, Grishin NV: MUMMALS: multiple sequence alignment improved by using hidden Markov models with local structural information. Nucl Acids Res. 2006, 34: 4364-4374. 10.1093/nar/gkl514.PubMedPubMed CentralView ArticleGoogle Scholar
- Do C, Mahabhashyam M, Brudno M, Batzoglou S: ProbCons: Probabilistic consistency-based multiple sequence alignment. Genome Res. 2005, 15: 330-340. 10.1101/gr.2821705.PubMedPubMed CentralView ArticleGoogle Scholar
- Roshan U, Livesay DR: Probalign: multiple sequence alignment using partition function posterior probabilities. Bioinformatics. 2006, 22: 2715-2721. 10.1093/bioinformatics/btl472.PubMedView ArticleGoogle Scholar
- Löytynoja A, Goldman N: An algorithm for progressive multiple alignment of sequences with insertions. Proc Natl Acad Sci USA. 2005, 102: 10557-10562. 10.1073/pnas.0409137102.PubMedPubMed CentralView ArticleGoogle Scholar
- Roth AC, Gonnet GH, Dessimoz C: The algorithm of OMA for large-scale orthology inference. BMC Bioinformatics. 2008, 9: 518-10.1186/1471-2105-9-518.PubMedPubMed CentralView ArticleGoogle Scholar
- Dwivedi B, Gadagkar SR: Phylogenetic inference under varying proportions of indel-induced alignment gaps. BMC Evol Biol. 2009, 9: 211-10.1186/1471-2148-9-211.PubMedPubMed CentralView ArticleGoogle Scholar
- Löytynoja A, Goldman N: Phylogeny-aware gap placement prevents errors in sequence alignment and evolutionary analysis. Science. 2008, 320: 1632-1635. 10.1126/science.1158395.PubMedView ArticleGoogle Scholar
- Talavera G, Castresana J: Improvement of phylogenies after removing divergent and ambiguously aligned blocks from protein sequence alignments. Syst Biol. 2007, 56: 564-577. 10.1080/10635150701472164.PubMedView ArticleGoogle Scholar
- Aagesen L: The information content of an ambiguously alignable region, a case study of the trnL intron from the Rhamnaceae. Org Divers Evol. 2004, 4: 35-49. 10.1016/j.ode.2003.11.003.View ArticleGoogle Scholar
- Simmons MP, Richardson D, Reddy ASN: Incorporation of gap characters and lineage-specific regions into phylogenetic analyses of gene families from divergent clades: an example from the kinesin superfamily across eukaryotes. Cladistics. 2008, 24: 372-384. 10.1111/j.1096-0031.2007.00183.x.View ArticleGoogle Scholar
- Liu K, Raghavan S, Nelesen S, Linder CR, Warnow T: Rapid and accurate large-scale coestimation of sequence alignments and phylogenetic trees. Science. 2009, 324: 1561-4. 10.1126/science.1171243.PubMedView ArticleGoogle Scholar
- Wong KM, Suchard MA, Huelsenbeck JP: Alignment uncertainty and genomic analysis. Science. 2008, 319: 473-476. 10.1126/science.1151532.PubMedView ArticleGoogle Scholar
- Lassmann T, Sonnhammer ELL: Automatic assessment of alignment quality. Nucl Acids Res. 2005, 33: 7120-8. 10.1093/nar/gki1020.PubMedPubMed CentralView ArticleGoogle Scholar
- Dessimoz C, Cannarozzi G, Gil M, Margadant D, Roth A, Schneider A, Gonnet G: OMA, A comprehensive, automated project for the identification of orthologs from complete genome data: Introduction and first achievements. RECOMB 2005 Workshop on Comparative Genomics, Volume LNBI 3678 of Lecture Notes in Bioinformatics. Edited by: McLysath A, Huson DH. 2005, Berlin: Springer, 61-72.Google Scholar
- Robinson DF, Foulds LR: Comparison of phylogenetic trees. Math Biosci. 1981, 53: 131-147. 10.1016/0025-5564(81)90043-2.View ArticleGoogle Scholar
- Guindon S, Gascuel O: A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol. 2003, 52: 696-704. 10.1080/10635150390235520.PubMedView ArticleGoogle Scholar
- Gonnet GH, Hallett MT, Korostensky C, Bernardin L: Darwin v. 2.0: An interpreted computer language for the biosciences. Bioinformatics. 2000, 16: 101-103. 10.1093/bioinformatics/16.2.101.PubMedView ArticleGoogle Scholar
- Gonnet GH, Cohen MA, Benner SA: Exhaustive matching of the entire protein sequence database. Science. 1992, 256: 1443-1445. 10.1126/science.1604319.PubMedView ArticleGoogle Scholar
- Stamatakis A: RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics. 2006, 22: 2688-2690. 10.1093/bioinformatics/btl446.PubMedView ArticleGoogle Scholar
- Schwartz AS, Pachter L: Multiple alignment by sequence annealing. Bioinformatics. 2007, 23: e24-e29. 10.1093/bioinformatics/btl311.PubMedView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited