Large-scale assignment of orthology: back to phylogenetics?
© BioMed Central Ltd 2008
Published: 30 October 2008
Skip to main content
© BioMed Central Ltd 2008
Published: 30 October 2008
Reliable orthology prediction is central to comparative genomics. Although orthology is defined by phylogenetic criteria, most automated prediction methods are based on pairwise sequence comparisons. Recently, automated phylogeny-based orthology prediction has emerged as a feasible alternative for genome-wide studies.
Homologous sequences - that is, those derived from a common ancestral sequence - can be further divided into two different classes according to the mode in which they diverged from their last common ancestor . The divergence of two homologous sequences by a speciation event gives rise to orthologous sequences, whereas a duplication event will define a paralogous relationship between the duplicates. Although such straightforward definitions could suggest that distinguishing paralogs and orthologs is simple, it is definitely not. For example, it is not unusual for multiple lineage-specific gene loss or duplication events, as well as other evolutionary processes, to result in intricate scenarios that are difficult to interpret. Far from being a simple curiosity, the establishment of correct orthology and paralogy relationships is crucial in many biological studies. For instance, phylogenetic analyses that aim to infer correct evolutionary relationships between several species should be based on orthologous sets of sequences . Moreover, as orthologs are, relative to paralogs, more likely to share a common function, the correct determination of orthology has deep implications for the transfer of functional information across organisms . Finally, the establishment of equivalences among genes in different genomes is a prerequisite for comparative analyses of genome-wide data to detect evolutionarily conserved traits [4, 5].
Originally defined on an evolutionary basis, orthology relationships are best established through phylogenetic analysis. This usually involves the reconstruction of a phylogenetic tree describing the evolutionary relationships among the sequences and species involved, so that speciation and duplication events can then be mapped on the nodes of the tree. This is the classical procedure for establishing orthology relationships. However, the availability of whole sequenced genomes means the need to detect orthology at a genomic scale, a task for which the, mostly manual, phylogeny-based approach is not suited. Automated approaches were soon developed that inferred orthology relationships from pairwise sequence comparisons. Although these methods perform reasonably well, they have many drawbacks that can lead to annotation errors or misinterpretation of data [6, 7]. To avoid such pitfalls, and in an attempt to approximate the classical approach for detecting orthology, several automatic methods have been proposed that delineate orthology relationships from phylogenetic trees. Despite the greater accuracy of such methods compared with pairwise approaches, the large demands of time and computing power needed to generate reliable trees have limited their use to datasets of moderate size. Recently, however, the combination of automated large-scale phylogenetic reconstruction with newer algorithms is paving the way for the use of phylogeny-based methods for orthology detection at genomic scales [8, 9]. This progress is likely to have a deep impact on future comparative studies.
Homology is defined as the relationship that exists between two biological entities - for example, two sequences or two anatomic characters - that are derived from a common ancestor. In 1970, Walter Fitch coined the concepts of orthology and paralogy to distinguish two types of homology relationships between biological sequences . Orthologous sequences are those that derive by a speciation event from their common ancestor, whereas the origin of paralogous sequences can be traced back to a gene-duplication event. Despite this clear definition, orthology and paralogy are often misinterpreted by biologists. This is partly due to the fact that what may seem simple when comparing pairs of closely related species, easily gets complicated when wider groups of distantly related species are involved. It is sometimes wrongly claimed, for example, that only two sequences from the same species can be regarded as paralogs, or that two sequences from different species are orthologous to each other only if they perform the same biological function. I will briefly summarize here the main misunderstandings that can arise when dealing with properties of orthologous sequences (see  for a more thorough discussion), which are key to understanding why some of the methods discussed later would be more appropriate than others.
The first clarification is that orthology is a purely evolutionary concept, certainly related to, but not based on, the functionality of the sequences involved. All homologous proteins have a common ancestry and thus are expected to have similar three-dimensional structures and to perform related functions. But changes in functionality within a homologous family of proteins caused by sequence variation or context-dependency are not rare . This is especially true in the case of paralogs, because processes of neo- or subfunctionalization may favor the retention of duplicate genes . Orthologous sequences derived by speciation are, therefore, less prone to functional shifts but are definitely not free from them.
Yet another complication in defining orthology relationships among proteins is that they often comprise distinct domains that may have followed different evolutionary histories . Such evolutionary chimeras can be created by fusion and recombination events between different genes and may lead to situations in which, for example, a single member of a given protein family has recently acquired a new domain through recombination with another family. In such cases the different domains should, in principle, be treated as independent evolutionary units and orthology relationships be delineated accordingly. Thus, in multidomain families, orthology relationships should be first established among core domains and then extended, where possible, to adjacent regions.
To avoid these pitfalls and extend the procedure to multiple genome comparisons, Tatusov and colleagues introduced the concept of clusters of orthologous groups (COGs)  (Figure 2b). COGs are derived from the search for 'triangular' BBH relationships across a minimum of three species, and their subsequent combination into larger groups. This strategy has been followed by many groups and is the operational definition of orthology used by many databases such as EGO  and STRING .
Other extensions of the BBH approach include recent implementations such as Inparanoid  (Figure 2c) or OrthoMCL , which achieve higher sensitivity through sequence-clustering techniques that consider a range of BLAST scores beyond the absolute best hits. For instance, Inparanoid predicts paralogs resulting from lineage-specific duplications, which it calls 'in-paralogs', by including intraspecific BLAST hits that are reciprocally better than between-species BLAST hits. So, to a certain level, Inparanoid is able to include one-to-many and many-to-many relationships. Its limitation is that it is designed for comparing pairs of genomes only. OrthoMCL expands the procedure to comparisons of multiple genomes. It first uses a similar strategy to Inparanoid to define orthologous relationships between each pair of genomes. The comparisons of all possible pairs of genomes are represented as a graph in which the nodes represent genes and the edges represent orthology relationships. A Markov clustering algorithm (MCL) is then applied. In brief, OrthoMCL simulates random walks on the graph of orthology predictions to determine the transition probabilities among the nodes, that is, the probabilities that two nodes are connected in a random walk. The graph is partitioned into different orthologous groups on the basis of these probabilities.
Yet another type of method that cannot be strictly considered pairwise-based but that does not specifically build phylogenetic trees to define orthology, aims to refine previously made COGs. Generally, these methods organize clusters of orthologous genes into a hierarchical structure by using some evolutionary information. For instance, COCO-CL subdivides a given orthologous group on the basis of the correlation coefficient between their sequences, as inferred from a multiple sequence alignment . In contrast, OrthoDB uses the information regarding the species to which a given sequence belongs to organize an orthologous group in a hierarchy that is guided by the species tree .
In the classical procedure for determining orthology relationships, a phylogenetic tree is constructed from an alignment of homologous sequences and subsequently compared to a species tree. This comparison allows the geneticist to infer the events of gene loss and duplication that have occurred along the evolution of the sequence family considered. The first strategy for inferring such relationships automatically was proposed by Goodman and colleagues , who developed an algorithm for fitting a given gene tree to its corresponding species tree and inferring the minimum set of duplications needed to explain the data. This problem came to be known as 'tree reconciliation' (Figure 2d), and several other algorithms have been implemented that solve it efficiently [22–24]. These tree-based algorithms for orthology detection are very intuitive, as they simply implement automatically what an expert would do manually and, provided that correct species and gene trees are given, the algorithm will infer the correct orthology relationships. A number of databases have been developed that use such algorithms to derive orthology relationships from automatically reconstructed trees [25–27].
The main limitation of the tree-reconciliation method is that for many scenarios the species tree is not known with confidence. Moreover, it has been shown that another assumption of the tree-reconciliation problem, the correctness of the gene tree, is frequently violated . In such cases, erroneous gene trees will inevitably lead to incorrect orthology and paralogy assignments and the inference of many extraneous duplications and gene losses. As a result, these methods are very sensitive to slight variations in the topology or the rooting of the gene tree and, when applied at a large scale they perform similarly to and even worse than standard pairwise methods  and need manual curation . Even if the gene tree is correctly reconstructed, it may not conform to the species tree in cases where horizontal gene transfer events have occurred. Such gene trees are hard to reconcile with the species tree and are often confused by apparent events of massive gene loss.
One possible solution to cope with the existing ambiguity in gene and species trees is to account for this uncertainty during the process of tree reconciliation. Some approaches consider the uncertainty of the different nodes of the gene tree as inferred from their bootstrap, or equivalent, values, and weight the gene loss and duplication events accordingly [31, 32]. Another approach that tackles the uncertainty of both the gene and the species tree was recently proposed by the group of David Liberles . This algorithm, called 'soft parsimony', modifies uncertain or poorly supported branches by minimizing the number of gene duplication and loss events implied by the tree. It starts by generating all possible rooted trees that can be derived from a given gene tree. Then the edges that have a support value under a given threshold are collapsed. Each tree is subsequently reconciled with the species tree, which can include multifurcations at unresolved nodes, and the number of duplications is computed. If more than one tree minimizes the necessary duplications, these are compared in terms of the number of gene losses implied. Finally, the collapsed nodes are reconstituted.
Soft parsimony is able to solve the most obvious errors arising from tree reconciliation, which normally implies a multitude of gene losses and duplications. It also allows the use of species trees with unresolved nodes, which usually better represent what we really know about relationships within most phylogenetic groups. Nevertheless, these algorithms still need a certain level of resolution in the species trees and have a number of underlying assumptions that should be taken into account. For instance, the scenario with the minimal number of losses and gene duplications is not necessarily the real one, as losses and duplications can be rampant in some cases . Furthermore, the number of iterations and tree-reconciliation steps that these methods involve may limit its use in large-scale datasets.
Yet another way out of the problem of ambiguity in species and gene trees is to consider the gene tree topology in a very relaxed way and minimize the need to know the true evolutionary relationships of species. This approach is followed in recent algorithms that are based on the level of overlap between the species encountered within a tree. Basically, these algorithms examine the level of overlap in the species connected to two related nodes to decide whether their parental node represents a duplication or speciation event (Figure 2e). They assume that a node represents a duplication event if it is ancestral to two tree-partitions that contain sets of species that overlap to some degree. Conversely, if the two partitions contain sets of species that are mutually exclusive, the node is considered to represent a speciation event. The only evolutionary information that such algorithms require is that needed to root the tree so that a polarity (ancestors to descendants) between the internal nodes is defined.
One such algorithm has been used in the prediction of all orthology and paralogy relationships for all human genes and their homologs in 38 other eukaryotic species . The reason for using this type of algorithm was its speed and the high degree of topological diversity observed in the human phylome, something that would have resulted in many wrong assignments if a reconciliation algorithm had been used. This orthology-prediction methodology is now implemented in all phylomes deposited at PhylomeDB . Van der Heijden and colleagues implemented a species-overlap algorithm in a program called LOFT (Levels of Orthology From Trees) . Besides predicting orthology relationships between genes in a phylogenetic tree, LOFT assigns a hierarchy to the orthology relationships. Similar to the Enzyme Classification (EC) numbers, each gene of a family is given a code that indicates its level within the orthology hierarchy. In this way orthologous groups can be defined at different levels and the orthology and paralogy relationships can be readily inferred from the code.
In conclusion, the prediction of orthology, rather than just homology, relationships among genes in sequenced genomes is a necessary task that often needs to be performed in an automated way. Most automatic strategies to derive such orthology relationships still use rough approximations that are far away from the original definition of orthology. Nowadays, however, the increasing speed at which computer programs can generate phylogenetic trees, as well as the availability of new algorithms, allows the possibility of actually predicting orthology by mapping the speciation and duplication events on a tree, thus following the formal definition of orthology. It is likely that soon this strategy will become the most commonly used in genome-wide searches for orthology. The expected increase in the accuracy of the predicted relationships will result in a higher reliability of transfer of information across species. Recent analyses show that phylogeny-based methods are less prone to error than similarity-based approaches. The same analyses show, however, that there is still room for improvement and that future algorithms will need to take into account the inherent topological variability that is expected in any genome-wide phylogenetic analysis.
This work was partly funded by grants from the Spanish Ministries of Health (FIS06-213) and Science and Innovation (GEN2006-27784-E/PAT) to TG.