- Open Access
Constant relative rate of protein evolution and detection of functional diversification among bacterial, archaeal and eukaryotic proteins
Genome Biologyvolume 2, Article number: research0053.1 (2001)
Detection of changes in a protein's evolutionary rate may reveal cases of change in that protein's function. We developed and implemented a simple relative rates test in an attempt to assess the rate constancy of protein evolution and to detect cases of functional diversification between orthologous proteins. The test was performed on clusters of orthologous protein sequences from complete bacterial genomes (Chlamydia trachomatis, C. muridarum and Chlamydophila pneumoniae), complete archaeal genomes (Pyrococcus horikoshii, P. abyssi and P. furiosus) and partially sequenced mammalian genomes (human, mouse and rat).
Amino-acid sequence evolution rates are significantly correlated on different branches of phylogenetic trees representing the great majority of analyzed orthologous protein sets from all three domains of life. However, approximately 1% of the proteins from each group of species deviates from this pattern and instead shows variation that is consistent with an acceleration of the rate of amino-acid substitution, which may be due to functional diversification. Most of the putative functionally diversified proteins from all three species groups are predicted to function at the periphery of the cells and mediate their interaction with the environment.
Relative rates of protein evolution are remarkably constant for the three species groups analyzed here. Deviations from this rate constancy are probably due to changes in selective constraints associated with diversification between orthologs. Functional diversification between orthologs is thought to be a relatively rare event. However, the resolution afforded by the test designed specifically for genomic-scale datasets allowed us to identify numerous cases of possible functional diversification between orthologous proteins.
In an early attempt to relate theoretical concepts and empirical results bearing on the molecular basis of evolution, Kimura and Ohta listed five principles of molecular evolution [1,2]. The first of these deals with the concept of the molecular clock and predicts that, as long as a protein's function remains unaltered, the protein's rate of evolution will be approximately constant within different phylogenetic lineages. From this principle, it follows that detection of changes in a protein's rate of evolution may reveal functional changes . This rate-constancy prediction is tested here by studying the relative rates of protein evolution in a simple phylogenetic context. The test relies on the simultaneous analysis of numerous protein sequences that represent complete protein sets from sequenced genomes or partial protein sets from genomes of which a significant fraction has been sequenced. Consistent with the rate-constancy prediction, given a phylogenetic tree topology, the ratio of evolutionary distances along different branches of the tree should be approximately constant for all proteins from an analyzed set of species.
Clusters of orthologous proteins from bacterial, archaeal and eukaryotic species sets of size three were analyzed with respect to this prediction. Specifically, for each protein phylogeny, the length of the longest branch was compared to the sum of the two shorter branches (Figure 1). Such a genome-scale test allowed for a robust assessment of the rate constancy of protein evolution, as well as the detection of multiple cases of accelerated evolutionary change and a quantitative assessment of the fraction of genes that undergo such acceleration.
Results and discussion
Genomic-scale relative rates test
Three genomes of intracellular pathogenic bacteria of the family Chlamydiaceae (Chlamydophila pneumoniae, Chlamydia trachomatis and Chlamydia muridarum) and three genomes of archaeal hyperthermophiles of the genus Pyrococcus (Pyrococcus abyssi, P. horikoshii and P. furiosus) were chosen for analysis. Orthologs were identified using the rationale and procedure that underlie the construction of clusters of orthologous groups of proteins (COGs) [4,5]. This approach has been designed to maximize the likelihood of identification of orthologs when complete genomes are compared. A modified approach involving two additional steps was used to identify probable orthologs among the encoded proteins from three partially completed mammalian genomes (Homo sapiens (human), Mus musculus (mouse) and Rattus norvegicus (rat).
These procedures resulted in a total of 703 Chlamydiaceae, 1,305 Pyrococcus and 2,112 mammalian orthologous protein sets. Each orthologous protein-sequence set was aligned and the evolutionary distances between sequences within the sets were computed. These distances were used in phylogenetic reconstructions for all orthologous protein sets. For each species group, branch lengths were averaged to produce the trees shown (Figure 2). Each of the trees contains two relatively short branches of virtually identical length and one long branch, forming two main phylogenetic partitions. The rate-constancy prediction holds that the evolutionary rates along these phylogenetic branches should be correlated. In other words, the length of branch A (bA) should be correlated with the length of branches B and C (bB + bC) (Figures 1,2) resulting, in an ideal case, in a constant bA/(bB+bC) ratio.
This prediction was tested by evaluating the branch lengths estimated from alignments of orthologous protein sets of the three species groups. Consistent with the prediction, there is a highly significant positive correlation between bA and bB + bC for all three species groups (Figure 3). Thus, the vast majority of proteins in each analyzed group of species seem to evolve in accord with the rate-constancy prediction, suggesting that the functional constraints among orthologous proteins remain largely unchanged. However, in each group of species there are outliers (Figure 3) that do not conform to the expected pattern of variation. For each of the three groups of orthologous protein sets, upper and lower limits of the distance from the linear trend line that correspond to an expectation value of 0.05 were determined. All points above and below these lines (Figure 3) represent statistically significant deviants from the linear trend, given the standard error and the number of orthologous protein sets analyzed for each group of species. Points above the upper limit line show more variation between the phylogenetic partitions (that is, longer bA) than expected by chance, and points below the lower limit line show more variation within one partition (that is, longer bB and/or bC) than expected. The majority of the deviants fall above the upper limit. These points represent proteins with an accelerated rate of amino-acid substitution between phylogenetic partitions. The less frequent outlier points below the lower line represent proteins that have undergone accelerated evolution in one or both of the two more closely related species in the corresponding group. All of the orthologous protein sets that showed significant deviations from the linear trend were re-analyzed using BLAST similarity searches followed by phylogenetic analysis. This was done to verify that these sets contain orthologous proteins. For orthologous proteins, it is expected that they will form monophyletic clades without any intervening paralogs or proteins from distantly related species. For all Chlamydiaceae and mammalian data, the phylogenetic analysis revealed that these sets do in fact consist of orthologous proteins. However, the Pyrococcus analysis yielded five cases where orthology could not be ascertained. These ambiguous Pyrococcus sets were removed and the linear relationship between bA and bB +bC was re-assessed. The total number of orthologous protein sets that show evidence of accelerated evolution suggestive of functional diversification is between 0.76% and 1% of the total number of such sets analyzed for each species (Chlamydiaceae = 7/703, Pyrococcus = 11/1305, human-mouse-rat= 16/2112).
A number of cases of positive selection at the molecular level have been supported by comparison of the rates of synonymous (Ks) and non-synonymous (Ka) substitution . Ka > Ks is considered unequivocal evidence of positive selection . With this in mind, comparisons of Ks and Ka were carried out for the orthologous protein sets that showed a pattern of variation consistent with functional diversification in the whole-genome test (Figure 3). For the bacterial and archaeal species groups, the vast majority of these comparisons were inconclusive because the evolutionary distance between phylogenetic partitions resulted in a saturation of Ks. The more closely related human-mouse-rat group did not show saturation of Ks. Comparisons of Ks and Ka for functionally diversified mammalian protein sets did not reveal any cases with Ka > Ks. This is probably due to the fact that, if directional selection led to the acceleration of the amino-acid substitution rate in these proteins, it acted in an episodic manner [6,7] and on a minority of residues against a constant background of purifying selection [8,9]. This is consistent with previous work suggesting that adaptive changes are difficult to find using the strict Ka > Kscriterion and that even documented cases of positive selection can be missed using this technique alone [6,10,11,12,13]. The average Ka is, however, significantly higher for the putative functionally diversified proteins than for the total dataset (Table 1, Figure 4). The mammalian proteins with an increased rate of amino-acid substitution between the human and rodent lineages also show a level of Ks significantly higher than that of the total dataset (Table 1). This can be explained by the fact that, consistent with previous reports [10,14,15,16,17], we observe a positive correlation between Ks and Ka (r = 0.556, P < 10-10). This correlation could reflect a mechanistic bias in mutation [10,18,19] or synonymous sites may be subject to some degree of selection [20,21,22,23] (or both). However, the Ka/Ks ratio is also significantly higher for the sets of putative functionally diversified proteins (Table 1), indicating a relative acceleration of amino-acid substitution, which could be due to a relaxation of functional constraints and/or directional selection.
Functional prediction and phyletic distribution
There are some notable similarities in the (predicted) functional identities and the phyletic distributions of the putative functionally diversified proteins (Table 2). The sets from all three species groups are enriched for proteins that appear to function at the periphery of the cells and mediate their interaction with the environment. Specifically, 5 of the 7 cases in Chlamydiaceae, 7 of the 11 Pyrococcus cases and 13 of the 16 mammalian cases are (predicted) membrane or secreted proteins (Table 2). Some of the remaining proteins are completely uncharacterized and could associate with membranes by other mechanisms that do not involve recognizable structural features such as signal peptides or transmembrane segments. The mammalian set of functionally diversified proteins, for which much more experimental information is available than for the bacterial and archaeal sets, is particularly enriched for secreted proteins, including several proteins involved in immune surveillance and defense (Table 2). An example of a protein that is implicated in a different class of functions, but is still related to defense mechanisms is the predicted restriction endonuclease from Pyrococcus. Generally, the functions of proteins for which accelerated evolution was detected and functional diversification inferred are compatible with the previously reported cases of accelerated evolution which is particularly prominent, for example, in the immune system .
The phyletic distribution of the functionally diversified proteins is typically narrow, particularly in Chlamydiaceae (six of the seven proteins have no homologs outside this bacterial family). Although the selected Pyrococcus proteins showed more evolutionary conservation, the majority had orthologs only in other archaeal genomes, if not only in Pyrococcus. Similarly, 13 of the 16 selected mammalian proteins were either mammal-specific or vertebrate-specific (Table 2). The only two clear exceptions to this pattern of poorly conserved proteins primarily involved in defense and extracellular signaling were the mammalian enzymes hydroxysteroid sulfotransferase and hydroxy acid oxidase. Accelerated evolution in proteins with a narrow phyletic distribution may represent species-specific adaptations in how each organism responds to and processes unique environmental cues.
The present results show that the mode of evolution of the vast majority of the analyzed proteins from all three domains of life - bacteria, archaea and eukaryotes is compatible with the rate-constancy prediction of molecular evolution. A recent analysis of the genome-wide distribution of the evolutionary rates of prokaryotic genes suggested that the shape of this distribution does not significantly change during evolution, although the rates themselves may change . This is consistent with the present finding of a strong correlation between evolutionary rates in different phylogenetic partitions. However, with remarkable consistency, for all three species groups analyzed here, approximately 1% of orthologous genes seem to have experienced accelerated evolution, suggesting functional and potentially adaptive diversification that is conspicuous enough to be detected at a statistically significant level.
It is also a formal possibility that some of the genes identified as functionally diversified may be losing their function. As this would entail a change in function, it can be considered as a special type of functional diversification. However, two points argue against this possibility. First, the Ks/Ka values for mammalian genes are all < 1 (including the putative diversified genes) and thus do not show evidence of loss of function or of pseudogenes. Second, amino-acid sequence alignments of the diversified proteins show that these proteins still retain the identity of critical domains, such as signal peptides and transmembrane domains, as well as critical residues in the active sites of enzymes (Table 2 and data not shown).
Accelerated evolution and functional diversification between paralogs is thought to be common in evolution and is posited as a major mechanism for the generation of new genes and new biochemical functions . Functional diversification between orthologs is considered to be much less common, however. In fact, directional selection leading to functional diversification between orthologs is the least common form of natural selection observed at the molecular level . In the present study, the implementation of a simple test designed specifically to exploit the availability of numerous complete and partially complete genome sequences facilitated the detection of multiple cases of probable functional diversification between orthologs. Application of this method to the growing set of complete genomes is expected to yield similar results and may reveal numerous additional cases of functionally diversified orthologs.
Materials and methods
Three closely related Chlamydiaceae genomes and three closely related Pyrococcus genomes were chosen for analysis. These genomes were chosen on the basis of three criteria. First, there were three or more closely related complete genome sequences available; second, the genomes were closely enough related to avoid saturation of amino-acid changes and to ensure unambiguous identification of orthologs; and third, the genomes were distantly enough related to provide meaningful information on amino-acid variation. Predicted proteomes for three Chlamydiaceae genomes (Chlamydophila pneumoniae - CWL029, Chlamydia muridarum and Chlamydia trachomatis) and two Pyrococcus genomes (P. abyssi and P. horikoshii) were obtained from the National Center for Biotechnology Information (NCBI) public server . The predicted proteome for the P. furiosus genome was obtained from the sequencer's site . The method for identifying probable orthologs is based on the genome-specific best hits (BeTs) rationale that underlies the construction of the COGs database [4,5]. For each group of genomes, all-against-all BLAST  comparisons were done using the predicted protein sequences. For each query sequence, the BeT (highest score or lowest expectation value) to each of the other proteomes was recorded. Triangles of mutually consistent symmetrical BeTs were detected to form COGs. Only COGs with one representative (ortholog) from each genome (COGs of size 3) were considered for further analysis. This sequence-similarity-based approach is generally not dependent on constant rates of evolution and thus is not misled by rate variation. However, it is a formal possibility that misidentification of orthologs could result from cases of extreme rate variations that would make the orthologs less similar to each other than to some paralogs. This possibility was examined using phylogenetic analysis, as described below.
Messenger RNA sequences from R. norvegicus, Mus musculus and Homo sapiens were obtained using NCBI's batch Entrez server . The encoded amino-acid sequences were clustered at 99% similarity threshold using BLASTCLUST  to remove redundant sequences. The BeT procedure was used to identify perfectly symmetrical COGs of size 3 for this species group. Because these mammalian genomes are not complete, a formal possibility exists that the COGs procedure will isolate non-orthologous protein sets. For this reason, two additional steps were added to the ortholog identification procedure. Subsequent to the construction of human-mouse-rat COGs and phylogenetic trees, the 'triplet test'  was used to further ensure that the COGs of size 3 contained only orthologs. The triplet test verifies that the mouse and rat members of the COGs are approximately equidistant from the human member. In addition to the triplet test, the distributions of Ks between human and rodents and Ks between mouse and rat was used to eliminate potentially non-orthologous sequence sets. In both cases, the distribution of Ks was approximately normal, with the exception of long and sparsely populated tails. Sequence sets that fell in the tails of the Ks distribution (Ks > 0.35 for mouse-rat and Ks > 1.0 for human-rodent) were removed before analysis to eliminate potentially non-orthologous protein sets. This manual selection process does not appear to bias the results in any way (for example, by eliminating a certain functional class of genes). Omitting this step resulted in the inclusion of a number of distantly related and poorly aligned sequence sets that may not actually include orthologs. Thus, the relative rates test as implemented for the mammalian dataset is a conservative one. Furthermore, the most rapidly evolving genes will not be included in the analysis as many of them will not have detectable orthologs even in closely related genomes.
Orthologous protein sequence sets were aligned using CLUSTALW  with default options. Alignments that resulted in any branch lengths (based on the corrected gamma distance, see below) longer than 2 in the case of the Chlamydiaceae and Pyrococcus species groups or longer than 1 in the case of the human-mouse-rat species group were visually inspected and adjusted manually. Manual adjustment included removal of poorly aligning and/or low-sequence-complexity regions identified by BLAST. These adjustments did not result in any qualitative differences in the results of subsequent analyses.
For each orthologous protein set of size 3, evolutionary distances between the sequences were calculated using 3BRANCH program (Y.I.W., unpublished data, available from the author on request) that uses the distance correction based on the gamma distribution of site rate variation [32,33] with an α parameter of 1.0. Phylogenetic branch lengths were calculated from distances by solving the following set of equations (Figure 1):
distance from species A to species B (dAB) = branch length A (bA) + bB,
dAC = bA + bC,
dBC = bB + bC.
For example, solving for bA yields bA = (dAB + dAC - dBC)/2.
For each orthologous gene set, pairwise estimates of the synonymous (Ks) and non-synonymous (Ka) substitution rates were obtained from nucleotide-coding sequence alignments using the Pamilo-Bianchi-Li method [34,35] implemented in the PBLtest program (I.B.R., unpublished data, available from the author on request). Pairwise estimates were averaged to obtain the data presented in Table 1.
Domain architectures of the functionally diversified proteins were assessed by using the SMART server [36,37] and the NCBI's CD-Search server [28,38]. Domain architectures together with the results of PSI-BLAST  searches that were run for three iterations, Entrez protein annotation  and protein annotations in the COG database [5,40], were used to assign (probable) functions to the selected proteins. The phyletic distribution of these proteins was determined from the PSI-BLAST search results using the TAX_COLLECTOR program of the SEALS package . Additional case-by-case inspection of these results was conducted and neighbor-joining phylogenetic trees [42,43] were constructed to verify the orthologous relationships between the selected proteins. This was done by ensuring that putative orthologous proteins did not show a greater similarity to any paralogs or proteins from more distant species than they did to the proteins included in the analyzed COGs. As a result of this inspection, five Pyrococcus COGs were removed from the original set of selected proteins.
For each species set, the standard error of the predicted y-value for each x-value in the regression (Figure 3) was determined. The number of standard errors away from the linear trend line that corresponds to an expectation value of 0.05 was calculated using the inverse of the standard normal cumulative distribution for a probability equal to 1 - (0.05/number of values in plot/2).
Kimura M, Ohta T: On some principles governing molecular evolution. Proc Natl Acad Sci USA. 1974, 71: 2848-2852.
Kimura M: The Neutral Theory of Molecular Evolution, New York: Cambridge University Press;. 1983
Hughes AL: Adaptive Evolution of Genes and Genomes, Oxford: Oxford University Press;. 1999
Tatusov RL, Koonin EV, Lipman DJ: A genomic perspective on protein families. Science. 1997, 278: 631-637. 10.1126/science.278.5338.631.
Tatusov RL, Galperin MY, Natale DA, Koonin EV: The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res. 2000, 28: 33-36. 10.1093/nar/28.1.33.
Messier W, Stewart CB: Episodic adaptive evolution of primate lysozymes. Nature. 1997, 385: 151-154. 10.1038/385151a0.
Zhang J, Rosenberg HF, Nei M: Positive Darwinian selection after gene duplication in primate ribonuclease genes. Proc Natl Acad Sci USA. 1998, 95: 3708-3713. 10.1073/pnas.95.7.3708.
Goodman M, Moore GW, Matsuda G: Darwinian evolution in the genealogy of haemoglobin. Nature. 1975, 253: 603-608.
Gonzalez DS, Jordan IK: The alpha-mannosidases: phylogeny and adaptive diversification. Mol Biol Evol. 2000, 17: 292-300.
Wolfe KH, Sharp PM: Mammalian gene evolution: nucleotide sequence divergence between mouse and rat. J Mol Evol. 1993, 37: 441-456.
Endo T, Ikeo K, Gojobori T: Large-scale search for genes on which positive selection may operate. Mol Biol Evol. 1996, 13: 685-690.
Sharp PM: In search of molecular Darwinism. Nature. 1997, 385: 111-112. 10.1038/385111a0.
Yang Z, Bielawski JP: Statistical methods for detecting molecular adaptation. Trends Ecol Evol. 2000, 15: 496-503. 10.1016/S0169-5347(00)01994-7.
Graur D: Amino acid composition and the evolutionary rates of protein-coding genes. J Mol Evol. 1985, 22: 53-62.
Mouchiroud D, Gautier C, Bernardi G: Frequencies of synonymous substitutions in mammals are gene-specific and correlated with frequencies of nonsynonymous substitutions. J Mol Evol. 1995, 40: 107-113.
Ohta T, Ina Y: Variation in synonymous substitution rates among mammalian genes and the correlation between synonymous and nonsynonymous divergences. J Mol Evol. 1995, 41: 717-720.
Makalowski W, Boguski MS: Evolutionary parameters of the transcribed mammalian genome: an analysis of 2,820 orthologous rodent and human sequences. Proc Natl Acad Sci USA. 1998, 95: 9407-9412. 10.1073/pnas.95.16.9407.
Averof M, Rokas A, Wolfe KH, Sharp PM: Evidence for a high frequency of simultaneous double-nucleotide substitutions. Science. 2000, 287: 1283-1286. 10.1126/science.287.5456.1283.
Williams EJ, Hurst LD: The proteins of linked genes evolve at similar rates. Nature. 2000, 407: 900-903. 10.1038/35038066.
Miyata T, Hayashida H: Extraordinarily high evolutionary rate of pseudogenes: evidence for the presence of selective pressure against changes between synonymous codons. Proc Natl Acad Sci USA. 1981, 78: 5739-5743.
Lipman DJ, Wilbur WJ: Interaction of silent and replacement changes in eukaryotic coding sequences. J Mol Evol. 1984, 21: 161-167.
Ticher A, Graur D: Nucleic acid composition, codon usage, and the rate of synonymous substitution in protein-coding genes. J Mol Evol. 1989, 28: 286-298.
Akashi H: Synonymous codon usage in Drosophila melanogaster : natural selection and translational accuracy. Genetics. 1994, 136: 927-935.
Grishin NV, Wolf YI, Koonin EV: From complete genomes to measures of substitution rate variability within and between proteins. Genome Res. 2000, 10: 991-1000. 10.1101/gr.10.7.991.
Li WH: Molecular Evolution, Sunderland, MA: Sinauer Associates;. 1997
National Center for Biotechnology Information Bacterial Genomes. [ftp://ncbi.nlm.nih.gov/genomes/Bacteria/]
Utah Genome Center: P. furiosus ORF sequences. [http://www.genome.utah.edu/pfu599.pep]
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25: 3389-3402. 10.1093/nar/25.17.3389.
Batch Entrez. [http://www.ncbi.nlm.nih.gov:80/Entrez/batch.html]
NCBI BLAST download. [ftp://ncbi.nlm.nih.gov/blast/executables/]
Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994, 22: 4673-4680.
Ota T, Nei M: Estimation of the number of amino acid substitutions per site when the substitution rate varies among sites. J Mol Evol. 1994, 38: 469-482.
Grishin NV: Estimation of the number of amino-acid substitutions per site when the substitution rate varies among sites. J Mol Evol. 1995, 41: 675-679.
Li WH: Unbiased estimation of the rates of synonymous and nonsynonymous substitution. J Mol Evol. 1993, 36: 96-99.
Pamilo P, Bianchi NO: Evolution of the Zfx and Zfy genes: rates and interdependence between the genes. Mol Biol Evol. 1993, 10: 271-281.
Schultz J, Copley RR, Doerks T, Ponting CP, Bork P: SMART: a web-based tool for the study of genetically mobile domains. Nucleic Acids Res. 2000, 28: 231-234. 10.1093/nar/28.1.231.
Entrez protein. [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Protein]
COGs: Phylogenetic classification of proteins encoded in complete genomes database. [http://www.ncbi.nlm.nih.gov/COG/]
Walker DR, Koonin EV: SEALS: a system for easy analysis of lots of sequences. Proc Int Conf Intell Syst Mol Biol. 1997, 5: 333-339.
Saitou N, Nei M: The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol. 1987, 4: 406-425.
Thompson JD, Gibson TJ, Plewniak F, Jeanmougin F, Higgins DG: The CLUSTAL_X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Res. 1997, 25: 4876-4882. 10.1093/nar/25.24.4876.
We thank Alexey Kondrashov for many helpful discussions and constructive criticisms, Charles DeLisi, David Landsman, Detlef Leipe, Wojciech Makalowski and Itai Yanai for critical reading of the manuscript and constructive comments and L. Aravind for advice on protein function prediction. The release of the unpublished P. furiosus genome sequence by the Utah Genome Center at the University of Utah is acknowledged and appreciated.