- Open Access
Rosetta Stone proteins: "chance and necessity"?
Genome Biologyvolume 3, Article number: interactions1001.1 (2002)
A response to Functional associations of proteins in entire genomes by means of exhaustive detection of gene fusions by AJ Enright, CA Ouzounis. Genome Biology 2000, 2:research0034.1-0034.7
The field of predicting protein-protein interactions has been active for over two decades, but in silico methods have become prominent in the last three years with the expansion of analyses at a genomic scale [1,2]. The rationale of one of the computational methods is as simple as it is elegant. Two polypeptides A and B in one organism are likely to interact if their homologs are expressed as a single polypeptide AB in another. The latter polypeptide (AB) is called a Rosetta Stone protein, as it contains information about both A and B. Marcotte et al.  have proposed that fusion to form a single polypeptide reduces the entropy of dissociation of A and B. The result is a huge increase in the local concentration of A with respect to B. A recent paper in Genome Biology describes the effort of Enright and Ouzounis , who carried out a vast analysis of genes of the Rosetta Stone type over 24 genomes, including eukaryotic genomes. They uncovered many new 'composite' or 'Rosetta' proteins, many of them contributed by eukaryotes. Here, I provide simple arguments to suggest that including eukaryotic sequences in the analyses may increase the robustness of predictions made using the Rosetta Stone approach.
In prokaryotes, transcription and translation are coupled, and functionally related genes are clustered. Dandekar et al.  compared nine prokaryotic genomes and noticed a poor conservation of architecture of operons (clusters of co-transcribed functionally related genes) - but what is an operon in one organism may, in another, be a regulon (several co-regulated operons or sub-operons). They noticed that a number of gene pairs were highly conserved, taking this as evidence for direct physical interactions between the corresponding gene products, rather than a reflection of co-regulation resulting from functional coupling (see also ). Given that a biochemical function in many cases depends on the action of a multimeric complex, a correlation between co-regulated and interacting proteins is to be expected (corresponding to a proportion of the positive hits in the approach of Dandekar et al.).
The rationale of the Rosetta-type search seems to be far more robust, but the existence of a Rosetta protein is not always proof of protein-protein interactions. The operon can be considered as a selfish cassette of DNA that can confer a selective advantage under certain conditions. The operon is therefore a gene cluster that has been assembled by deleting 'uninteresting' intervening sequences and can be spread by horizontal gene transfer to many recipient genomes . From a minimalist perspective, it is conceivable that deletion of an intervening sequence between two adjacent genes may lead to an in-frame fusion of the open reading frames (ORFs). If folding of the fused proteins is not altered, this is a way to co-regulate gene expression as efficiently as an operon does for separate ORFs. Thus, fusion events may reflect an alternative strategy of co-regulation and not direct physical interactions. This might explain, at least in part, one surprising result of Enright and Ouzounis : when they tried to validate their predictions using the results of a yeast two-hybrid experiment on a genomic scale, only one case found validation. This may also reflect, as the authors notice, the extremely high number of false-positive hits of the two-hybrid method. Consider also that Mycoplasma genitalium (with a 580 kb genome containing 479 ORFs), which has a genome smaller than that of Mycoplasma pneumoniae (816 kb, 677 ORFs), nevertheless contains 15 Rosetta proteins whose M. pneumoniae homologs are encoded by split genes. The reverse comparison shows that M. pneumoniae has only four Rosetta proteins when the reference genome is that of M. genitalium. Although this does not preclude the possibility of physical interactions between the putative partners, it can be used as a circumstantial argument to suggest that reductive evolution may push towards gene fusion for the sake of economy.
Although in prokaryotes fusion events may in some instances reflect co-regulatory strategies, the introduction of eukaryotic genomes into searches for Rosetta proteins may help improve the robustness of predictions, especially when the hits involve organisms from different kingdoms of life (eukaryotes versus bacteria and archaea). Instead of using 'energetic' arguments (changes in entropy, ΔS, or in the Gibbs' free energy, ΔG), we can use a simple mass-action approach to justify this. Consider, for example, a dimerization reaction characterized by the equilibrium constant K, such that
where Aeq, Beq and ABeq are the concentrations at equilibrium. If Ao and Bo are the initial concentrations of A and B, the constant can be expressed as
This formula is an ordinary quadratic equation, and we can find the value of ABeq analytically. Imagine now that we have a cell that swells (expanding from the volume of a prokaryotic cell to that of a eukaryote) without changing the input concentration amounts of each subunit. A modest increase of the volume, say 1,000 times, translates into a dramatic decrease in the amounts of AB. For instance, for K = 108 M-1 and Ao = Bo = 1 μM, a 1,000-fold dilution leads to over 10,000 times less AB at equilibrium (see Figure 1a). Even worse, if the starting concentration of A and B is 1 nM, the same dilution leads to over 800,000 times less AB at the end. Increasing K may help to overcome this problem. If now K = 1012 M-1 for Ao = Bo = 1 μM, we will have only 1,000 times less AB (which is a proportional change with respect to the dilution factor). However, if Ao = Bo = 1 nM, 1,000 times less of AB is obtained for K > 1015 M-1.
It is clear that increasing the affinity (K) between the partners enhances dimer formation even at low input doses of monomers (Figure 1b). The increase in K required can be enormous, however, and even in the case of an irreversible (ultra-tight) dimerization, it will always be limited by the translational diffusion of the partners. If the cell requires a certain amount of a product at a certain moment, this may be unattainable, from a kinetic point of view, with low monomer concentrations. This is especially critical in eukaryotes, where co-regulated genes are not physically linked and where transcription and translation take place in different compartments. If similar levels of AB activity were needed by both types of cells (that is, a small or prokaryotic cell and a large or eukaryotic cell), three main non-exclusive alternatives are possible: first, a proportional increase in the input molar concentrations in the large cell; second, the introduction of compartments in the large cell; and third, fusion of the interacting partners. The first strategy is not parsimonious because, in our hypothetical example, a 1,000-fold increase in the initial concentrations of all interacting partners must be guaranteed (involving a 1,000-fold increase of 'biologically useless' free monomers, that is, Ao-ABeq, Bo-ABeq). This might also imply drastically slowing down the turnover of the proteins so as to allow their accumulation. The second strategy is more advantageous. Note, however, that in most cases independent polypeptides must diffuse across the cytosol (site of synthesis) to reach their compartments. Since the compartments and/or organelles are big targets and able to sequester the proteins, they may relieve the diffusion problem, whose consequence is only kinetic. If the cell needs high concentrations of the AB complex in a short period of time, however, increasing the input concentration of monomers will be required (in spite of the 'relief' provided by the organelles); if time is not a problem, by contrast, the cell can 'wait' until the organellar concentrations of monomers and/or complexes are the correct ones. The strategy of gene/protein fusion is the most parsimonious and kinetically advantageous. It dramatically helps to diminish the amount of transcribed and translated products required to attain the desired levels of functional activity. This is clearly beneficial in terms of time and energy consumption, and can also be applied to proteins sorted to specialized compartments (no diffusion of monomers is required to enable them to meet inside the compartments or within membranes).
In the case of enzymes, we can also evaluate the advantages of gene or protein fusion from the perspective of the chemical reactions they catalyze. In prokaryotes, after translation, whether the gene products interact physically or not, they are all produced in relative proximity. In eukaryotes, protein fusion to yield polyproteins able to catalyze successive steps of a metabolic pathway provides a great advantage compared to producing independent polypeptides. Note that the partition of the cellular volume into organelles also enhances metabolic efficiency and that, perhaps, the existence of organelles is linked to improving metabolic processes rather than to relieving the protein diffusion problem evoked above. This is reminiscent of the notion of metabolic channeling, used to describe the restricted flow of substrates and products in multienzyme systems (substrates and/or products are passed from one active center to another). It has been argued that free diffusion is sufficiently rapid to obviate the need for channeling . Again, this is easily applicable to prokaryotes. In eukaryotes, however, large cytosolic volumes may result in a greater need for channeling.
It is safe to assume that most Rosetta proteins conserved across eukaryotes and prokaryotes are responsible for the 'core' metabolism (intermediary and basic information transfer, as defined in ). Consistent with this, the comparison of Drosophila with other organisms  shows that, in almost all cases where functional annotation is known, the Rosetta components are involved in the core metabolism. Exploitation of these results might aid understanding of how simple organisms work. Even this would be a big achievement, which would in turn help us to understand more complex eukaryotic systems. From the perspective outlined here, eukaryotic Rosetta proteins are likely better to reflect protein-protein interactions (producing fewer false positives) than those found outside Eukarya (many of which are also relevant). On the other hand, Rosetta proteins specific to eukaryotes might reflect the modular nature and the combinatorial design of many eukaryotic components (complex transcription factors, signal transduction molecules and molecular adaptors). It is conceivable that nature could have evolved a huge combinatorial panoply of enzymes with a limited set of interacting generic domains (cofactor-binding and catalytic domains), but in fact selection has favored a 'copy-and-paste' strategy, allowing the multiplication of domains that appear today as fusion products . The results of this copy-and-paste process can be grouped into two main classes: a primordial scenario leading from several domains constituting several peptides to several domains combined into one polypeptide, and a sophisticated scenario leading from several genes encoding several polypeptides to one gene for one polyprotein. The evolutionary path followed from one state to the other is, however, largely unknown.
Marcotte EM, Pellegrini M, Ng HL, Rice DW, Yeates TO, Eisenberg D: Detecting protein function and protein-protein interactions from genome sequences. Science. 1999, 285: 751-753. 10.1126/science.285.5428.751.
Enright AJ, Iliopoulos I, Kyrpides NC, Ouzounis CA: Protein interaction maps for complete genomes based on gene fusion events. Nature. 1999, 402: 86-90. 10.1038/47056.
Enright AJ, Ouzounis CA: Functional associations of proteins in entire genomes by means of exhaustive detection of gene fusions. Genome Biol. 2001, 2: research0034.1-0034.7. 10.1186/gb-2001-2-9-research0034.
Dandekar T, Snel B, Huynen M, Bork P: Conservation of gene order: a fingerprint of proteins that physically interact. Trends Biochem Sci. 1998, 23: 324-328. 10.1016/S0968-0004(98)01274-2.
Overbeek R, Fonstein M, D'Souza M, Pusch GD, Maltsev N: The use of gene clusters to infer functional coupling. Proc Natl Acad Sci USA. 1999, 96: 2896-2901. 10.1073/pnas.96.6.2896.
Lawrence J: Selfish operons: the evolutionary impact of gene clustering in prokaryotes and eukaryotes. Curr Opin Genet Dev. 1999, 9: 642-648. 10.1016/S0959-437X(99)00025-8.
Welch GR, Easterby JS: Metabolic channeling versus free diffusion: transition-time analysis. Trends Biochem Sci. 1994, 19: 193-197. 10.1016/0968-0004(94)90019-1.
Chervitz SA, Aravind L, Sherlock G, Ball CA, Koonin EV, Dwight SS, Harris MA, Dolinski K, Mohr S, Smith T, et al: Comparison of the complete protein sets of worm and yeast: orthology and divergence. Science. 1998, 282: 2022-2028. 10.1126/science.282.5396.2022.
All Fuse. [http://maine.ebi.ac.uk:8000/services/allfuse/]
Teichmann SA, Rison SC, Thornton JM, Riley M, Gough J, Chothia C: The evolution and structural anatomy of the small molecule metabolic pathways in Escherichia coli. J Mol Biol. 2001, 311: 693-708. 10.1006/jmbi.2001.4912.
I thank Sandrine Caburet for helpful discussions.