Functional classification of proteins for the prediction of cellular function from a protein-protein interaction network
© Brun et al. 2003
Received: 25 June 2003
Accepted: 14 November 2003
Published: 15 December 2003
We here describe PRODISTIN, a new computational method allowing the functional clustering of proteins on the basis of protein-protein interaction data. This method, assessed biologically and statistically, enabled us to classify 11% of the Saccharomyces cerevisiae proteome into several groups, the majority of which contained proteins involved in the same biological process(es), and to predict a cellular function for many otherwise uncharacterized proteins.
Complete genome sequencing makes available a large number of coding protein sequences for which we have little or no functional information. In fact, the function of 30-35% of encoded proteins per completely sequenced genome remains unknown . To decipher the functions of these proteins and, more broadly, to propose functional relationships among proteins, new computational methods relying upon genome organization have been developed. The Rosetta Stone method proposes that two proteins in a given proteome are functionally linked when they exist as a single fused polypeptide in another proteome [2, 3]. The chromosomal proximity method suggests that genes repeatedly found as neighbors on chromosomes in different organisms may encode functionally related proteins [4–6]. Finally, the phylogenetic co-inheritance of proteins in several different proteomes may indicate their functional link . Although these methods and combinations thereof  successfully predict the function of certain proteins, they suffer from several limitations: they are more informative when applied to completely sequenced genomes; they are generally more appropriate for prokaryotic genome organization; and the principles underlying some of them are only valid for a small number of proteins.
Molecular interactions are essential actors for all biological processes. Large-scale studies of protein-protein interactions have been carried out in several organisms to establish interaction maps and to decipher protein function [9–16]. These large intricate networks now need to be analyzed in detail to extract information related to protein function and to relationships linking cellular processes. Various methods of biological network analysis have been proposed so far. They may, for instance, allow identification of functional modules after network clustering , or the assignment of function to proteins of unknown function on the basis of the functional annotation of their neighbors . Another way to analyze the interaction network is to compare proteins functionally at the cellular level. This approach would represent a useful complement to sequence-comparison methods, which address function at the molecular level. With this in mind, we propose a new bioinformatics method allowing a functional classification of the proteins according to the identity of their interacting partners.
The method, named PRODISTIN for protein distance based on interactions, was applied to the yeast interactome and statistically evaluated for robustness using several independent criteria. The analysis of the results obtained demonstrated that proteins are grouped according to their cellular rather than molecular function; proteins involved in the same molecular complex(es), pathway(s) or cellular process(es) are clustered; a sound prediction of cellular function for the uncharacterized proteins is possible. The biological relevance of the obtained predictions is discussed with respect to recent experimental results.
Principle of the PRODISTIN method and classification of the yeast proteome
PRODISTIN clustering depends neither on sequence similarity nor on biochemical function
To understand the biological foundation of PRODISTIN clustering, we examined different possibilities that could explain protein segregation in the tree. First, we tested whether sequence similarity correlates with our clustering results, given the abundance of proteins involved in related functions that exhibit similarity in their sequences. Pairwise alignments between the sequences of the 602 yeast proteins classified by PRODISTIN were computed using a global and a local alignment algorithm. Given that the obtained distances (expressed as the percentage of similarity for global and the score for local alignments, respectively) do not fit with tree distances, the tree model is not appropriate to represent these huge alignments . We thus directly compared the distance values obtained with PRODISTIN, the global and the local alignments (as described above), by identifying for each distance matrix the nonredundant pairs of proteins (x, y) for which y is the closest neighbour of x or vice versa.
Among the 611 closest pairs of proteins identified with PRODISTIN, the 546 obtained with the global and the 527 obtained with the local alignment, 112 are shared between both alignments (21.2%), 32 between PRODISTIN and the global alignment (5.8%) and 38 between PRODISTIN and the local alignment (7.2%). This result strongly suggests that sequence alignments do not cluster the same proteins that PRODISTIN does, leading to the conclusion that PRODISTIN clustering is only moderately dependent on sequence similarity.
As sequence similarity is not a key determinant of PRODISTIN clustering, we then investigated the capacity of PRODISTIN to cluster proteins with identical or related functions. To do so, we separately analyzed PRODISTIN classes using two types of protein functional annotations described in the Yeast Proteome Database (YPD) : the 'functional category' corresponding to the biochemical function(s) and the 'cellular role' describing the cellular function(s) (see [19, 24] for discussions about the notion of function). Both types of function are known for 420 proteins in the tree. For comparison, PRODISTIN classes were separately constructed as defined above according to either the cellular or the biochemical function of proteins, using the 420/602 proteins annotated for both types of function (Figure 2a). Among the total of 369 proteins belonging to PRODISTIN classes, 212 (57%) are clustered according to both types of function, and 157 (43%) according to only one type of function. Strikingly, 69% of the latter (108/157) are clustered according to the cellular function whereas the remaining 31% (49/157) are grouped according to the biochemical function. Therefore, the PRODISTIN method clusters proteins more efficiently by their cellular function than by their biochemical function. This result is further validated by the following observations. First, when the subcellular localization of the classified proteins is investigated, proteins belonging to the same subcellular compartment are found clustered in the tree, as would be expected from clustering based on cellular function (data not shown). Second, when the biochemical function of proteins is considered, proteins with functions such as 'protein kinase' or 'hydrolase' are found broadly scattered in the tree. Given that proteins with such biochemical functions are likely to be involved in a large number of different cellular processes, their scattering throughout the tree is to be expected from clustering on the basis of the cellular function. Third, sequence-similarity classification of proteins differs from PRODISTIN protein clustering, as described above. Consequently, from now on, we will only consider PRODISTIN classes based on the cellular function of proteins.
Classification of the S. cerevisiae proteome: integrated analysis of cellular processes and their cross-talk
Using the 509 yeast proteins of the tree annotated in YPD for 'cellular role', 64 different PRODISTIN classes were constructed, containing 3 to 36 members each. They contain two-thirds (408/602) of the tree proteins and cover 29 different 'cellular roles' out of 44 possible (Figure 2b; see also Additional data file 1). Whereas some 'cellular roles' are associated with only one class in the tree (such as 'meiosis', which is class 27 (Figure 2b, see also Additional data file 1)), several classes have the same cellular role. This generally corresponds to different aspects of a given cellular process: for instance, the six classes accounting for 'vesicular transport' (Figure 2b) are specifically devoted to autophagy (class 45), structural proteins related to actin (class 55), endoplasmic reticulum to Golgi transport (classes 56, 57), endocytosis (class 58) and exocytosis (class 59), respectively (see Additional data file 1).
Cross-talk between cellular processes after PRODISTIN classification
Superimposed cellular processes
PRODISTIN classes composed of doubly annotated proteins
Cell stress other metabolism
Cell structure protein folding
Lipid fatty acid metabolism protein translocation
PolII transcription protein degradation
RNA processing and modification RNA splicing
Partially overlapping cellular processes
PRODISTIN classes composed of at least three proteins annotated for a cellular role, three proteins annotated for another one, with some doubly annotated
Cell polarity cell structure
Cell polarity mating response
Cell Structure protein complex assembly
Chromosome and chromatin structure mitosis
Mating response differentiation
Protein degradation vesicular transport
Nested cellular processes
Nested PRODISTIN classes
Aging ⊂ Signal transduction
0 ⊂ 54
Cell cycle control ⊂ Amino acid metabolism
3 ⊂ 1
Cytokinesis ⊂ Cell polarity
20 ⊂ 8, 21 ⊂ 8
Mating response ⊂ Cell polarity
25 ⊂ 8, 26 ⊂ 8
Cell polarity/Mating response ⊂ Signal transduction
9 ⊂ 54
Cell stress ⊂ Protein degradation/Vesicular transport
11 ⊂ 45
Cell stress ⊂ Signal transduction
12 ⊂ 54
Cell structure/Protein complex assembly ⊂ Mitosis
13 ⊂ 28
Chromatin/Chromosome structure ⊂ PolII transcription
16 ⊂ 35
Mating response/Differentiation ⊂ Signal transduction
24 ⊂ 54
PolIII transcription ⊂ PolII transcription
42 ⊂ 39
RNA processing and modification ⊂ Nucleus-cytoplasm transport
51 ⊂ 31
RNA splicing ⊂ RNA processing/modification
53 ⊂ 52
Vesicular transport ⊂ Cell polarity/cell structure
55 ⊂ 7
Vesicular transport ⊂ Cell polarity
59 ⊂ 8
Unknown ⊂ Cell structure/protein folding
60 ⊂ 14
Unknown ⊂ Vesicular transport
62 ⊂ 56
Finally, a third case is encountered, in which small classes are nested within larger classes (Table 1) representing another example of cross-talk between cellular processes. The example given is for class 1 'amino acid metabolism' (Figure 3c; see also Additional data file 1). The metabolism of amino acids is related to cell-cycle control (class 3, Figure 3c) through the ubiquitin-dependent proteolysis pathway mediated by the ubiquitin protein ligase complex SCF (Skp1-Cdc53-F-box protein). This complex contains two core proteins - Skp1 and Cdc53 - and a F-box motif-containing protein required for the specific targeting of certain proteins to the degradation pathway . Consequently, a 'cell cycle control' class containing Skp1, Cdc53 and the F-box protein Cdc4, which targets Sic1 to degradation at the G1-S transition of the cell cycle, is nested within an 'amino acid metabolism' class enclosing the F-box protein Met30, which targets the transcription activator Met4 towards degradation during methionine biosynthesis. It is interesting to note that these classes encompass the uncharacterized F-box-containing protein Flm1 which, on the basis of its position in the classification tree (Figure 3c), is a candidate to target Csm3, a protein needed for chromosome segregation at meiosis , towards the ubiquitin-dependent proteolysis pathway.
The detailed analysis of the classes shows that the PRODISTIN method clusters proteins belonging to the same molecular complex, pathway or cellular process, and underlines cross-talk between functions. Therefore, the method enables the extraction of complex functional information from interaction networks by considerably reducing their complexity.
Functional predictions and their biological relevance
Functional predictions and comparisons with predictions obtained by other means
Predicted function (this study)
Prediction after 
Prediction after 
Prediction after 
Amino acid metabolism, cell cycle control (0)
Mitochondrion organization and biogenesis
Cell cycle control (0)
Cell polarity (1)
Cell polarity and structure, actin cytoskeleton organization and biogenesis
Cell polarity, cell structure, vesicular transport
Cell polarity, mating response (1)
Cell wall organization and biogenesis
Cell stress, other metabolism
Cell structure, protein folding (1)
Protein-vacuolar targeting, cell cycle arrest in response to pheromone
Cell structure, protein folding (0)
Cell cycle arrest in response to pheromone
Cell structure, protein folding (0)
Cell cycle arrest in response to pheromone
DNA synthesis (1)
Spindle pole duplication
Mating response, differentiation, signal transduction
Nucleus-cytoplasm transport (0)
Regulation of mitosis
Nucleus-cytoplasm transport (0)
DNA-dependent DNA replication
PolII transcription (1)
Protein synthesis turnover, protein deneddylation
PolII transcription (1)
Protein synthesis turnover, protein deneddylation
PolII transcription (1)
Transcription from polII promoter, DNA repair
Protein degradation (0)
Intermediate and energy metabolism, transcription, DNA maintenance, chromatin structure, phospholipid metabolism, vacuole inheritance
RNA processing and modification
RNA processing and modification
RNA processing and modification (1)
RNA metabolism, mRNA nucleus export
RNA processing and modification (1)
Deadenylation-dependent decapping, NOT mRNA catabolism, nonsense mediated
RNA processing and modification (1)
Vesicular transport (1)
Vesicular transport (0)
Chromatin silencing at ribosomal DNA, nicotinamide metabolism
For two proteins (5%), no cellular function has ever been proposed by any other method. For 27 proteins (64%), our prediction is in accordance with or related to previously proposed ones, or the experimental results. For 13 proteins (30%), our predictions disagree (Table 2; see also Additional data file 2). When only the 19 experimentally determined functions are considered, PRODISTIN predictions are in accordance with 11/19 (58%) of them. Noticeably, when the functional predictions obtained by the global optimization method (GOM ) for the same proteins are considered, only 4/13 (31%) predictions are in accordance with the experimentally determined functions. Taken together, these observations strengthen the relevance of the PRODISTIN predictions for the uncharacterized proteins.
Interestingly enough, the PRODISTIN method also reveals the existence of clusters containing only proteins of unknown function. In one case, a cellular function can now be proposed for the entire cluster: as class 62 (annotated 'unknown') is nested into class 56 (annotated 'vesicular transport'), all its members can therefore be associated with 'vesicular transport' and a posteriori recent experimental results strengthen our predictions (Table 2) [31, 32].
Finally, the putative involvement of proteins of already known function in new cellular processes is also encountered. Class 52 (Figure 3e) contains proteins involved in RNA processing, including the members of the two LSM complexes which play a part in mRNA decapping (Lsm1-7) and pre-mRNA splicing (Lsm2-8) . Given that two small subunit ribosomal proteins Rps28A and B have been found to interact with Lsm2, Lsm4, and Lsm8 in the two-hybrid screen from Uetz et al. , these authors suggested either a possible involvement of Lsm proteins in translation/ribosomal biogenesis or an unforeseen role of the ribosomal proteins in RNA splicing. As both proteins share all their interactors with Dcp1 (mRNA-decapping enzyme), PRODISTIN rather suggests a novel implication of Rps28A and B in mRNA decay.
Altogether, these results lend further support to the ability of the PRODISTIN method to directly derive a cellular function for proteins from the information contained within the interaction network, without using any additional sequence or structure information.
Statistical evaluations of PRODISTIN clusters
To evaluate the quality of PRODISTIN classifications and predictions on a more statistical basis, four different types of control experiments have been performed in order to assess the influence of various parameters.
First, given that annotations taken from databases may contain inconsistencies, our classification for the yeast proteome (originally established with YPD annotations) was further tested using the Gene Ontology (GO) annotations . We used the GO Term Finder tool from the SGD database to search for significant shared GO terms (or their parents) used to describe the genes of interest and to calculate a p value for the occurrence of common terms (for details see Help in ). Lists of genes constituting all PRODISTIN classes were successively processed with the GO term finder for the 'biological process' ontology. On average, for 87.3% of the PRODISTIN classes, the best hit, that is, the common GO term with the lowest p value, is in accordance with the class annotation proposed using YPD annotations. These terms are highly statistically significant as a p value < 1e-6 is encountered for 83.63% of the classes. Moreover, these terms applied to 77% of the class members on average. As GO terms represent an independent source of functional annotation from YPD, these congruent results confirm that PRODISTIN efficiently clusters proteins having common or related cellular functions.
Success rates for PRODISTIN vs majority rule
Totally in accordance
Partially in accordance
Number of proteins on which a prediction is possible
We then tested PRODISTIN's performance on random networks of identical topologies in order to assess whether PRODISTIN clustering would have occurred by chance. For this, all protein names were reshuffled and randomly assigned to nodes in the network. The PRODISTIN analysis of such networks only allows the construction of a tiny number of classes (15 on average, instead of 63), consequently leading to a very low number of proteins for which a prediction is possible (51 on average instead of 389 in the current study). Finally, the prediction rate drops to 60%. This clearly indicates that random interaction networks never lead to both a high number of PRODISTIN classes and a correct prediction rate, as true networks do.
Protein-protein interactions as good indicators of protein cellular function
We present here a new bioinformatics method that is able to compute a functional clustering of proteins on the basis of protein-protein interaction data. When applied to the yeast interactome, our method classified 602 proteins, representing a significant part of the proteome (11%), into 64 classes of functionally related proteins.
Our method was based on the assumption that a distance formula (the Czekanovski-Dice distance) that uses information on shared interactors could potentially mirror a functional distance between proteins. The demonstration that the classification and the protein clustering resulting from PRODISTIN are essentially driven by the cellular function of proteins gives strong support to our initial assumption. This also may be explained by the fact that the chosen distance formula makes it possible to take into account not only the functional information carried by the nearest neighbors in the protein-protein network, but also by proteins two edges away. Therefore, the obtained distance values, once clustered, are able to highlight subgraphs in the network, such as those formed by proteins involved in the same pathway(s) or cellular process(es).
As we also showed that the PRODISTIN functional distance clusters proteins independently of their sequence similarities and their actual biochemical function, we now have the opportunity to quantify functional relationships between proteins in the same way that sequence alignments make it possible to quantify protein-sequence similarity. PRODISTIN thus represents a useful complement to sequence-comparison methods, which rather point towards proteins that have the same molecular function. It is interesting to note that the majority of proteins with the same biochemical function are not clustered in the tree despite their sequence similarity. This moderate dependence of cellular function on sequence similarities clearly means that many functional similarities are at present missed by sequence-based methods, emphasizing the importance of using other types of data than sequence and structure as a basis for function assessment.
Two major advantages result from the fact that PRODISTIN computes all interactions constitutive to an interaction network at once. First, it produces a large functional tree, allowing direct comparison in terms of cellular function for any pair or group of proteins. Second, it makes it possible to visualize a large number of cellular processes and their main actors in a single integrated view, thus offering the possibility of examining the links between cellular functions, and more broadly, the organization of cellular functions within the interaction network. In doing so, PRODISTIN functional trees can capture the essential part of the functional information buried in complex interaction networks, something which is at present impossible to deduce from the intricate graphical representations. Consequently, PRODISTIN can be considered to be one of the first cellular bioinformatics tools available that allows not only comparison of the function of individual proteins but also the ability to study cell function more globally. For instance, the dissection of given cellular functions into sub-functions visible at network level or the study of the functional relationships between known cellular functions can be investigated. As discussed in Results, PRODISTIN has shown that the 'vesicular transport' general function can be separated into distinct subfunctions. An analytic approach of this kind could be systematically undertaken for all known yeast cellular functions, as they are statistically represented in the tree, and later on for those of other organisms. As far as the second question of the relationships between functions is concerned, PRODISTIN could represent a valuable functional data-mining tool. It is, for instance, interesting to note that, although there exist 44 different YPD 'cellular roles' to describe the complete yeast proteome, of which 42 are represented by more than one protein in the tree, our PRODISTIN classes at present cover only 29 of them. Despite the existence of biases in the interaction dataset generally, due to a deeper investigation of certain proteins and to methodological flaws, this observation could suggest a predominant role for these 29 cellular functions in the organization of the network.
Comparison of the PRODISTIN method with recent functional prediction methods
Comparison of the results of PRODISTIN with those of other computational methods for assessing and comparing protein functions is not straightforward. Because of the lack of common interaction sets, functional annotations, common evaluation tools and sometimes insufficient description of the algorithms used, no simple benchmarking comparative analyses are yet possible. However, in an attempt to evaluate the relative advantages and disadvantages of the different methods, we compared their results when available. For this purpose, we evaluated PRODISTIN against the MRA  and two networks-based methods, the GOM  and the Rives and Galitski method (RGM ). We measured their relative behavior in terms of success rate in the prediction of the function of already known proteins (PRODISTIN vs MRA vs GOM), functional assignment of unclassified proteins (PRODISTIN vs GOM), and ability to cope with false-positive and false-negative interactions in the dataset (PRODISTIN vs GOM vs MRA).
Our results (see Table 3) and those of the GOM (Table 1 in ) both agree that the MRA has a lower success rate than PRODISTIN or GOM in predicting the function of known proteins. When the ability of GOM and PRODISTIN to predict a function for 42 otherwise uncharacterized proteins is compared to recently published experimental results as a reference, the latter performs better (Table 2). We found that 58% of PRODISTIN predictions are in accordance with the literature, whereas only 31% of the predictions made by the GOM are.
Finally, when robustness towards the presence of false-positive and false negative interactions is assayed by changing the topology of the network, the MRA again performs less efficiently than PRODISTIN (Figure 4). In addition, on random networks of identical topology, both PRODISTIN and the RGM (Table 1 in ) show that clustering of proteins in true networks is always higher than clustering observed in random networks.
Unlike GOM, PRODISTIN and RGM produce functional trees as an output. But PRODISTIN goes one step further, by finding functional classes on the tree according to two parameters (the minimal number of annotated proteins for the same function in the class and their minimal representation in the class - 3 and 50%, respectively, in this study). This considerably facilitates the process of function assessment, as it minimizes the ambiguity inherent in tree representation. This class construction also has a positive buffering effect that limits the influence of false interactions on the classification and makes it possible to maintain high prediction rates, as already discussed. One may argue that constructing classes limits the number of proteins for which a prediction is possible. It is then important to note that PRODISTIN settings may be changed easily at different levels. Depending on the goal of the user (favoring class coverage of the tree, for instance), the number of proteins per class can be increased by juggling with the two parameters defining the PRODISTIN classes, but at the unavoidable price of a slight decrease in the overall accuracy of the predictions. Switching from the YPD annotation system to the GO system using GO slim categories also increases the number of classified proteins in the tree and consequently, of possible predictions (D.M., B.J. and C.B., unpublished data).
As more interactions become available, the coverage of the proteome and the mean number of interactions per protein will increase, therefore improving the relevance of the protein clusters found by the PRODISTIN method. Noticeably, it can be anticipated that using interactions recently described in the literature as well as new interactions produced by large-scale approaches could rapidly lead to the classification of the majority of the yeast proteome. As far as the PRODISTIN method is concerned, work presently in progress in our laboratory will soon totally automate the tedious task of manually constructing PRODISTIN classes on the tree.
Finally, PRODISTIN can be applied not only to the proteomes of unicellular organisms (this study) but also to those of metazoans. The classification trees recently obtained on the Drosophila and the human proteome (C.B., S. Siret, P. Mouren and B.J., unpublished data) show protein clusters having a true biological significance. Furthermore, other types of interaction networks such as genetic interaction networks (A. Baudot, B.J., C.B., unpublished data) and transcriptional networks can also benefit from the application of our general method. These new developments will allow PRODISTIN to be applied to a large variety of biological questions, such as the evolutionary fate of duplicated genes, the functional aspects of horizontal transfer of genes from one species to another, the integration of signaling pathways and the evolutionary comparison of gene networks.
Materials and methods
Protein-protein interaction data sets
Yeast protein-protein interactions were extracted from the MIPS database . Only direct binary interactions were selected, based on the method used for their identification (two-hybrid, excluding high-throughput experiments, in vitro binding, far western, gel retardation and biochemical experiments). For high-throughput two-hybrid experiments, 948 interactions were taken from Uetz et al.  and 839 from Ito's core data . This yielded a total of 2,946 interactions involving 2,139 proteins (average connectivity 2.6 interactions per protein). The 1,517 protein-protein interactions involving 730 proteins from Helicobacter pylori and their corresponding PBS categories were taken from Rain et al. .
Only proteins involved in at least three binary interactions were selected for further classification. Taking into account that the existence of false-positive and false-negative interactions weights more for poorly connected proteins, and that the estimated number of interactions per protein is close to five [38, 39], we chose to rule out proteins for which the contribution of such false interactions may blur the analysis. Proteins in our dataset have 2.6 interactors on average. We thus chose to set the connectivity threshold to be classified to 3, which means that proteins implicated in one or two interactions were not classified but taken into account for the computation. First, it is stated that a relation between two proteins to be classified exists if either they interact with each other and/or they share at least one common interactor. Subsequently, a graph in which vertices are proteins and edges correspond to this relation, was computed. The connected components are computed and the main one containing almost all of the proteins was selected. Second, the Czekanovski-Dice distance between all pairs of proteins of this class was then calculated. This classical distance on graphs corresponds to the formula
D(i,j) = #(Int(i) Δ Int(j))/ [#(Int(i) ∪ Int(j)) + #(Int(i) ∩ Int(j))]
in which i and j denote two proteins, Int(i) and Int(j) are the lists of their interactors plus themselves (to decrease the distance between proteins interacting with each other) and Δ is the symmetrical difference between the two sets. This distance was chosen because it increases the weight of the shared interactors by giving more weight to the similarities than to the differences; it is very close to an ultrametric distance because the vast majority of distance values between protein pairs is at a maximum (for two proteins that do not share any interactor, the distance value is 1, the highest value, whereas for two proteins interacting with each other and sharing exactly the same interactors, the distance value is 0, the lowest value). Consequently, the advantage of choosing this distance is that it authorizes the use of tree representation. With such distance values, only one tree structure fits the initial distance values, independently of the chosen clustering algorithm. We have used the BioNJ algorithm  to build a tree from our distance matrices. This is an improvement of the neighbor-joining algorithm , which takes into account the variance of the distance between proteins to evaluate the length of the branches in the tree. A circular classification tree was then drawn using the TreeDyn package .
Sequence alignments and analysis
Pairwise sequence alignments have been performed on the set of 602 protein sequences classified with the PRODISTIN method. Both Needleman-Wunsch (global alignment) and Smith-Waterman (local alignment) algorithms have been applied. The programs used for the two algorithms are available at  and , respectively. The chosen alignment matrix was BLOSUM50, and the gap-opening and gap-extension penalties were set to 12 and 2, respectively. The resulting 363,004 alignments have been processed to calculate the distance corresponding to the percentage of similarity for each protein pair in the global alignment and for the score in the local alignment.
Subtree robustness measurement
The robustness of each subtree was computed by measuring its homogeneity using a criterion based on topology. Considering triples made of two elements within a given subtree and one outside the subtree (possibly restricted to the sibling subtree), we evaluated the percentage of these triples for which the two elements belonging to the same subtree are separated by the smallest distance value. This allowed us to calculate a class robustness index (CRI) for each inner branch, which was computed by the Qualitree program  as a measurement of robustness/quality of the downward class. CRI may be considered as functionally equivalent to the bootstrap index usually used to assess the quality of phylogenetic subtrees. CRI values for PRODISTIN classes are available in Additional data file 1. The average CRI per tree corresponds to the sum of all triples for which the two elements belonging to the same subtree are separated by the smallest distance value divided by the sum of possible triples.
Annotation sources and functional tree visualization
We downloaded the 'cellular role', 'functional categories' and 'sub-cellular localization' annotation files for yeast proteins from YPD  on 28 May 2002. The category labels were then loaded into Treedyn  for a direct class visualization on the trees as displayed in Figure 2b.
Additional data files
We thank J.-C. Rain for providing the H. pylori data, A. Baudot, L. Fasano, S. Gangloff, A. Kissenpfennig, D. Nesic, E. Remy, L. Röder, J. Smith and D. Thieffry for carefully reading the manuscript and helpful discussions, and Pierre Mouren for technical assistance. This project is supported by three Action Bioinformatique inter-EPST grants to A.G., F.C. and B.J. respectively. C.B. thanks Valigen SA and the Fondation pour la Recherche Médicale for financial support.
- Galperin MY, Koonin EV: Who's your neighbor? New computational approaches for functional genomics.Nat Biotechnol 2000, 18:609–613.PubMedView ArticleGoogle Scholar
- Enright AJ, Iliopoulos I, Kyrpides NC, Ouzounis CA: Protein interaction maps for complete genomes based on gene fusion events.Nature 1999, 402:86–90.PubMedView ArticleGoogle Scholar
- Marcotte EM, Pellegrini M, Ng HL, Rice DW, Yeates TO, Eisenberg D: Detecting protein function and protein-protein interactions from genome sequences.Science 1999, 285:751–753.PubMedView ArticleGoogle Scholar
- Tamames J, Casari G, Ouzounis C, Valencia A: Conserved clusters of functionally related genes in two bacterial genomes.J Mol Evol 1997, 44:66–73.PubMedView ArticleGoogle Scholar
- Dandekar T, Snel B, Huynen M, Bork P: Conservation of gene order: a fingerprint of proteins that physically interact.Trends Biochem Sci 1998, 23:324–328.PubMedView ArticleGoogle Scholar
- Overbeek R, Fonstein M, D'Souza M, Pusch GD, Maltsev N: The use of gene clusters to infer functional coupling.Proc Natl Acad Sci USA 1999, 96:2896–2901.PubMedView ArticleGoogle Scholar
- Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO: Assigning protein functions by comparative genome analysis: protein phylogenetic profiles.Proc Natl Acad Sci USA 1999, 96:4285–4288.PubMedView ArticleGoogle Scholar
- Marcotte EM, Pellegrini M, Thompson MJ, Yeates TO, Eisenberg D: A combined algorithm for genome-wide prediction of protein function.Nature 1999, 402:83–86.PubMedView ArticleGoogle Scholar
- Bartel PL, Roecklein JA, SenGupta D, Fields S: A protein linkage map ofEscherichia colibacteriophage T7.Nat Genet 1996, 12:72–77.PubMedView ArticleGoogle Scholar
- Flajolet M, Rotondo G, Daviet L, Bergametti F, Inchauspe G, Tiollais P, Transy C, Legrain P: A genomic approach of the hepatitis C virus generates a protein interaction map.Gene 2000, 242:369–379.PubMedView ArticleGoogle Scholar
- Fromont-Racine M, Mayes AE, Brunet-Simon A, Rain JC, Colley A, Dix I, Decourty L, Joly N, Ricard F, Beggs JD, Legrain P: Genome-wide protein interaction screens reveal functional networks involving Sm-like proteins.Yeast 2000, 17:95–110.PubMedView ArticleGoogle Scholar
- Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M, Sakaki Y: A comprehensive two-hybrid analysis to explore the yeast protein interactome.Proc Natl Acad Sci USA 2001, 98:4569–4574.PubMedView ArticleGoogle Scholar
- McCraith S, Holtzman T, Moss B, Fields S: Genome-wide analysis of vaccinia virus protein-protein interactions.Proc Natl Acad Sci USA 2000, 97:4879–4884.PubMedView ArticleGoogle Scholar
- Rain JC, Selig L, De Reuse H, Battaglia V, Reverdy C, Simon S, Lenzen G, Petel F, Wojcik J, Schachter V, et al.: The protein-protein interaction map ofHelicobacter pylori.Nature 2001, 409:211–215.PubMedView ArticleGoogle Scholar
- Uetz P, Giot L, Cagney G, Mansfield TA, Judson RS, Knight JR, Lockshon D, Narayan V, Srinivasan M, Pochart P, et al.: A comprehensive analysis of protein-protein interactions inSaccharomyces cerevisiae.Nature 2000, 403:623–627.PubMedView ArticleGoogle Scholar
- Walhout AJ, Sordella R, Lu X, Hartley JL, Temple GF, Brasch MA, Thierry-Mieg N, Vidal M: Protein interaction mapping inC. elegansusing proteins involved in vulval development.Science 2000, 287:116–122.PubMedView ArticleGoogle Scholar
- Rives AW, Galitski T: Modular organization of cellular networks.Proc Natl Acad Sci USA 2003, 100:1128–1133.PubMedView ArticleGoogle Scholar
- Vazquez A, Flammini A, Maritan A, Vespignani A: Global protein function prediction from protein-protein interaction networks.Nat Biotechnol 2003, 21:697–700.PubMedView ArticleGoogle Scholar
- Jacq B: Protein function from the perspective of molecular interactions and genetic networks.Brief Bioinform 2001, 2:38–50.PubMedView ArticleGoogle Scholar
- Wood V, Rutherford KM, Ivens A, Rajandream M-A, Barrell B: A re-annotation of theSaccharomyces cerevisiaegenome.Comp Funct Genomics 2001, 2:143–154.PubMedView ArticleGoogle Scholar
- Malpertuy A, Tekaia F, Casaregola S, Aigle M, Artiguenave F, Blandin G, Bolotin-Fukuhara M, Bon E, Brottier P, de Montigny J, et al.: Genomic exploration of the hemiascomycetous yeasts: 19. Ascomycetes-specific genes.FEBS Lett 2000, 487:113–121.PubMedView ArticleGoogle Scholar
- Guénoche A, Garreta H: Can we have confidence in a tree representation?Comput Biol 2001, 2066:45–56.View ArticleGoogle Scholar
- Costanzo MC, Crawford ME, Hirschman JE, Kranz JE, Olsen P, Robertson LS, Skrzypek MS, Braun BR, Hopkins KL, Kondu P, et al.: YPD, PombePD and WormPD: model organism volumes of the BioKnowledge library, an integrated resource for protein information.Nucleic Acids Res 2001, 29:75–79.PubMedView ArticleGoogle Scholar
- Brun C, Baudot A, Guénoche A, Jacq B: The use of protein-protein interaction networks for genome wide protein function comparisons and predictions.In Methods in Proteome and Protein Analysis(Edited by: Kamp RM, Calvete JJ, Choli-Papadopoulou T). Berlin Heidelberg: Springer-Verlag 2004, 103–124.View ArticleGoogle Scholar
- Huhse B, Rehling P, Albertini M, Blank L, Meller K, Kunau WH: Pex17p ofSaccharomyces cerevisiaeis a novel peroxin and component of the peroxisomal protein translocation machinery.J Cell Biol 1998, 140:49–60.PubMedView ArticleGoogle Scholar
- Patton EE, Willems AR, Tyers M: Combinatorial control in ubiquitin-dependent proteolysis: don't Skp the F-box hypothesis.Trends Genet 1998, 14:236–243.PubMedView ArticleGoogle Scholar
- Rabitsch KP, Toth A, Galova M, Schleiffer A, Schaffner G, Aigner E, Rupp C, Penkner AM, Moreno-Borchart AC, Primig M, et al.: A screen for genes required for meiosis and spore formation based on whole-genome expression.Curr Biol 2001, 11:1001–1009.PubMedView ArticleGoogle Scholar
- Schwikowski B, Uetz P, Fields S: A network of protein-protein interactions in yeast.Nat Biotechnol 2000, 18:1257–1261.PubMedView ArticleGoogle Scholar
- Gavin AC, Bosche M, Krause R, Grandi P, Marzioch M, Bauer A, Schultz J, Rick JM, Michon AM, Cruciat CM, et al.: Functional organization of the yeast proteome by systematic analysis of protein complexes.Nature 2002, 415:141–147.PubMedView ArticleGoogle Scholar
- SaccharomycesGenome Database[http://genome-www.stanford.edu/Saccharomyces]
- Calero M, Winand NJ, Collins RN: Identification of the novel proteins Yip4p and Yip5p as Rab GTPase interacting factors.FEBS Lett 2002, 515:89–98.PubMedView ArticleGoogle Scholar
- Hettema EH, Lewis MJ, Black MW, Pelham HR: Retromer and the sorting nexins Snx4/41/42 mediate distinct retrieval pathways from yeast endosomes.EMBO J 2003, 22:548–557.PubMedView ArticleGoogle Scholar
- He W, Parker R: Functions of Lsm proteins in mRNA degradation and splicing.Curr Opin Cell Biol 2000, 12:346–350.PubMedView ArticleGoogle Scholar
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al.: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium.Nat Genet 2000, 25:25–29.PubMedView ArticleGoogle Scholar
- SGD Gene Ontology Term Fineder[http://genome-www4.stanford.edu/cgi-bin/SGD/GO/goTermFinder]
- Wojcik J, Boneca IG, Legrain P: Prediction, assessment and validation of protein interaction maps in bacteria.J Mol Biol 2002, 323:763–770.PubMedView ArticleGoogle Scholar
- Mewes HW, Frishman D, Guldener U, Mannhaupt G, Mayer K, Mokrejs M, Morgenstern B, Munsterkotter M, Rudd S, Weil B: MIPS: a database for genomes and protein sequences.Nucleic Acids Res 2002, 30:31–34.PubMedView ArticleGoogle Scholar
- Legrain P, Wojcik J, Gauthier JM: Protein-protein interaction maps: a lead towards cellular functions.Trends Genet 2001, 17:346–352.PubMedView ArticleGoogle Scholar
- Grigoriev A: On the number of protein-protein interactions in the yeast proteome.Nucleic Acids Res 2003, 31:4157–4161.PubMedView ArticleGoogle Scholar
- Gascuel O: BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data.Mol Biol Evol 1997, 14:685–695.PubMedView ArticleGoogle Scholar
- Saitou N, Nei M: The neighbor-joining method: a new method for reconstructing phylogenetic trees.Mol Biol Evol 1987, 4:406–425.PubMedGoogle Scholar
- Bioinformatics web site of Dr. Andrew C.R. Martin[http://www.bioinf.org.uk/software]
- The European Molecular Biology Open Software Suite[http://www.emboss.org]
This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.