GOToolBox: functional analysis of gene datasets based on Gene Ontology
© Martin et al; licensee BioMed Central Ltd. 2004
Received: 13 April 2004
Accepted: 25 October 2004
Published: 26 November 2004
We have developed methods and tools based on the Gene Ontology (GO) resource allowing the identification of statistically over- or under-represented terms in a gene dataset; the clustering of functionally related genes within a set; and the retrieval of genes sharing annotations with a query gene. GO annotations can also be constrained to a slim hierarchy or a given level of the ontology. The source codes are available upon request, and distributed under the GPL license.
Since complete genome sequences have become available, the amount of annotated genes has increased dramatically. These advances have allowed the systematic comparison of the gene content of different organisms, leading to the conclusion that organisms share the majority of their genes with only relatively few species-specific genes. On this basis, one can develop strategies to infer gene annotations from model species to less experimentally tractable organisms. However, such functional inferences require the definition of species-independent annotation policies.
In this regard, the Gene Ontology consortium  has been created to develop a unified view of gene functional annotations for different model organisms. Three structured vocabularies (or ontologies) have been proposed, which allow the description of molecular functions, biological processes and cellular locations of any gene product, respectively. Whereas the majority of GO terms are common to several organisms, some of them are specific to a few organisms only, enabling the description of some aspects of gene function which are specific to few lineages only. Within each of these ontologies, the terms are organized in a hierarchical way, according to parent-child relationships in a directed acyclic graph (DAG). This allows a progressive functional description, matching the current level of experimental characterization of the corresponding gene product. The hierarchical organization of the gene ontology is particularly well adapted to computational processing and is used for the functional annotations of gene products of several model organisms such as budding yeast , Drosophila , mouse , nematode  and Arabidopsis . More recently, GO annotations for human genes have been proposed in the context of the GOA project .
In parallel, the recent development of new high-throughput methods has generated an enormous amount of functional data and has motivated the development of dedicated analysis tools. For instance, one might wonder whether genes detected as being coexpressed in a DNA chip experiment are related in terms of molecular or cellular function. In practical terms, we address here the following generic questions. First, are there statistically over- or under-represented GO terms associated with a given gene set, compared to the distribution of these terms among the annotations of the complete genome? Second, among a particular gene set, are there closely functionally related gene subsets? And third, are there genes having GO similarities with a given probe gene?
All the programs are written in PERL and use the CGI and DBI modules. All the ontology data and the gene-GO terms associations are taken from the GO consortium website. These data are structured in a PostGreSQL relational database, which is updated monthly. Statistics are calculated using the R statistical programming environment. The web implementation of the GOToolBox is accessible at .
In this section, we describe the five main functionalities of the GOToolBox suite. Two of them (GO-Proxy and GO-Family) are not encompassed by any other GO-processing tool currently available (see also 'Comparison of the GOToolBox with other GO-based analysis programs').
The first step in analyzing gene datasets consists in retrieving, for each individual gene of the dataset, all the corresponding GO terms and their parent terms using the Dataset creation program. The genomic frequency of each GO term associated with genes present in the dataset is then calculated. The resulting information is structured and stored in a data file, available for download on the GOToolBox server for one week. This file contains also the counts of terms within a reference gene dataset (genome or user-defined), and can then be used as an input for the GO-Stats and GO-Proxy programs described below.
An optional tool, GO-Diet, allows either the reduction of the term dataset to a slim GO hierarchy (either one proposed by the GO consortium or a user-defined one) or the restriction of the considered terms to a chosen depth of the ontology. It is also possible to filter terms based on the way these have been assigned to the gene products (evidence code). This tool is useful to decrease the number of GO terms associated with a gene dataset, thereby facilitating the analysis of the results of programs described below, particularly when the input gene list and/or the number of associated GO terms is large. Note that the GO-Diet program can generate a gene-term association file in the TLF format, allowing the use of GO terms as gene labels with the TreeDyn tree drawing program . The GO-Diet options are proposed in the Dataset-Creation form.
GO term statistics
Frequencies of terms within the dataset are calculated and compared with reference frequencies (for example with genomic frequencies or with the frequencies of these terms in the complete list of genes spotted on an array). This procedure allows the delineation of enrichments or depletions of specific terms in the dataset. The probability of obtaining by chance a number k of annotated genes for a given term among a dataset of size n, knowing that the reference dataset contains m such annotated genes out of N genes, is then calculated. This test follows the hypergeometric distribution described in Equation 1:
GO-based gene clustering
The goal of the GO-Proxy tool is to group together functionally related genes on the basis of their GO terms. The rationale sustaining our method is that the more genes have common GO terms, and the less they have specific associated terms, the more likely they are to be functionally related. For any two genes of the gene set, the program calculates an annotation-based distance between genes, taking into account all GO terms that are common to the pair and terms which are specific to each gene. Indeed, any two genes can have 0, 1 or several shared GO terms (common terms) and a variable number of terms specific for each gene (specific terms). This distance is based on the Czekanowski-Dice formula (Equation 2):
In this formula, x and y denote two genes, Terms(x) and Terms(y) are the lists of their associated GO terms, # stands for 'number of ' and Δ for the symmetrical difference between the two sets. This distance formula emphasizes the importance of the shared GO terms by giving more weight to similarities than to differences. Consequently, for two genes that do not share any GO terms, the distance value is 1, the highest possible value, whereas for two genes sharing exactly the same set of GO terms, the distance value is 0, the lowest possible value. All possible binary pairs of genes from the dataset are considered, resulting in a distance matrix.
Next this matrix is processed with a clustering algorithm, such as the WPGMA algorithm, and a functional classification tree is drawn, in which the leaves correspond to input genes. On the basis of this tree, classes can be defined, for instance by using partition rules, and the statistical relevance of the terms associated with each class is calculated using the method described for GO-Stats. The Czekanowski-Dice distance and the corresponding clustering have already proved their effectiveness in delineating protein functional classes derived from the analysis of protein-protein interaction graphs .
Finding GO-related genes
A last tool, GO-Family, aims at finding genes having shared GO terms with a user-entered gene, on the basis of a functional similarity calculation. It searches the genomes either of one or several supported species (five at the moment). Given an input gene name, the program retrieves the associated GO terms and compares them with those of all other genes by calculating a functional similarity percentage. The program then returns the list of similar genes, sorted by score. By similar genes, we mean either genes having more than one common associated term, or genes which have different associated terms but one or more common parent terms.
When measuring the similarity percentage S between the input gene A and another gene G, one can identify terms that are common to the two genes (Tc), and terms that are specific to A (Ta) and G (Tg). Three different similarity measures have been implemented and proposed to the user:
Si = (Tc/(Ta+Tc)) × 100 (3)
Sp = (Tc/(Ta+Tg+Tc)) × 100 (4)
Scd = (1 - ((Ta+Tg)/(Ta+Tg+2Tc))) × 100 (5)
Mining biological data with the GOToolBox
In this section, we provide two examples showing how combinations of several GO analysis tools can be used to validate or further delineate gene functional classifications.
Application of GOToolBox to the study of protein-protein interaction networks
The first is to generate the GO annotation set necessary to define the functional classes of proteins. In this particular study devoted to the yeast interactome, the term dataset was fitted to the fourth ontology level using GO-Diet. We chose to work at this particular level because it was previously shown to provide a good representation of the complexity of the cellular functions of the proteins described by the biological process annotations . The second step is to estimate the relevance of the annotations associated with the resulting classes using associated GO terms. The GO-Stats program can be used in this framework, using as reference dataset the list of proteins given as an input to PRODISTIN (Figure 4b).
Examples of class relevance evaluation
Original class annotations
Most relevant term among class annotations
Number of proteins in the class
Number of class proteins annotated for the term
Number of proteins annotated for the term in the reference set
Conjugation with cellular fusion; perception of abiotic stimulus; cell surface receptor linked signal transduction; sensory perception; response to pheromone during conjugation with cellular fusion
Conjugation with cellular fusion
Functional clustering of sets of transcriptional factor targets
GO can also be used to split gene sets into coherent functional subclasses on the basis of shared annotation terms. As an illustration, we have analyzed a gene set encompassing putative targets of the Engrailed transcription factor in Drosophila melanogaster. These genes were identified on the basis of in vivo UV cross-linking and chromatin immunoprecipitation experiments (X-ChIP) . These experiments led to the cloning and sequencing of several hundreds of DNA fragments, allowing the computational identification of a well conserved DNA pattern, which was closely related to the known engrailed consensus. In order to delineate potential functional biases among engrailed targets, we have used Go-Diet and Go-Proxy to cluster the corresponding genes on the basis of 'Biological Processes' GO annotations.
Classes found by GO-Proxy in a set of transcriptional putative targets and their statistical evaluation
Number of genes in the class
Nucleobase, nucleoside, nucleotide and nucleic acid metabolism
Cell growth and/or maintenance
Response to external stimulus
Comparison of GOToolBox with other GO-based analysis programs
Summary of the functionalities offered by GOToolBox and other GO processing tools
Multiple testing correction
Fisher exact test
Hypergeometric Binomial Chi Square
Fisher exact test
Fisher exact test
Hypergeometric Fisher exact test Binomial
Fisher exact test
Fisher exact test
GO Term-Finder (CPAN)
GO Term-Finder (SGD)
Hypergeometric Fisher exact test Binomial
Functionalities unique to the GOToolBox suite
First, it should be highlighted that, to the best of our knowledge, no other program performing all five functions proposed in GOToolBox exists at present. Furthermore, the GO-Proxy and GO-Family tasks are unique to GOToolBox. These two functionalities are potentially very useful to the biologist. Indeed, on the one hand, the GO-Proxy implementation of a gene-to-gene distance calculation based on several GO terms allows the determination of classes consisting of functionally related genes. This feature should prove useful in all cases where the user wishes to identify functional subgroups within a list of genes of interest. On the other hand, the ability to search for genes similar to a user-defined gene on the basis of related GO terms (GO-Family) is also unique among all GO processing tools. When used to find functionally similar genes within a given species, the GO-Family program is often able to find paralogs as well as other genes with related functions, independently of sequence similarities. Similarly, when used to find functionally similar genes in other species, the program can successfully identify genes with related functions, including orthologs. In addition, the GO-Family program could be very valuable in the context of genome annotation: it could be used by database annotators to verify the coherence of the annotations of genes with known related functions, which if correctly annotated, would indeed be expected to be detected by the program.
Because of the presence of these two programs in our suite, we are inclined to think that GOToolBox represents a major improvement over other GO-based Web tools.
Comparison of statistical analyses performed by all GO-based Web tools
Numerous programs have been developed to provide statistical evaluation of the occurrence of GO terms (Table 3). We compared these programs to GO-Stats at two levels: the statistics used to calculate the enrichment/depletion of GO terms, and the availability of different features, such as the output types and the GO terms filtering utilities to create the gene dataset.
As shown in Table 3 (column 3), four different approaches to calculating the probability of having x genes annotated for a given GO category have been implemented in various dedicated programs: hypergeometric distribution, binomial distribution, Fisher exact test and Chi-square test.
The two latter are non-parametric tests and are therefore less powerful than P-value calculations obtained with both the hypergeometric and the binomial distributions. In particular, the Chi-square test seems to be the less efficient, because it only gives valid results for large gene datasets, and it does not distinguish between over and under-represented terms .
The binomial distribution permits us to calculate the probability of obtaining x genes annotated for a given GO category when randomly picking k times one gene among N genes, leaving the possibility that one gene can be picked many times, which is not the correct situation in our case. It is important to note that when N is large, the hypergeometric distribution tends to give the same results as the binomial distribution. On average, the hypergeometric distribution seems to be both the most adapted model and the most powerful statistical test.
To compare the results obtained with the different methods for P-value calculation, we have implemented these methods in the GO-Stats module of GOToolBox, excepted the Chi-square test for reasons explained above. The implementation of these tests in GO-Stats permits us to compare the methods without having to deal with problems due to program-specific input formats, data update, and supported/unsupported organism species, as is often the case when using different programs. In addition, this gives great flexibility to the user, allowing he or she to use different statistical methods. We verified that (as might be expected) different programs using the same statistical methods give the same results. This was essentially true, with slight variations probably due to the use of different versions of GO by some programs (data not shown). Therefore, the comparison between programs mainly relies on the number of possible statistical tests that are available. As shown in Table 3, three programs (GOToolBox, GFINDer  and CLENCH ) propose the same three possible statistical tests, whereas all other programs have implemented only one method.
However, among these three programs, GOToolBox is the only one in which a multiple testing correction is implemented to adjust P-values and provide a correction for the occurrence of false positives. We choose the Bonferroni correction since it appears to be the most stringent in assessing the significance of enrichment/depletion
Comparison of other features proposed by GO-based web tools
In addition to the statistical tests used by the different programs, the presence of functional features offering flexibility to the end-user can also be considered as a criterion for program comparison. Features such as the GO terms filtering utilities and output types proposed by different programs are worth comparing (Table 3, last two columns).
The GO terms filtering functions allows one to restrict the number of GO terms associated with each gene in the dataset, to facilitate interpretation of the results. Many ways to perform this restriction are possible: either mapping the terms on a slim ontology or fitting the terms to a given level (depth) of the ontology hierarchy. As shown in Table 3, only GOToolBox allows the use of both these filtering methods. They have been implemented and are accessible under the 'Create Dataset' form. In addition, in GOToolBox it is possible to restrict the number of terms associated with each gene, by taking into account only terms inferred in a particular way (for instance, terms inferred from direct assay) and to combine the filtering methods with the slim mapping or the level fitting described above.
As far as the output types are concerned, several programs propose a tabulated output file with terms ranked according to their P-values, (with the exception of GoMiner  and GOTM , therefore precluding the interpretation of the results in these cases). However, a positive attribute of GO Term Finder , GOTM and GoMiner over GOToolBox is that they propose directed acyclic graph (DAG) graphics for visualization of results. At the moment, GO-Stats allows the visualization of relationships between terms in tabulated output only, but a future version of GOToolBox will also incorporate a DAG graphical output option.
In conclusion, the GOToolBox is a multipurpose, flexible and evolvable software suite that compares favorably to all existing GO-based web-analysis programs. Its two unique features, GO-Proxy and GO-Family, enable new kinds of analyses to be carried out, based on the functional annotations of gene datasets These new functionalities are likely to be very useful to many biologists wanting to extract novel and meaningful biological information from gene datasets.
The authors would like to thank Badih Ghattas for helpful discussions. This project is supported by two grants from the Action Bioinformatique inter-EPST, awarded to D.T. and B.J., respectively. D.M. and C.B. are respectively indebted to the French Ministère de l'Education, de la Recherche et de la Technologie, and to the Fondation pour la Recherche Médicale for financial support.
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000, 25: 25-29. 10.1038/75556.PubMedPubMed CentralView ArticleGoogle Scholar
- Cherry JM, Adler C, Ball C, Chervitz SA, Dwight SS, Hester ET, Jia Y, Juvik G, Roe T, Schroeder M, et al: SGD: Saccharomyces Genome Database. Nucleic Acids Res. 1998, 26: 73-79. 10.1093/nar/26.1.73.PubMedPubMed CentralView ArticleGoogle Scholar
- The FlyBase database of the Drosophila Genome Projects and community literature. The FlyBase Consortium. Nucleic Acids Res. 1999, 27: 85-88. 10.1093/nar/27.1.85.Google Scholar
- Blake JA, Eppig JT, Richardson JE, Davisson MT: The Mouse Genome Database (MGD): expanding genetic and genomic resources for the laboratory mouse. The Mouse Genome Database Group. Nucleic Acids Res. 2000, 28: 108-111. 10.1093/nar/28.1.108.PubMedPubMed CentralView ArticleGoogle Scholar
- Stein L, Sternberg P, Durbin R, Thierry-Mieg J, Spieth J: WormBase: network access to the genome and biology of Caenorhabditis elegans. Nucleic Acids Res. 2001, 29: 82-86. 10.1093/nar/29.1.82.PubMedPubMed CentralView ArticleGoogle Scholar
- Huala E, Dickerman AW, Garcia-Hernandez M, Weems D, Reiser L, LaFond F, Hanley D, Kiphart D, Zhuang M, Huang W, et al: The Arabidopsis Information Resource (TAIR): a comprehensive database and web-based information retrieval, analysis, and visualization system for a model plant. Nucleic Acids Res. 2001, 29: 102-105. 10.1093/nar/29.1.102.PubMedPubMed CentralView ArticleGoogle Scholar
- Camon E, Barrell D, Lee V, Dimmer E, Apweiler R: The Gene Ontology Annotation (GOA) Database - an integrated resource of GO annotations to the UniProt Knowledgebase. In Silico Biol. 2004, 4: 5-6.PubMedGoogle Scholar
- GOToolBox. [http://gin.univ-mrs.fr/GOToolBox]
- TreeDyn, a dynamic graphics editor for exploring phylogenetic or classification trees. [http://viradium.mpl.ird.fr/treedyn]
- Brun C, Chevenet F, Martin D, Wojcik J, Guenoche A, Jacq B: Functional classification of proteins for the prediction of cellular function from a protein-protein interaction network. Genome Biol. 2003, 5: R6-10.1186/gb-2003-5-1-r6.PubMedPubMed CentralView ArticleGoogle Scholar
- Solano PJ, Mugat B, Martin D, Girard F, Huibant JM, Ferraz C, Jacq B, Demaille J, Maschat F: Genome-wide identification of in vivo Drosophila Engrailed-binding DNA fragments and related target genes. Development. 2003, 130: 1243-1254. 10.1242/dev.00348.PubMedView ArticleGoogle Scholar
- GO Tools. [http://www.geneontology.org/GO.tools.html]
- Masseroli M, Martucci D, Pinciroli F: GFINDer: Genome Function Integrated Discoverer through dynamic annotation, statistical analysis, and mining. Nucleic Acids Res. 2004, 32: W293-W300. 10.1093/nar/gkh108.PubMedPubMed CentralView ArticleGoogle Scholar
- CLENCH. [http://www.personal.psu.edu/faculty/n/h/nhs109/Clench]
- Zeeberg B, Feng W, Wang G, Wang M, Fojo A, Sunshine M, Narasimhan S, Kane D, Reinhold W, Lababidi S, et al: GoMiner: a resource for biological interpretation of genomic and proteomic data. Genome Biol. 2003, 4: R28-10.1186/gb-2003-4-4-r28.PubMedPubMed CentralView ArticleGoogle Scholar
- Zhang B, Schmoyer D, Kirov S, Snoddy J: GOTree Machine (GOTM): a web-based platform for interpreting sets of interesting genes using Gene Ontology hierarchies. BMC Bioinformatics. 2004, 5: 16-10.1186/1471-2105-5-16.PubMedPubMed CentralView ArticleGoogle Scholar
- GO Term finder. [http://genome-www4.stanford.edu/cgi-bin/SGD/GO/goTermFinder]
- eGOn. [http://nova2.idi.ntnu.no/egon]
- Al-Shahrour F, Diaz-Uriarte R, Dopazo J: FatiGO: a web tool for finding significant associations of Gene Ontology terms with groups of genes. Bioinformatics. 2004, 20: 578-580. 10.1093/bioinformatics/btg455.PubMedView ArticleGoogle Scholar
- FuncAssociate. [http://llama.med.harvard.edu/cgi/func/funcassociate]
- Robinson M, Grigull J, Mohammad N, Hughes T: FunSpec: a web-based cluster interpreter for yeast. BMC Bioinformatics. 2002, 3: 35-10.1186/1471-2105-3-35.PubMedPubMed CentralView ArticleGoogle Scholar
- Castillo-Davis CI, Hartl DL: GeneMerge - post-genomic analysis, data mining, and hypothesis testing. Bioinformatics. 2003, 19: 891-892. 10.1093/bioinformatics/btg114.PubMedView ArticleGoogle Scholar
- Beissbarth T, Speed TP: GOstat: find statistically overrepresented Gene Ontologies within a group of genes. Bioinformatics. 2004, 20: 1464-1465. 10.1093/bioinformatics/bth088.PubMedView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.