GOTax: investigating biological processes and biochemical activities along the taxonomic tree

GOTax, a novel web-based platform that integrates protein annotation with protein family classification and taxonomy, allows for an extensive assessment of functional similarity between proteins and for comparing and analyzing the distribution of protein families and protein functions over different taxonomic groups.

which selects all GO terms from M. paratuberculosis that are not present in human, and limits the evaluation to biological processes.

Semantic comparison
A semantic comparison of two sets of GO terms is performed with a query like: GO WHERE <condition> RESTRICT <limit> SIM GO WHERE <condition> RESTRICT <limit>.
Again condition and limit clauses may be defined completely freely, enabling the selection of arbitrary sets of GO terms for the comparison. An example for such a query is: GO WHERE TAX:1770 AND NOT TAX:9606 SIM GO WHERE TAX:9606.
This query calculates semantic similarity scores between all GO annotations from M. paratuberculosis that are not present in human with all human GO annotations.

Functional comparison
The functional similarity between two sets of gene products is performed with a query like: result may consist of either proteins, Pfam or SMART domains. The condition allows to arbitrarily select sets of these entities. Therefore, not only complete proteomes can be compared, but also proteins or domains involved in specific biological processes of molecular functions for example. The query GENE WHERE TAX:1770 AND GO:GO:00006260 FUNSIM GENE WHERE TAX:9606 AND GO:GO:00006260 performs a functional comparison between proteins from M. paratuberculosis involved in DNA replication and human proteins involved in DNA replication.

Pfam set comparison
Two sets of Pfam domains are compared with the following query: The result is a list of domains appearing in both sets, unique to either one of the sets and domains not present in the two sets.
A complete discussion of the query language is given on the GOTaxExplorer homepage [11].
Analyzing the rfunSim score The funSim score is a measure of the functional similarity of two gene products [2]. It uses the gene product annotation with Gene Ontology terms and is based on the semantic similarity between two GO terms as used by Lord et al.
[1]. The funSim score is defined as and ranges between 0, indicating no functional similarity, and 1 for exactly matching functions. Due to this definition, the funSim score of a pair of gene products is usually lower than the average of their MFscore and BPscore. Therefore, we define the rf unSim score as square root of the funSim score, This results in higher functional similarity values for most gene product pairs. In the following, some examples for protein pairs illustrate the difference between funSim and rf unSim scores.
The stress response protein bis1 (O59793) from S. pombe is annotated with the function "protein heterodimerization activity" (GO:0046982) and the process "response to stress" (GO:0006950). The high pH protein 2 (P39734) from S. cerevisiae is involved in the same process but annotated with "protein binding" (GO:0005515) as function. The funSim score of these two proteins is 0.655 and the rf unSim score is 0.809. Since they are involved in the same process and "protein heterodimerization activity" is a descendant of "protein binding" in the GO graph, the rf unSim score seems to more accurately reflect the true functional similarity.
The glucan endo-1,3-alpha-glucosidase agn1 precursor (O13716) from S. pombe is involved in "cell septum edging catabolism" (GO:0030995) and shows the function "glucan endo-1,3-alpha-glucosidase activity" (GO:0051118). Protein EGT2 precursor (P42835) from S. cerevisiae has "cellulase activity" (GO:0008810) and is annotated with the process "cytokinesis" (GO:0000910). These two proteins have a funSim score of 0.364 and a rf unSim score of 0.603. Looking at the GO graph, it becomes evident that "cytokinesis" is an ancestor of "cell septum edging catabolism" and that the functions of the two proteins are related through the common ancestor "hydrolase activity, hydrolyzing O-glycosyl compounds" (GO:0004553). The close relationship between the functions and the process annotated to the two proteins suggests that the rf unSim score is more accurately capturing the true relationship.
In order to obtain a more objective assessment of the performance of the rf unSim score in comparison with the original funSim score, we used the sets of Inparanoid orthologs (IO) and of protein pairs without significant sequence similarity (NSS) from [2]. The funSim and the rf unSim scores of all protein pairs in both sets have been computed and used for estimating prediction performance. The receiver operating characteristics (ROC) curves allow for assessing the performance of the scores in predicting true positives (protein pairs in IO) and true negatives (protein pairs in NSS). The ROC curves were calculated and visualized using the ROCR package [18] ( Figure S1 and S2). The two curves are identical, the only difference being that the score cut-off at a given true positive and false positive rate is higher for the rf unSim score.
The calibration error of a score measures how well the score coincides with the true class membership [20]. It is calculated as follows: first, all protein pairs are ordered according to their score. Then the pairs 1 -100 are put into a bin and the percentage of true positives in this bin is calculated. Following, the mean prediction is calculated and the absolute frequency between observed true positive frequency and predicted positives gives the calibration error for this bin. This computation is repeated for protein pairs 2 -101, 3 -102 and so on. The final calibration error is then the mean of all binned calibration errors. Protein pairs Figure S1: ROC curve for the rf unSim score. The curve shows the change in true positive rate with varying false positive rate.
with a score of 0.6 should belong to IO in 60% of the cases and to NSS in 40% of the cases for example and the calibration error measures the deviation of this ideal scenario. For this test, the funSim and rf unSim scores are interpreted as probability of two proteins to be functionally similar. ROCR was used for calculating and plotting the calibration error of both scores ( Figure S2). The calibration error of the rf unSim score is smaller than the calibration error of the funSim score up to a score of approximately 0.75, and roughly equal thereafter. The results from the ROC curves and the calibration error analyzes support the intuition that the rf unSim score gives better results.