Data
In the context of the MouseFunc competition, we were provided with several data sets describing features of the 21,603 genes in the mouse genome (as of February 2006). The datasets are described elsewhere [15]; we provide short descriptions here.
Three datasets, 'Zhang', 'Su', and 'SAGE', provide mRNA expression data. 'Zhang' contains 55 DNA microarray experiments covering 13,567 genes [34]. 'Su' contains 61 DNA microarray experiments covering 18,209 genes [35, 36]. 'SAGE' describes SAGE tag counts at the 'quality 99' cut-off for 139 SAGE libraries in the Mouse Atlas of Gene Expression. Two datasets, 'Pfam' and 'InterPro' provide protein domain data. Pfam contains 3,133 domains covering 15,569 genes [4]. 'InterPro' contains 5,404 domains covering 16,966 genes [37]. There is some overlap between the domains represented in each dataset, and both datasets are very sparse. One dataset, 'PPI', contains protein-protein interaction data from the OPHID database [38]. 'OMIM' contains human disease gene associations for 2,488 diseases/phenotypes and covers 1,938 genes [39]; additional mouse phenotypes were available from the Mouse Genome Database [2] ('MGI') containing 33 phenotypes spanning 3,439 genes, as were very limited phylogenetic profiles calculated using BioMart [40] (18 eukaryotic species) and InParanoid [41] (21 species), covering 15,940 genes and 15,703 genes, respectively.
Along with the data sets, some GO annotations were provided for 19,443 of the 21,603 known mouse genes, although fewer than half were annotated by any one major GO branch (BP, CC, or MF). Of the remaining 2,160 genes, 1,718 were designated as the 'test set', for which functions were to be predicted. Mouse genes have been annotated with 5,624 unique GO labels; however, only 2,816 of these were used for the functional prediction problem, selected for having between 3 and 300 positive examples in the data set. Of the 2,816, 1,726 were from BP, 763 from MF, and 326 from CC. After the conclusion of the MouseFunc competition, the organizers provided the annotations of the 1,718 withheld genes. Of the 1,718, 1,339 were annotated and 379 were unannotated.
Organization of experiments
We evaluated prediction performance in two ways: by average performance over ten-fold cross-validation, and by prediction of the annotations of the MouseFunc held-out genes. For ten-fold cross-validation, we randomly divided the 21,604 MouseFunc genes into 10 equal sized groups. Our evaluation involved ten rounds of holding out one of the groups, training the classifier and networks on the remaining groups, and predicting annotations for the held-out group. For the MouseFunc held-out genes, we trained the classifier and networks on the 19,443 annotated genes, and predicted the annotations for the 1,339 annotated MouseFunc test set genes.
Predicting Gene Ontology annotations by a family of naïve Bayes classifiers
As a representative example of the classification approach, we trained and applied 2,815 naïve Bayes classifiers, each to predict the presence or absence of one GO annotation. Feature vectors for the classifiers were constructed from the following binary MouseFunc datasets: Pfam, InterPro, PPI, Phenotype, phylogenetic_binary, and InParanoid, totaling 11,000 binary features. Use of binary features allowed training of the naïve Bayes classifiers by simple counting. We used the log-odds form of naïve Bayes, calculated as:
where c is the class assignment (that is, 1 if the gene has the annotation, 0 if not), f is a given binary feature, and i is a counter across the features. In this form, the log-likelihood on the left hand side will be positive if the features indicate c = 1, and negative if the features indicate c = 0. Missing features were omitted from the computation. We transformed the resulting log-odds ratio into a score by projecting the log-odds ratio L onto a sigmoid using the following transformation:
s = eL/(1 + eL)
We implemented naïve Bayes and the sigmoid transformation in MATLAB (The Mathworks, Natick, MA, USA).
Predicting Gene Ontology annotation by the association network
The network-based prediction comprised two steps: first, the probability (P) was predicted for each pair of genes to share a GO term using a naïve Bayes classifier based on all the available features. This produced a functional network, in which genes formed vertices, with gene pairs connected by edges of weight P. Second, the GO annotations of each target gene were inferred by propagating the GO annotations of the gene's network neighbors.
In calculating the network, 11 pairwise features were employed (all as provided from the MouseFunc competition): protein-protein interactions, shared interaction partners, shared phenotypes (MGI phenotypes, OMIM diseases), shared domains (Pfam, InterPro), mRNA co-expression (Zhang, Su and SAGE), and phylogenetic profiles (Ensembl, InParanoid). Each feature was scored using one of three scoring schemes: the hypergeometric probability, frequency, or Pearson correlation coefficient.
In the case of protein interactions, shared interaction partners and phylogenetic profiles, we calculated the hypergeometric probability of the partners occurring by random chance, calculated as:
where:
In this approach, P indicates how likely a protein pair (A, B) is to interact (or share at least k interaction partners, or show similar phylogenetic profiles, depending on data set) by chance. For protein-protein interactions, n and m equal the number of interactions in which each protein A and B is involved, respectively. N equals the total number of interactions, and k equals 1. In the case of shared interaction partners, n and m are as before, but N equals the total number of genes, and k equals the number of shared interaction partners between proteins A and B. For phylogenetic profiles, each profile (provided in the competition using both Ensembl and InParanoid) consisted of a binary vector of '1's and '0's, indicating the presence or absence of orthologous genes in the target organisms. In calculating the phylogenetic profile hypergeometric distribution, n and m equal the sum of each vector for the proteins A and B, respectively, N equals the total number of organisms used to construct the phylogenetic profiles, and k is the number of shared organisms in the phylogenetic profile. Finally, we calculate -log P as the score for each of these data sets for each pair of genes.
For the calculation of gene associations using the MGI mouse phenotype, OMIM disease, Pfam domain, and InterPro domain data sets, we employed a frequency model as follows. Each gene was considered to have a feature vector, v
i
, of length n, where each vector element represents the presence (1) or absence (0) of a phenotype, a disease, or a domain. We calculate P as the probability of sharing the observed number of vector elements by chance. -log P is taken as the score, calculated as:
where f
i
is the frequency of the element i among the N total genes.
Finally, for the gene expression data sets (Zhang, Su and, SAGE), we calculated the Pearson correlation coefficient of the mRNA expression levels between each pair of genes across each of the expression data sets.
Given this set of scores associated with each pair of genes, we calculated the probability for a gene pair to share GO terms as follows. The scores for each pairwise feature were discretized into bins (9 to 28 bins, depending upon data set). Bins were chosen by ranking the gene pairs by the given score, then selecting bin locations to separate the top scoring pairs of 10,000, 20,000, 50,000, 100,000, 200,000, 500,000 ... 50,000,000 and L (the total number of pairs) gene pairs. In the case of the mRNA expression data sets, a similar process was also performed after ranking the gene pairs in reverse order (capturing anti-correlation of the expression vectors). In the event that the number of bins was less than 7 after merging the same break points, the bin size was reduced by half until the number of bins became 7 or more. For each gene pair, the posterior probability of sharing at least one GO term was then predicted using a naïve Bayesian classifier based upon the discretized bins.
Finally, a gene association network was constructed between gene pairs using the top 10% probabilities of sharing GO terms as edge weights. GO terms were predicted for each target gene by propagating the GO terms of its neighbors in the network. For a target gene with k neighbors (n1, n2 ... nk), the prediction score, S
i
, for each GO term, G
i
, was calculated as:
where P
k
is the edge weight between the target gene and its k-th neighbor.
Combination of network and classifier predictions
The predicted scores from the networkfull and the classifier were combined by simply applying the given operation (minimum, maximum, mean) to each pair of predictions. If the prediction score by network is Snetwork and the score by classifier is Sclassifer, then the combined score will be min(Snetwork, Sclassifer), (Snetwork+Sclassifer)/2 and max(Snetwork, Sclassifer) for the minimum, the mean, and the maximum score, respectively.
Evaluation of prediction performance
To evaluate the prediction performance of the various algorithms, we calculated the AUC, or area under a ROC curve, for each classification approach as applied to each GO term, and then examined the mean, median, and distribution of the AUC values. An ROC curve plots the true positive rate (sensitivity, TP/[TP+FN]) versus the false positive rate (1 - specificity, FP/(FP+TN)), where TP is true positives, FN is false negatives, FP is false positives, and TN is true negatives. A randomly guessing classifier will have an equal true positive rate and false positive rate, yielding an AUC of 0.5. A classifier making predictions better than random will produce a curve above the diagonal, and an AUC greater than 0.5, with a perfect classifier having an AUC of 1.0. A classifier making predictions worse than random will produce a curve below the diagonal, yielding an AUC between 0 and 0.5. The AUC was calculated using the ROCR library [42], which is shown to generate nearly the same AUC values as the script used for the MouseFunc competition.
We calculated the AUC for the 1,339 annotated genes (out of the 1,714 total genes) in the MouseFunc test set, using the same February 2006 GO annotations used in the competition, and the identical held-out test set of annotated genes. Summary statistics for groups of GO terms (mean, median, standard deviation) include only annotations with at least one positive example. The mean for a given group - for example, 'BP 3 to 10' - is the mean of the AUCs computed for the 1,339 annotated test set genes, for all of the GO terms belonging to the group 'BP 3 to 10', where its GO terms belong to the biological process class and the term frequency is between 3 and 10 in the MouseFunc training data set.