The GeneMANIA algorithm consists of two parts: an algorithm, based on linear regression, for calculating a single, composite functional association network from multiple networks derived from different genomic or proteomic data sources; and a label propagation algorithm for predicting gene function given this composite network. Below we describe the two parts of the GeneMANIA algorithm individually. We evaluate GeneMANIA on yeast benchmarks as well as the MouseFunc I benchmark.
GeneMANIA label propagation
Gaussian field label propagation algorithms for binary classification take as input an association network, a list of nodes with positive labels, possibly a list of nodes with negative labels, and initial label bias values. For gene function prediction, each gene is associated with a single node and nodes representing genes in the seed list (that is, positive genes) are assigned an initial label bias value of +1, and those representing genes that are deemed negative examples are assigned an initial bias value of -1. In the Results section, we explore a couple of strategies to defining negative examples; here we will assume that these examples are provided. GeneMANIA label propagation differs from [22] in how we assign the label bias to unlabeled genes, which are those that appear neither on the seed list nor among the negative examples.
Given this input, label propagation algorithms assign a discriminant value to each node. These discriminant values are determined by letting the initial label bias of nodes propagate through the association network to nearby nodes. To account for potential noise in the initial labelings, GeneMANIA label propagation, like [22], allows the discriminant values assigned to positively and negatively labeled nodes to deviate from their initial biases.
Our label propagation algorithm assigns discriminant values by finding those that minimize a cost function that penalizes both differences between the discriminant values of neighboring nodes in the network as well as differences between the discriminant values of nodes and their label bias (see Materials and methods for details). This cost function allows information about the node labels to propagate through the network to affect the discriminant values of genes that are not directly connected to the seed list. As such, the initial bias values of labeled and unlabeled nodes are important in determining their discriminant values and those of their neighbors; to account for the fact that only a small portion of the genes are expected to be labeled as positive, in the GeneMANIA algorithm we set the initial bias of unlabeled nodes to be the average bias of the labeled nodes: , where n+ is the number of positive and n- is the number of negative examples.
When the number of positively and negatively labeled genes is equal, the GeneMANIA label propagation algorithm is identical to that of Zhou and coworkers [22] (hereafter ZBLWS), and similar to the original Gaussian fields label propagation algorithm [23], since they both assign an initial label bias of 0 to the unlabeled nodes. However, in gene function prediction problems, the number of positives is almost always only a very small proportion of the total number of genes and our method of setting these label biases makes a dramatic improvement in prediction accuracy in these cases (Figure 1).
Efficient implementation of GeneMANIA label propagation for large genomes
In GeneMANIA, label propagation on the composite association network is the most time-consuming step. Here we describe how we implemented this algorithm in order make it more efficient.
The discriminant values resulting from GeneMANIA label propagation can be calculated by solving a system of linear equations, y= A f, for f, the vector of discriminant values, given A, the coefficient matrix (whose derivation from the association network weights is described in Materials and methods) and y, the vector of node label biases. While in principle this system can be easily solved by multiplying the inverse of the coefficient matrix by y, we employ a conjugate gradient (CG) method to solve this system that makes efficient use of CPU processing time and computer memory resources, especially for large genomes.
To show why the CG method is well-suited for this problem, we will briefly describe the algorithm. The CG method iteratively improves an estimate, ft, of a solution to the linear system as follows: at each iteration t, the current estimate, ft, is multiplied by the matrix A. If the result of this matrix multiplication yt= A ftis equal to ythen ftis a correct solution. On the other hand, if ytdoes not equal ythen the CG method calculates a new estimate, ft+1, based on the difference between ytand y. Calculating ft+1requires performing another matrix multiplication between A and a vector with the same number of elements as ft. The new estimate, ft+1, is guaranteed to be more accurate than ft[29] and, starting from a random estimate of f, the CG method is guaranteed to converge to a correct solution after n iterations, where n is the number of nodes in the network. In practice, however, the CG method can converge in many fewer iterations.
By reducing the number, m, of non-zero elements in A, we can reduce the runtime of the CG method for label propagation. The runtime of each CG iteration is proportional to m, which is equal to the number of edges plus the number of nodes in the functional association network that A represents. Many types of functional association networks are sparsely connected, and those that are not, such as networks derived from molecular profiling data, can be made sparse with little resulting impact on the performance of GeneMANIA (Figure 2). In contrast to the CG method, the run times of standard methods for solving linear systems, like Gaussian elimination, depend on the number of potential connections, which can be orders of magnitude larger than the number of actual ones.
Using the CG method for label propagation leads to further time reductions because it is possible to get very close to the exact solution with only a fraction of the full number of required iterations. Indeed, we found that for networks with sizes ranging from 1,000 nodes to 20,000 nodes, on average less than 20 CG iterations were needed to derive a solution that is within the computer round-off error of the exact solution (Figure 3). Because there does not appear to be any relationship between network size and number of CG iterations required to reach a solution in our application, in practice, the running time of our label propagation algorithm depends only upon m, in contrast to other methods for solving linear systems whose runtime in the worst case would depend on the product of m and n.
In summary, to make our algorithm efficient for large genomes, we use a conjugate gradient-based procedure to solve for the discriminant values and we sparsify functional association networks so that only the most informative associations are retained. These two changes speed up the label propagation algorithm by orders of magnitude for large genomes.
GeneMANIA network integration
Like two recent methods for combining multiple functional association networks [9, 17], GeneMANIA builds a composite functional association network by taking a weighted average of the individual functional association networks. However, unlike these two methods, GeneMANIA optimizes the network weights and calculates the discriminant values separately. The advantage of this approach is that unlike the algorithm described in Tsuda and coworkers [17] (hereafter TSS), which also uses Gaussian field label propagation, GeneMANIA needs to run the computationally intensive label propagation only once, whereas TSS runs it multiple times on different networks while computing the network weights. For example, on our five network yeast benchmark (described in Materials and methods) across 400 GO functional categories, the TSS runs label propagation, on average, 68.8 times per category.
Our approach to computing network weights was inspired by recent work showing that when all available functional association networks are relevant for predicting gene function, a composite network generated by weighting each data source equally supports predictions as accurate as those derived using a composite network generated by an optimal choice of weights [17, 30], but when some of the association networks are irrelevant, the prediction performance of equal weighting scheme is degraded [30]. These observations suggest that heuristics network-weighting methods that can identify and down-weight irrelevant networks may be as accurate as optimal network weightings. As we show later, it is also important to be able to identify and down-weight redundant functional association networks.
The regularized linear regression algorithm that GeneMANIA uses is, by design, robust to the inclusion of irrelevant and redundant networks. This property is especially important when the data sources cannot be carefully controlled - for example, in a webserver that automatically downloads new data from web repositories or allows users to contribute their own data.
To set the network weights, we use ridge regression [31], which corresponds to finding a vector of network weights, α= [α1,..., α
d
]t, that minimizes the cost function: (t- Ωα)t(t- Ωα) + (α- α)tS(α- α). Here α
i
is the weight of the ith network, tis a vector derived from the initial label of the labeled nodes, Ω is a matrix with columns corresponding to individual association networks, αis the mean prior weight vector, and Sis a diagonal precision matrix. When all the diagonal entries of Sare set to zero, we say that the cost function is unregularized and solving for α becomes equivalent to unregularized linear regression. For versions of the GeneMANIA algorithm designed to be deployed on a webserver, we have limited prior information about the networks and we set the mean prior weight of all networks to be equal. We call this the 'equal-weight' prior. However, when predicting membership in a GO functional category, we can improve prediction accuracies if we set the mean prior weight of each network to be the average of all the weight vectors calculated using unregularized linear regression on a large number of functional classes in the same branch of GO. We call this the 'branch-specific' weight prior (see Materials and methods for details).
Gene function prediction in mouse
To evaluate GeneMANIA, we performed function prediction using the MouseFunc I benchmark data. To do so, we constructed ten association networks from the ten datasets (see Materials and methods for details) and used the GeneMANIA network integration and label propagation algorithm to predict gene function for each function class separately. When constructing individual association networks, we set the sparsity level to S = 50, that is, we kept only the top 50 association weights (links) for each gene and set the rest to zero. Setting S = 100 results in better performance on most specific GO biological process (BP) classes while it has minimal effect in other evaluation categories (supplementary Figure 2 in Additional data file 1).
We follow the same procedure as [15] and compare prediction accuracies using average area under the receiver operating characteristic (ROC) curve (AUC) in twelve MouseFunc I evaluation categories. Briefly, the evaluation classes are created by grouping GO categories corresponding to all pairwise combinations of the three GO branches (BP, cellular component [CC], and molecular function [MF]) and four specificity levels, that is, number of annotations ([3 to 10], [11 to 30], [31 to 100], and [101 to 300]). We report prediction performance on both the 'test' and the 'novel' benchmarks (see [15] for details).
We compare GeneMANIA's predictions in the first and second rounds of MouseFunc I (GeneMANIAEntry-1 and GeneMANIAEntry-2, respectively) and the version of GeneMANIA designed to be implemented on a webserver (GeneMANIAWS) against the best performing methods in MouseFunc I. For GeneMANIAWS, we assume no knowledge about the source of the gene list whereas in MouseFunc I the competing methods were provided with the name of GO category from which each gene list was derived. Below we discuss the outcome of our evaluation on the 'test' benchmark and then on the 'novel' benchmark.
Figure 4 shows the prediction performance of three versions of GeneMANIA as well as the best performance that was achieved on the test benchmark; GeneMANIAEntry-1 achieved the highest prediction performance on all evaluation classes containing GO categories with more than ten annotated genes, and GeneMANIAEntry-2 achieved the highest prediction performance on the categories with ten or fewer annotated genes. We measure performance by using one minus the AUC (1 - AUC), so lower values indicate better performance. GeneMANIAEntry-2 had significantly lower median 1 - AUC than all other methods in predicting BP and MF categories with 3 to 10 annotations (p < 0.05, Wilcoxon-Mann-Whitney) while GeneMANIAEntry1 and GeneMANIAWS had significantly lower median 1 - AUC compared with GeneMANIAEntry-2 in predicting BP categories with 31 to 100 and 101 to 300 annotations, CC categories with 101 to 300 annotations, and MF categories with 31 to 100 annotations.
The reduced performance of GeneMANIAEntry-1 in the categories with the fewest annotations is likely due to overfitting, because we used unregularized linear regression to set the network weights. In GeneMANIAEntry-2, we switched to ridge regression with the branch-specific weight prior, which is less prone to overfitting. Note also that GeneMANIAWS, which uses an equal-weight prior, also has improved performance on the smaller GO categories.
We suspect that one reason for the drop in prediction performance of GeneMANIAEntry-2 compared with GeneMANIAEntry-1 on the larger GO categories in the test benchmark is because of our definition of negative examples. In our first entry, we defined as negative examples all genes with any GO annotation but not one in the category being predicted. In our second entry, we refined this definition so that the negative examples for a given category were only those annotated to a sibling category, that is, one that shared a parent in the GO hierarchy. Although choosing the genes annotated to sibling categories of interest as negatives improved prediction performance on novel tasks (Figure 5), it degraded the prediction performance on the test set. This may due to the reduction in the number of negative examples. One way to alleviate this effect is to define a range of label biases between [-1, 0] for genes that have GO annotations but are not annotated to the function of interest.
On the novel benchmark, in addition to the ten association networks, eight month out-of-date annotations for test genes were also made available. Annotations in other categories of gene function can be informative for predicting new annotations in a given category (for example [32]); in our second entry to MouseFunc I, we investigated two ways in which old annotations may be helpful in predicting new annotations on the novel benchmark. First, we reasoned that genes previously annotated in parent categories but not sibling categories should be considered as potential new positive examples and as such, as described earlier, we used only genes annotated to sibling categories as negatives. Second, often annotations in one branch of the GO imply annotations in a different branch. For example, transcription factors (MF annotation) are located in the nucleus at least some of the time (CC annotation) and a number of transcription factors are involved in cell cycle regulation (BP category). To include this information, we constructed association networks from each branch of GO (see Materials and methods).
Figure 5 shows the prediction accuracies of three versions of GeneMANIA as well as the best accuracy obtained for each evaluation set on the 'novel' MouseFunc I benchmark. For the evaluation classes containing the GO categories with 3 to 10 or 11 to 30 annotations, GeneMANIA entries still had the lowest error on the MouseFunc I benchmark, with GeneMANIAEntry-2 having significantly smaller median error in five of the six 'small-sized' evaluation categories (p < 0.05, Wilcoxon-Mann-Whitney). GeneMANIA did less well in the larger GO categories, possibly suggesting that we are not making full use of the older annotations. In contrast to the test benchmark, here, GeneMANIAEntry-2 improved performance can be mostly attributed to the choice of negative examples (data not shown), although the inclusion of the GO networks also slightly improves the prediction performance (data not shown).
In summary, we observed that employing regularization when using linear regression to combine multiple networks results in a drastic improvement in prediction accuracies in most specific functional classes. This is because with only a few positive examples, identifying relevant networks is a challenging task and prone to over-fitting. One way of alleviating this effect is to estimate the relevancy of each network based on its average weight on a large number of similar prediction tasks (for example, prediction of functional classes in the same branch of GO). Second, we have demonstrated that in the binary classification of genes according to GO classes, the genes that are used as negative examples have a large impact on the prediction outcome with label propagation.
Performance of GeneMANIA on the yeast benchmark
Inspired by the performance of GeneMANIA on the MouseFunc I benchmark, we compare GeneMANIA's performance to that of the TSS algorithm and bioPIXIE [14] on yeast data. To compare GeneMANIA with the TSS algorithm, we used five yeast functional association networks (from [9]) and 400 yeast GO functional classes that we derived from a 2006 version of GO annotations (see Materials and methods for details). We compare GeneMANIA with the TSS algorithm, which had previously been shown to be as accurate as an SVM with optimized network weights while requiring orders of magnitude less computation time [17], enabling us to run much more extensive comparisons. We compared GeneMANIA with the bioPIXIE network because it is currently deployed on a popular website that provides gene function predictions. For bioPIXIE, we evaluate both the network and the probabilistic graph search algorithm published with the bioPIXIE network in comparison with GeneMANIA. Note that the bioPIXIE network was built using different data sources, which are not included in the five yeast network benchmark, so the reported performance is with regard to the specific published network, not the Bayesian network algorithm used to derive the network. In addition, to illustrate the attainable accuracies with additional datasets, we performed function prediction with GeneMANIA using a 15 yeast benchmark data set (GeneMANIA15) that we derived from recent genomics and proteomics data sources (see Materials and methods). However, we were unable to use these networks as input to the TSS algorithm due to the memory requirements of the algorithm.
Figure 6 shows the prediction performance on the yeast benchmark on BP and CC evaluation classes (see supplementary Figure 4 in Additional data file 1 for results on MF functional classes) for the TSS algorithm, bioPIXIE and GeneMANIA on a variety of input collections of functional association networks. The comparative performance of GeneMANIA15 (GM-15WS) to GeneMANIA label propagation applied to the bioPIXIE network (GM-biPx) and to the bioPIXIE probabilistic graph search (PGS) algorithm applied on the same network (PGS-biPx) underlines the value of the GeneMANIA algorithm for a webserver. Given the bioPIXIE network as input, GeneMANIA has a significantly (p < 0.05, Wilcoxon-Mann-Whitney) lower median 1 - AUC than PGS-biPx in all evaluation classes, showing that using simple heuristics to assign discriminant values leads to a considerable loss in accuracy. The value of being able to tailor the composite functional association network to the input seed list is shown by the comparative performance of GM-biPx and GM-15; while GM-biPx performs well compared with GM-15 in predicting the smaller GO-BP categories, for which the bioPIXIE network was optimized, GM-biPx has much higher error, and significantly lower performance (p < 0.05, Wilcoxon-Mann-Whitney) for the largest GO-BP categories and GO categories from other branches of the hierarchy that the bioPIXIE network was not designed to predict. In our tests, GM-biPx outperforms GM-15 on the smaller GO-BP categories; however, the performance of GM-biPx is likely overestimated because the test sets used for evaluating GM-biPx are not independent from the training sets used to build the bioPIXIE network since we were not able to recover the cross-validation folds used in [14].
We were not able to compare GeneMANIA with the TSS algorithm on the 15 yeast network benchmark; the TSS algorithm (as obtained from [33]) run on the 15 yeast networks resulted in an out of memory error when solving the system of linear equations when running MATLAB (version 7.4) on a quad-core MAC desktop with Intel Xeon 2.66 GHz processors and 4 GB of memory. However, when comparing GeneMANIA on the five yeast networks from [17] (TSS-5 versus GM-5), we see that GeneMANIA performs as well as or better than TSS; among the evaluation categories shown in Figure 6, GM-5 has a significant lower median 1 - AUC than TSS-5 on BP and CC categories with 31 to 100 and 101 to 300 annotations (p < 0.05, Wilcoxon-Mann-Whitney).
Running time and prediction accuracy
Figure 7 demonstrates the trade-off between computation time and mean error of GeneMANIA, the TSS algorithm, and the bioPIXIE graph search algorithm on the yeast benchmark. GeneMANIA with 15 networks and branch-specific weight prior achieves the lowest error in yeast and we use it as a benchmark for comparing the error of other methods. While PGS-biPx was faster than GM-biPx, its median error (1 - AUC) was three times higher. We also note that the TSS-5 algorithm was both slower and, on average, less accurate than GM-5. The primary cause of this slow down is that the TSS algorithm runs label propagation more than 50 times more than GeneMANIA in order to optimize the network weights.
The effect of redundancy and random networks on equal weighting
To ensure that the GeneMANIA network integration scheme is robust to irrelevant and redundant networks, we evaluated GeneMANIA in the presence of redundant and noisy networks. To do so, we constructed 20 redundant yeast networks by adding a slight amount of noise to the PfamA network (Gaussian noise with mean zero and standard deviation of 0.025). We also constructed two irrelevant networks by assigning association weights between 0 and 1, drawn from a uniform distribution, to a random set of 0.01% of the association weights and setting the rest of the associations to zero. We conducted function prediction with the GeneMANIA network integration scheme and weight scheme in which all the networks are assigned an equal weight. Figure 8 shows the results of function prediction with these two strategies on 300 yeast GO categories with 11 to 300 annotations. In the presence of irrelevant and redundant networks, the performance of the equal weight scheme is drastically degraded. The multiple network integration scheme of GeneMANIA detects these redundant and irrelevant networks and, as a result, GeneMANIA's performance is not affected.