Technical supplement to \Consistent probabilistic outputs for protein function prediction"

Protein function prediction, in the context of the Gene Ontology, is a task that consists of answering, for a (cid:12)xed protein X , a large number of binary questions of the form: \Does protein X belong to GO term Y ?" Those binary classi(cid:12)cation problems are strongly related because the ontology consists of nested classes. Two natural requirements for this prediction problem are

• that high-confidence predictions can be produced with a quantified confidence level.
Methods of structured classification proposed in machine learning Taskar et al. [2003] could in theory be used to tackle this problem.However, two practical difficulties that need to be surmounted are the large amount of missing data and the large scale of the problem.To address both issues, our approach consists of first making partial predictions for each term given each data type and then integrating all partial predictions to form a set of consistent predictions whose confidence level can be estimated.We use calibration methods to assign a confidence level to partial predictions and subsequently consider several reconciliation methods to produce, for the whole ontology, a set of confidence levels that are consistent in the sense that they naturally yield consistent predictions.
The presentation of the methods is organized as follows: in Sec. 1 the notions of calibration as it applies to our problem, the algorithm we use to obtain calibrated partial predictions and generally relevant notations are introduced.In Sec. 2 the Bayesian formulations as well as an efficient variational formulation to perform inference are presented.In Sec. 4, we present algorithms based on projections of probability distributions.

Calibration
In statistics, the calibration of a prediction for the binary variable Y is a procedure which, given some evidence X, returns a probability value in [0, 1] that reflects the confidence that the predicted value is Y =1.From a Bayesian point of view a natural candidate is P (Y |X), as obtained by Bayes' rule, where X is some input variable that the prediction is based on.Frequentists, on the other hand, define well-calibrated predictions as those that are good estimates of their own success probability.Formally, S is a well-calibrated prediction if P (Y = 1|S = p) = p.We refer the reader to the relevant literature Cohen and Goldszmidt [2004].The simplest frequentist method for calibration is the logistic regression, which as suggested by Platt [1999] , can be used in combination with support vector machines (SVMs) [Boser et al., 1992].
In the present work we are interested in calibrating jointly several related GO term predictions.Extending the Bayesian approach to this structured problem is fairly natural and has been explored for function prediction by Barutcuoglu et al. [2006].Similarly, logistic regression can be extended to the structured case by conditional random fields (CRFs).We pursue yet another direction, in which the calibrated values obtained from individual logistic regressions are combined to yield a set of consistent calibrated values.This type of problem has been studied in the multiclass classification setting, where several class specific binary classifiers are combined to predict a set of mutually exclusive classes Wu et al. [2003].An advantage of this approach is that the missing data are dealt with on a per term basis, which is easier.

Constrained calibration and notations
Whereas the structure that is generally exploited in the multiclass setting is that classes are mutually exclusive, here we wish to exploit the inclusion relations between GO terms.The consequence of such structure on calibrated values is that confidence should decrease along any lineage of the ontology.More formally, if in the ontology G, there is an edge i → j corresponding to GO term i being a parent of GO term j, by which we mean that GO term j is included in GO term i, and if (p i ) i∈I are confidence values, then we should have p i ≥ p j .Notice that the ontology graph then defines a partial order on (p i ) i∈I .This can be interpreted simply as the fact that confidence should decrease as one makes more precise predictions; but because ideal calibrated values p i can be interpreted as probabilities, one can further consider them as originating from a joint probability distribution P on a set of binary random variables Y = (Y i ) i∈I that are indicators for each term, such that p i = P (Y i = 1) and, finally, such that for all edges i → j the implication (Y j = 1) ⇒ (Y i = 1) translates in probabilistic terms as P (Y i = 0, Y j = 1) = 0.The set of distributions satisfying the implication (or inclusion) constraints of the graph is then A subset that is easier to parameterize is the set of distributions in P ⇒ that factorize according to the graph G.To formally define this set, we introduce some notation that we will use throughout this appendix.Denote by π i , c i , A i and D i respectively the set of parents, children, ancesters and descendants of a node i in G and denote by Y π i = j∈π i Y j as well as y π i = j∈π i y j where y j ∈ {0, 1} is the value of a realization of Y j .The distributions that factorize according to G can be defined formally as An element of P G is completely characterized by its set of conditional distributions: ).The marginal probabilities (our calibrated values) can then be computed immediately from the the conditionals: and vice versa: the conditionals q i = p i pπ i are easy to obtain from the marginals.Several of the methods we consider require computing the entropy of the distribution.For distributions in P G , the entropy has a simple analytical expression: and similarly for y c i , then that set can be written as We omit further details on P G −1 here.
1 Although both sets P ⇒ and P ≤ have the same marginals.

The Bayesian approach
Binary label SVM output The principle of the Bayesian approach is to turn the decision problem "Does protein j have function i?" into an inference problem.The answer to the question is encoded as a binary number which is treated as a random variable.Initially, a prior joint distribution for the binary Y i variables is chosen, and one assumes that given Y i = y i some evidence x i is observed independently for each GO term according to p(x i |Y i = y i ) = L i (y i ), y i ∈ {0, 1}.Subsequently, using Bayes' rule and appropriate computational methods the quantity P (Y|X = x) is computed and used as the calibrated value.In the approach of Barutcuoglu et al. [2006], the evidence is the output of an SVM classifier.It is natural to choose the prior (1 − q i0 ) yπ i −y i 1 {y i ≤yπ i } , and the posterior distribution is Unless G is a tree, the posterior distribution is not in P G (or respectively in P G −1 if the prior was a tree).The Bayesian formulation takes us naturally outside of the class of models considered.A computational consequence is that, except in the tree case, calculating the marginal probabilities, i.e., performing exact inference, becomes expensive, and approximate inference is necessary.In Barutcuoglu et al. [2006], the authors use exact inference because they limit their analysis to small graphs (personal communication).Approximate inference can typically be performed using loopy belief propagation or some other variational method [Wainwright and Jordan, 2003].We present next an efficient variational inference algorithm that fits well within our framework.

Variational inference in P G
The principle of variational inference is to write an optimization problem whose minimizer is the set of marginal probabilities of the distribution, when the unnormalized exponential form of that distribution is available, and use optimization algorithms to solve for the minimum [Wainwright and Jordan, 2003].Typically, if the inference involves finding the marginals, say m, of a distribution Q ∈ P, where the latter is some exponential family, then the unnormalized log-likelihood is a dot product θ, φ(y) between the vector of parameters θ and some vector of sufficient statistics φ(y) such that m = E φ(Y), and the entropy of the distribution can be written as a function of m: H(m).To a given P corresponds a set of possible marginals M. The variational inference problem can then be formulated as the optimization problem: max m∈M H(m) + m, θ(x) In the Bayesian inference situation presented above, taking the log of (1) we get the log of the unnormalized log-likelihood and identify φ(y), θ with, to match (1), η i1 = log q i0 , η i0 = log(1−q i0 ) and i (k) = log L i (k).
Taking expectations of (2) with respect to any distribution in P ⇒ we get At this point, we still have to define the set of possible marginals.One of the difficulties here is that the posterior distribution whose marginal we are after is in P ⇒ , but it is in general not in P G .If we denote by M ⇒ (resp.M G ) the set of marginals obtainable from joint distributions in P ⇒ (resp.P G ), then we can write our optimization problem as max m∈M ⇒ H(m) + m, θ One typically appeals to variational inference in cases where the optimization problem is intractable.An approximate variational inference method, instead of finding the exact set of marginals, finds the closest set of marginals in a simpler distribution class.It turns out that M ⇒ is not easy to deal with: in particular the expression of H(m) for m ∈ M ⇒ is in general intractable.By contrast, P G is an easier distribution class for which entropy is computed easily as we argued in Sec.1.1.Therefore, we consider the approximate inference problem max Parametrizing this optimization problem with the conditionals q i and setting the gradient to 0 we get the following fixed point equations: . Note that, with the notations of (1), if we define qi through a simple Bayes rule, we obtain qi 1−q i = q i0 L i (1) (1−q i0 )L i (0) then f i = log qi 1−q i .This suggest a modification of the BPAL algorithm where this log-odds ratio is set with the output of a logistic regression using qi = pi ; we call that algorithm BPLR.Notice also that the equation for q i given the other variables (4) is in closed form because q i cancels in the numerator and denominator of pπ k p i = j∈A k \(A i ∪{i}) q j .Because the function considered is strictly concave with respect to each q i , enforcing the fixed point equation iteratively on the coordinates performs coordinate ascent on the function and therefore converges to a local maximum.

Cascaded logistic regression
A major weakness of the Bayesian approach is that it requires the models for conditional densities p(x i |Y i = y i ) to be nearly correct, which is in general unlikely to be the case.In other words, the method is not robust to model specification.The advantage of logistic regression is precisely that it models P (Y i = y i |X i = x i ) directly.A way to construct a model that approximates P (Y = y|X = x) in the case where the dependencies between terms are taken into account is as follows.Assume that P (Y = y|X = x) is in P G so that it factorizes according to the graph.This implies that ) is then a logistic regression.Fitting this model is very similar to fitting independent logistic regressions except that only examples of proteins having all parents GO terms are used to fit the model.That is also a weakness of that model: some training sets have very few negative examples.

Projection methods
The methods presented so far to reconcile the predictions for different GO terms either appeal to a generative model (e.g., the model of Barutcuoglu et al. [2006]) and model the distribution of SVM outputs by some mixture of densities, which is quite far from optimal from a calibration point of view (naive Bayes from the same densities is inferior to logistic regression) or appeal to the more sophisticated machinery of CRFs, which are difficult to implement and require a further step of learning.
In contrast, in this section, we propose methods that make direct use of the calibrated values obtained from the logistic regressions and try to find the closest set of values that are consistent with the ontology.

Isotonic regression
Denote by (p i ) i∈I the set of calibrated values obtained from the logistic regressions.We propose to find a set of marginal probabilities (p i ) i∈I in M ⇒ such that pi and p i are close together.A first formulation could consist in choosing the 2 distance as a measure of closeness.This approach yields the quadratic program (QP) min p i , i∈I i∈I This QP is known in the statistical literature as the isotonic regression problem.When the inequality constraints correspond to a total order, then the problem is simpler, and an efficient algorithm, PAVA [Barlow et al., 1972], is known to solve it.More generally, isotonic regression can be solved using an interior-point solver, provided the number of edges in the graph is not too large.An approximate algorithm with complexity O(n2 ) was recently proposed by Burdakov et al. [2006].We use a simple algorithm based on iterations of PAVA which approximates the solution.Instead of using the Euclidian distance, we consider the Kullback-Leibler divergence D(p i q i ) = p i log p i q i + (1 − p i ) log 1−p i 1−q i , which is more natural for probability distributions.There are two ways to minimize simultaneously all term specific KL-divergences, because the latter is not symmetric.
These two problems turn out to be closely related to the previous one: the solutions to (IR) and (IR.2) are actually the same.The KKT conditions of (LO-IR) show that its solution can be obtained by solving an isotonic regression on the log-odds ratios log pi 1−p i with the same inequalities and then mapping the obtained log-odds ratios back to probabilities.

Projections on P G
Note that in the previous section, even though we minimize a sum of KL-divergences between pairs of marginals, we are actually not minimizing a KL-divergence between joint distributions.Another way of formulating a projection is as follows: define a joint distribution P on the GO terms, where each individual term is independently Bernoulli distributed Ber(p i ) and find a joint distribution P which is close and satisfies the constraints, i.e., P ∈ P ⇒ .For instance, the problem can be stated 2 as min D(P P ) s.t.P ∈ P ⇒ .
However, a generic distribution from P ⇒ is not tractable for the reasons outlined previously.On the other hand, if we consider P G , then the problem becomes tractable.Using a parameterization with conditional distributions q i the problem can be written as: [p i log pi + (1 − p i ) log(1 − pi ) + h(q i )p i ] with p i = j∈A i q j .We differentiate the above expression to find a stationary point: Because ∂p k ∂q i = p k q i = j∈A k \{i} q j does not depend on q i , coordinate descent has the closed form updates log with f i = log pi 1−p i .Notice that the update rules for ( 5) and ( 4) are quite similar.
min P ∈P G D(P P ) = min P ∈P G i∈I E P [X i log pi + (1 − X i ) log(1 − pi )] + H(P ) = min q i ∈[0,1] n i∈I