Design principles of molecular networks revealed by global comparisons and composite motifs

A global comparison of the four basic molecular networks in yeast - regulatory, co-expression, interaction and metabolic - reveals general design principles.


Topological statistics of networks (excerpt from [14])
The topological analysis of the networks provides quantitative insight into their basic organization. Four topological statistics of particular interest in network analysis are: (1) Average degree (K). The degree of a node is the number of links that this node has with other nodes. The average degree of the whole network is the average of the degrees of all its individual nodes.
(2) Clustering coefficient (C). This is defined as the ratio of the number of existing links between a node's neighbors and the maximum possible number of links between them (similar to an odds ratio). The clustering coefficient of a network is the average of all its individual coefficients. This statistic can be used to determine the completeness of the network.
(3) Characteristic path length (L). The graph theoretical distance between two nodes is the minimum number of edges that is necessary to traverse from one node to the other. The characteristic path length of a network is the average of these minimum distances. It gives a measure of how close nodes are connected within the network.
(4) Diameter (D). The diameter of a network is the longest graph theoretical distance between any two nodes in the graph. Table 1c explains, in detail, the formulas that are used to calculate these statistics.
Until recently, classical random network theory was used to model complex networks. This was introduced by Erdös and Rényi. It assumes that any two nodes in the network are connected with random probability p and the degrees of the nodes follow a Poisson distribution, which has a strong peak at the average degree, K. Most random networks are highly homogenous, in that most nodes have the same number of links (degree), k i ≈ K, where k i is the degree of the ith node. The chance of having nodes with k links falls off exponentially for large k (i.e. P(k) ≈ e -k ), meaning that it is very unlikely that there will be any nodes of degree significantly larger than average.
To explain the heterogeneous nature of complex networks, Barabási and colleagues recently proposed a "scale-free" model in which the degree distribution in many large networks follows a power-law (P(k) ≈ k -r ). A remarkable point about this distribution is that most of the nodes within these networks have very few links, with only a few of them (hubs) being highly connected. Many aspects of genomic biology have such a scalefree structure. Concurrently, Watts & Strogatz found that many networks can also be described as having a "small-world" property, i.e. they are defined as being both highly clustered and containing small characteristic path lengths (large C and small L).

Calculation of likelihood ratios
Jansen et al created a set of gold standard positives (truly interacting proteins) and a set of gold standard negatives (truly non-interacting proteins) for the prediction of interactions [43]. The likelihood ratio L for a certain feature f (e.g. two genes being co-regulated) is then defined as ( ) where P represents the total number of gold standard positives. N represents the total number of gold standard negatives. TP is the number of true positives (i.e. protein pairs overlapping with the gold standard positives). FP is the number of false positives (i.e. protein pairs overlapping with the gold standard negatives).
The significance of a certain likelihood ratio value can be estimated using the hypergeometric distribution: Suppose that the total sample space is N. In the first round, a sample of size S 1 is randomly selected without replacement from N. Then, the entire initial sample (S 1 ) is subsequently returned to the sample space. In the second round, another sample of size S 2 is randomly selected without replacement from N. The size of the overlap (denoted as i) between the two samples (S 1 and S 2 ) is a hyper-geometric random variable. The probability of observing an overlap of a given size X or greater is calculated by the formula: To calculate the statistical significance of the overlap between the gold-standard positives and our predictions, we first need to determine the total sample space (N). Because there are about 6000 yeast proteins, the total number of possible interacting pairs is 1.8x10 7 (6000x6000/2 = 1.8x10 7 ). Therefore, the sample space (N) is 1.8x10 7 ; S 1 = 8250; S 2 = 23295; X = 7889. Please note that the same likelihood ratio may correspond to different significance values, because the likelihood ratio only depends on the ratio of TP/FP, but the P-value depends on the actual values of TP and FP.

All networks are significantly related
As discussed in the main text, we first used Pearson correlation coefficient to measure the association between different networks. The results show that most of the networks are, in fact, closely related. However, one problem with correlation coefficient is that it takes into account all pairs of proteins within the whole network. However, since the majority of pairs have a distance larger than the diameter, the coefficient actually measures the correlation between distant pairs. Supplementary Table 1A shows that all networks are significantly related (P values are all smaller than 10 -4 ), although the specific correlations are not very significant.
In total, we used seven different methods to measure the association between networks:

Cramer's V (V).
We divided all pairs of proteins in a network into three bins: 1) connected pairs; 2) close pairs (distance=2); 3) distant pairs (distance≥3). The χ 2 statistics were calculated on the 3x3 contingency table between any two networks.
The formula for Cramer's V is where I and J are the numbers of rows and columns (in our case, I = J = 3), and N is the total number of gene pairs. Cramer's V lies between zero and one inclusive, equals zero when there is no association, and equals one when the association is perfect.

Pearson correlation coefficient (CC)
Given two random variables, X and Y, the Pearson correlation coefficient CC(X; Y) is calculated by the formula: where X records the distances for all protein pairs in one network. Y records the distances for the corresponding pairs in another network. We only include protein pairs that show up in both networks for this calculation.
The Pearson correlation coefficient lies between -1 and 1.

Mutual information (I)
Given two random variables, X and Y, the Mutual Information I(X; Y) between them measures how much information does one variable conveys about the other one. It is defined as the relative entropy (or Kullback-Leibler distance) between the joint distribution and the product distribution of X and Y, that is: where P(x, y) indicates the joint distribution of X and Y and P(x) and P(y) their marginal distributions. It is easy to prove that: where H(X) and H(Y) are the entropies of X and Y, and H(X| Y) and H(Y| X) are the conditional entropies of X given Y and Y given X respectively. This states that the information that Y conveys about X is the reduction in uncertainty about X due to the knowledge of Y (and vice-versa).

Contingency coefficient (C)
The formula for C is It lies between zero and one. But it will never be equal to one.

Association score (A)
Two proteins are defined as "associated" if their distance is smaller than or equal to 2. Given two networks, the fraction of associated pairs in one network that are also associated in the other was calculated. Then, we also calculated the random expectation. The association score is just the ratio of the observed fraction over the random expectation.

Pearson correlation coefficient on binned data (Binned CC)
We calculated Pearson correlation coefficients between networks using the binned data that were produced when we calculated Cramer's V.

Mutual information on binned data (Binned I)
We calculated mutual information between networks using the binned data that were produced when we calculated Cramer's V.
All methods produce similar results as indicated by their correlations with Cramer's V used in the main text (see supplementary Table 1B).

Expectation of the essentiality of composite hubs
We use "h1" and "h2" to denote hubs in two networks and use "e" to denote essential genes. P(e) = 0.2 for there are about 1000 essential genes in yeast genome (~6000 genes). Because we define hubs as the top 20% of the nodes with the highest degrees, P(h1) = P(h2) = 0.2. Since we know that hubs tend to be essential genes, let us suppose P(e|h1) = P(e|h2) = 0.4 (the fraction of essential hubs in the interaction network is about 40%).

Expectation of the overlap of hubs between scale-free networks
Above, we have shown that P(h1&h2) = 0.05, given the fact that hubs in scale-free networks tend to be essential. If we assume that this were not the case, then P'(h1&h2) = P(h1) * P(h2) = 0.2 * 0.2 = 0.04 < P(h1&h2).
Therefore, hubs in scale-free networks should tend to overlap just because of their essentiality.

Defining hubs with different cutoffs does not affect the results
In order to determine whether different definitions of hubs using different cutoffs will affect the final results, we repeated all calculations in Figures 2 and 3 using different cutoffs. As shown in SupFig. 3A, even though there is some fluctuation, the overlaps between the regulatory networks and action networks (particularly interaction and expression networks) are always very similar to random expectation (P values > 0.05), whereas, the overlaps between action networks are always significantly higher than random expectation (P values < 0.001) when the cutoff is smaller than 70%.
SupFig 3B clearly shows that hubs in different network have a strong tendency to be essential (except those in target population). Furthermore, tri-hubs and bi-hubs have even higher tendencies than normal hubs at all cutoffs. The fluctuation at the low cutoff end is due to the limited statistics, i.e., there are only few tri-hubs at those cutoffs.

Regulator bridges generate time-delayed expression relationships
Bridge motif: We have constructed the following approximate chemical kinetic model for the bridge motif (see above) in the transcriptional regulatory network: . At the same time, TF 1 binds to a second DNA motif and activates the transcription of another transcription factor TF 2 . After translation and translocation, the TF 2 protein then binds to a third DNA motif and activates the transcription of the second target 2 mRNA T . We assume that the binding between TF and the corresponding DNA motif is fast and reversible, with an association constant of K A . k 1 is the rate constant for transcription, k 2 is the combined rate constant for translation and translocation. We assume that K A and k 1 are the same for both transcription factors. NTP, PPi, and aa-tRNA represent nucleoside triphosphate, pyrophosphate, and aminoacyl-tRNA, respectively.
The above model corresponds to the following set of equations: ( ) Here we use [A] to denote the concentration of molecular species A. We assume that there is one copy of each DNA motif in the genome, so that the concentration for each DNA motif is the same as the concentration of DNA molecules in the nucleus, c D . We also assume that the concentrations of NTPs are roughly constant, so that the rate of transcription only depends on TF binding to DNA motif. Similarly, we assume that the concentrations of aa-tRNAs are roughly constant, so that the rate of translation only depends on the concentration of mRNA. Furthermore, we initialize the parameters in the following way: These parameters can be understood in the following intuitive way. When t<0, TF 1 is absent in the nucleus. When t≥0, the number of TF 1 molecules in the nucleus rises to a constant of 1000 per DNA molecule. The association constant between TF and the corresponding DNA motif is defined such that there need to be at least 100 free TF molecules per DNA molecule to ensure that the DNA motif is bound to TF half the time. Furthermore, we assume that the combined rate constant of translation and translocation is ten times slower than the rate constant of transcription.
We numerically simulate the time evolution of the concentration of two target mRNAs: 1 mRNA T , and 2 mRNA T ( Supplementary Fig.9). It is clear from the simulation that the two targets have a time-delayed expression relationship.
Our model can be improved by incorporating additional processes such as mRNA and protein degradation, and by considering stochastic gene expression. However, the conclusion that the two targets have a time-delayed expression relationship will still hold.

Disconnected pairs in the metabolic network tend to have expression relationships other than co-expression
We obtained expression profiles of yeast genes through two complete cell cycles [6]. Between the expression profiles of pairs of genes, we used a local clustering method to calculate four types of temporal relationships: correlated (i.e. co-expressed), time-shifted, inverted, and inverted time-shifted [45]. To find these relationships, expression levels must be assessed over a time-course, with many measurements, at small and uniform intervals. Most available datasets do not satisfy these conditions, being only suitable for simple correlation calculations (i.e., co-expression); thus, we can only conduct detailed analysis on the cell-cycle datasets [6,34].
We examined the occurrences of co-expression and shifted motifs using three cell-cycle datasets [34]: (1) Alpha (2)Cdc-15 (3) Elutriation. Supplementary Figure 10 shows that the results in all three datasets are very similar to those in the main text, i.e. disconnected pairs in the metabolic network tend to have expression relationships other than coexpression. However, similar results could not be found in the interaction dataset (see supplementary Figure 11).

Defining expression networks with different cutoffs does not affect the results
In order to determine whether different definitions of expression networks using different cutoffs will affect the final results, we repeated all calculations in Figures 3, 4 and 5 using different cutoffs. Supplementary Figures 12 and 13 clearly shows that all results remain the same.

P values by cumulative binomial distribution
Most P values in our analysis measure whether the difference is significant between the testing and control groups. They are calculated using the cumulative binomial distribution: where N is the total number of possible gene pairs in the data; c o is the number of observed pairs with a specific relationship in the testing group; and p is the probability of finding a gene pair with the same relationship in the control group. In this manner, we are testing whether gene pairs with a specific relationship are over-represented as compared to the control group. If they are under-represented, then P(c < c o ) = 1 -P(c ≥ c o ). Tables   Supplementary Table 1 13E-03 correlation 1.00E+00 9.14E-01 9.92E-01 1.00E+00 9.57E-01 9.91E-01 9.97E-01

Supplementary Table 2. TFs regulating the same genes tend to interact and co-express with each other
A. Likelihood ratios for TFs in the same motif to interact #: P-values measure the significance of the difference between the observed fraction of bridges and the random expectation at different distances. *: The P-value for overall measures the significance of the difference between the observed fraction of bridges and the random expectation for all disconnected enzyme pairs.

Figure captions
Supplementary Figure 1. Essentiality of bi-hubs. We created randomized networks where the tendency for hubs to be essential is conserved, and compared observed essentiality enrichment in bi-hubs with calculations based on the randomized networks. All other calculations are the same as in Figure 2B.

Supplementary Figure 2. Fold enrichments of hub overlaps (O) between two
networks relative to random expectation. We created randomized networks where the tendency for hubs to be essential is conserved, and compared observed essentiality enrichment in composite-hubs with calculations based on the randomized networks. All other calculations are the same as in Figure 3B. It is clear that, as more and more hubs in the metabolic network are removed, the L's of interaction and expression networks go down very quickly (i.e., the networks begin to collapse), as compared to that of the regulatory network, which remains around 6.3. (B) Hubs in the regulatory network are removed. As more and more hubs are removed, the L of the regulatory network goes down very quickly, whereas the L's of all three action networks stay stable. These results confirms our conclusion that separation of hubs between regulatory and action networks provides stability to the cell. Figure 6A. Examples of triangles. In all panels of supplementary ACS2 and ERG10 are two enzymes involved in pyruvate metabolism. ACS2 turns acetyladenylate into acetyl-CoA, which is subsequently turned into acetoacetyl-CoA by ERG10. ACS2 and ERG10 have a co-expression relationship (the local clustering score is 14.6). (C) A shifted motif in Met-Exp. HEM13 and HEM14 are two enzymes involved in porphyrin and chlorophyll metabolism. HEM13 turns coproporphyrinogen III into protoporphyrinogen IX, which is subsequently turned into protoporphyrin IX by ERG10. HEM13 and HEM14 have a time-shifted relationship (the local clustering score is 14.9). | | indicates a time-shift relationship.