Cross-species modules in a multi-layer network
A co-association network is a representation of certain types of genomics data. The data can be rather simple, such as protein binding profiles, in which two genes are connected if their corresponding proteins can physically interact. In many cases, it can be high dimensional, such as genome-wide expression profiles. In this scenario, two genes are linked in a mathematically abstract way if their expression values across a variety of conditions are highly correlated. Despite the origin of the network, from a topological standpoint, a module is an interconnected region of the network where the density of edges is higher than the average density of the whole graph. Constituents of a module are presumably genes working in a coordinated fashion, that is, sharing a common function. We combined the co-association networks from different species to form a network with two types of edges representing two types of functional similarities. Mathematically this structure is a multi-layer network [13]. Genes in a species are connected if they are co-associated, whereas genes from different species are connected if they are orthologs. Figure 1 shows a simple example of such a multi-layer network. We extended the concept of modules used in co-association networks of individual species in a novel cross-species fashion. Here a module may comprise genes from multiple species, characterized by the two types of functional similarity in a cross-species manner. Within a module, from a molecular viewpoint, genes from the same species most likely share the same function as they are co-associated, co-expressed or physically bound together. Orthologs across different species (by definition homologs descended from the same ancestral gene), because of their sequence similarity, might have similar biological function from an evolutionary standpoint. Intuitively, a module should consist of nodes that form clique-like structures within a co-association network, as well as nodes that are linked by orthology relationships between layers of co-association networks. Nevertheless, as illustrated in Figure 1, it is entirely possible that a module in the multi-layer network consists of genes from a single species. In fact, this is the case when a novel function emerges for a particular species and the genes corresponding to the specific function do not have corresponding orthologs.
Overview of OrthoClust
Figure 2 shows the three major steps of OrthoClust: construction of the multi-layer network, defining the cost function of the system and assigning nodes to modules by multiple runs of simulated annealing.
Construction of the multi-layer network
The inputs of OrthoClust are the co-association networks of two or more species, and the orthology relationships between genes of the species of interest. Of course, co-association networks are derived from raw data, and there are various ways to arrive at the networks depending on the specific data and biological purpose. OrthoClust combines individual layers of co-association networks by connecting genes in different species via their orthology relationships. To account for the fact that many orthologous pairs are not one-to-one but many-to-many, orthologous links are weighted such that the weights are normalized by the number of orthologs of each node (Materials and methods).
Defining the cost function in the multi-layer network
OrthoClust defines a cost function in order to detect modules in a multi-layer network. Specifically, every node can take a discrete label σ ranging from 1 to q. Nodes with the same label will be assigned to the same module. q is therefore a parameter chosen to be the maximum number of modules allowed in the system. If the network has M nodes, there will be Mq ways (configurations) to assign nodes to modules. In general, OrthoClust can work for N species. For the case N = 2, each configuration is characterized by a cost function H defined as:
Here, S1 and S2 are the sets of genes for the two species, respectively. Λ
ij
= A
ij
- k
i
k
j
/2 m, with k
i
= ∑
j
A
ij
and m = ∑
i
k
i
/2. As A is a network adjacency matrix, the subtracted term is the expected number of links between nodes i and j in an ensemble of random graphs with the same degree distribution [14, 15]. Its presence in H is to reduce the contribution of links between nodes with higher degree (that is, hubs). The superscripts (1 or 2) correspond to the networks of two species. The value of the Kronecker delta equals one if nodes i and j have the same label and zero otherwise. The first two terms of the cost function H are essentially the modularity functions of two individual networks [16]. In the standard modularity function, a network with high modularity means there is a high number of links between nodes in the same module, and a low number of links between nodes in different modules. The novelty of OrthoClust is the last term regarding the orthologous links between nodes in different layers of the co-association networks. It sums over O(S1, S2), that is, all the orthologous pairs between S1 and S2. As mentioned above, each pair of orthologs is weighted by w
ij
to take into account the many-to-many orthology relationships (Materials and methods). Configurations in which orthologs have the same label will lower the cost function. The relative contribution between co-association edges and orthologous edges is controlled by a coupling constant κ (for determination of the constant, see below). In the language of statistical physics, the entire framework can be interpreted as a spin system called a q-state Potts model [17], which is a generalization of the Ising model. The cost function characterizes the energy of the spin (label) system and the optimal assignment of nodes to different modules is equivalent to the ground state of the Potts model.
Assigning nodes to modules by multiple runs of simulated annealing
To optimize the cost function, OrthoClust employs a standard simulated annealing procedure similar to the one used in [18]. Labels are randomly assigned initially, and updated via a heat bath algorithm. The temperature of the system is gradually lowered until the flipping rate of labels is lower than a certain threshold (Materials and methods). Although the labels have divided nodes of the network into modules, we do not directly use the resultant configuration due to the probabilistic nature of simulated annealing, but perform the annealing process R times. By summarizing the results using a co-appearance matrix (a matrix whose elements (i,j) represents how often the two nodes i and j co-appear in the same module), OrthoClust arrives at a set of modules by thresholding the co-appearance frequency and looking for nodes that co-appear often (Materials and methods). Often the sizes of the modules follow a power law distribution; tiny modules are therefore neglected (Materials and methods). OrthoClust is generally not very sensitive to the value of q. This is because, even though the system starts with many different labels (a high value of q), the large range of states will coalesce into a few modules and only a few labels will remain to cover the appropriate number of modules as the system cools down. In other words, the exact value of q is not very important as long as q is chosen to be large.
Using OrthoClust for integrating expression profiles across species
A particular application for OrthoClust is to cluster expression profiles across species. Since OrthoClust is a network framework, raw expression profiles should be transformed into individual co-expression networks. Many algorithms have been proposed for this purpose based on calculating the N by N Pearson correlation matrix [19–22]. For our application, we found that a rank-based algorithm in which each gene is connected to the top d genes with the highest (absolute) Pearson correlation works best for resolving modules [19] (Materials and methods). It is well known that co-expression networks in many different species are modular, meaning that a subset of genes (a module) have a specific function [5, 23–25]; therefore, it is interesting to explore how these modules emerge in a cross-species fashion. Like various co-association networks constructed by correlating high-dimensional data, a co-expression edge can be assigned to have either a positive (+1) or a negative sign (-1) based on the sign of the correlation coefficient between two genes. Since anti-correlated genes do not work together, it is instructive to separate them into two different modules. This can be achieved by modifying the original cost function to separate the sets of positive and negative links in each species as specified by the superscripts (+ or -), that is:
The minus sign in front of the negative links means the effects of the negative links are opposite to those of the positive links, meaning that, in the favorable configurations, nodes in the same module are likely to be connected by positive links and nodes from different modules tend to be connected by negative links [26].
Simultaneous clustering of expression profiles in worm and fly via OrthoClust
As a demonstration, we applied OrthoClust to the transcriptomes of worm and fly generated by the modENCODE consortium [26]. In this analysis, the initial number of spin states q was chosen to be 250. We summarized the results for R = 32 annealing runs (more details in the section 'Robustness analysis' below) using a M-by-M co-appearance matrix, where M is the size of the multi-layer network (the total number of genes in worm (20,377) plus fly (13,623) in this case). As shown in Figure 3A, there are blocks of worm and fly genes along the diagonal. These blocks consist of genes that co-appear often in various runs of annealing, representing different worm and fly modules. Of particular interests are the blocks of worm and fly genes that co-appear with high frequency in the off-diagonal positions. For instance, as highlighted in Figure 3A, a block of worm genes and a block of fly genes form a conserved module. As expected, they share a significant fraction of Gene Ontology (GO) terms (P = 3.3 × 10-16, hypergeometric test). Figure 3B shows the common GO terms between a set of worm genes and a set of fly genes in the conserved module. Most of the common GO terms refer to fundamental biological functions like RNA processing and cell cycles processes. On the other hand, blocks that do not overlap in the off-diagonal positions correspond to specific worm and fly modules. For instance, GO terms related to chitin (main component of exoskeletons of arthropods) activities were found in certain fly-specific modules. At a global level, we found that the size of modules follows a power-law distribution with an exponent of -1.89 (Figure S1 in Additional file 1). The power-law distribution observed includes certain large modules. Practically, one could implement extra steps to break down these large modules recursively.
Separation of modules in the Gene Ontology space
As OrthoClust divides genes into modules based on how they are separated topologically in the multi-layer network, it is instructive to examine systematically how these modules are separated in functional space as defined by GO terms. To do so, we employed a metric to quantify the semantic similarity between all worm and fly genes (both intra-species and inter-species) based on the overlap of GO terms in a vector space model [28]. As shown in Figure 4, for the 150 modules obtained by clustering all worm and fly genes, the overlap between genes within a module is much higher than the overlap between genes across modules (P = 3 × 10-83, Wilcoxon test). Nevertheless, since two orthologous genes tend to have very similar GO terms, we further investigated whether such a high level of overlap between genes within a module is merely the consequence of orthology. We therefore repeated the analysis by removing all orthologs inside the modules. We found that the overlap between genes within modules is still significantly higher than the overlap across modules (P = 1.5 × 10-45, Wilcoxon test; Figure S2 in Additional file 1). Thus, we conclude that, in terms of GO annotation, OrthoClust has separated genes with different functions into different modules.
Benchmarking modules based on co-regulation patterns
Apart from GO analysis in Figure 4, we further tested whether genes inside a module are indeed more functionally relevant by examining the number of common regulators they possess. We identified the binding targets of a set of worm and fly transcription factors based on ChIP-Seq experiments also generated by the modENCODE consortium [29] (Materials and methods). These ChIP-Seq experiments and the RNA-Seq experiments for expression profiles were performed in the same developmental stages. For all pairwise combinations of modules, we then counted the number of common transcription factors for each pair of genes (Figure S3 in Additional file 1). We found that pairs of genes within a module, on average, have more common transcription factors than pairs of genes in different modules (a 2.6-fold increase in worm and 1.6-fold increase in fly, P < 0.001 under permutation test). This result is consistent with general observations that a transcription factor tends to regulate targets sharing similar biological functions.
Effects and the determination of the coupling constant κ
The cost function of OrthoClust takes into account two types of edges: co-association edges and orthology relationships. The coupling constant κ determines the relative contribution of the co-association links within a species and the orthologous links across species. A low value of κ means networks are likely to be clustered independently whereas a high value of κ means orthology links are more important and the label of a node tends to be aligned with its ortholog rather than its neighbors in the same network. In the clustering of gene expression profiles, we employed two independent methods to quantify the effects of tuning κ and determine its optimal value. First of all, we made use of a set of 1,288 metagenes obtained from [23] as our gold standard. These metagenes were constructed based on orthologs whose expression relationships are conserved across multiple species, including worm, fly and human. A metagene consists of genes from different genomes (worm and fly in this case) that presumably share the same function by considering their expression values across different conditions. Unlike our clustering approach, which is based on the optimization of a global cost function, metagenes were inferred by examining the likelihood that individual co-expression edges are transferred from one species to another in a local manner. This complementarity makes the set of metagenes a good gold standard for validation. Following our clustering framework, the constituents of a metagene should appear in the same module. As shown in Figure 5A, for a low value of κ, clustering was performed independently and it was rare that the worm and fly components of a metagene fall into the same module. Nevertheless, both the worm and fly networks have high modularity, meaning the two networks were independently well separated into modules. On the other hand, for a high value of κ, most of the metagenes satisfied the criterion whereas the resultant modularity of individual networks became low. The value of κ can be tuned so as to balance this tradeoff.
Another method we used to examine the effects of κ is the similarity measure between genes based on their GO annotation as described in the previous section. The similarity scores between each pair of the 34,000 worm and fly genes define a weighted network W, where the nodes are the genes and the edges are weighted by the pair-wise scores. Since the weighted network is not a multi-layer network but a single-layer network, its modularity can be quantified by a more traditional modularity function for weighted network defined as: , where k
i
= ∑
j
W
ij
, m = ∑
i
k
i
/2 and σ
i
is the module label of node i[30]. A high modularity score means highly similar genes (in terms of GO annotation) are grouped in a module whereas distant genes are separated. In principle, this weighted network based on GO annotation serves as a benchmark for the multi-layer network defined by OrthoClust. A favorable way of assigning nodes to modules by OrthoClust therefore should also be a favorable way to divide the weighted network into modules. As shown in Figure 5B, for each value of κ, we found the way to assign nodes to the modules by OrthoClust and then calculated the corresponding modularity score of the weighted network. When the value of κ is too high or too low, the modularity score of the weighted network is low. The κ that maximizes the modularity score of the weighted network should therefore be the optimal κ for OrthoClust. Combing Figure 5A and Figure 5B, we picked κ = 3 as our optimal value.
Weights associated with the orthology relationships
Orthology relationships between species connect layers of co-association networks. While the coupling parameter κ defines the overall relative contribution between intra-species and inter-species connections, the weight of each orthology connection could, in principle, be adjusted. It is very common that in eukaryotes many orthologs are many-to-many instead of one-to-one, mathematically forming various bipartite cliques in the multi-layer network. We tested OrthoClust by treating all the orthologous pairs equally in the cost function. We found that most of the cliques cannot be resolved, and their nodes got assigned to a single module (Figure S4 in Additional file 1). This implies the cost function favors very large cliques and is biased against the conserved clusters that are linked by one-to-one orthologs. To account for this effect, OrthoClust therefore weights down the orthologous link of a node by the number of orthologs it possesses. As shown in Figure S4 in Additional file 1, the weighted approach works better in resolving the huge cliques.
Comparison with single-species clustering
The aim of OrthoClust is to perform clustering across multiple species in an integrated fashion. Naively, one could construct a cross-species module by performing clustering on individual species separately and concatenate the modules of different species by the orthologs they share. To examine this alternative approach, we performed single species clustering on the expression profiles of worm and fly separately using various standard methods (Materials and methods). We then tested for each combination of worm and fly modules, whether or not there is an enrichment of orthologs between them based on a simple hypergeometric test (Materials and methods). We found that even though there are certain combined worm-fly modules with significant enrichment of orthologous gene pairs, the enrichment is lower than the cross-species constructed by OrthoClust (Figure 6). This is of course not surprising because OrthoClust takes into account the orthology relationships in the algorithm. Nevertheless, this analysis suggested that, by using merely the co-expression data, it is in general less effective in finding the corresponding sets of genes in two species responsible for the same function. To show the result is not a consequence of the particular mathematical form of the cost function imposed by OrthoClust, we ran OrthoClust with κ = 0. As there was no coupling between two species in the cost function, the resultant sets of worm fly modules were essentially clustered independently. Again, we found that the combined worm-fly modules have lower enrichment of orthologous pairs compared with the case with optimal κ = 3. Interestingly, this analysis also manifests how the coupling term in the cost function brings two sets of independent modules closer together in terms of the sharing of orthologs.
Comparison with network alignment
The concatenation of networks using orthology relationships resembles the problem of cross-species network alignment [12]. To compare OrthoClust with network alignment, we applied IsoRank [12] to align the worm and fly co-expression networks (Materials and methods). Again, using the metagenes obtained from [23] as a gold-standard, we found that 88% of metagenes were aligned by IsoRank (Figure S5 in Additional file 1), compared with 81% by OrthoClust. Although IsoRank slightly outperformed OrthoClust in identifying the corresponding functional genes between two species, it does not immediately report how these pairs form clusters. Motivated by [31], we looked for co-expression edges in two networks whose nodes are aligned by IsoRank. By connecting such edges in the network, we generated aligned subgraphs that could potentially be interpreted as modules conserved across two species. Among the gene-pairs that are predicted to be in the same module, we found that 43% are consistent with OrthoClust. The percentage is probably reasonable because aligned subgraphs do not really possess the properties of clusters signified by the dense connections between genes within a species.
Robustness analysis
Simulated annealing was employed to optimize the cost function defined by OrthoClust. To reduce the effects of the stochastic nature of simulated annealing, we constructed the co-appearance matrix by repeating the annealing process R times. To determine R, we ran independent trials of R runs, resulting at different co-appearance matrices and thus different sets of clusters. We then compared the consistency between two sets of clusters by considering if two genes have been assigned to the same module by trial 1, whether or not they are assigned to a common module in trial 2. This is essentially done by calculating a confusion matrix (Materials and methods). By pairwise comparison of independent trials, we found that the overlap between trials increases as R increases (Figure S6 in Additional file 1). More specifically, the overlap increases from 46% for R = 8 to 65% for R = 32, and 75% for R = 64. The statistically significant results for R = 32 in the previous analysis show that the value offers a reasonable compromise between computational cost and robustness. We then further superposed different trials to construct a co-appearance matrix with 128 runs, and thus a set of 'most accurate' clusters. We then calculated the consistency between the ultimate set with sets constructed with smaller values of R (Figure S6 in Additional file 1). We found that the average consistency between clusters generated with R = 32 and the ultimate set is 76%.
Mapping uncharacterized elements to modules
Apart from understanding the modular nature of biological processes, clustering expression profiles is very useful for inferring the putative functions of uncharacterized proteins [32] as well as ncRNAs [33, 34]. The essence of this approach is 'guilt by association': if an uncharacterized element is highly co-expressed with a core set of genes, one can infer the function of the gene based on the functions of genes within the core set. Nevertheless, most core sets were constructed by single-species clustering. The cross-species modules constructed by OrthoClust can potentially serve as an anchor to relate uncharacterized but analogous elements from different species. To explore this avenue, we constructed modules using a set of core worm and fly genes (worm-fly orthologs) by OrthoClust (Materials and methods; Figure S7 in Additional file 1), arriving at a set of 21 core worm-fly modules with similar proportions of worm and fly genes (Additional file 2). We further investigated the functions of these modules based on their enriched GO terms (Materials and methods). For each module, by clustering the enriched GO terms, we assigned a list of representative keywords as their characteristic functions (Figure 7). For instance, module 1 is signified by neurological system process and module 2 by cellular lipid metabolism. As expected, many genes in these modules have orthologous partners within the module. In 18 out of the 21 modules, the fraction of genes with orthologous partners is higher than 80%.
We then mapped worm and fly ncRNAs to the 21 modules based on their expression profiles (Materials and methods). Though there is no gold standard available to evaluate systematically the performance of the mapping, we found examples suggesting that ncRNAs from different species could be linked together in terms of their potential functions. For instance, sphinx, the fly long non-coding RNA (lncRNA) expressed in the brain, was shown to be involved in regulation of male courtship behavior [35]. In our analysis, this lncRNA was mapped to module 1, which is characterized by neurological system process and behavior. On the other hand, linc-10 and linc-104, worm large intergenic non-coding RNAs (lincRNAs) that are highly expressed in male-related stages [36], were mapped to the same module.
In addition to the mapping, we also found that some modules are enriched with different classes of ncRNAs (Figure S8 in Additional file 1). The list of worm and fly ncRNAs we tested and the modules they mapped to can be found in Additional file 3.
Generalization to N species
OrthoClust is a general framework not only applicable to the clustering of expression profiles but in general other genomics data in the form of co-association networks. In addition, the framework can be readily applied to more than two species by modifying the cost function. In general, for N species, the cost function will have N terms for the co-association networks, and N(N - 1)/2 terms for the orthologs between all pairs of species. For instance, if N = 3, the cost function can be written as:
Here, S
1
, S
2
, S
3
stand for three different species. The inner summation is the modularity function for the network of a single species. The outer summation sums the three networks together. The extra terms represent the coupling (with coupling constant κ) between three pairwise combinations, namely the orthologous pairs represented by O(S
1
, S
2
), O(S1, S3) and O(S2, S3). The coupling constant κ can be determined by the same approach we explained in the example of two species. Of particular interest is the third-order coupling term for the 1-1-1 triplets, O(S1, S2, S3). A triplet consists of three genes from three species that are orthologous to one another in a one-to-one fashion; that is, for triplet , apart from , no other gene in S
1
is orthologous to and vice versa. In this cost function, the third-order term favors a 1-1-1 triplet to have the same label. The 1-1-1 triplets are of particular importance among all the orthologous triplets because they correspond to particularly conserved biological function. Genes performing less conserved functions are more likely to undergo gene duplication and end up with many-to-many orthologs.