Modular architecture of protein structures and allosteric communications: potential implications for signaling proteins and regulatory linkages

A new method for studying signal transmission between functional sites by decomposing protein structures into modules demonstrates that protein domains consist of modules interconnected by residues that mediate signaling through the shortest pathways.


Background
Allosteric communications play crucial roles in many cellular signaling processes. Perturbations caused by factors such as ligand binding at one functional site affect a distant site, thereby regulating binding affinity and catalytic activity [1,2]. Since the allosteric model proposed by Monod and coworkers [1], decades of research have extended the common view of allostery associated with multi-domain proteins to single domain proteins. The allosteric behavior displayed by single domain proteins, such as myoglobin [3], called into question the existing allosteric dogma. In the 'new view' of protein allostery, all proteins are potentially allosteric when thought of in terms of population redistribution upon ligand binding causing conformational change in a second binding site [1].
Dynamic models have been proposed to explain the conformational changes involved in signal transmission between functional sites [4,5]. In particular, the role of the pre-existing equilibrium of conformational sub-states in allostery proposed already over 20 years ago [6] is increasingly receiving attention, emphasizing the key role of protein dynamics in this process [1,[7][8][9]. Although experimental methods such as double mutant cycle analysis [10] have provided insights into allosteric communications, understanding the general principles of the transmission of information between distant functional surfaces remains a challenge in structural biology. Several theoretical methods based on sequence and structural considerations have been proposed for the identification of key amino acids for long-range communications [11][12][13]. Among these, an interesting sequence-based approach has been proposed by Ranganathan and coworkers [14,15] for estimating the thermodynamic coupling between amino acids in several examples of protein families. Recently, we introduced a model based on a network representation of protein structures. The model allows us to determine fold centrally conserved residues (FCCRs). These residues are responsible for maintaining the shortest pathways between all amino acids and, thus, play key roles in signal transmission [13]. Analysis of several protein families showed an agreement between our results and experimental data, illustrating the importance of protein topology in network communications. Perceiving protein structures as information processing networks, it is reasonable to assume that mutations of amino acids crucial for network communications could impair signal transmission.
The rationale for modular organization of proteins in allosteric behavior has been discussed previously [16][17][18]. Modular domains can act cooperatively, leading to new input (and output) relationships. The Src family proteins constitute a clear example of this modular architecture: these proteins contain amino-terminal SH3 and SH2 domains, which flank a kinase domain by intra-molecular SH3-binding and SH2binding sites [16]. It is further known that modular functional units display certain degrees of functional specificity in a number of proteins. In several cases of protein-protein inter-actions, which are involved in cell signaling, some parts of the interacting interface participate in the information transfer, whereas other interacting regions appear to contribute solely to binding affinity [19]. Examples of proteins exhibiting this binding site modular configuration include Myosin, C5a receptor, and the protein kinase R activator PACT among others [19]. Here, we aim to obtain the modular decomposition of allosteric proteins and to explore a relationship between the modules and the allosteric activity. We expect that such a relationship, if it exists, would lead to deeper insight into functional mechanisms. We develop a new approach for decomposing protein structures into modules using their residue network representations. Our methodology is based on the edge-betweenness clustering algorithm proposed by Newman and Girvan [20,21], which has been previously applied to a wide variety of problems [22][23][24][25]. This method uses edge centrality to detect module boundaries and finds the assignation of nodes into modules [20].
The small-world topology of protein structures suggests that the key amino acids for signal transmission should lie in the shortcuts linking different regions of the structure. The removal of the most central contacts forming these shortcuts divides the structure into modules. We characterize these modules from a structural point of view. Our results, derived from a non-redundant dataset of multi-domain proteins, reveal that, in the vast majority of the cases, modules tend to be located within rather than across domains. Therefore, modules can be considered as sub-domains. Further analysis shows that the percentage of long-range interactions at the modular boundaries is much higher than that in non-boundary regions. Residues forming inter-modular contacts fluctuate less than those participating only in the intra-modular interactions. One possible explanation of this finding is that most central residues, which have been shown to be important for the allosteric communications, are located at the inter-modular interfaces and, therefore, tend to be more rigid to maintain their contacts. Inspection of 13 allosteric proteins shows that functionally annotated regions exhibit a modular architecture, with modules interconnected by FCCRs, which are responsible for mediating the shortest pathways between all amino acids and, thus, play crucial roles in allosteric communications [13]. Functional sites are often contained in one module; however, there are also examples of functional sites shared by two or more modules. Some of these cases correspond to binding sites divided into two modules belonging to different domains. The Gα s subunit and P450 cytochromes are examples of functional sites shared between modules. Interestingly, the modular decomposition of the Gα s subunit reflects binding site partitioning into regions involved in different sub-functional specialization, general binding and information transfer regions [26]. The P450eryF active site is divided into a module containing the ligand-binding site, and a module comprising the effectorbinding site, whereas the P450cam substrate binds to one module, and the product binds mainly to another module. A detailed analysis of a large dataset of proteins with functional annotations revealed that modules exhibiting high modularity tend to include functional sites.
Our results lead us to propose that the modular architecture of protein structures yields a more efficient performance of the functional activity. Modules may possess certain functional independence; and, they are interconnected through amino acids previously shown to mediate signaling in proteins. Modules consist of groups of highly cooperative residues. Evolution has organized proteins as systems consisting of modules linked by amino acids that maintain the shortest pathways between all amino acids and are, thus, crucial for signal transmission, leading to robust and efficient communication networks. This organization is advantageous and, as such, has been conserved by evolution.

Results and discussion
Here we propose a novel way to decompose protein structures into modules based on their representation as residue interacting networks (see Materials and methods). Our approach relies on the edge-betweenness clustering algorithm presented by Newman and Girvan [20,21]. Modular decomposition allows us to identify functionally important regions in proteins.

Structural properties of modules
We carried out the modular decomposition of protein structures of a non-redundant dataset of 100 multi-domain proteins (described in Materials and methods). Results show that the majority of the modules have most of their residues in one domain ( Figure 1). That is, modules tend to be located within rather than across domains, and hence may be considered as sub-domains. Comparison of contacts between amino acids belonging to different modules (inter-modular contacts) and those between amino acids belonging to the same module (intra-modular contacts) revealed that the percentage of long-range interactions is larger in the inter-modular contacts ( Figure 2). This finding is in agreement with the rationale that long-range interactions often mediate the shortest pathways between most residues in the protein.
A detailed analysis of 115 proteins (described in Materials and methods) with available structures in different conformational states and temperature B-factors showed that residues with inter-modular contacts fluctuate less than those forming exclusively intra-modular contacts. Figure 3 clearly illustrates this situation: the normalized root mean square deviation (RMSD) values and the B-factors of the residues involved in inter-modular interactions tend to be lowerthan those of the residues involved in intra-modular interactions. This result could suggest that intra-modular regions, which include most of the protein or ligand binding sites, absorb conformational changes due to perturbations. In contrast, the boundaries between modules are more rigid, allowing them to maintain key residue contacts for the integration and transmission of the information between modules.

Modularity of protein function
The modular decomposition of protein structures provides information about functional sites and signal transmission. We selected a dataset of 13 allosteric proteins based on previously analyzed examples [13] and new examples with  Percentage of long-range interactions for each protein of the multi-domain protein dataset. The interactions were calculated separately for the set of the inter-modular residues and for the set of intra-modular residues. The ordinate axis shows the percentage of long-range interactions for the inter-modular interfaces (in red) and for the intra-modular regions (in blue). experimental information. A detailed study of these proteins revealed that many modules contain functional regions, which are interconnected by residues mediating the shortest pathways between most amino acids in the structure (FCCRs). A majority (72%) of the FCCRs connect modules (Additional data file 1). Table 1 summarizes the analyzed examples, including the assignment of functional sites to modules (detailed information is provided in Table 3 of Additional data file 1).

Modular division of functional sites
Functional sites can be decomposed into modules. In some cases, the modules are located in different domains. An illustrative example of this situation is the pyruvate kinase (PDB ID 1liu, chain A shows that the adenylyl cyclase-binding site is divided into two modules: one of the modules contains the switch I and switch II regions and the other module comprises the α 3-β5 loop ( Figure 4). Thus, in this example we find a correspondence between the modular decomposition of the binding site and its partition into signal-transfer and general binding regions. . The modular decomposition of this protein indicates that the two modules share the active site. Each of these modules contains one of the two androstenedionbinding sites (Figure 5a).

P450cam (Pseudomonas putida)
The camphor monoxygenase P450cam catalyzes the 5-exo hydroxylation of camphor [33]. Its active site may be considered to have two functionally different subsites: the substrate binding region (site I) and the L 6 position of the iron to which oxygen binds upon reduction (site II) [33]. Allosteric interactions between these subsites are reflected in the fact that site I binding can inhibit site II ligation and vice versa. Furthermore, the presence of the product 5-exo-OH camphor inhibits binding of the substrate camphor (and vice versa) [33]. The modular decomposition of the P450cam structure (PDB ID 1noo) shows that the substrate (camphor) and product (5-exo-OH camphor) binding sites are mainly located in different modules, sharing common central residues, which are likely to be important for the allosteric communication between these sites. Figure 5b shows that residues comprising the 5-exo-OH camphor binding site tend to be located closest to the heme central ion, whereas amino acids forming the camphor binding site tend to be positioned distal from the heme group.
These examples suggest that the modular design of functional sites might be related to their sub-functional specialization. Each module contains a portion of the active site and is mainly involved in a specific sub-function, such as the binding of the substrate, the product or an allosteric ligand.

Modularity and functional significance of modules
Analysis of the previously studied dataset of 115 proteins with functional site annotations (described in Materials and methods) indicates that modules exhibiting high modularity values tend to comprise functional sites. The analysis of all modules illustrates that a large percentage of modules comprising functional regions exhibit above average modularity values ( Figure 6a). Figure 6b clearly illustrates that there is a correlation between the percentages of functional modules and the modularity values.

Conclusion
In signaling proteins, modular domains can act as switches mediating activation, repression and integration of diverse input functions. Experimental studies confirm that interdomain linker regions are crucial for the domain coupling required for the information transfer [16]. Our approach decomposes protein structures into modules, allowing us to study functional sites linked by signal transmission. To detect module peripheries, we rely on the identification and removal of the most central residue contacts, assuming that the interactions of these amino acids are crucial for information transfer. Our results show that modules, which often characterize functional sites, can be considered as building blocks of protein domains. Hence, the question arises, how is the transmission between distinct modules achieved? Although a very complex process, which is not fully understood, our findings suggest that inter-modular boundaries are essential for inte-grating and transmitting the information between functional regions. The majority of the fold centrally conserved residues, recently shown to play a key role in signal transmission by maintaining the short path lengths between all residues in the structure [12], are those responsible for the inter-modular interactions. Furthermore, boundary residues are rigid, sustaining key amino acid interactions for the communication between modules. On the other hand, intra-modular regions, which include most of the protein or ligand binding sites, form a flexible cushion. Most of the inter-modular residue interactions form long-range contacts, which are predominantly involved in mediating signaling. A detailed study of 13 allosteric proteins showed that functional sites are often contained within one module. However, there are cases of active sites divided into two or more modules. The analysis of the Gα s subunit and of Cytochromes P450eryF and P450cam illustrate that the modular architecture of the active site may relate to its sub-functions. Modules containing functional sites display high modularity, suggesting that modularity can be used to identify functional modules.
To conclude, our approach decomposes protein domains into modules. Mapping annotated functional regions onto the decomposed structures illustrates that the modules characterize functional sites. We observe that most intermodular boundary residues provide the shortcuts in the communication wires. These residues maintain the shortest pathways between all amino acids, leading to robust and efficient signal transmission communication networks. Functional specificity and regulation relies on the communication between modules. This advantageous organization has been conserved by evolution. Furthermore, due to the possible functional independence of modules, changes in boundary residues may lead to new functions or to functional alterations as might be needed in a changing environment. Therefore, a modular configuration might allow signaling proteins to increase their regulatory links, and to expand the range of control mechanisms either via new modular combinations or through modulation of inter-modular linkages. Since our results indicate that boundary residues are crucial in efficient short communication pathways, both mechanisms appear possible.

Protein datasets
A non-redundant dataset of 100 multi-domain proteins was selected from NCBI [34]. The domain information was extracted from the CATH database [35,36]. This dataset was used to analyze the distribution of protein modules into domains and to calculate the distribution of the long-range interactions at the inter-modular interfaces and in the intramodular regions. Using the definition of Green and Higman [37], we considered the interactions as long range if they occur between amino acid residues that are ten or more residues apart in the sequence. While residues close in sequence   are close in space, we adopt this standard notation, which has been used in numerous studies. The analyses of flexibility and modularity of modules were based on a different dataset of 115 proteins with conformers. This dataset was compiled using the database of macromolecular movements: [38-40] undergoing distinct molecular motions. Only conformers with more than 60% sequence identity were chosen. The annotations of functional sites were taken from PDBsum [41,42]. We annotated a module as functional if more than 30% of its residues belong to a functional site. We selected 13 examples of proteins displaying allosteric activities with existing PDB structures. All protein structure images were created using DS ViewerPro 6.0 [43].

Network analysis of protein structures
Each protein structure was modeled as an undirected graph, where amino acid residues corresponded to vertices, and their contacts were represented as edges. Residues i and j were considered to be in contact if at least one atom corresponding to residue i was at a distance of less than or equal to 5.0 Å from an atom from residue j. This value approximates the upper limit for attractive London-van-der-Waals forces [12,37].
FCCRs were calculated as in del Sol et al. [13]. Protein networks were decomposed into modules using the edgebetweenness clustering algorithm of Girvan and Newman [21] based on the iterative removal of the highest betweenness edges. We used the parallel implementation PEBC (parallel edge betweenness clustering) [44] of the Girvan and Newman algorithm. We modified the program to obtain the modular decomposition after removing 80% of the network edges. This cutoff was obtained empirically for optimizing the correspondence in the mapping of functional sites into modules. Based on the expression of network modularity introduced by Guimerà and Nunes Amaral [45], we defined the modularity of protein modules Q m as follows: Binding site of the G-protein α s subunit (PDB ID 1azs) divided into two modules Figure 4 Binding site of the G-protein α s subunit (PDB ID 1azs) divided into two modules. This division coincides with the specialized regions of this binding site for ligand binding only (pink module) and ligand binding and information transfer (blue module). The binding site residues are depicted in spacefill. Modular regions not involved in the binding site are depicted in green. where L is the number of edges in the network, l m is the number of edges between nodes in module m, and d m is the sum of the degrees of the nodes in module m. The rationale for this modularity measure is as follows: modules with high modularity values must contain many within module links and as few as possible between-module links. The equation above imposes Q m = 0 in cases when the module comprises the whole network or if nodes are placed randomly into modules.

Protein flexibility analysis
The analysis was carried out over the dataset of 115 proteins with conformers in two ways. We first calculated the averaged main chain residue RMSD considering all pairs of structurally aligned conformers. The structural alignments were obtained using MultiProt [46,47]. We also calculated the main chain temperature B-factor of each residue. The normalizations of the RMSDs and B-factors were calculated using the standard definition of the Z-score values.

Additional data files
The following additional data are available with the online version of this paper. Additional data file 1 contains figures with additional examples of protein modularity and tables with the data sets used for the analyses.