Determinants of protein function revealed by combinatorial entropy optimization
© Reva et al.; licensee BioMed Central Ltd. 2007
Received: 18 July 2007
Accepted: 1 November 2007
Published: 01 November 2007
We use a new algorithm (combinatorial entropy optimization [CEO]) to identify specificity residues and functional subfamilies in sets of proteins related by evolution. Specificity residues are conserved within a subfamily but differ between subfamilies, and they typically encode functional diversity. We obtain good agreement between predicted specificity residues and experimentally known functional residues in protein interfaces. Such predicted functional determinants are useful for interpreting the functional consequences of mutations in natural evolution and disease.
The diversity of biologic phenomena arises from the complexity and specificity of biomolecular interactions. Nucleic acid and protein polymers encode and express biologic information through the specific sequence of polymer units (residues). The sequences and corresponding molecular structures are under selective constraints in evolution. At specific sequence position, changes in sequence alter intermolecular communication and affect the phenotype and can lead to disease [1–6]. Detailed understanding (quantitative and predictive description) of how such molecular changes affect cellular and organismic function lies at the heart of molecular and systems biology. Our ability to predict the biologic and medical consequences of human genetic variation and to design therapeutic interventions can benefit hugely from such detailed understanding. We are therefore motivated to develop further our ability to identify functionally specific residues in protein molecules.
Identifying interaction sites on protein molecules is difficult, both experimentally and theoretically. Most proteins have complicated three-dimensional shapes with interaction sites that are composed of contributions from nonsequential residues. Even with the three-dimensional structure known, however, the sites of functionally important interactions may not be obvious. Mutational experiments to probe the contributions of individual residues to such interactions are expensive. Computational methods to simulate the interactions of biologic macromolecules in molecular detail do not yet have adequate power and accuracy. Fortunately, biologic evolution has recorded rich and highly specific information in genetic sequences. For proteins, this provides the opportunity to analyze conservation patterns in amino acid sequences and extract valuable information about specific protein-partner interactions. In particular, residues in protein active sites and protein binding sites are under sufficiently strong selective pressure to allow their identification from an analysis of protein family alignments.
In a sufficiently diverse family, globally conserved residues (residues conserved in most or all family members) are easily identified and are likely to be conserved as a result of strong selective constraints. A number of research groups have developed sophisticated methods to identify additional key residues that are involved in protein structure and function, especially residues that are strongly conserved within each subfamily but differ between subfamilies [7–18]. If subfamily specific conservation patterns were perfect, then these methods would probably yield identical lists of functional residues. However, real conservation patterns can be considerably more complicated for a variety of reasons, for instance because of superimposition of multiple evolutionary constraints involving several interactions partners. In addition, current sequence collections are incomplete, for example with respect to species representation, and particular protein families are often not evenly sampled. Finally, results depend on the level of subfamily granularity (the number of subfamilies defined in a given protein family). Consequently, the extraction of biologically relevant conservation signals from multiple sequence alignments remains a challenging problem.
We present a new algorithm with which to solve the combinatorial complex problem of identifying specificity residues and, simultaneously, the corresponding optimal division into subfamilies. In our approach, called combinatorial entropy optimization (CEO), we optimize a conservation contrast function over different assignments (clusterings) of proteins to subfamilies. Hierarchical clustering  is used to explore the space of alternative clusterings over a diverse set of clustering trajectories to reach an optimum. Given an optimal clustering, individual residue positions (columns) vary considerably in the value of the combinatorial entropy. The distribution of column entropy values is a z-shaped curve and, reassuringly, is drastically different from the corresponding distribution for randomized alignments. Different entropy values are interpreted to reflect different residue-specific functional constraints, and residues with lowest entropy values are predicted to be functional.
We validate the method by comparing sets of predicted specificity residues with sets of experimentally known functional residues, such as interaction residues observed in three-dimensional macromolecular complexes, and we obtain good agreement between prediction and observation. Interestingly, the predictive power of the method goes beyond protein-protein interactions and is applicable to any functional constraint that conserves specific residue types in particular positions across all members of a protein subfamily.
Parameter choice and robustness of results
The clustering algorithm partitions the sequences of a protein family into subfamilies and simultaneously selects a set of characteristic residues. The value of the contrast function, which is optimized; the number of subfamilies; and the set of the characteristic residues, which constitute the resulting optimal configuration, depend on the value of the parameter A (see Materials and methods, below [Equation 7]). We tested the robustness of the results with respect to parameter changes. To explore the choice of A, we conducted tests in a number of protein families with A ranging from 0.0 to 1.0, in 0.001 increments. Ideally, the selected set of characteristic residues varies slowly with A in a region of suboptimal A. The tests determined that A = 0.6 to 0.9 as the optimal range, and we tested all local minima of ΔS0(A) in this range. We tested the robustness of the results for many protein families, with representative results for two protein families in Additional data file 1. We conclude that the assignment of sequences to subfamilies is reasonably consistent with prior biologic knowledge (which in itself is incomplete and not formally defined) and that the selection of characteristic residues is reasonably stable in the range A = 0.6 to 0.9. For example, for protein kinases, of the top 30 characteristic residues at the overall minimum (A = 0.68), ranked by the column-specific difference entropies, 26 are in the top 30 at the second best local minimum (A = 0.72); alternatively, for ras-like small GTPases, of the top 20 residues at A = 0.833, 19 are in the top 20 at A = 0.85.
As a practical consequence of these tests, for a given protein family alignment the current software implementation of the algorithm scans the values A = 0.6 to 0.9 in increments of 0.025 and reports results for the value of A for which ΔS0(A) is minimum. For typical protein families this procedure yields results that resonate well with the biologic intuition of protein family experts (the reported protein subfamilies are not too fine grained nor trivially unified), and the selection of characteristic residues is a good starting point for detailed analysis and design of mutational experiments. After an initial scan, users can of course select any range of granularity parameter A as input and obtain more fine grained or more unified families as output.
Validation: subfamilies and key residues of ras-like GTPases
To illustrate typical results of the CEO algorithm applied to families of amino acid sequences, we chose the small GTPases, a large and functionally diverse protein domain family with members, probably, in all eukaryotes. These GTPases are molecular switches, timed by their rate of GTP hydrolysis, which is regulated by a number of interaction partners . GTPase activating proteins accelerate the GTPase by several orders of magnitude; guanine nucleotide exchange factors catalyze the binding of nucleotide after dissociation; and guanine nucleotide dissociation inhibitors stabilize the prenylated form of the GTPase in the cytoplasm and slow down dissociation of nucleotide. The switch is read out in its active form by interaction with downstream effectors, such as raf kinase for ras ad rho kinase for rho.
Small GTPases as testing ground
These multiple functional interactions provide an ideal testing ground for specificity analysis. A plausible evolutionary scenario involves repeated genomic duplication of an evolutionary ancestor and subsequent selection of variants, following mutation, in which the new family members have taken on a specific function. For the more than 100 distinct small GTPases in, for instance, mammalian genomes, many functions are known but our knowledge is far from complete. It is therefore interesting to analyze in which way our specificity analysis agrees with known divisions into functional protein subfamilies and to make explicit predictions pointing to candidate residues for mutational functional experiments.
Results for ras-like G-domains
Agreement with known functional subfamilies
Because the analysis only used amino acid sequences and did not use any functional information, the concentration of similar functional names and annotations in the computed subfamilies immediately indicates successful functional classification (Additional data file 2). For example, all Ras and Rho proteins (as far as names have been assigned in the literature) are in distinct subfamilies. Finer levels of classification also appear to agree with known functional classifications; for example, Rab5A, Rab5B, and Rab5C are in a subfamiliy distinct from that of Rab6A, Rab6B, and Rab6C. As a result of systematic focus on specificity conservation patterns in our method, the implied functional distinctions between subfamilies constitute predictions when the protein class is known but functional details are not yet known.
Agreement with known functional residues
Prediction of as yet uncharacterized functional residues
Given the excellent agreement of the set of specificity residues derived from sequence family information with sets of functional residues reported as the result of detailed experiments, we are encouraged to identify potential functional residues in prediction mode. The simple hypothesis, following detailed analysis, is that all computed specificity residues have a functional implication, defined either as an observed phenotypic consequence upon changing the amino acid type or as direct observation of specific interactions (above nonspecific background) with other biologic molecules. Although such detailed predictions may be the subject of a subsequent analysis, we propose here that the following residues in the ras-type GTPases that are not in the 'switch' regions and have not been observed in protein-protein contacts in three-dimensional structures are particularly interesting (Figure 2): G75, E76, F78, K104, and A155. We propose mutational experiments for these residues within the context of carefully chosen available functional assays.
Validation: prediction of binding sites
Various functional constraints can give rise to patterns of specificity residues, including macromolecular interfaces. To assess the predictive utility of the method for the prediction of interactions, we compared the overlap between the set of predicted specificity residues with known binding sites in several protein complexes. Although evolutionary constraints on specificity residues can be the result of any kind of functional interaction, residues in protein-protein interactions and protein-nucleic acid (NA) interactions are particularly well defined in three-dimensional structures of macromolecular complexes. A strong overlap of predicted specificity residues with binding sites would indicate that the method correctly identifies functional constraints on binding site residues. If that is the case, then one would expect a reasonable fraction of specificity residues to be binding site residues. We therefore assess the predictive potential of the implied prediction method, aware of the risk for over-prediction in cases in which other functional constraints operate outside binding sites.
Statistical significance and accuracy of prediction
Statistical significance of the presence of predicted specificity residues in known interfaces of protein-protein and protein-DNA/RNA complexes
P S&I g
P C&I g
P (S+C)&I g
1wq1R1 (1 to 166)
P-loop containing nucleoside triphosphate hydrolases
Superfamily (human) 156/0.90/0.90
1wq1G, GDP, Mg, AF3
1wq1G2 (718 to 1, 037)
GTPase activation domain, GAP
Superfamily (human) 20/0.90/0.90
1wq1R, GDP, Mg, AF3
1fvuA3 (1 to 133)
Superfamily (swiss) 64/0.90/0.90
1fvuB4 (401 to 525)
Superfamily (swiss) 136/0.90/0.90
1a2kA5 (10 to 121)
1a2kD, GDP, Mg
1a2kD6 (12 to 170)
P-loop containing nucleoside triphosphate hydrolases
Superfamily (human) 170/0.90/0.90
1a2kA, GDP, Mg
1i2mB7 (24 to 417)
Superfamily (nrd90) 77/0.90/0.90
1i2mA8 (12 to 170)
P-loop containing nucleoside triphosphate hydrolases
Superfamily (human) 170/0.90/0.90
1rrpB9 (17 to 150)
Superfamily (nrd90+swiss) 59/0.90/0.90
1rrpA10 (12 to 170)
P-loop containing nucleoside triphosphate hydrolases
Superfamily (human) 170/0.90/0.90
1rrpB, GNP, Mg
1blxB11 (41 to 72)
PFAM (human) 1043/0.95/0.95
1blxB11 (73 to 105)
PFAM (human) 1043/0.95/0.95
1blxB11 (106 to 137)
PFAM (human) 1043/0.95/0.95
1blxA12 (5 to 309)
Protein kinase-like (PK-like)
Superfamily (human) 81/0.90/0.95
2cciA13 (4 to 286)
Protein kinase-like (PK-like)
Protein Kinase Resource 390
1h27B1, 1h27B2, TPO
2cciB114 (181 to 307)
Pfam N-cyclin 379/0.95/0.90
2cciA, 2cciF, TPO
2cciB215 (309 to 431)
Pfam C-cyclin 238/95/90
1n7tA21 (14 to 98)
Erbin PDZ domain
PFAM (human) 237/0.90/0.90
1g4dA16 (13 to 81)
Repressor protein C
Putative DNA-binding domain
Superfamily (nrd90) 244/0.90/0.95
1e3oC17 (104 to 160)
lambda repressor-like DNA-binding domains
Superfamily (swiss) 397/0.90/0.90
2up1A18 (10 to 92)
Hnrnp A1, Up1
RNA-binding domain (RBD)
Superfamily (swiss) 552/0.90/1.0
1ec6A19 (4 to 90)
Eukaryotic type KH-domain (KH-domain type I)
Superfamily (nrd90+swiss) 463/0.90/0.80
1serB20 (501 to 610)
Seryl tRNA synthetase
Superfamily (swiss) 96/0.90/0.90
Example: interactions of cell cycle kinases
Specificity residues computed from family alignments reflect functional constraints. The distribution of specificity residues is particularly interesting for proteins engaged in multiple interactions. An example is the cell cycle kinase cyclin-dependent kinase CDK2, which plays a key role in the cell cycle (phases S and G2) in all eukaryotes. CDK2 forms complexes with cyclins (E and A) and specifically phosphorylates numerous substrates, such as retinoblastoma protein (pRb), retinoblastoma-like protein 1 (p107), cell division control protein CDC6, cyclin-dependent kinase inhibitor p27, tumor suppressor p53, and transcription factor E2F1. Currently, 72 proteins are reported in the Human Protein Reference Database as interacting with CDK2. CDK2 is tightly regulated; it requires specific activating phosphorylation at position Thr160 by a CDK-activating enzymatic complex (CAK); it can be inhibited by the Ink4 and Cip1/Kip1 families of cell cycle inhibitors or by phoshorylation in the glycine-rich loop by the Wee1 or Myt1 kinase. To derive specificity residues in CDK2, we used 390 sequences of protein kinases related to CDK2. We also derived specificity residues for cyclin A (379 sequences for domain N and 238 sequences for domain C).
The CEO algorithm is motivated by the observation that functional constraints in many cases give rise to a position-specific signature of amino acid residue types in protein sequences. Given a protein family alignment, the algorithm developed and tested here solves the challenging computational problem of detecting functional protein subfamilies and, at the same time, identifying a functional residue signature. This signature is a set of key residues (sequence positions) that vary characteristically between subfamilies but are conserved within each subfamily. The computational procedure ranks the key residues by their contribution to the optimal value of the contrast function, defined in terms of combinatorial entropy. One can use this residue ranking to prioritize further analysis and design experiments. The method also provides a signal-to-background criterion that is used to automatically classify all residues into three broad classes: specificity residues, conserved residues, and 'neutral' residues.
Alternative solution to a complicated problem
As far as we know, the first algorithmic approaches to the problem of identification of specificity residues appeared in the mid-1990's, from the groups of Sander  and Cohen . (See Background, above, for references to additional methods.) The current approach is sufficiently different from previous approaches to offer an alternative solution to this complicated problem. We cannot, however, claim superior performance relative to other approaches, because no 'gold standard' of experimentally determined specificity residues exists against which to validate different methods. In practice, we see a number of advantages relative to our own first approach, which was based on multivariate correspondence analysis, especially the automated definition of the resulting set of specificity residues and corresponding protein subfamilies, with granularity of subfamily division depending on a single adjustable parameter.
Method refinement and advanced use
The algorithm performs well in practice and has been tested in many protein families in consultation with domain experts. In the future, one interesting refinement of the algorithm would be a strict distinction between paralogous (same species) and orthologous (different species) variation, provided that enough sequences are available. We are also interested in applying the method to signal enhancement in the derivation of evolutionary trees by restricting phylogenetic analysis to the subset of functionally constrained residues. Our earlier work has demonstrated the way in which evolutionary trees of this type appear less noisy and potentially reach further back in evolutionary time . In another interesting application, joint specificity analysis across two protein families of potential interaction partners may lead to successful prediction of matched residues sets that are involved in protein-protein interactions [7, 28]. The kernel of the CEO method may also be applicable to the analysis of gene expression patterns, patterns of gene copy number changes, and large-scale genotyping datasets. This may lead to the discovery of novel subtypes of tissues and samples, and to the derivation of characteristic genetic and molecular patterns corresponding to different developmental and disease phenotypes (Reva B, Antipin Y, Sander C, unpublished).
Our results and examples demonstrate that the method can be used to identify functionally important residues from sequence information alone, without the use of three-dimensional structure or experimental functional annotation. Multiple applications are possible. The ability to locate functional determinants will be useful for the identification of residues in active sites that determine binding specificity; for the prediction of binding sites of protein complexes with other proteins, NAs, or other biomolecules; for assessing the biologic or medical significance of nonsynonymous single nucleotide polymorphisms; and for planning sharply focused mutation experiments to explore protein function. A particularly valuable application may be the design of therapeutic compounds that are highly specific to one (or a select few) of a series of paralogous proteins.
The method is publicly accessible via a web server  hosted in the Computational Biology Center of Memorial Sloan Kettering Cancer Center.
Materials and methods
Definition of the algorithmic problem
On the intuitive level, the algorithmic problem is as follows. First, divide a given multiple sequence alignment into subfamilies (also called sequence clusters) such that each subfamily has a characteristic conservation signature at a number of sequence positions. Then, optimize the information in the subfamily division to achieve a reasonable compromise between the number of proteins in a subfamily and the number of characteristic residues positions used to distinguish the subfamilies from each other (the larger the number of proteins per subfamily, the smaller the number of characteristic residue positions, and vice versa; the two extremes of 'one sequence per subfamily' and 'all sequences in a single subfamily' are uninformative).
Here N k is the number of sequences in subfamily k; Nα,i,kis the number of residues of the type α in column i of subfamily k. (Gaps are taken into account as a separate residue type; α = 21 corresponds to a gap.) The numerator is the total number of permutations of N k symbols and the product in the denominator divides out the number of indistinguishable permutations for each residue type α.
is an additive measure (both in terms of alignment columns and subfamilies) for comparing different distributions of residues. The statistical entropy depends on subfamily size. The entropy of the union of two subfamilies is always greater than or equal to the sum of entropies of the individual subfamilies. The entropy is equal to zero when all sequences are separated into subfamilies of a single sequence each (maximal fragmentation); the entropy is maximal when all sequences are united in one family (maximal unification). The dependence of the statistical entropy on subfamily sizes allows one to formulate an optimization problem, namely find the distribution of sequences into subfamilies that is maximally different from a random distribution of sequences. Subfamilies of sequences with many conserved residue patterns (which change across subfamilies) will contribute the most to the optimal solution.
We define specificity residues (also called characteristic or key residues) as residues that are conserved in a subfamily but differ between subfamilies. Thus, one is challenged to determine simultaneously the best division of the set of sequences into subfamilies and the subset of residues that best discriminates between these subfamilies. 'Best' is defined in terms of a contrast function that aims to measure the degree to which the specificity residues are distinctly different in each subfamily. The value of the contrast function is minimal for the best solution, with the result reported as a set of specificity residues and corresponding sequence subfamilies. The sections below describe the contrast function, the meaning of 'best', the optimization algorithm, and a criterion for selecting the top-ranked specificity residues.
Definition of the contrast function in terms of combinatorial entropy
and Nα,iis the number of residues of type α in column i and N is the total number of sequences (lines) in alignment. (Because can be noninteger numbers, ! is computed using the relation X! = Γ(X + 1) .)
is the contrast function to be minimized in the process of finding the best decomposition into subfamilies. (Because ΔS0 is a negative number, this means that the absolute value of ΔS0 is maximized.)
The optimization algorithm
A straightforward solution to the optimization problem would be to enumerate all possible partitionings of the set of sequences into subfamilies, calculate the combinatorial entropy difference (the contrast function) as in Equation 6, and then choose the partitioning with the lowest value of ΔS0. The only problem with this approach is that the number of partitionings of N sequences into K clusters is astronomically large for all but very small values of N and K. One therefore needs an effective strategy for exploring a reasonable subset of partitonings with the aim of finding one with a value of the contrast function close to the global optimum. Often such complex value landscapes are explored using stochastic algorithms, which can be used in future implementations. In this report we use a simple deterministic hierarchical clustering method  with each clustering step guided by evaluation of a guide function (Equation 7) for all alternative choices in that step.
Starting from N clusters, each containing one sequence, in each clustering step all pairs of clusters are considered as merger candidates. The pair of clusters with the lowest value of the guide function is merged into one cluster. The merger steps are repeated until all sequences are in one cluster. At this stage the result is a complete trajectory of merger steps, which can be represented as a tree (not shown) and the task is to choose the best partioning (tree level). The best partioning is defined as the one with the minimal value of ΔS0, or the maximal absolute value of the combinatiorial entropy difference between the actual and uniformly mixed ('random') distribution of residue types (Equation 6). The complexity of the hierarchical clustering algorithm is of O(N**2 ln N), where N is the number of sequences in the multiple alignment .
To explore different partitionings of sequences into subfamilies, the guide function includes a penalty term . The penalty term affects the clustering trajectory by favoring mergers that result in smaller clusters over those that result in larger clusters. To explore a larger space of alternative partionings, we perform hierarchical clustering for different relative weights of the penalty term.
averaged over all L columns of the alignment.
Where N k and N m are the number of sequences in the corresponding clusters k and m.
ΔS'k,mis the maximal possible value of the combinatorial entropy (per column) after merging clusters of size N k and N m . This second term simply captures the mere size contribution to the entropy and counteracts the tendency toward trajectories with early emergence of dominant large clusters. This tendency is due to the fact that the entropy of a larger system is always greater than the sum of the entropy values of its subsystems. Whatever the trajectories explored and whatever the devices used to guide the exploration of trajectory space, the evaluation of best partitioning is exclusively based on the combinatorial entropy difference of Equation 6.
Note that although the guide function determines the details of each clustering step, the final optimum is chosen as the minimum of the combinatorial entropy difference (Equation 6) in the two-dimensional space of two variables, the clustering step l, and the penalty weight (1 - A). Typical optimal values of A in tests for diverse protein families range between 0.6 and 0.9.
Evidence for selective pressure and selection of specificity residues
We compared entropy plots for the original alignment with the entropy plot for a randomized alignment (for details, see Figure 7). The differences between the original and the randomized entropy plots are drastic; there are no downturn and upturn regions in the entropy plots for randomized alignments, and the absolute values of the entropy differences produced for the randomized alignments are several times smaller than those of the original alignments.
is the average entropy per residue for the residue distribution in alignment column i; fα,iis the fraction of residues of type α in column i (α = 21 for gaps). We require ⟨s⟩ i < 0.03 and f21,i< 0.5 for globally conserved columns; mathematical details related to Equations 10 and 11 are provided in Additional data file 3.
Test application: prediction of contact residues and evaluation of accuracy
Specificity residues - and, of course, globally conserved residues - reflect functional constraints that operate in evolution. They are an informational fossil record, most clearly visible over large evolutionary intervals during which the background distribution may vary considerably. The constraints can be of diverse origin, but it is plausible that all constraints can be traced to the requirements of intermolecular interactions that are important for survival. Therefore, prediction of specificity residues has broad applicability for the identification of functional interactions and, as a consequence, for ranking genetic variation, for planning mutation experiments, or for the molecular design of specificity.
Here, we test one particular application of the identification of specificity residues from multiple sequence alignments: the prediction of intermolecular interfaces. We use known three-dimensional structures of protein and DNA complexes from the Protein Data Bank (PDB) as defining experimental reality against which predictions are compared. A key limitation is that there may be several such interfaces in a given protein family and that the complexes in the PDB contain only a subset of these. Nonetheless, it is instructive to see the extent to which specificity residues, interpreted as predicted interface residues, overlap with known intermolecular interfaces. A large overlap indicates good prediction accuracy, but over-prediction (false positives) is expected.
Where the numerator represents the number of all possible assignments for which the sets of size S and L have A or more common residues; and the denominator represents the total number of all possible assignments up to complete overlap of the two sets. To correct for the N c globally conserved residues, which by definition are excluded from being identifies as specificity residues, we use N - N c in Equation 12 in place of N.
Choice of multiple sequence alignments
The multiple sequence alignments are the only source of information used in the predictions. Predictions are best for accurate, nonredundant alignments of diverse sequences without significant gap regions. In the interface prediction tests, we used alignments from the 'Superfamily'  and PFAM  collections, as well as the Homology-Derived Secondary Structure of Proteins database  and curated alignments of human protein kinases  from the Protein Kinase Resource . As needed, the original alignments were prepared for specificity analysis by trimming deletions and insertions across the whole alignment so as to preserve the continuity of the main sequence (the sequence of a given protein); removing redundant sequences (typically at the level of about 95% identical residues for large alignments) using the MView program [38, 39]; and removing sequences with many gaps (for example, with more than about 10% to 20% gaps compared with the main sequence). Finally, the total number of sequences in the alignment must be large (>100).
Additional data files
The following additional data are available with the online version of this paper. Additional data file 1 is a table summarizing the results of a robustness analysis of the method, as described in the main text. Additional data file 2 is a table summarizing the results of optimal clustering of 126 GTPases of human Ras superfamily. Additional data file 3 is a tutorial section that explains the link between the common notion of probability entropy (information entropy) and the less well known formulation of combinatorial entropy.
Source code of the core method is available on request from the authors, subject to acceptance of a public domain license.
combinatorial entropy optimization
Protein Data Bank
We thank two anonymous reviewers for challenging questions and comments. We thank Joanne Edington, Maureen Higgins, and Alex Lash for helpful suggestions and support, and Daniel Eisenbud for comparison of methods. This work was funded in part by the Alfred W Bressler Scholars Endowment Fund and by Atlantic Philanthropies.
- Hussain SP, Hofseth LJ, Harris CC: Tumor suppressor genes: at the crossroads of molecular carcinogenesis, molecular epidemiology and human risk assessment. Lung Cancer. 2001, S7-S15. 10.1016/S0169-5002(01)00339-7. Suppl 2
- Heo WD, Meyer T: Switch-of-function mutants based on morphology classification of Ras superfamily small GTPases. Cell. 2003, 113: 315-328. 10.1016/S0092-8674(03)00315-5.PubMedView ArticleGoogle Scholar
- Yang Z, Ro S, Rannala B: Likelihood models of somatic mutation and codon substitution in cancer genes. Genetics. 2003, 165: 695-705.PubMedPubMed CentralGoogle Scholar
- Greenblatt MS, Beaudet JG, Gump JR, Godin KS, Trombley L, Koh J, Bond JP: Detailed computational study of p53 and p16: using evolutionary sequence analysis and disease-associated mutations to predict the functional consequences of allelic variants. Oncogene. 2003, 22: 1150-1163. 10.1038/sj.onc.1206101.PubMedView ArticleGoogle Scholar
- Xi T, Jones IM, Mohrenweiser HW: Many amino acid substitution variants identified in DNA repair genes during human population screenings are predicted to impact protein function. Genomics. 2004, 83: 970-979. 10.1016/j.ygeno.2003.12.016.PubMedView ArticleGoogle Scholar
- Buchholz TA, Weil MM, Ashorn CL, Strom EA, Sigurdson A, Bondy M, Chakraborty R, Cox JD, McNeese MD, Story MD: A Ser49Cys variant in the ataxia telangiectasia, mutated, gene that is more common in patients with breast carcinoma compared with population controls. Cancer. 2004, 100: 1345-1351. 10.1002/cncr.20133.PubMedView ArticleGoogle Scholar
- Casari G, Sander C, Valencia A: A method to predict functional residues in proteins. Nat Struct Biol. 1995, 2: 171-178. 10.1038/nsb0295-171.PubMedView ArticleGoogle Scholar
- Lichtarge O, Bourne HR, Cohen FE: An evolutionary trace method defines binding surfaces common to protein families. J Mol Biol. 1996, 257: 342-358. 10.1006/jmbi.1996.0167.PubMedView ArticleGoogle Scholar
- Mihalek I, Res I, Lichtarge O: A family of evolution-entropy hybrid methods for ranking protein residues by importance. J Mol Biol. 2004, 336: 1265-1282. 10.1016/j.jmb.2003.12.078.PubMedView ArticleGoogle Scholar
- Mirny LA, Shakhnovich EI: Universally conserved positions in protein folds: reading evolutionary signals about stability, folding kinetics and function. J Mol Biol. 1999, 291: 177-196. 10.1006/jmbi.1999.2911.PubMedView ArticleGoogle Scholar
- Afonnikov DA, Oshchepkov DY, Kolchanov NA: Detection of conserved physico-chemical characteristics of proteins by analyzing clusters of positions with co-ordinated substitutions. Bioinformatics. 2001, 17: 1035-1046. 10.1093/bioinformatics/17.11.1035.PubMedView ArticleGoogle Scholar
- Oliveira L, Paiva AC, Vriend G: Correlated mutation analyses on very large sequence families. Chembiochem. 2002, 3: 1010-1017. 10.1002/1439-7633(20021004)3:10<1010::AID-CBIC1010>3.0.CO;2-T.PubMedView ArticleGoogle Scholar
- Goh CS, Cohen FE: Co-evolutionary analysis reveals insights into protein-protein interactions. J Mol Biol. 2002, 324: 177-192. 10.1016/S0022-2836(02)01038-0.PubMedView ArticleGoogle Scholar
- Lockless SW, Ranganathan R: Evolutionarily conserved pathways of energetic connectivity in protein families. Science. 1999, 286: 295-299. 10.1126/science.286.5438.295.PubMedView ArticleGoogle Scholar
- Suel GM, Lockless SW, Wall MA, Ranganathan R: Evolutionarily conserved networks of residues mediate allosteric communication in proteins. Nat Struct Biol. 2003, 10: 59-69. 10.1038/nsb881.PubMedView ArticleGoogle Scholar
- Kalinina OV, Mironov AA, Gelfand MS, Rakhmaninova AB: Automated selection of positions determining functional specificity of proteins by comparative analysis of orthologous groups in protein families. Protein Sci. 2004, 13: 443-456. 10.1110/ps.03191704.PubMedPubMed CentralView ArticleGoogle Scholar
- Donald JE, Shakhnovich EI: Predicting specificity-determining residues in two large eukaryotic transcription factor families. Nucleic Acids Res. 2005, 33: 4455-4465. 10.1093/nar/gki755.PubMedPubMed CentralView ArticleGoogle Scholar
- Marttinen P, Corander J, Törönen P, Holm L: Bayesian search of functionally divergent protein subgroups and their function specific residues. Bioinformatics. 2006, 22: 2466-2474. 10.1093/bioinformatics/btl411.PubMedView ArticleGoogle Scholar
- Everitt BS, Landau S, Leese M: Cluster Analysis. 2001, Arnold Publishers, Oxford University Press, US. ISBN 0340761199, 4Google Scholar
- Predicts functional residues in a protein. Based on entropy analysis of a multiple sequence alignment. [http://proteinfunction.org]
- Jaffe AB, Hall A: Rho GTPases: biochemistry and biology. Annu Rev Cell Dev Biol. 2005, 21: 247-269. 10.1146/annurev.cellbio.21.020604.150721.PubMedView ArticleGoogle Scholar
- Hall BE, Yang SS, Boriack-Sjodin PA, Kuriyan J, Bar-Sagi D: Structure-based mutagenesis reveals distinct functions for Ras switch 1 and switch 2 in Sos-catalyzed guanine nucleotide exchange. J Biol Chem. 2001, 276: 27629-27637. 10.1074/jbc.M101727200.PubMedView ArticleGoogle Scholar
- Li R, Zheng Y: Residues of the Rho family GTPases Rho and Cdc42 that specify sensitivity to Dbl-like guanine nucleotide exchange factors. J Biol Chem. 1997, 272: 4671-4679. 10.1074/jbc.272.8.4671.PubMedView ArticleGoogle Scholar
- Elliot-Smith AE, Mott HR, Lowe PN, Laue ED, Owen D: Specificity determinants on Cdc42 for binding its effector protein ACK. Biochemistry. 2005, 44: 12373-12383. 10.1021/bi0506021.PubMedView ArticleGoogle Scholar
- Karnoub AE, Symons M, Campbell SL, Der CJ: Molecular basis for Rho GTPase signaling specificity. Breast Cancer Res Treat. 2004, 84: 61-71. 10.1023/B:BREA.0000018427.84929.5c.PubMedView ArticleGoogle Scholar
- Stenmark H, Valencia A, Martinez O, Ullrich O, Goud B, Zerial M: Distinct structural elements of rab5 define its functional specificity. EMBO J. 1994, 13: 575-583.PubMedPubMed CentralGoogle Scholar
- Murzin AG, Brenner SE, Hubbard T, Chothia C: SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol. 1995, 247: 536-540. 10.1006/jmbi.1995.0159.PubMedGoogle Scholar
- Pazos F, Helmer-Citterich M, Ausiello G, Valencia A: Correlated mutations contain information about protein-protein interaction. J Mol Biol. 1997, 271: 511-523. 10.1006/jmbi.1997.1198.PubMedView ArticleGoogle Scholar
- Landau LD, Lifshitz EM: Statistical Physics, part 1. 1996, Oxford, UK: Butterworth-Heinemann, 3Google Scholar
- Press WH, Teukolsky SA, Vettering WT, Flannery BP: Numerical Recipes in C. 1992, Cambridge, UK: Cambridge University PressGoogle Scholar
- Manning CD, Raghavan P, Schütze H: Introduction to Information Retrieval. 2007, Cambridge, UK: Cambridge University PressGoogle Scholar
- Reva BA, Rykunov DS, Finkelstein AV, Skolnick J: Optimization of protein structure on lattices using a self-consistent field approach. J Comput Biol. 1998, 5: 531-538.PubMedView ArticleGoogle Scholar
- Gough J, Karplus K, Hughey R, Chothia C: Assignment of homology to genome sequences using a library of Hidden Markov Models that represent all proteins of known structure. J Mol Biol. 2001, 313: 903-919. 10.1006/jmbi.2001.5080.PubMedView ArticleGoogle Scholar
- Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths-Jones S, Khanna A, Marshall M, Moxon S, Sonnhammer ELL, et al: The Pfam protein families database. Nucleic Acids Res. 2004, D138-D141. 10.1093/nar/gkh121. 32 Database
- Schneider R, Sander C: The HSSP database of protein structure-sequence alignments. Nucleic Acids Res. 1996, 24: 201-205. 10.1093/nar/24.1.201. Hanks S, Quinn AM: Protein kinase catalytic domain sequence database: identification of conserved features of primary structure and classification of family members.Methods Enzymol 1991, 200:38-62PubMedPubMed CentralView ArticleGoogle Scholar
- Smith C, Shindyalov IN, Veretnik S, Gribskov M, Taylor S, Ten Eyck LF, Bourne PE: The Protein Kinase Resource. Trends Biochem Sci. 1997, 22: 444-446. 10.1016/S0968-0004(97)01131-6.PubMedView ArticleGoogle Scholar
- Brown NP, Leroy C, Sander C: MView: a web compatible database search or multiple alignment viewer. Bioinformatics. 1998, 14: 380-381. 10.1093/bioinformatics/14.4.380.PubMedView ArticleGoogle Scholar
- Hobohm U, Sander C: A sequence property approach to searching protein databases. J Mol Biol. 1995, 251: 390-399. 10.1006/jmbi.1995.0442.PubMedView ArticleGoogle Scholar
- Sayle R, Bissell A: RasMol: a program for fast realistic rendering of molecular structures with shadows. Proceedings of the 10th Eurographics UK '92 Conference, University of Edinburgh, Scotland. 1992Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.